No Title

CSCI 6356 Fall 2000
Xiannong Meng Programming Assignment Two
Assigned: Thursday September 21, 2000 Due: Thursday October 12, 2000

Web search has been becoming a very hot subject of discussion. Efficient, fast search is a key for a search engine to succeed. One important component of a search engine is the search of a keyword through a piece of text. In this project, you are to implement some parallel program to search and retrieve text in parallel.

Assume your program receives a piece of text (such as the one passed back by a crawler). Your task consists of two parts. Part one is to parse the received text such that all words are extracted from text. Part two is to search for the text to see if a particular keyword given by the user is in the text.

Here are some details for part one of the program.

You may assume for now the text file is stored on your local disk. The program needs to open the file for read and read the file into an array of character. For a small file, the program may read the whole file at once and store it in the array. For large files, however, the program may have to distribute the text to other processes while reading.
The second step in the program is to distribute the text to the participating hosts using any MPI function you'd like. You may use the data-parallel model or the processor-farm model. Each piece of text the control program distributes to the workers should have a unique id.
Each node, after receiving the part of the text that is assigned to it, works on the text using the same algorithm to extract words and their location. The location here is defined as a pair (text-id, order) where order indicates the n-th word in this piece of text. If a word appears more than once, only the first location is recorded. (an added bonus would be to record all occurrences.)
After all the nodes collect the information, they send the information to the control process which will combine the information to form one master file.
How to extract a word out of a sequence of characters (text)? There are many different ways of doing it. A simple way would just use the function fscanf to extract the word. After extracting a word, you may have to consider remove extra characters that do not belong to the original word. For example, fscanf would read <html> as one word, your program should remove the leading and trailing bracket symbol.
You need to select proper data structures to store the words and their location. Remember that duplicated words should be removed.
The resultant words and their location should be stored back to a file for search purpose later.

Some details about the second part of the program. (You may write this as a separate program, or as another function in the program).

The second part of the program does a search for a keyword given by a user. The program should report if the keyword is found in the text file or not. If it is, the location of the keyword (as defined above) should be printed. If not, the program should print a message indicating so.
The control process should read the master list of the word structure (the word itself and its location) then distribute the list to the worker processes.
When a keyword is typed in by the user, the master process broadcasts this word to all the worker processes.
Each node searches for the word in their part of the list and reports the result to the control process. If the word is found, the control process reports its location as (file-id, order). If the word is not found, the control process prints a message to indicate such a result.
The program should keep in a loop until the user chooses to quit (e.g. type a keyword quit).

Hand in: Please email me by the due date the location of your program in our lab computers. I'll go there and check it on-line. For those who work outside our lab, please ftp your program to the lab so that I can check them on-line.

About this document ...

Next: About this document

Xiannong Meng
Thu Sep 21 12:44:22 CDT 2000