Next: About this document
CSCI 6356 Fall 2000
Xiannong Meng Programming Assignment Two
Assigned: Thursday September 21, 2000 Due: Thursday
October 12, 2000
Web search has been becoming a very hot subject of
discussion. Efficient, fast search is a key for a search engine to
succeed. One important component of a search engine is the search of a
keyword through a piece of text. In this project, you are to implement
some parallel program to search and retrieve text in parallel.
Assume your program receives a piece of text (such as the one passed
back by a crawler). Your task consists of two parts. Part one is to
parse the received text such that all words are extracted from
text. Part two is to search for the text to see if a particular
keyword given by the user is in the text.
Here are some details for part one of the program.
- You may assume for now the text file is stored on your local
disk. The program needs to open the file for read and read the file
into an array of character. For a small file, the program may read the whole
file at once and store it in the array. For large files, however, the program
may have to distribute the text to other processes while reading.
- The second step in the program is to distribute the text to
the participating hosts using any MPI function you'd like. You may use
the data-parallel model or the processor-farm model. Each piece of
text the control program distributes to the workers should have a
unique id.
- Each node, after receiving the part of the text that is
assigned to it, works on the text using the same algorithm to extract
words and their location. The location here is defined as a pair
(text-id, order) where order indicates the n-th word in this
piece of text. If a word appears more than once, only the first
location is recorded. (an added bonus would be to record all occurrences.)
- After all the nodes collect the information, they send the
information to the control process which will combine the information
to form one master file.
- How to extract a word out of a sequence of characters (text)?
There are many different ways of doing it. A simple way would just use
the function fscanf to extract the word. After extracting a
word, you may have to consider remove extra characters that do not
belong to the original word. For example, fscanf would read
<html>
as one word, your program should remove the leading and
trailing bracket symbol.
- You need to select proper data structures to store the
words and their location. Remember that duplicated words should be removed.
- The resultant words and their location should be stored back
to a file for search purpose later.
Some details about the second part of the program. (You may write this
as a separate program, or as another function in the program).
- The second part of the program does a search for a keyword
given by a user. The program should report if the keyword is found in
the text file or not. If it is, the location of the keyword (as
defined above) should be printed. If not, the program should print a
message indicating so.
- The control process should read the master list of the word
structure (the word itself and its location) then distribute the list
to the worker processes.
- When a keyword is typed in by the user, the master process
broadcasts this word to all the worker processes.
- Each node searches for the word in their part of the list and
reports the result to the control process. If the word is found, the
control process reports its location as (file-id, order). If the word
is not found, the control process prints a message to indicate such a
result.
- The program should keep in a loop until the user chooses to
quit (e.g. type a keyword quit).
Hand in: Please email me by the due date the location of your
program in our lab computers. I'll go there and check it on-line. For
those who work outside our lab, please ftp your program to the lab so
that I can check them on-line.
Next: About this document
Xiannong Meng
Thu Sep 21 12:44:22 CDT 2000