Introduction
In the previous phases of the project, you built a web server that can interact with any browsers; you created a indexer which can parse and index a collection of text documents; then you developed a crawler that can collect live pages from the web. In this phase of the project, you are asked to put all three previous phases of the project together to make a working search engine. The expected product of this phase of the project is a system that can do the following.
- Collecting live web pages;
- Parsing and indexing these web pages;
- Taking user queries from a web browser;
- Retrieving web pages in the collection that are relevant to the query;
- Sending the URLs of these relevant web pages back to the browser and having the browser display these URLs in a reasonable and plausible format.
These components collectively make a working search engine. The crawler collects web pages. These pages are sent to the indexer for parsing and indexing. The indexer builds an inverted indexing system which link each individual term in the document collection to all the documents that contain the term. The indexing system also writes the result to a disk file so the index can be re-used without crawling repeatedly. On the other hand, the user interface component takes the user query and sends it to the ranking/retrieving system for processing. The ranking component takes the query and searches the relevant document from the inverted indexing list and returns the list of relevant URLs to the user.
The first three phases of the project accomplish all above, except the returned list is not ranked in any particular order. Because of limited time, the computation of ranking is left as the last phase of the project.
Your Tasks
As described in the Introduction section, your tasks in this phase is to combine the previous three phases together to make it a working search engine. One of the necessary functionality that we didn't emphasize in the first three phases of the project is to format the output nicely for display on the screen of the browser. When a list of URLs are returned from the ranker to the browser, you need to add HTML tags for each of the URL in the list. The following example illustrates this idea.
Assume your search query term computer resulted the following URLs from your ranker in the given order (e.g., the order on the posting list.)
Figure 1: Raw List of URLs Generated by Ranker
Taking this list, your ranker or user interface component of the program is responsible for adding the necessary HTML tags so the list will be formatted properly for display in the browser. (Use the Java class Html.java for your code base, if you'd like.) Here is how the list may look like with the added HTML tags.
Figure 2: List of URLs with Added HTML Tags
With the HTML tags in place, here is how the return page may look like that contains the list of URLs
Figure 3: Formatted Output Displayed on Screen
What to Hand In
Your team needs to hand in the following in the order given.
- A team report for phase four of the project with a cover page. The report shouldn't be too long, maybe two to three pages. The report should include the following as a minimum.
- The name of your team (should be same as your search engine);
- The name of each team member;
- A description of the roles of each team member and the contributions of each member;
- A summary of the working process, e.g., what the team started with, what the team has accomplished, any problems encountered, how the team solved them, any thoughts on the project, among others.
- Team meeting minutes during this phase of the work.
- Source code for the programs and any other supporting documents. (This should be a complete set of a working search engine, without ranking.)
- Snapshots of sample runs. Use any copy-and-paste features to capture screen shots.
- Email the instructor a copy of the complete source code and sample runs in zip or tar format.