Web Information Retrieval -- Xiannnong Meng

Search Engine Project Part 3: Crawling the Web

Summer 2014

Project	Part 1	Part 2	Part 3	Part 4	Part 5

Introduction

You are to implement a crawler or a robot in this phase of the project to collect information from the web. The crawler starts with a given set of starting URLs. The number of URLs in this set can be one or more. The crawler will retrieve the pages specified by this starting URL set; parse the pages; extract some more URLs from these pages. The crawler then visits the pages following these newly harvested URLs. This process will continue until either the time allowed has expired, or the number of retrieved web pages has reached the limit, or there is no new page to visit.

The retrieved web pages are passed to the Indexer you built in the second phase of the project for processing. An inverted index will be built by the Indexer for all the web pages retrieved, ready for a user to search. In the indexing building phase, your Indexer takes a set of local files as input source. Now the Indexer takes the input from the web pages retrieved by the Crawler.

General Algorithm

Traversing the web is very similar to traversing a general graph. Each web page can be considered as a node in the graph. Each hyper-link can be considered as a link in the graph. From this point of view, crawling web is not too much different from the graph traverse algorithm you learned in a typical data structure course.

The following is a general web traversing algorithm that we discussed in lectures.
Initialize queue (Q) with initial set of known URLs Until Q empty or page or time limit reached do { Pop a URL, L, from the front of the Q If L is not point to an HTML text page (.gif, .jpeg, .ps, .pdf, .ppt, ...) continue loop If already visited L, continue loop If the page can't be downloaded (404 error, or robot exclusion) continue loop Download the page, P, for L Index P (and possibly store it to a disk file if needed) Parse P to obtain list of new URLs (this can also be done by Indexer) Append newly harvested URLs to the end of Q }
Figure 1: General Algorithm for Crawling the Web

Some Issues To Be Considered

Because of the vast size of the web, there are some technical and engineering issues we have to consider for a successful, less-intrusive crawler. We list here some of the issues to consider in crawling the web.

Obey the robot protocol. See http://www.robotstxt.org/robotstxt.html for specific details. The basic idea when crawling the web is that first to check the server site (typically the root page) to see if the server administrator has put the robots.txt in place. If the file is there, check the contents to see what are the directories that are excluded for visiting. We may ignore the individual page meta tags for now. Do not index and analyze the directories and pages that are excluded for visiting. If you don't follow the protocol, you may receive direct complaint from the server administrator. And your access to these web sites may be hindered. Pay attention to this issue.

When visiting a web site you should report to the server the name of the user agent, the host where your program is running and a valid email address in case the site administrator wanted to contact you. This can be done when establishing the contact with the server, as shown in the following example in Java. Syntax in other languages varies, but the idea is the same.
// first get the information about local host InetAddress inet = InetAddress.getLocalHost(); // now preparing the command to be sent to the server String cmd = "GET " + path + " HTTP/1.0\n"; cmd += "Host: " + inet.getHostName() + "\n"; cmd += "User-Agent: " + "course-project\n"; cmd += "From: " + "you@seu.edn.cn\n\n"; // then send the command to the server ...
Figure 2: Crawler Information Reporting

The InetAddress class allows you to get various pieces of information about an internet host (the computer where your program is running.) Here we retrieve the relevant internet information of the local host. Then we send the name of the local host to the server as a part of the HTTP protocol. We also send the “User-Agent” and the email address of the person who is running the crawler to the server. Note that we now put two new-line characters at the end of the email address, instead of after the protocol HTTP/1.0 as we did before, because all these pieces of information are part of the HTTP header.

Check for the example WebClient.java. If you use the URLClient class as the base for a client, there is no easy to identify yourself. In this case, you may skip the identification part.

Identifying yourself (the crawler) is a part of good robot behavior. When the site administrators see who is running the crawler and why, they are less likely to block the access and report it as intrusion. Other good behaviors include not to visit the same site with rapid succession. Rather wait a few seconds before the next visit. In practice you may run a number of threads with each thread visit a site. Each one thread should wait a few seconds in between the visits to the same site. Because you have multiple threads running, the overall idle time for the crawler may not be high.

If you decide to save the downloaded pages to disk files for further processing, make sure you use synchronized thread behavior to avoid inconsistence of the file status.

Complete partial URLs. Many web pages contain partial URLs. That is, the URLs are relative to the current path, or relative to the base URL specified in the web page header. For example, if your program is currently visiting some.edu/x/y/z/test.html, and the crawler encounters URLs within a page in the form of ../../home/page.html or mypage.html, the crawler then need to expand these URLs to a full URL so that the crawler can access them later. You may need to keep a current path and host for this purpose. For example, in our example, the current host is some.edu and the current path is /x/y/z/ then the afore-mentioned two URLs will be expanded as http://some.edu/x/home/page.html and http://some.edu/x/y/z/mypage.html, respectively. See the program example at UrlUtil.java for the handling of these cases.

During the web crawling your program has to keep track of the pages that have been visited in order to avoid infinite loop. A Hashtable might be a reasonable choice for keeping the visited list. You can also devise your own data structure to maintain the list.

Testing your crawler with some small sites first. Do not crawl off-campus sites until you are pretty confident that your crawler is working properly.

You may use the example WebClient.java or URLClient.java as a starting point for a crawler.

Testing Your Crawler

Test your programs on a small website first. For example, you can test your crawler against the small website you built in the first phase of this project. Our goal is to crawl the English web site of Southeast University.

If all components work correctly, you should be able to combine this phase with the second phase and be able to provide a query/answer system. In the second phase of the project, your program was able to answer queries and return the names of the documents containing the query term along with the term frequency count. If you feed the crawled web pages into that program, your program should be able to return the URLs of the web pages that contain the query term.

What to Hand In

Your team needs to hand in the following in the order given.

A team report for phase three of the project with a cover page. The report shouldn't be too long, maybe two to three pages. The report should include the following as a minimum.

The name of your team (should be same as your search engine);

The name of each team member;

A description of the roles of each team member and the contributions of each member;

A summary of the working process, e.g., what the team started with, what the team has accomplished, any problems encountered, how the team solved them, any thoughts on the project, among others.

Team meeting minutes during this phase of the work.

Source code for the programs and any other supporting documents.

Snapshots of sample runs. You can use any copy-and-paste features to save the result of running the programs to a text file.

Email the instructor a copy of the complete source code and sample runs in zip or tar format.

Project	Part 1	Part 2	Part 3	Part 4	Part 5