Project 1: Search Engine - A Client Server Program Pair
Extra Credit Work
This segment lists a few pieces of extra-credit work. Students who have time and interests are encouraged to work on these pieces. These pieces are more or less independent of each other. You can pick and choose any one you'd like to tackle. If you have time, you are encouraged to do more then one piece. The level of difficulties are not even in these work. If you are not sure where to start, please come and discuss them with the instructor.
- Implementing the tf-idf ranking algorithm: In the required project, the relevant URLs are returned without ordering. While many factors can affect the ordering of URLs for a real search engine, a basic component of ordering is based on tf-idf, or term frequency and inverted document frequency. For a detailed discussion of what it means and how to implement it, please refer to my project site for Summer Web Information Retrieval Course, in particular, for the discussion of its ranking component.
- Implementing a threaded server and a threaded crawler: The default requirement for the server and the client is a single process program. You can add threads to the server, as well as to the crawler. If you add threads to the crawler, be extremely careful that don't let the crawler threads run wide. Keep the number of threads small, e.g., two or three, and make sure you'd be able to stop the threads as needed.
- Implementing a stemming algorithm: In our project, we consider words of the same root as different ones. For example, computers, computing, computation, computational are all different words in our basic system. In fact these words come from the same root, they should be considered as one word. The process of finding the root of a word is called stemming. Why do we prefer stemming? Think about the above example, if we don't stem, the above four words of the same root will take four nodes in the inverted index list. If we stem these words, one index node suffices! To implement a stemming algorithm and incorporate it into the search engine system, please refer to my project site for Summer Web Information Retrieval Course, in particular, for the discussion of its stemming part.
- Implementing the basic PageRanking algorithm: The PageRank algorithm was what sets Google apart from many other search engines at the time. Read the original Google paper and their original PageRanking algorithm. (The original source is at http://ilpubs.stanford.edu/422/ but it seems difficult to access it.)
- HTTPS in C: HTTPS is also known as HTTP over TLS (Transport Layer Secure), or HTTP over SSL (Secure Socket Layer), or HTTP Secure. It is a secure HTTP protocol in the sense that all traffic between an HTTP server and a client are encrypted, thus, secure. In other more recent languages such as Java or Python, libraries are provided to use HTTPS. In C, I haven't found any so far. If you can either implement it yourself, or find it somewhere and use it appropriately, it should be an interesting project.
- Parsing and indexing Chinese (or any other non-English language): While a large portion of the web is in English, working with other languages is an important issue in networking. The basic ideas of working with Chinese is very much the same as with English. You'd have the same software architecture and code in most part, except in parsing, indexing, and user interface. Given a Chinese website, your program will read it as a regular web page, but parse it using Chinese encoding, actually, in Unicode, a coding scheme that works for all languages. Your task is to implement this part of the search and crawling process.
- Create and maintain a server log file: All product web servers keep a log file to maintain a set of basic information. Search engines such as Google use it to keep track of the search trend, commercial web sites such as Amazone use it to identify customer behavior. You can do it as well to keep such information as search query, user IP address (or host name), the time the query is issued, and the response code the server sends back. You are asked to follow the common log format. See this Wikipedia page for more details. To get the host information where the client resides, please read the manual page on Linux for getpeername().