The second phase of the project asks you to create a web server (a simple search engine) that can answer user queries about a web page your program has visited and indexed. The actual data (web pages) can be collected manually by using the web client program in the first phase. The type of queries your search engine need to be able to handle is very simple. For example, if the query is network, your server should return a list of web pages that contain the word network. A typical user interaction would look as follows (here we only list text, your should be able to do it in a browser).
The following sections discuss some of the technical details how we can implement such a simple search engine.
A web server essentially is a TCP based server program that follows the HTTP protocol as the application-layer protocol. Take a look at the echo-server.c program in the following. (You can certainly study a slightly more complicated and more complete set of programs in web-server.c.)
In this program, the server is waiting for client connection requests at an agreed-upon port. Once a connection is accepted, the server reads a string from the client, and sends it right back to the client with a inserted phrase "Echo --> ". From the client point of view, after a connection is accepted by the server, the client sends a message to the server, reads message sent by the server and prints it on the screen. This is the application protocol for this particular echo service!
For a search engine, the interaction between a client and a server follows the HTTP protocol, which is slightly more complicated than a service such as echo. To understand how HTTP works, first let's do the following experiment.
In your echo-server.c program, after reading the input from a client, instead of echoing the message back to the client, your echo server program prints what is read from the client on the screen, and then sends back the following message to the client.
Note that essentially your server now sends back an HTTP response code first (HTTP/1.0 200 OK) followed by two pairs of newline and carriage returns. The HTTP response code is then followed by a simple, but complete web page.
Now start a web browser, assuming your server program is running on a lab machine, e.g., dana132-lnx-4, put your echo server program's URL as the browser address, e.g., http://dana132-lnx-4:2500 where the number "2500" is the port number at which your echo server is running. Observe the behavior of the server.
When your server program prints what it reads from the client (a browser), you should see something similar to the following on your screen. We will explore the meaning of this request later.
You should see the browser displays the content sent back by the server.
The above is the simplest scenario of interaction between a web client and a web server. How does a client request a specific web page from a web server? In the same directory where your simple web server resides, create a simple web page with the following content (you can certainly make a more elaborate web page as you would like), calling the file with this content simple.html or a name of your choice.
Set the access permission of simple.html as readable by the world. If you are not familiar with how to set permissions, please read the manual page on the command chmod
. For what we need, you can simply do
chmod 644 simple.html
which sets the file readable by all and writable by the owner (you) only.
Revise your echo-server.c program by the following steps.
write()
statement in the echo-server.c program so that it only sends back the HTTP response code ("HTTP/1.0 200 OK"), not sending the in-line simple web page code. Remember that an HTTP response code must be followed by a newline by itself in a line.write()
system call. Doing so your web browser that made the request to your server should see the simple.html displayed in the browser and you should be able to click the hypertext link from within that page.Now let's read what the original client request. The client (a web browser) sends a request to the web server when the browser tries to connect to the server. The command GET / HTTP/1.1 indicates that the browser wants to read the root HTML page at the server. If a browser would request a specific file, e.g., simple.html, the parameter of the GET command would look as follows.
GET /simple.html HTTP/1.1
Confirm this phenomenon by changing the URL which the browser uses to access the web page as
http://host-name:2500/simple.html
Load the web request again (refresh the browser). You should still see the content of simple.html displayed on the browser screen. In addition, you should see from the server side that the parameter of the GET command has changed from the root "/" to "/simple.html." Other pieces of common requests from a browser remain the same, including the host name of the server, the agent name (the browser), the type of application the browser can handle (e.g., text/html, or application/xml), the accepted language, the accepted encoding mechanism, among others.
Now that we know how a web client (e.g., a browser) interacts with a web server, we can turn our attention to how to make a web server a search engine and how a web client such as a browser sends query to a search engine and how a search engine sends the search results back to the client.
A web client can send a piece of information such as a query to a web server by using the HTTP POST command. Let's first concentrate on the client side to see how we can post a request to the server.
The basic mechanism to post a query from a web client is to use a form submission method in HTML. Once again, let's change the program echo-server.c to make it accept and process a query from a client. Instead of sending simple.html to the client, let's have the server send back the form.html which reads as follows.