Homework 2: HTTP WebGet (25 Points)
Chris Tralie
Learning Objectives
- Implement internet communication using sockets in C
- Practice reading documentation and man pages for Unix system calls
- Handle command line arguments and file descriptors in C
- Implement the HTTP GET protocol
Description / Overview
The wget
command can be used to download information from URLs using HTTP protocols over TCP connections. In this assignment, students will implement a program called mywget
to replicate a subset of wget
functionality. In particular, mywget
will be capable of downloading arbitrary file types from arbitrary urls using the unencryped HTTP 1.0
protocol. For example, with a working implementation of this homework, running
will download a (probably outdated!) version of my cv to a file called out.pdf
in the same directory as mywget
. We'll be using the HTTP 1.0 protocol as opposed to newer versions of the HTTP protocol because HTTP 1.0 terminates connections after each individual file request, which makes it easier to tell programmatically when the file has finished transmitting. This does come at the cost of additional overhead having to re-establish a connection for each new file, but this is fine for a proof of concept in this assignment.
Getting Started / What To Submit
You can obtain the starter code for this assignment by using git:
You can build this program by typing make
. You will be editing the file mywget.c
. This is the only file you need to submit to canvas when you're finished.
As an example of how to run this program
then this would grab the file ctralie_cv.pdf
from www.ctralie.com
and save it to a file called out.pdf
.
Also, before you start coding, I'd highly recommend that you watch the video below:
Programming Tasks:
In this assignment, you'll walk through the sequence of
I have setup a command line argument parser following this scheme. The sets up required parameters --url
for specifying the url and --target
for specifying the filename to which to save the result. You can also specify the --port
, but the default value of 80 will work for most examples on the web. Regardless, you can access all of the values in struct myargs ret
I have also implemented a method parseURL
that separates a URL into a domain
(ignoring the http://
) and a path
, which will make it easier for you to setup your GET request. For instance, the URL http://www.ctralie.com/ctralie_cv.pdf
will split into URL www.ctralie.com
and path ctralie_cv.pdf
. The domain
and path
fields are also available to you in struct myargs ret
.
Part 1: Establishing A Connection (5 Points)
This part will get you practice with the fundamentals of establishing socket connections in C.
Your Task
To begin, initiate a TCP connection with the specified domain on the specified port using the socket
and connect
system calls. You will need to look up the IP address of the domain using DNS via getaddrinfo
, traversing the resulting linked list until you find a connection that works, or terminating gracefully if no connections can be established. Your program should not continue if any of these steps fail; instead, print informative error messages to stderr
explaining where it failed.
Part 2: Setup And Send An HTTP GET Request (8 points)
If the above steps succeeded, then you can setup an HTTP get request to the domain
Your Task
Implement an appropriate HTTP 1.0 GET request, following RFC 1945 for the HTTP 1.0 specification. We've gone through examples in class, but this is actually fairly simple. That said, if you need examples, you can use netcat and your browser like we did on homework 0. You may also want to review Kurose's videos in section 2.2.
In addition to the other minimum requirements, you should include a Connection: close
line in the header, as this will make the next step easier.
Be sure to setup the string first and print it to the command line first to make sure it's what you intended, and then send
the string over the socket stream you established in the first task.
NOTE: As per the documentation, there is no guarantee that a single call to send
will send all of the data you intended to. So for full credit on this task, you should examine the return value from send
to see how many bytes actually went out, and continue sending in a loop until all bytes went out.
Part 3: Receive And Parse Response (12 Points)
If the send was successful, you can receive the response back from the http server and find the file that you requested.
Your Task
Read back the response from the server using recv in a loop, and react as follows:
-
If the server reported a 200 OK status code, then find the body and write the data to the specified target file as binary data using
fwrite
. Be sure not to write anything in the header by accident (including the final line break), as this will cause binary files (e.g. pdfs) to be corrupt. - If the server reported any other status code, print this status code to the terminal and terminate the connection
- When you go to write the body to a file, do not use strlen to figure out the length of the body! This will not work for bodies that are binary, like pdf files, as the NULL terminator could randomly show up anywhere in the body! Instead, you should be able to figure out how long the body is by keeping track of the number of bytes received.
HINT: Since we asked the server to close the connection in the last step, there is a special return code that comes back from recv
that you can check to know that you're finished. Check the man pages to see how this works.
HINT 2: It will be much easier if you use my provided ArrayList
to accumulate the entire HTTP response before you try to parse it and find the body. This will make it much easier to find the body and write it all to a file in one chunk. Basic usage is as follows: