Homework 2: HTTP WebGet (25 Points)

Chris Tralie

Learning Objectives

  • Implement internet communication using sockets in C
  • Practice reading documentation and man pages for Unix system calls
  • Handle command line arguments and file descriptors in C
  • Implement the HTTP GET protocol

Description / Overview

The wget command can be used to download information from URLs using HTTP protocols over TCP connections. In this assignment, students will implement a program called mywget to replicate a subset of wget functionality. In particular, mywget will be capable of downloading arbitrary file types from arbitrary urls using the unencryped HTTP 1.0 protocol. For example, with a working implementation of this homework, running

will download a (probably outdated!) version of my cv to a file called out.pdf in the same directory as mywget. We'll be using the HTTP 1.0 protocol as opposed to newer versions of the HTTP protocol because HTTP 1.0 terminates connections after each individual file request, which makes it easier to tell programmatically when the file has finished transmitting. This does come at the cost of additional overhead having to re-establish a connection for each new file, but this is fine for a proof of concept in this assignment.

Getting Started / What To Submit

You can obtain the starter code for this assignment by using git:

You can build this program by typing make. You will be editing the file mywget.c. This is the only file you need to submit to canvas when you're finished.

As an example of how to run this program then this would grab the file ctralie_cv.pdf from www.ctralie.com and save it to a file called out.pdf.

Also, before you start coding, I'd highly recommend that you watch the video below:


Programming Tasks:

In this assignment, you'll walk through the sequence of

  1. Establishing a connection
  2. Setup And Send An HTTP GET Request
  3. Receive And Parse Response

I have setup a command line argument parser following this scheme. The sets up required parameters --url for specifying the url and --target for specifying the filename to which to save the result. You can also specify the --port, but the default value of 80 will work for most examples on the web. Regardless, you can access all of the values in struct myargs ret

I have also implemented a method parseURL that separates a URL into a domain (ignoring the http://) and a path, which will make it easier for you to setup your GET request. For instance, the URL http://www.ctralie.com/ctralie_cv.pdf will split into URL www.ctralie.com and path ctralie_cv.pdf. The domain and path fields are also available to you in struct myargs ret.

Part 1: Establishing A Connection (5 Points)

This part will get you practice with the fundamentals of establishing socket connections in C.

Your Task

To begin, initiate a TCP connection with the specified domain on the specified port using the socket and connect system calls. You will need to look up the IP address of the domain using DNS via getaddrinfo, traversing the resulting linked list until you find a connection that works, or terminating gracefully if no connections can be established. Your program should not continue if any of these steps fail; instead, print informative error messages to stderr explaining where it failed.

Part 2: Setup And Send An HTTP GET Request (8 points)

If the above steps succeeded, then you can setup an HTTP get request to the domain

Your Task

Implement an appropriate HTTP 1.0 GET request, following RFC 1945 for the HTTP 1.0 specification. We've gone through examples in class, but this is actually fairly simple. That said, if you need examples, you can use netcat and your browser like we did on homework 0. You may also want to review Kurose's videos in section 2.2.

In addition to the other minimum requirements, you should include a Connection: close line in the header, as this will make the next step easier.

Be sure to setup the string first and print it to the command line first to make sure it's what you intended, and then send the string over the socket stream you established in the first task.

NOTE: As per the documentation, there is no guarantee that a single call to send will send all of the data you intended to. So for full credit on this task, you should examine the return value from send to see how many bytes actually went out, and continue sending in a loop until all bytes went out.

Part 3: Receive And Parse Response (12 Points)

If the send was successful, you can receive the response back from the http server and find the file that you requested.

Your Task

Read back the response from the server using recv in a loop, and react as follows:

  • If the server reported a 200 OK status code, then find the body and write the data to the specified target file as binary data using fwrite. Be sure not to write anything in the header by accident (including the final line break), as this will cause binary files (e.g. pdfs) to be corrupt.
  • If the server reported any other status code, print this status code to the terminal and terminate the connection
  • When you go to write the body to a file, do not use strlen to figure out the length of the body! This will not work for bodies that are binary, like pdf files, as the NULL terminator could randomly show up anywhere in the body! Instead, you should be able to figure out how long the body is by keeping track of the number of bytes received.
Regardless of the outcome, be sure to close all file descriptors and free all dynamic memory before the program closes.

HINT: Since we asked the server to close the connection in the last step, there is a special return code that comes back from recv that you can check to know that you're finished. Check the man pages to see how this works.

HINT 2: It will be much easier if you use my provided ArrayList to accumulate the entire HTTP response before you try to parse it and find the body. This will make it much easier to find the body and write it all to a file in one chunk. Basic usage is as follows:

</