UC Berkeley Home IP Web Traces
Main Author: | Steven D.Gribble |
---|---|
Format: | info dataset eJournal |
Bahasa: | eng |
Terbitan: |
, 1997
|
Subjects: | |
Online Access: |
https://zenodo.org/record/4020425 |
Daftar Isi:
- Description This dataset consists of 18 days' worth of HTTP traces gathered from the Home IP service offered by UC Berkeley to its students, faculty, and staff Home IP provides dial-up PPP/SLIP IP connectivity using 2.4 kb/s, 9.6 kb/s, 14.4 kb/s, or 28.8 kb/s wireline modems, or Metricom Ricochet (approximately 20-30 kb/s) wireless modems. These client traces were unobtrusively gathered through the use of a packet sniffing machine placed at the head-end of the Home IP modem bank; the tracing program used was a custom module written on top of the Internet Protocol Scanning Engine (IPSE) created by Ian Goldberg. Only traffic destined for port 80 was traced; all non-HTTP protocols and HTTP connections for other ports were excluded from these traces. The traces contain the following information: a total of 9,244,728 references spanning the period from Friday, November 1st, 1996 at 15:18:59 PST through Tuesday, November 19th, 1996 at 05:52:03 PST. 8,377 unique clients were seen in the traces. the time at which the client made the request the time at which the first byte of the server response was seen the time at which the last byte of the server response was seen the client IP address (suitably anonymized) the client port the server IP address (suitably anonymized) the server port (always 80 for these traces) the presence of the no-cache, keep-alive, cache-control, if-modified-since, and unless client headers. the presence of the no-cache, cache-control, expires, and last-modified server headers. the values of the client if-modified-since, the server expires, and the server last-modified headers, if present. the length of the response HTTP header the length of the response data the request URL (suitably anonymized) Format For the sake of storage efficiency, the (gzipped) traces are stored in a binary representation. This archive of tools includes the following code to parse and manipulate the archives: showtrace: this program will print out a human readable ASCII representation of what is in the traces. To use, type: gzcat <tracefile> | showtrace Take a look at the source file showtrace.c to see how you can use logparse.[ch] to write code that parses and manipulates the traces. All times displayed are as reported by the gettimeofday() system call. anon_clients: this is the program that we used to anonymize the traces. I include this program under the principle that the anonymization used is strong enough that distributing the anonymization code cannot help anybody break the anonymization. timeconvert: a program that accepts a calendar time (i.e. time in seconds since the Epoch, as reported by showtrace and as used in the trace filenames) and outputs the local time corresponding to that calendar time. The showtrace tool will display lines in the following format: 848278028:829593 848278028:893670 848278028:895350 23.240.8.98:1462 207.36.205.194:80 2 8 4294967295 4294967295 835418853 170 844 37 GET 9168504434183313441..gif HTTP/1.0 848278028:829593 is the time at which the client made the request 848278028:893670 is the time at which the first byte of the server response was seen 848278028:895350 is the time at which the last byte of the server response was seen 23.240.8.98:1462 is the anonymized client IP address and the client port number 207.36.205.194:80 is the anonymized server IP address and the server port number 2 is the decimal representation of the client headers bitfield 8 is the decimal representation of the server headers bitfield the first 4294967295 is the if-modified-since client header value (note that 4294967295 is 0xFFFFFFFF, which means this header value was not present for this entry) the second 4294967295 is the expires server header value (again not present) 835418853 is the last-modified server header value 170 is the length of the HTTP response header 844 is the length of the response data 37 is the length of the anonymized request URL "GET 9168504434183313441..gif HTTP/1.0" is the anonymized request URL. The interpretation of the client and server header bitfields are as defined in the logparse.h header in the tools code. The tools code has been tested on both Linux and Solaris. The provided Makefile assumes Solaris - you may have to play with the LIBS definition for other platforms. HPUX is a mess; I didn't even try, but it should be possible to get these tools to work with little effort. If you do, please let me know what you did so that I can make your changes available to the world. Measurement The Home IP population gains IP connectivity using PPP or SLIP across their 2.4 kb/s, 9.6 kb/s, 14.4kb/s or 28.8kb/s wireline modem, or their (approximately) 20-30kb/s wireless Metricom Ricochet modem. There are a total of roughly 600 modems available via the Home IP bank. All traffic from these modems ends up feeding over a single 10Mb/s shared Ethernet segment, on which we placed a network monitoring computer (a Pentium Pro 200Mhz running Linux 2.0.27). The monitor was running the IPSE user-level packet scanning engine and a custom-written HTTP module that reconstructed HTTP connections from the gathered IP packets on-the-fly and emitted an unanonymized trace file. Each trace file was then anonymized and transmitted to our research workstations for further postprocessing and analysis. The trace gathering engine was brought down and restarted approximately every 4 hours (for administrative and address-space-growth reasons). This implies that there are two weaknesses in these traces that you should be aware of: any connection active when the engine was brought down will have a possibly incorrect timestamp for the last byte seen from the server, and a possibly incorrect reported size. We estimate that no more than 150 such entries (out of roughly 90000-100000) are misreported for each 4 hour period. any connection that was forged in the very small time window (about 300 milliseconds) between when the engine was shut down and restarted will not appear in the logs. We estimate that no more than 30 such drops occur for each 4 hour period. The packet capture tool reported no packet drops. Considering that a Pentium Pro 200MHz was used to capture the traces on a 10 Mb/s Ethernet segment, it is virtually certain that no trace drops besides those mentioned above occurred. There may be periods of uncharacteristically low activity in the traces - these correspond to network outages from Berkeley's ISP, rather than trace failures. The traces do contain entries for requests issued by the client but that weren't completed (because, for instance, the user pressed the STOP button and the TCP connection was shut down before the request completed). Unknown timestamps in the traces contain the value 0xFFFFFFFF (reported by showtrace as 4294967295), and incomplete requests contain header and data length values that report as much header/data was seen. The trace data is sorted by completion time (i.e. the time at which the last bye of the server response was seen, or the time at which the connection was dropped). However, because of inaccuracies and apparent time travel in the Linux system clock, some trace entries appear slightly out of order. All timestamps within the traces are as reported by the gettimeofday() system call, so these timestamps ostensibly have microsecond resolution. Privacy To maintain the privacy of each individual Home IP user, we have stripped identity information out of the traces through a post-processing phase. Because it is very trivial to identify a user based solely on the pages that the user has visited, we were forced to anonymize the URL and destination IP address of each web request as well as the source IP address. All anonymization was done using a keyed MD5 hash of the data (32 bits for client and server IP addresses, 64 bits for URLs). We ourselves do not know the key used to salt the MD5 hash, so don't bother asking us for it. Similarly, don't bother asking us for unanonymized traces. In order to preserve some information about the URLs, the post-processed URLs have the following format: COMMAND URLHASH.[flags][.suffix] [HTTPVERS] where: COMMAND is one of GET, HEAD, POST, or PUT, URLHASH is the string representation of the 64-bit MD5 hash of the URL, flags contains the character q to indicate that a question mark was seen in the URL, and the character c to indicate that the string CGI or cgi was seen in the URL, suffix is the filename suffix, if present, and HTTPVERS is the HTTP version field of the HTTP command issued by the client, and is one of HTTP/1.0 HTTP/1.1 the NULL string (indicating HTTP/0.9). To our knowledge, however, no HTTP 1.1 requests were observed during the tracing period. Here are some examples of URLs contained in the traces: GET 8252631242092696791.q.map HTTP/1.0 - the client issued a GET request, the URL contained a question mark, the URL ended in the suffix .map, and HTTP/1.0 was used by the client. An example of a request that may generate this anonymized URL is GET /foo.map?BAR=BAZ HTTP/1.0. POST 36782605103285618862.c HTTP/1.0 - the client issued a POST, the URL contained the substring CGI or cgi, the URL did not end with a dotted suffix, and HTTP/1.0 was used by the client. An example of a request that may generate this anonymized URL is POST /cgi-bin/foo HTTP/1.0. GET 103551731373256697..gif HTTP/1.0 - the client issued a GET request, the URL contained neither the substring [CGI|cgi] nor a question mark, the filename ended with the .gif suffix, and HTTP/1.0 was used. An example of a request that may generate this anonymized URL is GET /image.gif HTTP/1.0. GET 41438582632480924518. HTTP/1.0 - the client issued a GET request, the URL contained neither the substring [CGI|cgi] nor a question mark, the filename didn't end with a dotted suffix, and HTTP/1.0 was used. An example of a request that may generate this anonymized URL is GET /foo HTTP/1.0. Privacy was the firstmost concern during this trace gathering experiment - UC Berkeley and the CS department consider the privacy of the student body to be paramount, and whenever we had the choice of putting more information in these published logs at the cost sacrificing the privacy of the traced users, we have invariably chosen to maintain the users' privacy at the cost of losing this information. It is our hope that someday the web protocols and servers will become secure enough to make a tracing effort of the kind we have done impossible. Acknowledgements Steven D. Gribble contributed the traces to the ITA. He also maintains the official UC Berkeley page dedicated to this tracing effort. For inquiries, contact Steve Gribble at gribble [at] gmail [dot] com. These traces, documentation, and associated trace tools were created by Steve Gribble with the assistance of Armando Fox, Ian Goldberg, Eric Brewer, and Cliff Frost.