Packet Header Collection and Very Large Databases
Packet header collection can result in a very large database (VLDB) because packets come across Internet wires continuously at high speed. Even if the total throughput on a wire is small, the data grow and grow, so the database eventually gets large if collection continues. We collect packet headers on MHWire1, a wire that connects a Bell Labs network of 3000 hosts to the rest of the Internet. MHWire1 has just a trickle compared with the high throughput wires of Internet ISPs. But after one year of collection, our databases of 328 million TCP connection flows with 6866 million TCP/IP packet headers took up about 350 gigabytes. An Internet wire with 100 times the throughput of MHWire1 would reach the same size in about 3.6 days.
The success of analyzing Internet traffic data depends heavily on an ability to intensively analyze the traffic database in great detail. We need to explore the raw data in its full complexity; relying only on summaries is inadequate. We need to study packet-level processes taking many variables into account; studying only byte counts in equally spaced intervals is inadequate. Success in detailed, intensive analysis depends on the analyst's computing environment.
To cope with very large traffic databases, we developed S-Net a traffic measurement and analysis system that begins with packet header collection on network wires, and ends with data analysis on a cluster of linux PCs running S, a language and system for organizing, visualizing, and analyzing data. Packet capture employs a PC with Berkeley Unix, an altered Unix kernel to enhance performance, the program tcpdump. time-stamping based on GPS (Global Positioning System) clock discipline, and attention to packet drops. The compressed header files are moved to the cluster of linux PCs, which are linked by fast switches. Each PC has 1, 2 or 4 processors, 300 to 2000 megabytes of memory, and they all have large amounts of disk space. An algorithm then organizes the header information by TCP connection flow, and the flows are processed to create flow objects in S. Analysis is carried out in S. Flows and S flow objects are computed in parallel on all of the PC processors and are stored on the disks of all machines. S is run on high-end PCs with large amounts of memory. Each analyst has a low-end PC that stores that user's S directories. The analyst logs onto a high-end machine from the home machine to run S, mounting the home S directories as well as the directories across the cluster housing the S objects. In other words, each data analysis session is distributed across the cluster.
We have implemented S-Net-MH in the Bell Labs research facility in Murray, NJ. This is a family sedan with 6 PCs (1 with 4 processors, 1 with 2 processors, and 4 with 1 processor) linked with 100mb/s Ethernet. We have implemented S-Net-Helios, part of the Helios DARPA-funded project, a racing car with 5 PCs (2 with 4 processors and 3 with 2 processors) with three duals and a quad linked by 1 gb/s Ethernet and the two quads linked by a Lucent OLS 40G system (packet over SONET) at OC48 (2.5 gb/s).
S-Net has worked quite well. Because the PCs and switches can be inexpensive and linux is free, the cluster can have a low overall cost. The cluster architecture scales readily; in our case, PCs and disks have been added and replaced incrementally as our database has grown. The S flow objects vary according to the specific analysis tasks; each is designed to enhance computational performance and to make the S commands that carry out the analysis as simple as possible. S is well suited to the task of analyzing Internet traffic data; its elegant design, which won it the ACM Software System Award for 1999, allows very rapid development of new tools.
Flow Processing Tools
To convert from raw packet header files obtained by tcpdump to flow summary files and S/S-PLUS objects, we developed a set of tools based on shell and Perl scripts.