Map-tools Online Supplement


This page shows example visualizations of network traffic data, and descriptions of the tools used to generate them, as described in the paper, Manifold Learning Visualization of Network Traffic Data, by Neal Patwari, Alfred O. Hero III, and Adam Pacholski, accepted to the ACM Workshop on Mining Network Data, August 26, 2005, Philadelphia, PA.

Map-tools code

Map-tools is a set of C-code and bash-script utilities for command-line processing of NetFlow data.  There are several tools used to process NetFlow data into sensor map visualizations.  The flow of the several tools is shown below.
Program Flow Chart

Executable Name
Description
sensorRouter, sensorPort, and sensorTime
These bash shell scripts run flow-tools and extract the desired data when sensors are either routers, ports, or time.  The measurements can be either flows, octets, or packets, separated in any way that flow-stat is able.  For example, traffic can be divided by source or destination port,  IP address, or autonomous system (AS).   An arbitrary filter using flow-filter or flow-nfilter can also be applied to limit, for example, the ports or IP addresses of the input traffic.  (Flow-tools was created by Mark Fullmer and information is available online.)  The output is the sparse data vectors in a two-column text format.
spl2dist
This C-code executable inputs the two-column sparse data vectors and outputs the distance between each pair of vectors.  When N sparse data vectors are input, an N by N matrix is output.  The data vectors can be optionally normalized, to use percent of total rather than absolute traffic numbers. Distance is calculated as L2 (Euclidean) distance.
wmds
This C-code executable inputs the N by N distance matrix and outputs low-dimensional coordinates. The number of dimensions defaults to 2, but can be set to any positive integer.  The dimension reduction is done using the weighted multi-dimensional scaling (wMDS) method, as described in the paper.  Arbitrary prior coordinates can be set, along with the weights and weighting scheme.  Neighbors can be selected via K-nearest-neighbors, with an arbitrary integer for K.
coords2eps
This C-code executable inputs N 2-dimensional coordinates and residuals ei, and produces an EPS file which plots the sensor map.  The axis limits can be chosen automatically or set on the command line.

The code was developed in part using Will Naylor and Bill Chapman's WNLIB subroutine library, which is a free, unrestricted ANSI C subroutine library. 

Map-tools source code is currently available upon request from Neal Patwari, whose email is below.  Man pages are available:

Image Database

NetFlow data  was collected from January 2 to January 29, 2005 from the 11 routers in the Abilene backbone network.  Sample visualizations are shown below.  Router names are abbreviations:
ATLA
Atlanta
CHIN
Chicago North
DNVR
Denver
HSTN
Houston
IPLS
Indianapolis
KSCY
Kansas City
LOSA
Los Angeles
NYCM New York City
SNVA
Sunnyvale
STTL
Seattle
WASH
Washington


Map-Tools Visualization
Description
Jan 05 Average Router Map
Summary of the four weeks starting 02-Jan ending 29-Jan:  Mean O and 1-standard deviation uncertainty ellipse (- - - -) of router maps from 2-Jan to 29-Jan. Most location estimates fall within the ellipse.  Router maps are calculated every 5 minutes when sensors are routers measuring number of flows per source IP address. Solid lines show actual connections in Abilene backbone network.
Router Map 2005-01-02 at 2:40
Sunday, 02-Jan-2005 at 2:40 UTD:  This is a `typical' map.  Although there is some deviation from the mean (eg. WASH and DNVR) the routers are placed very close to their 4-week mean.  Immediately after this time (at 2:45) a large port scan dramatically changes the router map.  Compare this map to the following map.

Legend:  The 4-week mean location (o) is connected to the current estimate (O) by a dashed red line (- - - -).  The shading of the circle is proportional to the residual value ei: dark indicates high residual and white indicates low residual.
Router Map 2005-01-02 03:00
Sunday, 02-Jan-2005 at 3:00 UTD:  There is a port scan occurring between 2:45 and 3:30 which involves two source IP addresses sending  a total of about 61,000 flows per 5 minutes.  The traffic is measured only at CHIN, IPLS, DNVR, and KSCY.  The flows are coming from source IPs 198.59.80.0 (unknown) and 140.113.200.0 (nctu.edu.tw) from port 48775 to destination IP 140.113.200.0 (du.se, Högskolan Dalarna, Sweden).  The source AS number is zero.  Almost all of the flows are single, 29-byte UDP packets, to a wide range of destination ports.  There are a few, larger (100-300 kB flows) to ports 22, 53, 6667, and 6669.   Because of the low traffic level (it is a Sunday and the day after New Year's day) this traffic corresponds to 40% of the total number of flows, thus the map is dramatically changed -- the affected routers are pushed North, while all other routers are pushed far South.  CHIN, IPLS, KSCY and DNVR are equally affected by traffic from source IP 198.59.80.0,  but only CHIN is affected by traffic from 140.113.200.0 (nctu.edu.tw).  This is why CHIN isn't located exactly at the same place as IPLS, KSCY, and DNVR.

Legend:  The 4-week mean location (o) is connected to the current estimate (O) by a dashed red line (- - - -).  The shading of the circle is proportional to the residual value ei: dark indicates high residual and white indicates low residual.
Router Map 2005-01-02 at 8:10
Sunday, 02-Jan-2005 at 8:10 UTD:  There are about 13,000 flows going between two IP addresses: 129.171.184.0 (University of Miami, FL)  and 64.4.16.0 (hotmail.com).  There are about 6000 flows originating from the U. Miami address from a wide variety of source ports to destination port 80 (TCP) of the hotmail.com address.  Each flow contains 1-6 (for an average of 2) 40-byte packets.  The hotmail.com address replies with 1500-byte packets, from source port 80 to a wide range of destination ports.  While there are normally many flows from the hotmail.com address, this traffic accounts for about 80% of the total flows coming from the hotmail.com address.   The map shows the source and destinations, ATLA and STTL, being mapped very far from their mean location.  Routers DNVR, KSCY, and IPLS are also affected by the anomalous traffic, and are grouped very close together.  HSTN traffic is usally very similar to ATLA, but at this time it is very different, and so HSTN is placed very far away.

Legend:  The 4-week mean location (o) is connected to the current estimate (O) by a dashed red line (- - - -).  The shading of the circle is proportional to the residual value ei: dark indicates high residual and white indicates low residual.
Router Map 2005-01-05 at 8:05
Wednesday, 5-Jan-2005 at 08:55 UTD:  At this time, there is scheduled maintenance on the CHIN-IPLS link.  Usually, IPLS and CHIN traffic are very similar, but during the downtime, much of the traffic on Abilene reroutes through different links, such as a more Southern route through WASH and ATLA.  As a result, the router map shows a much larger distance between IPLS and CHIN, and a much flatter map, since traffic on the Southern routers are, temporarily, very correlated with Northern traffic.

Legend:  The 4-week mean location (o) is connected to the current estimate (O) by a dashed red line (- - - -).  The shading of the circle is proportional to the residual value ei: dark indicates high residual and white indicates low residual.
Router Map 2005-01-06 at 17:55
Thursday, 6-Jan-2005 at 17:55 UTD:  There is an anomaly that totals 90,000 flows at the CHIN router.  These are single, 40-byte packet flows from two source IP addresses in Taiwan to a small range of destination IP addresses in Hungary.  This volume corresponds to about 25% of the typical flow volume on CHIN.  The traffic from the two Taiwanese source IP addresses was observed on CHIN and no other router, thus distances between sensor data recorded at CHIN and other routers are unusually high, and the 2-D coordinates for CHIN must be kept very distant from all other sensors.  Also, because normalized distances are used to keep the map size reasonably constant, the rest of the distances between routers have shrunk to compensate. 

Legend:  The 4-week mean location (o) is connected to the current estimate (O) by a dashed red line (- - - -).  The shading of the circle is proportional to the residual value ei: dark indicates high residual and white indicates low residual.
Router Map 2005-01-07 at 15:45
Friday, 7-Jan-2005 at 15:45 UTD:  There is an anomaly that totals 20,000 flows at the CHIN, NYCM, and WASH routers.  These are single, 40-byte TCP packet flows from source IP address 140.123.64.0 (ccu.edu.tw) in Taiwan to destination IP address 128.112.128.0 (princeton.edu).  There are a range of source ports (between 1024 and 2048) and a range of low destination ports (between 1 and 139). Since the traffic was observed on CHIN, NYCM and WASH but no other router, these three routers are moved East in the router map, while IPLS and ATLA are moved West, to keep them far apart from each other.

Legend:  The 4-week mean location (o) is connected to the current estimate (O) by a dashed red line (- - - -).  The shading of the circle is proportional to the residual value ei: dark indicates high residual and white indicates low residual.
Router Map 2005-01-12 at 20:15
Wednesday, 12-Jan-2005 at 20:15 UTD:  There is a large anomaly of 71,000 flows at the STTL, LOSA, and SNVA routers.  These flows are single, 29-byte UDP packet flows from source IP address 163.30.88.0 (possibly tyc.edu.tw) to destination IP address 134.71.24.0 (csupomona.edu, California Poly in Pomona).  The packets are from source port 40150 to  random destination ports. Since the traffic was observed on LOSA, SNVA, and STTL but no other router, these routers are placed far away to the West, while the rest of the routers, due to the constraint on total distances, are placed very close together.

Legend:  The 4-week mean location (o) is connected to the current estimate (O) by a dashed red line (- - - -).  The shading of the circle is proportional to the residual value ei: dark indicates high residual and white indicates low residual.
Router Map 2005-01-20 at 01:00
Thursday, 20-Jan at 01:00 UTD:  There are a large number (14,000) of 29-byte packets from a 129.25.0.0 (Drexel U.) source IP address sent to a 131.252.120.0 (Portland State U.) destination.  The packets are UDP with source port 3095 or 3096 to a wide range of random destination ports >1024. These packets travel through the WASH, NYCM, CHIN, IPLS, KSCY, DNVR, and STTL backbone routers. Other routers (SNVA, LOSA, HSTN, and ATLA) do not see any flows from this source address at this time. Distances between the listed Northern routers and the other Southern routers are unusually high.  In the router map, there is a clear split in the map between the two sets of routers.

Legend:  The 4-week mean location (o) is connected to the current estimate (O) by a dashed red line (- - - -).  The shading of the circle is proportional to the residual value ei: dark indicates high residual and white indicates low residual.
Port Map at 2005-01-01 at 03:35
Destination port map for 01-Jan-2005 at 3:35 UTD (O dpo#) along with past 1 hour map history (dotted line circles). Sensors are attached to the top 30 destination ports (by total flows) and measure number of flows per source IP address.
Other examples will be added to this table in the future.


Online Visualization Tool

A Java-based visualization applet is publically available.  The applet provides both temporal and spatial views of Abilene traffic data by port.  A router map is calculated within the applet using time-filtered total packet data by port.  More details are available along with the visualization applet.
Screenshot of Visualization Applet

Questions? Comments?

Please contact Neal Patwari at npatwari (at) umich (dot) edu. More contact info.


Last Updated: August 12, 2005