Access to MLab traceroute data from Google BigQuery

M-Lab (short for Measurement Lab) is an Internet measurement platform especially focusing on performance: their goal is to enable users to diagnose Internet performance issues. To this end, they perform measurements on the Internet: for instance, they launch traceroute from their servers to some other machines (in particular, Paris Traceroute). All of their data is freely available online, through Google Cloud Storage (for raw measurements) and Google BigQuery (once measurements are parsed).

Specifically for traceroute, the measurements are stored in an awkward format: each hop is stored in a different row, mixed with all the other traceroute measurements. This is not really convenient for people wanting to exploit the actual routes (series of hops) and not more general statistics. To this end, users have to reconstruct the traceroute by themselves, by sorting the scattered hops.

(By the way, if you want to download the raw data, here are the links: from May 2013 until June 2017, from April 2017 onwards.)

The official solution is to write code by yourself to find the edges of the path one by one, starting from the server IP until the destination IP (both of which are known, of course). A major issue is that, once a node misses in the traceroute, reconstruction becomes problematic (and close to impossible once two nodes miss).

This blog post contains a sample code to reconstruct the whole series of traceroutes using BigQuery (instead of downloading the raw files, which do not have this problem, but are harder to work with). It processes in two steps: first, finding all rows that correspond to the same measurement (all rows that have the same test_id, corresponding to the name of a raw file available on Google Cloud Storage); then, parsing this set of rows to find back the traceroute. Parsing is not exactly complicated: the source and destination IPs are defined as a part of the measurement (within the connection_spec field); the first hop starts with the source IP and ends at some other IP, say IP_A; the second hop starts with the address IP_A; you then follow this path until reaching the destination IP. The following code stops reconstructing the path as soon as one host did not respond to the requests (which unfortunately happens quite often—roughly 40% of the times!).

The output format is not strictly identical to the base BigQuery table schema: it only contains two fields, the measurement identifier test_id and a JSON string describing the whole traceroute. The entries within the JSON string are copied from the table structure, except for those corresponding to the hops: the fields paris_traceroute_hop of each row of the table are merged in a list paris_traceroute_hops.

If you want to try it for yourself, first register to the M-Lab Google Group (this is required to get access to M-Lab’s BigQuery), then access the traceroute table, start a new query (Ctrl+Space), and copy the following code (it limits itself to only ten measurements, but you can remove the LIMIT clause to get the whole data set).

Enjoy!

#standardSQL
CREATE TEMPORARY FUNCTION ProcessJson(s STRING)
  RETURNS STRING LANGUAGE js AS """
    var entry = JSON.parse(s);
    var newEntry = new Object(); 
    newEntry['paris_traceroute_hops'] = []; 
    
    // Reconstruct the path from the rows. 
    var soughtSource = entry[0]['connection_spec']['server_ip']; // Gets updated at each crossed hop. 
    const soughtDestination = entry[0]['connection_spec']['client_ip']; // Never gets updated. 
    
    do {
      // Find the first (hopefully, only) hop that matches. 
      var hopId = -1; 
      for (var i = 0; i < entry.length; i += 1) {
        if (entry[i]['paris_traceroute_hop']['src_ip'] == soughtSource) {
          hopId = i;
          break;
        }
      }
      
      // If not found, exit right now: the entry makes little sense. Mark it as such.
      if (hopId == -1) {
        newEntry['error'] = 'true';
        break; 
      }
      
      // Push the hop onto the new entry.
      newEntry['paris_traceroute_hops'].push(entry[hopId]['paris_traceroute_hop']); 
    
      // Next source is current hop's destination. 
      soughtSource = entry[hopId]['paris_traceroute_hop']['dest_ip']; 
    } while (soughtSource != soughtDestination);
    
    // Copy the remaining entries.
    for (var property in entry[0]) {
      if (property !== 'paris_traceroute_hop') {
        newEntry[property] = entry[0][property];
      }
    }
    
    return JSON.stringify(newEntry);
""";


WITH Tests AS (
  SELECT DISTINCT test_id
  FROM `measurement-lab.base_tables.traceroute`
  LIMIT 10 -- Limit to ten measurements (which already takes two to three minutes!)
)
SELECT test_id, ProcessJson(TO_JSON_STRING(ARRAY(
  SELECT AS STRUCT *
  FROM `measurement-lab.base_tables.traceroute`
  WHERE test_id = Tests.test_id
)))
FROM Tests

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s