Binary UTSDF (Unified Transportation Sensor Data Format) for Traffic Data

For many years, MN/DOT has collected traffic data from sensors (detectors) embedded in the roadway. As of March, 2000, data is being collected every 30 seconds from more than 4,000 detectors in the Twin Cities metro area. This raw data consists of volume (number of vehicles, sometimes called "flow") and occupancy (percentage of time a detector is "occupied"). As you might guess, this adds up to a very large amount of data every day. There is so much data that the advantages of storing it in a traditional database are far outweighed by the complications. This data storage problem led to the development of the MN/DOT Unified Traffic Data File Format. This format is now a special case of TDRL's UTSDF.

There are many benefits to UTSDF. Probably the most important benefit is simplicity. Earlier file formats had complicated bit field manipulation, which made it harder to develop data analysis tools. This problem has been eliminated, since all data is stored as either 8-bit or 16-bit binary integers. Another benefit is the compactness of the format. In earlier formats, this quantity of data would occupy 33 megabytes (MB) of disk space. In this format, the same data is compressed into about 13 MB (with no loss of precision). Another problem with earlier formats was the distinction between 30-second, 5-minute, and station data made accessing the data more complex than it needed to be. This format unifies all the data into a single file, simplifying the software needed to access the data. Another key benefit of the format is extensibility. It will be possible in the future to add different types of data (such as speed) to the format without sacrificing backward compatibility.

Each traffic data file consists of one day's worth of traffic data. The files are conventionally named with an eight-digit date (four-digit year, two-digit month and two-digit day), plus an extension of ".traffic". For example, a file called "20000323.traffic" would contain all the detector data for March 23, 2000. The file itself is actually in the popular ZIP compression format, making it easy to extract data using tools such as WinZip. Within the traffic file, there are two files for each detector, one containing the detector volumes for the whole day, and the other containing occupancies. These files are named using the detector index number as the base file name, with an extension of ".v30" (for volume), or ".o30" (for occupancy). So, if there was a detector number 100, the traffic file would contain two files, "100.v30" and "100.o30", (in addition to all the other defined detectors).

The volume files (*.v30) are flat binary files of 2880 bytes each. Each byte is an 8-bit signed volume for the corresponding 30-second period in the day. A negative value (-1) indicates missing data. The first 8-bit value represents the first 30-seconds of the day (midnight to 12:00:30), and the last value is the last 30-seconds of the day (11:59:30 to midnight).

The occupancy files (*.o30) are in a very similar format as the volume files, except each value is a 16-bit signed occupancy. Each file is 5760 bytes in length (2880 * 2). The occupancy values are fixed-point integers ranging from 0 to 1000 (tenth of a percent units). A negative value (-1) indicates missing data, as with the volume files. The 16-bit values are in high-byte first order. 

Last modified: 23 March 2000 

Addendum to the original description of the file format (8/3/2001)

1. The .c30 files are recorded in "scans" and are more precise than the .o30 files.  Soon, all the data will use the .c30 format.  Scans are defined as 1/60 second, so the valid range for data is 0 to 1800 (30 seconds * 60 scans/second).  The old .o30 files are in 1/10th percent occupancy, so they range from 0 to 1000.  That is the only difference between the two file formats.  If you want to get numbers in the range of 0 to 100, divide scan data by 18 or occupancy data by 10.  Any data outside the valid ranges should be considered "bad".

2. For volume data, it doesn't really matter whether you treat it as signed or unsigned. Since the samples are 30 second volume data, to get as many as 40 vehicles would be a flow rate of  4800 vehicles per hour, which translates to a headway of 0.75 seconds average between vehicles!  This does not happen in real life.  I suggest you treat any data not in the range of 0 to 40 as "bad" data.

3. Those different negative numbers are the result of minor bugs in the data collection software. It'll be fixed some time in the future.  Just as with volume, any value not in the valid range should be considered "bad" data.

This format information was provided by Doug Lau at TMC Mn/DOT.