Methodology & Results
Data overview
At a high level, this project compares real-time data (data that indicates which bus trips actually happened) with schedule data (the trips that the CTA said in advance would happen).
The real-time data is scraped from the CTA bus tracker API getvehicles feed every five minutes. Data collection started on 20 May 2022 and continues to date, though the data on this site is not yet automatically updated. Every day we scrape the raw data in the form of hundreds JSON files, which we store in Amazon Web Service’s Simple Storage Service (S3). We are only using the getvehicles feed, and not the feed that predicts bus arrivals at specific stops or any of the other bus data feeds that the CTA makes available.
Schedule data is collected from the CTA General Transit Feed Specification (GTFS) feed data. A list of schedule versions is compiled from transitfeeds.com. The schedule versions are deemed to be in-effect based on when they were online. Days that a new schedule feed was published by CTA are excluded from analysis. At the time of writing, we don’t scrape these automatically (so we have to manually process schedule feed versions), although there are plans to automate more of this. More information on the GTFS schedule files can be found on the CTA's GTFS page or on gtfs.org.
See our data and code
For more information on data collection, or to access our raw data yourself, see the documentation on our GitHub repo. Our S3 bucket is public and updates continuously. Our data scraping code and data analysis code are also available. If you find a bug in our data or code, please feel free to submit an issue in the repo or contact us directly.
Data notes and limitations
It is important to acknowledge that real-time data is not perfect. GPS devices can fail or break, for example, so it’s possible that there are bus trips that did happen but are not captured in the real-time data.
That being said, the data we are scraping represents the data that the CTA is making available to 3rd party bus-tracking apps and to riders. Riders make decisions based on the data that is available at the time they travel. This means that inaccurate or missing real-time data can affect riders’ ability to plan their journeys, even if there are more trips actually occurring than are counted in the real-time data. Real-time data captures the trips that a rider could have known about if they were checking a tracking app or service, even if it does not completely capture all trips that occurred.
This is an all-volunteer project, and for our current launch we have not done granular row-level data cleaning. We observed, for example, that for a small number of trips on the 66 Chicago bus, the trip ID was missing in the source data from the API and was only listed as a series of asterisks (like ********). We have not added any special handling or cleaning for that, so we may be slightly undercounting the number of actual trips that occurred on that route. We saw the same issue on the #74 Fullerton bus and assumed it was not widespread; however, after the data had been displayed on the site for several months we investigated further and realized that the data issue is common enough for the #74 that it made our display inaccurate. We have removed the #74 bus for the time being; we are not sure whether we will find a valid method to overcome the issue to reinstate the #74 bus on our map.