Transit Data Quality
What is data quality?
Introduction
Data increasingly plays a ubiquitous role in public and shared transportation, in?uencing everything from journey planning and ticketing to service improvement, governance and transport policy.
Improving the efficiency of multi-modal transport networks is crucial for reducing congestion and pollution. There is also a focus on making data accessible and interoperable through initiatives such as the National Access Points in the EU and the Bus Open Data Service in the UK.
Data quality is key. If a multi-modal journey planning app offers an impressive interface, but the passenger information is wrong, the travel experience will be poor, and the user might not use the app again. If inaccurate data is used to analyse network performance or inform policy outcomes, the analysis is likely to be ?awed and reduces con?dence in the insight it o?ers. High-quality data is at the heart of a positive user experience.
Public transit data receives frequent comments but lacks detailed definitions or exploration. This article aims to examine the concept of data quality in public transportation. This includes an analysis of the various types of data involved, as well as the challenges encountered in aggregating public transit data for journey planning and assessing the service performance of a multi-modal network.
There are several data quality dimensions that are important:
?????? Complete – does the data represent all the stops and journeys in the network?
?????? Up to date and timely – does the data represent the actual services available to the traveller, now and for future dates?
?????? Accurate – does the data accurately represent ground truth?
?????? Consistent – are data consistently described, from stop to stop, service to service, mode to mode, from one day to the next?
?????? Rich – can the data offer a rich travel experience (e.g. step-free access or real-time travel options during disruptions)?
Interoperable–is the data described using industry standard data formats that allow it to easily pass from system to system?
Not all quality dimensions are equally important for all use cases. For example, real-time data dimensions are important to travellers as they navigate their journey, whereas, for some planning use cases, complete and consistent schedule data can be sufficient.
Overview of data quality challenges
Various challenges must be tackled to ensure the delivery of high-quality data. Representing their ground truth in data is very complex. In this section, we examine at the over-arching obstacles.
Complex inter-relationships between data
There are four main layers to the data. They inter-relate, so a data quality issue at one data layer can also adversely affect the data in the layers above.
Within the foundational layer lies the physical data, encompassing the exact location and names of transit stops, platforms and bike share docking stations, and more. In more comprehensive data models, it can also specify major interchanges’ entrances and exits, as well as elements of intra-station navigation, like escalators.
In the network topology layer, the logical arrangement of stops, platforms and other elements can be connected to construct a service pattern. For example, it includes stops on the same side of a road or that are linked in a left-hand or right-hand turn at a junction.
The schedule layer provides a description of the planned and intended stop sequence, the frequency of the operating journeys, and the scheduled arrival or departure times at each stopping point. It also includes any variations to the journeys across different times of the day, days of the week and holiday periods.
The real-time data layer provides up-to-date information on the current status of journeys, including predicted arrival and departure times, as well as any delays, cancellations or diversions caused by disruptions.
Richer data models can include information about the specific vehicle in operation, including wheelchair access, occupancy, location and whether the vehicle is zero or low emission.
Data created by systems designed for logistics, not third-party data use
The original design of the software systems focused on managing the logistics of moving a fleet of vehicles around a network, rather than delivering high-quality passenger information for journey planning or data output for analytics. Consider a scenario where a scheduled service operates from A to C, but a driver change is necessary at
B. In this case, the logistics system would represent it as two distinct services, A to B and B to C, potentially leading to confusion when analysing the data for passenger or network performance purposes.
Typically, schedule data and real-time data systems are separate. Real-time data capabilities came later to track individual vehicle locations. Skilled schedulers use scheduling systems, while real-time data comes from GPS hardware on each vehicle. The systems sometimes handle common data elements differently, which can lead to the assignment of different IDs for the same data element.
This presents challenges in reconciling schedule and real- time data to achieve the complete data view necessary for passenger information and performance analytics.
Data creation is a dynamic process
Static data creation is also a dynamic process. Heavy rail systems with highly complex timetables may change their base schedules two to three times a year and stations typically change infrequently. However, often shorter-term platform changes or service changes due to engineering works are updated more regularly.
Changes in bus and coach base schedules occur more frequently, usually in correlation with school holidays. The road network is susceptible to disruptions due to road works or closures for major events, requiring updates
to both stops and schedules. Since these updates often affect the fundamental components of the schedules, data refreshes are usually conducted at intervals no shorter than twenty-four hours. Therefore, other real- time systems, such as messaging systems for real-time disruptions, should be employed for notifying passengers of last-minute changes.
There are sometimes challenges to changing schedules that are outside of the technical limitations. For example, sometimes, the schedule needs to run for a set period because of agreements with driver’s unions or to measure performance against services operated by a specific operator.
The process of creating and updating a schedule or a stop in data involves some manual work. Often the process of file version control and distribution is also quite manual. Therefore, human error can affect data quality during the data creation and updating process.
There is also a temporal dimension to changing static data that can be overlooked. Some travellers may wish to plan a journey that starts immediately, while others may wish to plan a journey three to four weeks ahead of time. To ensure a positive user experience, the journey planning application needs reliable data that includes this future ‘look ahead’ period.
The process of real-time data creation involves converting location and other data from a GPS tracker or ticket machine on a vehicle into information that describes the vehicle’s current status, journey or the entire network.
领英推荐
The source data should ideally have a driver-activated indication of the vehicle’s route. This allows for convenient matching with the schedule information, along with the bus’s location and vehicle id.
The quality of the real-time data depends therefore upon the GPS tracking device working correctly on each vehicle, the frequency with which the location message is updated (typically every thirty seconds, but ideally as frequently as
every five seconds), high-quality GPS network coverage to send the message, and the vehicle driver/ operator to active the system when there is a change of route.
The Real Time Passenger Information (RTPI) system is the critical software that knits together the network’s intended operation (the schedules), with the reality of the live operations as described through real-time data.
As discussed earlier, real-time sources vary in quality and may not always have the required attributes to understand and link this information to scheduled journeys. The RTPI system bridges the gap between an unpredictable
world and the demand for accurate and reliable realtime predictions for both passengers on the street and downstream analytics platforms.
Different transportation modes have different software systems and practices
There are considerable differences in the operation of a bus, heavy rail, light rail, ferry or bike share mode – the software systems and practises optimised for each mode can handle the data differently, leading to challenges aggregating data across modes into a single data view.
For example, rail-based systems tend to represent a stop as single central data point, when on the ground, a complex interchange may involve multiple platforms on different sides of the track or junction. This limitation also affects the ability to provide accurate entrance and exit, step-free access and traversal times information for passengers across the interchange. These challenges are further compounded when approaches to rail data are used to describe bus and coach services. Some practices and systems focus on the stop activity while others focus on the journey.
As new mico-mobility modes have become popular, such as bike and scooter share, new systems have been designed to describe their data. At the sector matures, work is being done to standardise the data outputs from these systems and to handle the data within the recognised industry standard data formats for describing transit data.
Transit agencies in the same mode have different software systems and practices
There is considerable variation in how different scheduling teams choose to create their schedules. Practices can vary, from depot to depot, as well as transit agency to transit agency. The software systems they use also vary in the way
they create data, interpret and output the data.
Data can be output in different data formats
While some scheduling and real-time systems continue to output data in custom or proprietary standards, the industry is shifting towards using major industry standard data formats. The General Transit Feed Specification (GTFS/ GTFS-RT and GBFS), managed by MobilityData. org, originated in N. America, but is widely used by journey planning software around the world. The European
CEN standard Network Timetable Exchange (NeTEx) and the Service Interface for Real Time Information (SIRI) standards are used in Europe, but also more
widely, to express more complex data. Some European regions use variants of the CEN standards, for example, TransXchange (TxC) format in the UK and VDV formats in the DACH region. GTFS/ GTFS-RT uses a flat file format and is relatively easy for developers to master and focusses on “downstream” journey planning use cases.
The CEN standards use XML, support richer peer-topeer integration, and cover “upstream” back office and operational use cases where richer data is required. Both major industry standard groups have extended their
formats to include micro-mobility data. Mobility Data administer the General Bikeshare Feed Specification (GBFS), while NeTEx and SIRI can now handle micro mobility services.
Data can be output in different profiles within the same data standard
Increasingly, the software systems that specialise in creating data for a specific mode will output the data in industry standard formats. However, they can interpret the standards differently, leading to inconsistencies in the data when aggregating data from multiple sources.
When using richer CEN data standards, the issue becomes more complex as they provide additional options for interpretation. In order to mitigate the issue, experts are specifying more tightly defined data profiles within standards. These specifications are often made at the national level. For example, the UK has developed a NeTEx profile for fares data and a TransXChange profile for schedules. The Nordics, France and Germany have also developed national NeTEx profiles for schedule and fares data. There is a risk that for large-scale data aggregation, the aggregator will have to handle data from multiple profiles, which still introduces a level of data inconsistency.
Combined system latency
Post COVID, there is a renewed focus on improving the quality of real-time information for both passengers, and for operational purposes. This includes analysing actual service and network performance, as well as supporting integrated ‘Be in – Be out’ (BiBo) or Check in – Check out (CiCo) ticketing systems.
Even a simple use case like providing real-time arrival time predictions at stop and on an app requires the coordination and transfer of different datasets across multiple systems.
When aggregated across multiple agencies and multiple modes, latency can build up. For example, it can take several hours to build a large schedule aggregated dataset and another several hours for a journey planning app to integrate the data and build their routing graphs. Currently, the standard practice is to allow approximately twenty-four hours for newly scheduled data to be visible on global journey planning platforms, while local apps
can update overnight. To provide accurate real-time predictions, the real-time vehicle position data needs to be matched to the most up-to-date schedule data. The overall architecture of the end-to-end system, with handoffs between the multiple systems, means that latency can increase from thirty seconds to two to three minutes, with a noticeable impact on data quality for the most time-sensitive use cases.
Machine learning vs human judgement in transit data quality
Transit data requires an integrated approach, connecting unique elements to create an accurate representation
– for passengers, journey planning, and operational use cases. Transit data describes both the operational intent of an organisation (schedules) with the reality of
operating that plan (real-time). The published data on its own cannot determine whether the data is right or wrong. Understanding how the system operates is crucial for delivering the best data.
Writing algorithms to identify problems is possible. However, acting on complex, identified data issues is hard to do without deep human transit knowledge and appropriate technical tools. The complexity increases when aggregating multiple transit agency datasets from different modes into one reliable network source. This ultimately calls for a holistic, network-level approach to ensure data quality, combining both machine and human expertise for optimal outcomes.