Spatio-temporal quality control: implications and applications for data consumers and aggregators

Spatio-temporal quality control: implications and applications for data consumers and aggregators

Authors: Douglas E. Galarus and Rafal A. Angryk

DOI: 10.1186/s40965-016-0003-2

How to cite: Galarus, D. E., & Angryk, R. A. (2016). Spatio-temporal quality control: implications and applications for data consumers and aggregators. Open Geospatial Data, Software and Standards, 1(2).

Information:
The amount and availability of data from sensor networks has grown rapidly in recent years due to increased computing power, greater coverage and bandwidth of communication networks, and reduced storage costs, as well as reduced costs for sensing equipment. As such, the types of monitoring have expanded from environmental sensing, industrial monitoring and control, as well as traffic monitoring, to the monitoring of household appliances, power consumption and control of household heating/cooling. The evolving “Internet of Things”, will surely make even more data available for new applications from numerous, overlapping providers. Increased attention must be given to quality control from the perspective of the aggregator and disseminator of data, and to the impact of quality control on their processes and products.

“Quality” is inherently subjective and dependent on the user and use of the data. Quality control is an exercise in measuring the quality of data, assessing it for a given use, and applying the results to that use. For instance, measures such as accuracy, precision, timeliness, reliability, etc., can be formulated and used by data consumers to determine which data is “good” and which data is “bad” relative to their applications. For real-time applications, data that is not timely (i.e., data that is stale or old when it first becomes available) may be of little use even if it is accurate where-as it may be useful for other applications that are not time-sensitive. Data may be accurate in terms of representing real-world conditions such as ambient air temperature, yet that data becomes unusable if the metadata associated with it such as location or time are incorrect. Having access to provider quality control measures and to data and metadata that can be used to formulate quality control measures is critical to successful use by consumers.

Quality control measures, if included at all, are generally presented from the perspective of the original data provider, with a focus on sensor accuracy, precision and other measures assessing the direct performance of the sensor. Differing quality control measures and policies from providers yield further challenges to data aggregators. For instance, one data provider may present quality control indicators at the sensor level while another flags at the station level, leaving uncertainty as to which of multiple sensor readings is in question. Aggregating such data into a uniform and cohesive offering is a challenge, as is the task of selecting which providers should be used from multiple, overlapping offerings.

Spatial-temporal data, used in the absence of quality control measures, will likely yield questionable or poor results. Because of these challenges, we must investigate ways to aggregate and derive quality control measures from provided data including sensor observations and timestamps not only corresponding to the original observation, but also to the times at which the data is made available, processed and redistributed. The best approach to improving the quality of data is to start at the source – the sensors. But, we must recognize and work with what is within our control. As aggregators of data from sensor networks controlled by other agencies, we make the best of what they give us and ideally add value to this data. In all likelihood, we will have no control over the content, format and distribution mechanisms used by the providers. We might not even have a direct mechanism for reporting problems to the provider and seek resolution. What we can do is implement our own quality control mechanisms and use them to optimize the performance of our systems.

For instance, we can evaluate the spatial-temporal coverage of provider data in the presence of multiple, overlapping providers and in light of bandwidth and processing constraints. We can seek answers to questions of whether to include data from one provider relative to others. For example, what do we gain in terms of spatial coverage by using data from one provider versus two, and what is the cost in terms of bandwidth and storage? What is the overlap in data from multiple providers? Does it improve spatial and temporal coverage? We can evaluate the impact of quality control processes implemented on our systems and in provider systems. We can compare providers to determine overlap, and determine which data we should use based on quality control measures, and we can determine spatial and temporal gaps in the data we are provided.

Quite often, sensor-level quality control processes utilize domain-specific, rule-based systems or general outlier detection techniques to flag “bad” values. For instance, NOAA’s Meteorological Assimilation Data Ingest System (MADIS) [1] applies a range of [–60 °F, 130 °F] to its validity check for air temperature observations [2] while the University of Utah’s MesoWest [3] uses the range [–75 °F, 135 °F] in their quality control checks for air temperature [4]. These ranges are intended to represent the possible air temperature values that could be observed in real world conditions, at least within the coverage area of the given provider. If an observation falls outside the range, then the provider will flag that observation as having failed the range test and the observation will for all practical purposes be considered “bad”. Obviously range tests aren’t perfect checks. For instance, the record high United States temperature would fail MADIS’s range test, although it would pass MesoWest’s test. Both MADIS and MesoWest employ a suite of tests to observations that go beyond their simple range tests. “Buddy” tests are used to compare observations at a given point to neighboring observations. MADIS uses Optimal Interpolation in conjunction with cross-validation to measure the conformity of an observation to its neighbors [2]. MesoWest uses multivariate linear regression to estimate observations [5]. A real observation is compared to the estimate for its location and if the deviation between estimated and observed is high, then the real observation is flagged as questionable.

These approaches help to assess the accuracy of the given observation, yet quality and performance in general needs to be assessed in further dimensions that account for spatial and temporal aspects of applications. For instance, we may want to maximize visual “coverage” of a map displayed in a web application at “critical usage times” with “good” data values while working within limited bandwidth. Such problems involve multiple, conflicting objectives, making them challenging to solve. Formulating such problems is challenging too because, by definition, we generally first view “quality” as being more subjective than “quantity”. Our challenge is to express quality in quantifiable terms.

In this paper, we present specific spatio-temporal quality control measures, applicable to a wide variety of spatio-temporal provider data distribution mechanisms. We present practical methods using these quality control measures, and demonstrate their utility.

We do not attempt to correct erroneous data or improve collection at the source. Others state correctly that correction at the source is the best way to improve data quality [6]. Our objective in this paper is to make the most of the data from providers as-is. We do not perform outlier detection or otherwise attempt to assess accuracy, precision or other direct quality measures on individual sensors. Instead, we use provider quality control descriptors to label “bad” data. In separate work, we tackle the problem of identifying “bad” data [7, 8]. We do not directly address system or network performance or present a distributed approach which would directly interact with sensors in the field. Building on prior work [9], we optimize measures such as coverage relative to bandwidth and scheduling downloads of provider data. Our interest is that of data aggregator/consumer, and we work within the relevant constraints of what can and cannot be controlled from this role.

This is an Open Access article. Please click here to read the full text and check the references included in the text above.

要查看或添加评论,请登录

Open Geospatial Data , Software and Standards的更多文章

社区洞察

其他会员也浏览了