Real-time Data Quality with Lidar Flow Analytics
I’m going to wrap up this series on lidar data cleaning and data quality issues with some discussion of real-time data. In doing people measurement and flow analytics, most of our primary use-cases aren’t about real-time. Even in digital analytics, real-time isn’t a big thing. And in the physical store, analytics typically involves at least a week’s worth of data. Yet real-time data – particularly with lidar – has become very important to us, because while real-time analytics isn’t a big deal, real-time operational control is.? From queue management in airports to dynamic associate allocation in retail stores to remote control of lighting, sound and HVAC, a lot of our current people-measurement isn’t flow analytics at all – it’s operational monitoring and control.
When it comes to real-time data, all of the data quality issues that I’ve cataloged (ghosts, fragments, track breaks, object misclassification, etc.) are exactly the same. Unfortunately, your ability to fix those problems is significantly constrained.
Real-time presents two challenges for the data quality engineer. The first is obvious. Performance is always a huge deal in real-time operations. With historical data, we have the luxury of processing data hourly or even daily and taking our time with it. That means we can afford to do a fair amount of processing on every record. That’s not always possible in real-time. That’s especially true since with real-time operational control, we often have to do edge-processing. That isn’t the case for things like queue management (queues don’t change in milliseconds), but it is true when we’re working in device control to adjust lighting and sound, or when we’re working in traffic where we often don’t have a high-bandwidth connection to the cloud. With real-time, edge processing, you have to pick your data quality spots and maximize the performance of the pipeline.
Even more significant and fundamental than these processing limitations is the fact that a LOT of lidar data quality cleaning technique rely on being able to look at the entire history of a track. When we stitch broken tracks, we get to look ahead (and even behind – negative time stitches do happen) to find the most appropriate match. When we eliminate fragments, we know exactly how long they lasted. We know, for every track, its duration and its movement. We can even eliminate ghosts by matching them to other tracks. None of that is usually possible in real-time. When a record shows up, we don’t know how long it’s going to last or where it’s going to go.
Unfortunately, that means that many of the most powerful techniques in lidar data cleaning are severely limited when applied to real-time data.
?
What Works and What Doesn’t
So, if you’re tasked with cleaning lidar real-time data, what can you do?
The first question to ask is how long you have. There’s real-time and then there’s REAL-TIME. When we’re doing some forms of device control, our latency needs to be in 100 millisecond range. That’s hard. It means that we’re generally confined to the edge, and it means that stitching, fragment and ghost removal are all off the table. On the other hand, there are still things we can do to improve the data. First, we do know where a record is and we know where it has been. We can use that information to apply static detection maps, classify Associates and Shoppers, and tune object classification. Similarly, we know the current velocity and all historical velocity figures; so we have a key data item for object classification. We also have the current dimensions and all previous dimensions as well. That means that while we can do no better than the Perception software on the first frame of an object, we may well be able to improve object classification in real-time as additional frames are seen.
For data cleaning in the 100-millisecond range, that’s about it.
However, as your time to process goes up, so does the range of applicable techniques. In queue management, we frequently delay the data by about 15 seconds. This is “near” real-time data and, for queue purposes, nearly always good enough. With a fifteen second delay, we can remove fragments, apply some ghost logic, and even do basic stitching. The end-result isn’t quite as good as what we’re used to with historical data but it’s often a very significant improvement over the raw data from the Perception layer.
Although stitching probably suffers the most in comparison to historical data, the overwhelming number of stitches occur within 15 seconds of break. The main driver of difference is that, in historical mode, we have richer data about each potential stitch and we do more sophisticated analysis of paths/velocity.
There’s nothing magic about fifteen seconds, either. You can delay 2 seconds or 10 seconds or 30 seconds. The longer the period you pick, the closer your “real-time” data will be to the historical data.
When we set up clients, we typically configure three different real-time feeds. The first feed, Raw, is exactly what comes from the Perception layer. The second feed, Basic, is the Perception Layer with minimal cleaning (typically with < 1-2 seconds of latency. The third feed is the Cleaned feed and is configured with a specific delay (e.g. 15 seconds) and includes at least elements of all the cleaning techniques we provide. That cleaned feed will also provide Associate identification and improved object classification.
Providing all three feeds makes sense for a couple of reasons. First, we often want to be able to compare the Raw and the Cleaned data. Especially when we are first setting up a location, playing the Raw feed in Journey Playback and comparing it to the Cleaned historical data is one of the ways we evaluate stitching and identify potential problems. In addition, it’s useful to have each feed since different use-cases will require different latencies. If a use-case isn’t super-sensitive, why not use cleaner data? This also provides us with a forensic check on all of our post-processing. If we triggered a false positive, we can look at the output from the Perception software and the cleaning layer to see if we could have done better or if, God forbid, we’re the ones who got it wrong!
领英推荐
All of that being said, you should always think a bit about the data quality necessary and appropriate to your specific real-time use-case. When it comes to queue management, the key lidar data issue is proper people identification. As long as it’s counting the right number of people, things like track breakage don’t matter. Similarly, if you’re doing curbside, understanding the number of vehicles is much more important than understanding the number of pedestrians. Getting object classification right for vehicles is going to be critical. In general, lidar data is quite good at people identification. If we’re targeting real-time queue management or HVAC control in an indoor area, we’re usually confident that we can get by with light cleaning. On the other hand, if you’re targeting dwells, journey metrics or relying on object classifications, the more cleaning you can do, the better.
?
Handling Differences
One question that the existence of 3 feeds will always raise is about the single source of truth. Back when I was leading the digital analytics at EY, we cared a lot about making sure that our clients had a single source of truth for every metric. If you have two systems giving you different answers, people always just pick the one they like the best.
In our world, though, we have multiple feeds to support different real-time operational use-cases. When it comes to a single source of truth there is one – it’s the final, cleaned feed used to process all historical data. Having said that, we do sometimes track what we generated in real-time. For example, with queue metrics, we track the real-time line lengths and estimated wait times and save them. By doing that, we can compare them to the historically accurate measured line lengths and wait times.
Not only does that help us improve the real-time processing, it helps us tune the latency we force into the system. If we know that a five second latency and a fifteen second latency produce nearly the same level of tracking with historical data, we can opt for the shorter delay.
So yes, one source of truth, but sometimes it’s nice to know how close to right your original answer was!
?
The Wrap
?I’ll wrap this series up by re-iterating a few of the key takeaways. First and foremost, writing a whole series of posts about lidar data quality and cleaning techniques makes it seem like the data must be pretty bad. It’s not. People measurement sensors have improved dramatically in the five plus years I’ve been at this, and lidar is one of the key reasons. It does an excellent job of object identification and high frame-rate precise tracking of movement. It supports almost every people-measurement and flow-analytics use-case we’ve found, including some of the most demanding ones (at opposite ends of the spectrum, full journey tracking and real-time device control).
That being said, data quality is the single most important aspect of any analytics system. Data quality is never, ever perfect and usually it’s not as good as the analyst would like. I’ve spent decades in analytics, and I’ll just reiterate what every use of data will tell you – data quality is always a problem. It’s a problem because analytics starts out hard and the more noise you put in the system, the harder it gets.
It's also really important to realize that from a flow-analytics perspective, data quality is not one number. The biggest mistake I see in people-measurement RFPs is the assumption that data quality is a specific thing that can be represented by a single number like 95% accuracy. That’s wrong. Data quality is specific to use-case, environment, and conditions. And anyone who tells you different is either ignorant or lying.
We get asked for these numbers all the time in RFP’s and we grit our teeth and do our best. But we always try to explain the real facts on the ground.
Finally, and this is the biggest takeaway, the data quality that comes from your sensors and Perception software isn’t the final data quality you can or should expect. I’ve never run into a situation where the data quality couldn’t be significantly improved with intelligent post-Perception software. There are a host of techniques for improving data quality from eliminating bad object identifications to improving object classification to fixing track breakage.
In these past few posts, I’ve tried to provide some fairly detailed guidance to the DIYer on how to get better lidar data. But if you just want to consume high-quality lidar data as quickly as possible, you can also just take advantage of our DM1 platform. Even if you don’t need all the viz tools and analytics, you can take full advantage of the cleaning and distribution layer that underpins the whole system.