Working with high-frequency market data: Data integrity and cleaning (Part 1)

Working with high-frequency market data: Data integrity and cleaning (Part 1)

This is Part 1 of a multi-part series on using high-frequency market data by Databento . We'll cover topics such as data integrity and cleaning, and also basic concepts like timestamping, resolution and subsampling, and more.

High-frequency market data here broadly refers to any full order book data (sometimes called "L3"), market depth (also called "L2"), and tick data (sometimes called "last sale"). Because these terms tend to be used interchangeably without clear convention, we'll refer to these as "MBO", "MBP", and "tick-by-tick trades" respectively.


Background

We'll start with a hot take: You should not be cleaning your market data.

Much of the public literature on market data cleaning and data integrity seems to cite prior work by Olsen and Associates's crew, who published their approach on "adaptive data cleaning" in An Introduction to High Frequency Finance (2001).

One of the top Google search results on "cleaning tick data" yields an article titled "Working with High-Frequency Tick Data", which suggests to remove outliers and cites outdated practices on Brownlees and Gallo (2006). This paper, in turn, cites Falkenberry's white paper (2002) for tickdata.com, which recurses back to Gen?ay, Dacorogna, Muller, Pictet and Olsen. Later publications as recent as 2010 still refer to these methods.

While these approaches make sense from an academic angle, they're different from best practices in actual trading environments.

What these adaptive trading filters often consider to be "outliers" are false positives—examples of true trading activity that we want to keep. We'll suggest some better ways of handling this.

True trading behavior: SPY on NYSE National's prop feed. Outliers and large dislocaitons in BBO are commonplace even on widely-traded symbols on less liquid venues. Extracted from Databento Equities Basic.
Another example of true behavior: AAPL on NYSE National's prop feed. The BBO fluctuates with a different, more deterministic pattern. Extracted from Databento Equities Basic.


Modern market data infrastructure

To understand why it's usually unnecessary to clean market data these days, it's important to know a few things about modern market data infrastructure:

  1. Raw feeds and direct connectivity are usually overprovisioned to the point that you rarely have packet loss if your infrastructure is set up properly downstream of the venue handoff.
  2. Modern feeds and matching engines are usually well-designed enough nowadays that the days of spurious prints are long gone. Most of the academic literature and white papers on cleaning tick data and real-time filters for tick data errors are based on behavior of older iterations of the SIPs and matching engines.
  3. The widespread adoption of electronic trading in the last two decades has mostly squashed the existence of tick data anomalies. Any sort of bug in the raw feeds invite microstructural exploits quickly get reported away. Participants scrutinize any edge cases, matching scenarios, sequencing and timestamping issues very thoroughly so they're actually uncommon.
  4. Many trading venues use white-labeled versions of more mature trading platforms from larger market operators like Nasdaq (e.g. J-GATE), Deutsche B?rse, Currenex, etc. So pricing data errors from poorly-implemented matching engines have become much less uncommon.
  5. Checksumming and other practices are usually in place to prevent silent bit rot and other physical events that can actually cause price anomalies.

Exceptions

That said, there are exceptions where it's important to scrub and clean your market data:

  • If you're using a data redistributor that provides a very lossy normalized feed or is known to mutate the raw data silently, the benefits of scrubbing the data could exceed the downsides.
  • If you're primarily trading in OTC, open outcry or auction markets or more exotic trading venues with outdated infrastructure (e.g. a UDP unicast feed with no recovery mechanism), perhaps it's still important to clean the data.
  • If you're using non-pricing data which requires manual data entry or scraping, such as financial statements, SEC filings, corporate actions, etc., it is often better to scrub the data and preprocess it upstream of your application logic so you don't have code repetition.
  • If you're using low-resolution market data from a redistributor with an opaque process for subsampling. This can be prone to errors introduced by the vendor, that are otherwise not present on the direct feeds.

Databento architecture: We generate all of the low-resolution and subsampled data, even daily OHLCV aggregates, in one step, on the same server that ingests and parses the direct feeds, to ensure consistency.


An example: How fast do venues fix bugs in their direct feeds?

"Any sort of bug in the raw feeds invite microstructural exploits quickly get reported away. Participants scrutinize any edge cases[...] very thoroughly so they're actually uncommon."

Going off a tangent, proprietary trading firms are very quick at testing new feeds, protocols, and connectivity. You'd be surprised how fast bugs get caught in the direct feeds when there's significant amounts of daily PnL on the line.

I was at one of the early trading firms to implement a client for CME Group's MDP 3.0 feed, over a year before it went into production.

On our first attempt to test my parser in the new release environment on March 2014, we found a bug where the packet sequence number would reset to 1 whenever tag 369-LastMsgSeqNumProcessed changed on the market recovery feed. This seemed like a middle-ground issue, just trivial enough that no one would report it right away.

In any case, we reported this to the exchange and learned that this was already a known issue—someone beat me to the punch the same day! And the issue was fixed on exchange side by the weekend.

(One reason we pulled together Databento's core team from various prop trading firms is because it takes a certain team culture and patience to chase these types of issues down to the venue side—and it's hard to instill this with people who haven't had production trading experience. It's so much easier for your support team to dismiss bug reports and blame it on the user, like most vendors out there.)


Data degradation from cleaning

The more important reason to avoid "cleaning" your data, in the conventional sense, is that the cleaning process almost always degrades the data quality.

  1. Ex-post cleaning introduces desync between real-time and historical data. Most "data cleaning" strategies I've seen are applied offline on a second pass of stored historical data. This is the worst kind of data cleaning because it introduces desync between the historical data that you're doing analysis on and the real-time data that you'll see in production. If your data provider advertises that their data is scrubbed for errors and tout their data cleaning techniques, or if you report missing intraday data to your provider and suddenly it's quietly patched, they're probably guilty of this antipattern.
  2. Anomalies are usually "true" behavior. If you have a lossless feed (no sequence gaps), then any gaps that remain are likely real. You usually want to see these anomalies because they are real and affect a vast number of other trading participants and you want to take advantage of it. Examples of this include times when the source feed fails over to their disaster recovery site. What better time to be quoting a wide spread than when most participants arbitrarily refuse to trade because of their tick data filters or operational risk mitigations during a feed failover?
  3. Real-time or adaptive filters have non-determinism that is very difficult to replicate in simulation. Even if you're applying "cleaning" inline on real-time data, it usually imparts non-deterministic latency and other effects that are hard to replicate in backtest or simulation.

With a lossless feed, most remaining data gaps are actually real. For example, SIAC's failover of OPRA on Oct 19, 2023 would've led to real gaps in most redistributors' feeds. It's usually better to retain these gaps in the historical data.


Better strategies for data quality assurance

Before considering cleaning your market data, here are some strategies that we recommend first:

  1. It's usually better to build in robustness in your trading model or business logic. For example, truncating or winsorizing your features can be done in constant time and also improve model fit. It's better to leave the decision of what's an "anomaly" to downstream application logic, e.g. your trading strategies. Taking this point to the extreme, it's almost better that your data has errors because it forces you to write explicit trading strategy behavior to handle data errors and events like gateway failover.
  2. If you do see spurious prints and anomalies, it's better to systematically fix any parser bugs or address normalization edge cases are most likely responsible. It's also often better to do a full regeneration of history after fixing those bugs than to do surgical patches of specific days of data, to ensure consistent data versioning.
  3. Instead of checking values, perform internal consistency checks. e.g. Construct book snapshots; convert market-by-order (MBO) to a level book with aggregated depth (MBP); compute OHLCV aggregates from incremental trade events and compare them to the session statistics messages provided directly by the trading venue.
  4. Report the anomalies to the trading venue rather than address them in your own code. In rare cases, you're seeing a recurring bug that the venue needs to fix. This is one of the reasons we include the original sequence numbers and channel IDs in Databento's normalized data—it allows you to work directly with the trading venue on trade breaks, matching errors, and data issues.
  5. Inconsistent book state can be marked with flags. This is preferable to discarding data altogether. At Databento, we use bit flags like F_BAD_TS_RECV and F_MAYBE_BAD_BOOK to indicate this.
  6. Out-of-order events and missing data may be self-healed with "natural refresh". Natural refresh is a common strategy adopted by low latency trading systems that don't want to pay the cost of A/B arbitration or gap retransmission. For example, if you see a cancel message for an order ID n that you've never seen before, you could assume that you missed the original add message for order ID n, but that has now been self-corrected.
  7. If you have to use a distributor's normalized feed, pick a data provider that has strong data versioning practices, writes their own parsers, and can trace the provenance of the data back to the original raw packets. (More on this in Part 2.) We address data versioning at Databento partly with an API endpoint to get dataset condition and last modified date, plus a public portal for reporting data errors.

Databento uses bit flags like


We embed the original sequence numbers and channel IDs where available, so that you can interact with trading venue's support desk and sequence order execution events.


Part 2

In the next part of this series, we'll talk about white-labeling and the hidden implications on data integrity, and more on best practices for performing checks on data integrity.


About Databento

Databento is a simpler, faster way to get market data. We provide APIs for real-time and historical market data, sourced directly from primary colocation sites.

We designed our APIs after years of experience writing feeds at leading market making firms, building platforms that trade to scale of 2-10% ADV on major markets like US equities and futures, and implementing popular data solutions like Bloomberg's Corporate Actions v2, Pico's 100G Corvil capture device, and more.

To learn more us, see our About page. For more articles like this, check out our blog.

Sam Horowitz

FX Electronic Markets, Risk Management and Data, ACI Global FXC, ACI FMA UK Exec Committee #Neurodivergent #Neurodiverse

11 个月

Great piece. The bottom line is that the firms that thrive rather than simply exist in this space usually do so by owning this aspect of their destiny. Which is at times a labour of love, but is, by and large, the only way to guarantee true confidence in your data.

Lou Lindley great post as always, but I think I might have to counter post you, as I have just been through this process with very positive results as we have non determination data feeds, basically a couple of your assumptions don't hold for crypto namley: * Well designed matching engines * Over provisioning data feeds * High speed data cleaning considered an edge so little push for venues to fix it. * Crappy non deterministic network routing

Fred Viole

OVVO Financial Systems | ovvolabs.com

11 个月

"Outliers not noise, they happened, they were caused by some exquisitely rare combination of events. Sometimes most valuable data point." https://twitter.com/EdwardTufte/status/435588215496396800

要查看或添加评论,请登录

Lou Lindley的更多文章

社区洞察

其他会员也浏览了