How we serve real-time tick data of the entire US options market over internet—on a single server
Earlier this year, Databento launched its US equity options data coverage.
One of the magical things about our options feed is that you can get a firehose of real-time tick data, every trade on all 17 exchanges timestamped to the nanosecond—streaming 1.4+ million tickers over the internet.
Just pass it symbols='ALL_SYMBOLS', and that's it. No expensive colo, cross-connect, or private extranet connection needed.
We're able to do this and deliver to multiple clients, even with a single server. We're not sharding the feed or buffering it on a distributed streaming platform like Kafka or Pulsar. There's no autoscaling or other trickery. Just simple load balancing for redundancy and horizontal scaling. Let's go into some details of how we do it.
Challenges
OPRA bandwidth and line-rate processing
Before we delve further, if you aren't familiar with US equity options, here's a chart put this in perspective.
We capture direct prop feeds to get the full depth and order book feeds of practically every US stock exchange and the three TRFs. Each alone is notorious for microbursts that are hard to handle, but OPRA dwarves every single one of them when plotted side-by-side.
A single day of OPRA pcaps takes about 7 TB compressed—after erasure coding and parity, we're basically filling 1 large hard disk drive per day. OPRA themselves estimate a bandwidth requirement of 37.3 Gbps to support the feed, or 53 Gbps together with the CTA SIP alone.
Multiplexing all-symbol and filtered subscriptions
But perhaps what's the most impressive technical feat is that we also let users handpick specific combinations of symbols and multiplex subscriptions to combinations of message types, so one user might be listening to the VIX and SPY options chain for tick-by-tick quotes, while another user might be listening 1-second and 1-minute aggregates of all symbols.
import databento as db
live = db.Live()
live.subscribe(
dataset="OPRA.PILLAR",
schema="trades",
stype_in="parent",
symbols=["TSLA.OPT", "SPY.OPT"], # symbol selection
)
live.start()
Our server needs to provide both a simple firehose and more stateful capability to manage customized subscriptions and filter symbols. Other vendors bypass this technical challenge by picking one of the two strategies only: a firehose usually requires less processing and fewer queues, while limited symbol subscriptions reduces bandwidth requirements.
How we do it
1. Keep things simple and choose boring technology.
The funny thing is, coming into this from the high-frequency market making world, our team wasn't very familiar with the modern tech stack and frameworks for distributed streaming. In HFT and many financial trading systems within the colo, distributed processing is often an antipattern.
When we surveyed the landscape, many of the existing vendors were using Kubernetes, Kafka, WebSocket, multiple flavors of databases like kdb and Vertica, and needed a whole rack of servers. But we did our napkin math and thought all that was unnecessary.
Instead, we quickly mocked up a design based on bare metal servers, simple multicast, binary flat files, lock-free queues, and BSD sockets over a day.
Simple architectures are great here because it means there's less traversal down the I/O hierarchy, which is usually the most expensive part of any data-intensive application—it's so much cheaper if you can keep everything on the same thread or rely on interprocess communication on one processor, than to pass data to another server and turn everything on your stack into a networked call.
We're not alone. There's many great stories about scaling with simple architectures:
2. Flat files and embedded databases are amazing.
Our real-time API is also capable of another interesting trick: it lets you replay intraday historical data from any start time in the session in an event-driven manner, as if it were real-time, and then seamlessly join the actual real-time subscription after the client is caught up.
import databento as db
live = db.Live()
live.subscribe(
dataset="OPRA.PILLAR",
schema="trades",
symbols="ALL_SYMBOLS",
start="2023-08-31T13:30", # intraday replay
)
live.start()
This is useful in many trading scenarios. For example, if you started your application late and need to burn in signals with a minimum lookback window. Or if your application crashed and needed to be started back up. Having the historical portion emulate the real-time feed also means you can use the same code, the same set of callbacks to handle both, avoiding costly implementation mistakes.
Despite this common use case, no vendor or trading venue offers this. For example, with Coinbase's API, to get the initial book state, you'd have to subscribe to the real-time WebSocket feed, and while buffering up the real-time messages, dispatch an initial book snapshot request over the REST API, then stitch the two parts together by arbitrating sequence numbers.
Pushing this to Databento's server brings a lot of additional complexity because it means the server is responsible for buffering the real-time feed while the client is catching up, which opens the room for denial attacks—but we'll save that and the story of our mitigation strategy for another day.
The way we implemented this is quite simple, using plain flat files with some indexing structure, zero-copy transfer and serialization techniques, and a C driver. The whole library, supporting utilities and data structures are less than 2,000 lines in pure C.
A more fancy term for this is a "serverless database", as coined by Man Group's team for Arctic, or an "embedded database", as Google's LevelDB and Meta's RocksDB are described.
Many notable projects have achieved performance gains by adopting embedded databases. Ceph saw significantly improved performance after an architectural change to their BlueStore backend, which is based on RocksDB. RocksDB is also the default state store in Kafka Streams and the entry index database for bookies in Pulsar. One of the major departures from Arctic's design from its previous iteration is the switch from a MongoDB backend to its new serverless design.
3. Know when to throw more (expensive) hardware at the problem.
Developer time is extremely expensive. A $10k FPGA offload NIC, a $20k switch, an extra server—all that's cheap compared to a month of wasted developer time.
领英推荐
The earliest design of our options stack involved sharding the OPRA lines across several servers, and then muxing the smaller, normalized output on a separate layer of distribution servers. This envisioned a hierarchical architecture that would allow us to use cheaper NICs and scale linearly and independently for increase in incoming feed message rate vs. increase in client demand—which would theoretically reduce server cost in the long run. This architecture is in fact not novel; we know of at least one major retail brokerage that adopts this.
But we ended up with a shared-nothing, distributed monolith design that only needed a single beefier server, with a 100G Napatech SmartNIC, to process the entire feed and fan out customer traffic. Scaling and redundancy is achieved by adding more servers and simple load balancing (direct server return) over them.
4. Kernel bypass networking is cheap.
One of the first "wow" moments of our early careers in electronic trading was just compiling Mellanox's userspace networking library, then called Voltaire Messaging Accelerator. The whole ordeal took less than 30 minutes and we saw a nearly five-fold decrease in UDP round-trip one latency and a significant improvement in throughput.
Two parts of our OPRA real-time architecture employ kernel bypass: the live gateway application and the load balancer.
These days, you can pick up a Mellanox ConnectX-4 NIC for as little as $70 on eBay and a brand new Xilinx (Solarflare) NIC for just over $1k. If you're providing any hyperscale or performance-sensitive web service, you should know about kernel bypass. There's a good introduction on Cloudflare's blog by Marek Majowski.
5. Use a binary format and efficient encoding/decoding.
For anything involving tick data or full order book data, binary encoding is almost a must. It's strange to put in all the effort to support a massive stack of Apache data frameworks, bare metal Kubernetes, IaC tools, microservices, and then just lose a significant amount of time in JSON encoding and decoding.
We wrote our own binary format, Databento Binary Encoding (DBN), which comes with a fixed set of struct definitions that enforce the way we normalize market data. Among many of the optimizations found in DBN is that we squeeze every MBO ("Level 3") message into a single cache line.
6. Build it yourself.
Because of the storage and bandwidth requirements, it's important to note that there aren't many official, real-time OPRA distributors like us.
When you whittle down to the API distributors that can deliver over internet or public cloud, there are only a handful. Of the remaining ones, several are actually white-labeling and redistributing another vendor's feed or using other ISVs' feed handler, a practice that's common in the market data space. (Some vendors whose feeds and parsers we've seen get white-labeled include OnixS, Exegy , dxFeed , and QUODD .) Instead, we write our own feed parsers and book builders.
It's no unique insight that building it yourself allows you to optimize further. But one important benefit specific to market data is that a lot of data integrity is lost in way that data is normalized. As an aside, a good normalization format will usually allow you to "compress" the raw multicast feed into a more lightweight feed, and discard unnecessary fields or bloat.
Writing a feed parser can seem daunting to a newcomer in this space, but at a market making firm, you usually get to a point where each engineer is churning out a new parser every 2 weeks.
7. Network.
At the end of the day, delivering order book data over internet requires a significant investment in your network. There's no way around this but to aggregate a lot of transit bandwidth from several tier 1 providers—something we've done with the help of our partners like Point5, Netris, NVIDIA Networking, and TOWARDEX.
If you're connecting over internet or cloud, one metric that correlates strongly (and surprisingly) with the quality of your financial data or connectivity provider can be extracted from looking at their ASN and seeing the observed AS paths and the quality of their peers. This gives you a sense how extensively they've built up their network.
Some caveats about this methodology:
With that said, here's a loose list of what we consider to be the top 15 financial data providers for a cloud or internet-based solution according to this methodology:
Aside from this, having dedicated interconnects to major cloud on-ramps and proximity hosting locations allows us to provide stable feeds and distribute our traffic further. At the moment, we support dedicated interconnects to AWS, Google, and Azure at the Equinix CH1 and NY5 on-ramps, as well as major colocation sites in Europe like Equinix LD4 and FR2.
8. Invest in people and a strong engineering culture.
This is self-explanatory. It's fitting to add parting words from our CEO and CTO:
"All this wouldn't have been possible without the work of the awesome engineers on our core and systems teams. These folks took breaks from lucrative careers in high-frequency trading and large tech firms to join a unheard startup based in Salt Lake City. They'll have our eternal respect."
For more articles like this, see Databento's engineering blog.
Technology entrepreneur (4X Founder), innovator and operator.
1 年Kudos for putting all this information in one easy-to-read article!
President at BC Skills Group, CEO at The Cube, PhD in Philosophy (Logic), Doctor of Business Administration, SoX (Sarbanes & Oxley) Certified Auditor
1 年Great job ???? From reading the article, you cover the upstream part of the operation, as the order execution will need a similar infrastructure to keep the hop tiny in terms of execution time. It will be great if you can close the loop and offer market execution infrastructure as well. Congratulations on the Great job ????
Experienced Technologist
1 年Great write up as usual! +1 for boring solutions :-) “Despite this common use case, no vendor or trading venue offers this.” Unless I’m not getting the context right then I’d say 60East’s AMPS messaging server does exactly this. 1) You can stream from any historical point (as far back as you have) rolling right into the current realtime feed. 2) It also supports a subscription with snapshot (from last image cache) which will send the contents of the cache matching your subscription before sending the real-time feed. Databento’s solution is, of course, the best for its use cases. Just thought the product was worth mentioning.