登录查看更多内容

Introducing M3 The Time-Series Infrastructure that Powers Metrics at Uber

Jesus Rodriguez

CEO of IntoTheBlock, Co-Founder, Co-Founder of LayerLens, Faktory,and NeuralFabric, Founder of The Sequence AI Newsletter, Guest Lecturer at Columbia, Guest Lecturer at Wharton Business School, Investor, Author.

发布日期: 2018年12月27日

Time series analysis is one of the fastest growing areas across disciplines such as machine learning or probabilistic programming. The emergence of markets such as the internet of things(IOT) or social networks has increased the relevance of time series infrastructure to power analysis over real time data. The importance of time series analysis have influenced the release of open source stacks such as Graphite or Prometheus. However, many of the top internet have regularly outgrown those stacks and pursued the path of building their own time series infrastructure. Uber is one of the companies that have contributed the most to the time series data infrastructure space. Earlier this year, the transportation giant decided to open source the stack that has been powering their time series analysis for years: M3.

Time is a core element of the Uber experience across its different apps. As a result, time series analysis seems to be a multiple more relevant than on other types of large scale businesses. Initially, Uber relied on traditional time series stacks such as Graphite, Nagios, StatsD and Prometheus to power their time series metrics. While that technology stack worked for a while, it was not able to keep up with Uber’s stratospheric growth and, by 2015, the company was in need of a proprietary time series infrastructure. That was the origin of M3 which was designed with five key guiding principles:

Improved reliability and scalability: to ensure we can continue to scale the business without worrying about loss of availability or accuracy of alerting and observability.
Capability for queries to return cross-data center results: to seamlessly enable global visibility of services and infrastructure across regions.
Low latency service level agreement: to make sure that dashboards and alerting provide a reliable query latency that is interactive and responsive.
First-class dimensional “tagged” metrics: to offer the flexible, tagged data model that Prometheus’ labels and other systems made popular.
Backwards compatibility: to guarantee that hundreds of legacy services emitting StatsD and Graphite metrics continue to function without interruption.

M3

High scalability and low latency are key principles of the M3 architecture. Any given second, M3 processes 500 million metrics and persists another 20 million aggregated metrics. Extrapolating those numbers to a 24-hour cycle indicate that M3 processes around 45 TRILLION metrics per day which is far beyond the performance of any conventional time series infrastructure. To handle that throughput, M3 relied on an architecture based on the following components:

· M3DB: M3DB is a distributed time series database that provides scalable storage and a reverse index of time series. It is optimized as a cost effective and reliable real-time and long term retention metrics store and index.

· M3Query: M3 Query is a service that houses a distributed query engine for querying both real-time and historical metrics, supporting several different query languages. It is designed to support both low latency real-time queries and queries that can take longer to execute, aggregating over much larger datasets, for analytical use cases.

· M3 Aggregator: M3 Aggregator is a service that runs as a dedicated metrics aggregator and provides stream based down sampling, based on dynamic rules stored in etcd.

· M3 Coordinator: M3 Coordinator is a service that coordinates reads and writes between upstream systems, such as Prometheus, and M3DB.

· M3QL: A query language optimized for time series data.

The relationship between the core M3 components is shown in the following figure:

Let’s explore some of the previous architecture building blocks in more details.

M3DB

M3DB is the core storage model in the M3 infrastructure. The stack was built in Go and designed for large-scale time series analysis from the ground up. The storage model is both distributed and strongly consistent which facilitates scalabilities while maintaining robust write dynamics. M3DB uses both in-memory and disk storage models depending on whether the records are frequently accessed or just used for long-term calculations respectively. From the management standpoint, M3DB is highly configurable and supported on a wide range of runtime environments.

One of the main contributions of M3DB is its clever storage model. Most transformations within a specific query are applied across different series for each time interval. For that reason, M3DB stores data in a columnar format facilitating the memory locality of the data. Additionally, data is split across time into blocks, enabling most transformations to work in parallel on different blocks, thereby increasing our computation speed.

M3QL

Since the early days, M3 supported Prometheus Query Language(PromQL) and Graphite’s path navigation language. To extend the data access capabilities of M3, Uber decided to build M3QL a pipe-based language that complements the capabilities of path navigation with richer data access routines. Just like other pipe-based languages, M3QL allows users to read queries from left to right offering a rich syntax as shown in the following figure.

M3 Query Engine

Just like other M3 components, the query engine was written in Go from the ground up and optimize for high throughput. Recent metrics from Uber are benchmarking 2500 queries per second being processed by M3’s query engine. The query engine workflow is structured into three main phases: parsing, execution and data retrieval. The query parsing and execution components work together as part of a common query service, and retrieval is done through a thin wrapper over the storage nodes. To support multiple query languages such as M3QL or PromQL, M3 introduces an intermediate representation based on a directed acyclic graph(DAG) which abstracts the query that needs to be executed. The current implementation of the query engine is tied to M3DB but the design can support other time series databases.

M3 Coordinator

M3 is a very complete platform but also enables the integration with mainstream time series analysis systems such as Prometheus. M3Coordinator is a service which provides APIs for reading/writing to M3DB at a global and placement specific level. It also acts as a bridge between Prometheus and M3DB. Using this bridge, M3DB acts as a long term storage for Prometheususing the remote read/write endpoints.

Getting started with M3 is relatively easy as the entire platform is packaged as Docker containers. The infrastructure has been tested on major cloud platforms such as Google Cloud and the entire source code is available in their GitHub repository.

M3 is certainly one of the most advanced infrastructures for time series analysis in the current market. While M3 might lack the support of commercial alternatives, it comes with the robustness developed during years of supporting Uber’s time series analysis processes. Doesn’t get much better than that.

要查看或添加评论，请登录

Jesus Rodriguez的更多文章

Robust Agents Are All We Need: Faktory Emerges from Stealth Mode with a Private?Alpha

2024年2月28日

Robust Agents Are All We Need: Faktory Emerges from Stealth Mode with a Private?Alpha

Last year, I had the unique opportunity to incubate a new project in the autonomous agents space, alongside a…

1 条评论
Google’s BLEURT is BERT for Evaluating Natural Language Generation Models

2020年5月27日

Google’s BLEURT is BERT for Evaluating Natural Language Generation Models

Natural language generation(NLG) is one of the fastest growing areas of research in deep learning. NLG applications are…
Two Deep Learning Frameworks and an AI Super-Computer: Microsoft Launches New Efforts to Achieve Large-Scale AI

2020年5月25日

Two Deep Learning Frameworks and an AI Super-Computer: Microsoft Launches New Efforts to Achieve Large-Scale AI

Training models with massive datasets is becoming the norm in modern deep learning applications. Some of the latest…
Uber Open Sources a New Framework for Designing Optimal Statistical Experiments

2020年5月18日

Uber Open Sources a New Framework for Designing Optimal Statistical Experiments

Rapid experimentation is a key element of modern software development. The raise in popularity of machine learning, has…
Uber Unveils Its New Data Quality Management Solution

2020年5月13日

Uber Unveils Its New Data Quality Management Solution

Data quality management is one of those often forgotten aspects of machine learning workflows. Small inconsistencies or…
LinkedIn Open Sources a Small Component to Simplify the TensorFlow-Spark Interoperability

2020年5月7日

LinkedIn Open Sources a Small Component to Simplify the TensorFlow-Spark Interoperability

Interoperating TensorFlow and Apache Spark is a common challenge in real world machine learning scenarios. TensorFlow…
Google Unveils TAPAS, a BERT-Based Neural Network for Querying Tables Using Natural Language

2020年5月6日

Google Unveils TAPAS, a BERT-Based Neural Network for Querying Tables Using Natural Language

Querying relational data structures using natural languages has long been a dream of technologists in the space. With…
Facebook Open Sources Blender, the Largest-Ever Open Domain Chatbot

2020年5月4日

Facebook Open Sources Blender, the Largest-Ever Open Domain Chatbot

Natural language understanding(NLU) has been one of the most active areas adopting state-pf-the-art deep learning…

2 条评论
Microsoft Research Unveils Three Efforts to Advance Deep Generative Models

2020年4月27日

Microsoft Research Unveils Three Efforts to Advance Deep Generative Models

Generative models have been an important component of machine learning for the last few decades. With the emergence of…
Facebook and Amazon Bring Two Projects to PyTorch 1.5 that Streamline the Lifecycle of Production-Ready Deep Learning Models

2020年4月22日

Facebook and Amazon Bring Two Projects to PyTorch 1.5 that Streamline the Lifecycle of Production-Ready Deep Learning Models

PyTorch is one of the fastest growing open source projects in the deep learning space. Initially incubated by Facebook,…

See all articles

Introducing M3 The Time-Series Infrastructure that Powers Metrics at Uber

Jesus Rodriguez

CEO of IntoTheBlock, Co-Founder, Co-Founder of LayerLens, Faktory,and NeuralFabric, Founder of The Sequence AI Newsletter, Guest Lecturer at Columbia, Guest Lecturer at Wharton Business School, Investor, Author.

M3

M3DB

M3QL

M3 Query Engine

M3 Coordinator

Jesus Rodriguez的更多文章

社区洞察

其他会员也浏览了

Scaling of Read Operations Using Elasticsearch

November 2023 | Watch on-demand: DoorDash @ RoachFest plus how to solve high CPU in PG

VAST Data Teams with National Center for Supercomputing Applications to Accelerate AI Research and Discovery

The ScyllaDB Sync: July 2024

etcd: The Unsung Hero Powering Kubernetes and How to Back It Up

Robust DolphinDB – How does DolphinDB Achieve Scalability, Reliability, Resilience, Consistency, and Monitorability

Distributed Snapshots

The Pillars of System Design: A Blueprint for Building Scalable and Resilient Systems

All about resources: Time, money, data, people.

NuNet Technical Roadmap Update Q1

M3

M3DB

M3QL

M3 Query Engine

M3 Coordinator

Jesus Rodriguez的更多文章

Robust Agents Are All We Need: Faktory Emerges from Stealth Mode with a Private?Alpha

Google’s BLEURT is BERT for Evaluating Natural Language Generation Models

Two Deep Learning Frameworks and an AI Super-Computer: Microsoft Launches New Efforts to Achieve Large-Scale AI

Uber Open Sources a New Framework for Designing Optimal Statistical Experiments

Uber Unveils Its New Data Quality Management Solution

LinkedIn Open Sources a Small Component to Simplify the TensorFlow-Spark Interoperability

Google Unveils TAPAS, a BERT-Based Neural Network for Querying Tables Using Natural Language

Facebook Open Sources Blender, the Largest-Ever Open Domain Chatbot

Microsoft Research Unveils Three Efforts to Advance Deep Generative Models

Facebook and Amazon Bring Two Projects to PyTorch 1.5 that Streamline the Lifecycle of Production-Ready Deep Learning Models

社区洞察

其他会员也浏览了

Scaling of Read Operations Using Elasticsearch

November 2023 | Watch on-demand: DoorDash @ RoachFest plus how to solve high CPU in PG

VAST Data Teams with National Center for Supercomputing Applications to Accelerate AI Research and Discovery

The ScyllaDB Sync: July 2024

etcd: The Unsung Hero Powering Kubernetes and How to Back It Up

Robust DolphinDB – How does DolphinDB Achieve Scalability, Reliability, Resilience, Consistency, and Monitorability

Distributed Snapshots

The Pillars of System Design: A Blueprint for Building Scalable and Resilient Systems

All about resources: Time, money, data, people.

NuNet Technical Roadmap Update Q1