The “Aggregate” Cloud Data Pattern

The “Aggregate” Cloud Data Pattern

As part of my re:Invent 2024 Innovation talk, I shared three data patterns that many of our largest AWS customers have adopted as they move to the cloud and start using S3 as generalized application storage. This article focuses on the “Aggregate” cloud data pattern, which is the most commonly adopted across AWS customers. You can also watch this six-minute video clip on the Aggregate data pattern for a quick summary.

We started to see the first data lakes, which typically use the Aggregate data pattern, emerge on Amazon S3 about five years after Amazon S3 launched in March 2006. As Don MacAskill , CEO and Co-founder of?SmugMug shared, Amazon S3 was immediately adopted by customers like SmugMug to store rapidly growing unstructured data such as images. But developers wanted to take advantage of the security, availability, durability, scalability, and low cost of S3 for other business uses and that led to the integration of Amazon S3 into the Hadoop ecosystem for business analytics. Developers that wanted to use Amazon S3 instead of HDFS as their data store depended on open-source project S3A, which is part of the Apache Hadoop ecosystem, or solutions like Amazon?Elastic MapReduce (EMR) with its built-in S3 integration so that Hadoop could directly read and write Amazon S3 objects. If you are interested in how customers with large-scale data lakes thought of “Aggregate” over ten years ago, you can read this Netflix blog from January 2013 that talks about how they built a Hadoop-based system on Amazon S3 – a super interesting glimpse into the evolution of data lakes at scale.?

If you fast-forward to today, what Netflix said in 2013 (“store all of our data on Amazon’s Storage Service (S3), which is the core principle on which our architecture is based”) is still the core of the Aggregate data pattern and has grown to expand beyond data lakes. Companies that use Aggregate send data from many different sources (like sensor feeds, data processing pipelines, applications that track consumer patterns, data streams, log feeds, databases, data warehouses, etc.) into Amazon S3 to store and use with any application, across compute types, application architectures, and use cases. Because so many customers have adopted Aggregate as a data pattern, Amazon S3 is increasingly being used as a generalized application storage layer as-is or with additional infrastructure to optimize S3 for integration like what S3A did for open-source Hadoop so many years ago.?More than a million data lakes run on AWS these days but there is much more storage beyond data lakes in S3. Today Amazon S3 holds more than 400 trillion objects, exabytes of data, and averages over 150 million requests per second. Ten years ago, just under 100 Amazon S3 customers were storing 1 petabyte or more of data, and now thousands of customers are storing over a petabyte of data, with some customers managing more than an exabyte of data. For context, a petabyte of data is about a thousand terabytes and one exabyte of data is about one million terabytes of data.

The Aggregate data pattern is super common among AWS customers for a few reasons. First, it lets application developers across organizations take advantage of the volume and diversity of data in a company. This is very different from the old on-premises world where data sets tended to be locked away in vertically integrated applications. By aggregating data in Amazon S3, application owners and other members of the team (such as data scientists or AI researchers) have access to a wide variety of raw and processed data sets to use for experiments and application innovation. Simply the act of bringing together data into one place can significantly change the speed of the business.?

Second, the Aggregate data pattern uses a federated ownership model, which many companies like because it decentralizes data ownership and fits in with the culture of their organization. Different organizations own the delivery of data into Amazon S3 and organizations own their own usage of the data sets as well. With Amazon S3 as the foundation of Aggregate, this data pattern gives the most flexibility to different organizations to use data.

And third, the Aggregate data pattern offers the most choice in tools because no matter what choice you make for an ISV or native AWS service, you can generally expect an integration with Amazon S3 for data storage.?

The key to success with the Aggregate data pattern is standardization on the building blocks of your data infrastructure. The underlying storage of Amazon S3 is one form of standardization but many customers apply other standards as well.?That is because your aggregated data sets often grow very quickly and you want to have some consistency across your organizations around the data. While federated ownership optimizes for flexibility, AWS customers also want to make sure that teams are using the data in the right way and teams are not creating or replicating work in data processing, governance, and other data workflows across the organization. For example, 罗氏公司 , a pioneer in healthcare, uses Amazon S3 to store various data types. Data is standardized in Amazon S3 through their data pipeline, running data through a single ETL pipeline to enforce consistent and accurate results across diverse document types, which helps various users, like analysts and business users, accelerate the time it takes to get to the right data for the task at hand.

There are many other ways that customers apply standards across an Aggregate data pattern but one of the most common is to standardize on file formats. For example, many of our largest data lake customers including Netflix, Nubank, Lyft, and Pinterest commonly use a file format called Apache Parquet to store business data.?Any text or numerical data that can be represented as a table, such as credit scoring, transaction records, or inventory reporting, can be stored in a Parquet file. In fact, Parquet is one of the most common and fastest growing data types in Amazon S3, with exabytes of Parquet data (which is highly compressed) stored today. Customers make over 15 million requests per second to this data, and we serve hundreds of petabytes of?Parquet every day. As one example of standardization, Pinterest standardizes their storage on S3, their tabular data on Parquet, and their open table format (OTF) on Apache Iceberg. They have thousands of business-critical Iceberg tables, and last year, adopted large language models (LLMs) to automate query generation to the right Iceberg table.

If you are using the Aggregate data pattern, you are in good company. Many of our AWS customers moved to the cloud using Aggregate and scaled using it. However, particularly in the last 12-18 months, as customers look to leverage analytics data for AI, we have more customers moving to the Curate data pattern either as their primary data pattern across an organization or as applied to specific teams in the business.

?

Sean Mathieson

Business Outcomes | Data-Driven | Intelligent Automation | Sustainability

2 个月

Great insights Mai-Lan Tomsen Bukovec ?? In APJ, we are noticing increasing momentum of the Aggregate and Curate data patterns across a broad range of industries and sizes of companies.

回复
Deepankar Das

Connecting technical AI with practical business outcomes. Founder @ Cyberhead.ai | AI/ML, GenAI, Cyber Security, Big Data

2 个月

Thank you for sharing, Mai-Lan. Very insightful.

回复

要查看或添加评论,请登录

Mai-Lan Tomsen Bukovec的更多文章

  • Principal Engineer Roles Framework

    Principal Engineer Roles Framework

    I have worked on Amazon S3 for ~12 years and if there is one thing that I have learned, it is that when you run complex…

    51 条评论
  • The “Extend” Cloud Data Pattern

    The “Extend” Cloud Data Pattern

    As part of my re:Invent 2024 Innovation Talk, I shared three data patterns that many of our largest AWS customers have…

    2 条评论
  • The “Curate” Cloud Data Pattern

    The “Curate” Cloud Data Pattern

    As part of my re:Invent 2024 Innovation talk, I shared three data patterns that many of our largest AWS customers have…

    3 条评论
  • Adapting to Change with Data Patterns on AWS

    Adapting to Change with Data Patterns on AWS

    At AWS re:Invent, I do an Innovation Talk on the emerging data trends that shape the direction of cloud data…

    4 条评论
  • "Tell Me More"

    "Tell Me More"

    In my experience, you have to be relentlessly curious if you want to do your best work. If you are curious, you uncover…

    12 条评论
  • Balancing Conviction and Connection

    Balancing Conviction and Connection

    In the course of a day, I hop balconies many times. Every time I am on a new balcony, I ask myself, “How am I going to…

    22 条评论
  • How to Move From a “Wait for it...” Batch-Processing Culture to a “Get It Now” Real-Time Data Culture

    How to Move From a “Wait for it...” Batch-Processing Culture to a “Get It Now” Real-Time Data Culture

    If there is one thing that we’ve learned over the years from AWS customers, it’s that real-time data is…

    1 条评论
  • Mentoring 2.0: Active Shadowing

    Mentoring 2.0: Active Shadowing

    Human potential is everywhere and it’s on us as leaders to recognize and cultivate it. During the recent AWS Paris…

    7 条评论
  • How the AWS Lambda Team Uses Fault Injection Simulator (FIS)

    How the AWS Lambda Team Uses Fault Injection Simulator (FIS)

    At #AWS, we use systems to help us test our assumptions and validate our approaches, whether that is through automated…

    4 条评论
  • Raising the bar with AWS operational analytics

    Raising the bar with AWS operational analytics

    When people want to improve their health, people often use fitness trackers to monitor their health and get the data…

    4 条评论

社区洞察

其他会员也浏览了