How to Build a Scalable Big Data Analytics Pipeline
Photo by Lukas from Pexels

How to Build a Scalable Big Data Analytics Pipeline

How to Build a Scalable Big Data Analytics Pipeline

Set up an end-to-end system at scale

Data is a vital element of today’s innovative enterprise. Data-driven decision making allows corporations to adapt to an unpredictable world. The ability to report on the data is the spine of business analytics. With the unprecedented growth of data in the 21st century, big data is no longer a buzzword but a reality that companies have to face.

Thou shalt love thy data as thyself

Data expands exponentially and it requires at all times the scalability of data systems. Building a big data pipeline at scale along with the integration into existing analytics ecosystems would become a big challenge for those who are not familiar with either.

Photo by

To build a scalable big data analytics pipeline, you must first identify three critical factors:

Input data

Either they are time-series or non-time-series, you must know the nature of your pipeline’s input data. It would determine under what format you store your data, what you do when data is missing, and what technology you use in the rest of the pipeline.

Output data

When building an analytics pipeline, you need to care about the end-users. Data Analysts use your pipeline to build a reporting dashboard or visualization. The output data needs to be accessible and manipulable given end-users’ possible lack of strong technical expertise in data engineering. Nowadays, famous analytics engines ease the integration between big data ecosystems and analytics warehouses.

How much data can the pipeline ingest?

The scalability of your data system can decide the long-term viability of the business. There’s nothing much alike between handling 100 GB and 1 TB a day. The hardware and software infrastructure must keep up with a sudden change in data volume. You don’t want to overload your data system due to the organic growth of your business. Scale your data pipeline for the best!

Typical big data analytics pipeline. Credit by author

Data collection

Data collection is the first and foremost module of a data pipeline where you have to assess the origin of your data. Are they coming from another data source or top-level applications? Will the data be structured or unstructured? Do you need to perform any data cleaning? We might think of big data as a chaotic volume of data, but actually, most big data are structured. Unstructured data will require additional techniques to build a data pipeline upon it.

Your pipeline’s architecture will vary in the method you choose to collect the data: either in batch or via streaming service. A batch processing pipeline demands an efficient storage system for I/O operations whilst a streaming processing one prefers a fault-intolerant transmission protocol.

When it comes to structured data, either they are texts, numbers, images, to feed them into the pipeline, they must go through a requisite process: data serialization. Data serialization is the process of converting structured data to a format that allows the sharing or storage of the data in a form that allows the recovery of its original structure.

Source:

Data serialization leads to a homogeneous data structure across the pipeline, thus keeping the consistency for all the data processing modules. XML, CSV, YAML, JSON are some of the most popular formats in data serialization. Serialized data is more optimized in terms of storage and transmission. Transferring data from one system to another might encounter incompatible problems, so a bit-wise communication ensures there is no information loss.

JSON is quite handy to handle both flat and nested data structures across the Internet. It offers a human-readable format and high integrity with JVM systems. However, in big data processing, the use of JSON is less favored than others due to its unoptimized storage and lack of structure validation.

{
  "people": [
    {
      "name": "John Doe",
      "id": 1,
      "email": "[email protected]",
      "phoneType": "Mobile",
      "phoneNumbers": [
        {
          "number": "1-541-754-3010",
          "type": "Mobile"
        }] 
    }],
  "numberOfContact": 1
}        

Protocol buffers (or protobuf) is Google’s internal mechanism for serializing structured data. With protobuf, you can define a generic schema and then perform the read/write operations with your favorite programming language. Think about a language-neural format like XML, but faster and smaller. Apart from the non-human-readable disadvantage, protobuf performs up to 6 times faster than JSON.

// source code reference: https://developers.google.com/protocol-buffers/docs/javatutorial
syntax = "proto2";

package tutorial;

option java_package = "com.example.tutorial";
option java_outer_classname = "AddressBookProtos";

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;

  enum PhoneType {
    MOBILE = 0;
    HOME = 1;
    WORK = 2;
  }

  message PhoneNumber {
    required string number = 1;
    optional PhoneType type = 2 [default = HOME];
  }

  repeated PhoneNumber phones = 4;
}

message AddressBook {
  repeated Person people = 1;
}        

Key takeaways:

  • Storage is essential for batch processing while the transmission is critical for streaming service
  • Serialization maintains a stable communication transferring between systems
  • Use protobuf to serialize data for a better performance

Data storage

Suppose you have the data collection modules up and running, where will you store all those data? It depends on many things: hardware resources, data management expertise, maintenance budget, etc. You need to make up your mind before deciding where to spend your money because this is a long-term play.

Data is the new oil, so it’s best to keep the oil in your backyard
Source:

If you have big money, the best thing is setting up your own data infrastructure. Data is the new oil, so it’s best to keep the oil in your backyard. Hire the best hardware engineers, assemble a proper data center, and build your pipeline upon it. Hadoop File System (HDFS) has always been the number one choice for in-house data architecture. It offers a tightly-integrated ecosystem with all tools and platforms available for data storage and ETL. It requires a minimum effort to set up a viable Hadoop stack. Its power lies in the capacity of horizontal scaling, which means bundling commodity hardware side by side to maximize performance and minimize costs.

You can even go the extra mile by optimizing the storage format. Storing files under .txt or .csv format might not be the brightest idea under HDFS. Apache Parquet is a columnar format available to any project in Hadoop, and it is recommended by every single data engineer out there. Being a column-based storage format, Parquet offers better compression and therefore optimized I/O operations. Its only drawback is the constraint in schema modification, for example, adding or removing a column takes more effort with parquet.

Coming from a SQL background, you can also set up a more accessible query system. The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Hive provides a SQL-like queries language (HiveQL) to execute queries directly on HDFS. Even though it does not follow all the SQL standards, HiveQL still eases the querying process for those who don’t speak Hadoop. Another common query engine is Presto which was largely developed by Facebook engineers.

Again, if you don’t have enough resources to build your own data warehouse, you can outsource the whole system to a cloud-based platform. Many famous tech companies offer all-in-one big data architectures such as Google Big Query, Amazon AWS, Microsoft Azure. By out-sourcing, you don’t have to bother setting up or maintaining the ecosystem, but that brings the risks of not being able to control your pipeline. There is a compromise between high-cost, low-maintenance, and low-cost, high-maintenance. Nevertheless, you can place your bet on the expertise of tech giants in managing your pipeline.

Key takeaways:

  • If you have big money, go for DIY data architecture, if not, our-sourcing is your answer
  • Use parquet to store files in Hadoop ecosystem
  • Setup a query system upon Hadoop for easy access

Analytics engines

Hadoop ecosystem and its alternatives are favorable for a big data storage system, but they don’t fit to be an analytics engine. They aren’t built to execute fast queries. For analytics purposes, we execute frequently ad hoc queries thus demand a system that returns quick results. Subordinate storage needs to be built on an analytics engine.

Vertica is a database management system designed for analytics at scale and fast query performance. It stores data in a columnar format and creates projections to distribute data across its nodes for high-speed queries. Vertica is widely used by many tech companies thanks to its reputation for providing a robust analytics engine and efficient querying system. Vertica can play the role of a database for numerous data-related external applications thanks to its easy integration using Java, Scala, Python, C++.

An analytics dashboard example. Photo by

However, Vertica shows some disadvantages of working with real-time data or high-latency analytics. Its constraints on changing schemas or modifying projections limit its use on data with rapid transformation. Druid is an open-source analytics database specifically designed for Online Analytics Processing (OLAP). Time-series data requires an optimized storage mechanism and fast aggregators. It contains mostly timestamps and metrics. Druid stores metrics as columns and partitions data based on indexes and metrics altogether for quick access, therefore, provides agile aggregation operations.

Key takeaways:

  • Vertica is great for low-latency analytics but requires much expertise to scale
  • Druid is built for time series data and provide a fast access system
  • Choose analytics database with maximum integration to visualization tools

Monitoring and quality

After finishing data collection, storage, and visualization integration, you might want to plug and play. But there’s one last thing is what to do in case of incidents. Where do you turn to when your pipeline crashes for no reason? That’s the purpose of the whole monitoring process. It helps you to track, log, and observe your system’s health and performance. Some tools even allow you to debug on the fly. With that said, a proper monitoring system is a must if you want to build a data pipeline that lasts. Here we distinguish between two kinds: IT monitoring and data monitoring.

IT monitoring is necessary for any software development. It shows various system-related metrics such as CPU, disk usage, resource consumption, memory allocated, etc. You can look at an IT monitoring and say whether you can double, or triple the pipeline’s capacity. With pre-optimized ecosystems like Hadoop or Vertica, we don’t need to worry much about IT performance. You can choose any basic IT monitoring tools like Grafana or Datadog to set up a simple dashboard keeping track of your metrics.

An example Grafana dashboard. Source:

Data monitoring is as crucial as other modules in your big data analytics pipeline. It detects data-related issues like latency, missing data, inconsistent dataset. The quality of your data pipeline reflects the integrity of data circulating within your system. These metrics ensure a minimum or zero data loss transferring from one place to another without affecting the business outcomes. We cannot name all the metrics logged by data monitoring tools because each data pipeline has its specific needs hence specific tracking. If you are building a time-series data pipeline, focus on latency-sensitive metrics. In case your data comes in batches, make sure you track properly the transmission processes. Some data monitoring tools can help you to build a straightforward data monitoring dashboard, but to suit your particular uses, it’s best to build one yourself.

Key takeaway:

  • Monitoring tools are indispensable in a data pipeline, but not all metrics are equally important
  • Data pipeline quality means the integrity of your data

Conclusion

We spend quite some time talking about a basic end-to-end big data analytics pipeline, and I hope you have acquired some useful knowledge. There is no all-in-one formula for building a pipeline as such, but you can base on the fundamental blueprint to craft your own.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了