AWS Data Engineering Essentials Guidebook
Data engineering lays the foundation for data science and analytics by integrating in-depth knowledge of data technology, reliable data governance and security, and a solid understanding of data processing. Data engineers manage data pipelines, i.e. the infrastructural designs for modern data analytics, to enable smooth data analysis operations.
With Amazon Web Services (AWS), data engineers can create data pipelines, manage data transfer and ensure efficient data storage.
Now, let us look at the AWS services used to build data engineering pipelines, frameworks and end-to-end workflow integrations:
Batch Processing
Amazon Simple Storage Service (S3) is a data store that can store any amount of data from across the internet. As it is an incredibly scalable, fast and affordable option, data engineers have the flexibility to duplicate their S3 storage across different Availability Zones with Amazon S3.
AWS Glue is a fully managed ELT (Extract, Load and Transform) service to easily and cost-effectively process, enhance and migrate data between different data stores and data streams. Data engineers can interactively analyze and process the data using AWS Glue Interactive Sessions. Data engineers can visually develop, execute and monitor ETL workflows in AWS Glue Studio with a few clicks. Glue uses Spark and can support parallel processing of jobs and serverless processing.
AWS Elastic Map Reduce (EMR) is one of the primary AWS services for developing large-scale data processing that utilizes Big Data technologies such as Apache Hadoop, Apache Spark, Hive, etc. Data engineers can use EMR to launch a temporary cluster to run any Spark, Hive or Flink task. It allows engineers to define dependencies, establish a cluster configuration and identify the underlying EC2 instances.
领英推荐
Amazon Athena is an interactive query tool to easily assess data in Amazon S3 with SQL. Data engineers can use Athena to gain some insights from the data once the metadata has been added to the Data Catalog. When accessing GB of data in Parquet format with strong partitions, engineers typically get results within seconds.
AWS Lambda is an AWS service for serverless computing that runs your code in response to events and effortlessly manages the underlying computing resources. Lambda is helpful when you really need to gather raw data. Data engineers can develop a Lambda function to access an API endpoint, get the result, process the data and store it in S3 or DynamoDB.
Real-time Processing
AWS Kinesis offers multiple managed cloud-based services to collect and analyze streaming data in real time. Data engineers use Amazon Kinesis to create new streams, easily specify requirements and start streaming data. In addition, Kinesis allows engineers to retrieve and analyze data immediately instead of waiting for a data output report.
AWS Data Migration Services (DMS) is a managed migration and replication service that helps move database and analytics workloads to AWS quickly, securely, and with minimal downtime and no data loss.
AWS Apache Flink is a streaming dataflow engine that can be used for real-time stream processing of high-throughput data sources. Flink supports event timing semantics for out-of-order events, exact once semantics, backpressure control, and APIs optimized for writing, streaming and batch applications. Amazon EMR supports Flink as a YARN application, so you can manage resources along with other applications within a cluster.