Overview of Amazon Kinesis
Amazon Kinesis is designed to make it simple to collect, process, and analyze streaming data in real time. This includes any data generated at high speed, such as:
- Application logs
- Metrics
- Website clickstreams
- IoT telemetry data
If data is produced continuously and needs to be analyzed as it arrives, it qualifies as real-time streaming data, making Kinesis an ideal solution.
Core Components in Kinesis
Amazon Kinesis is comprised of four services, each designed for specific streaming data needs:
- Kinesis Data Streams: Captures, processes, and stores data streams. Allows real-time data streaming with reliable throughput and data persistence.
- Kinesis Data Firehose: Loads data streams into AWS storage and analytics services, or even external systems. Simplifies the process of delivering streaming data to destinations like Amazon S3, Redshift, Elasticsearch, or third-party solutions.
- Kinesis Data Analytics: Analyzes data streams using SQL or Apache Flink. Enables real-time data analysis for monitoring, reporting, and quick insights, using familiar languages and tools.
- Kinesis Video Streams: Captures, processes, and stores video streams.
Kinesis Data Stream
Overview of Kinesis Data Streams
Kinesis Data Streams is designed for real-time, big data streaming within AWS. It enables continuous data streaming across various sources, providing a flexible and scalable way to process data in near real-time.
Core Components
- A Kinesis Data Stream comprises multiple shards, each identified by a unique number (e.g., Shard 1, Shard 2, etc.).
- When creating a Kinesis Data Stream, we specify the number of shards, which determines our stream's capacity in terms of ingestion and consumption rates.
- Shards can be scaled up or down based on demand.
- Producers are responsible for sending data to Kinesis Data Streams.
- Examples of producers include applications, desktop and mobile clients, and AWS services like the Kinesis Producer Library (KPL) and Kinesis Agent, which stream application logs to Kinesis Data Streams.
- Each record produced consists of: Partition Key: Used to determine which shard the record will be stored in. Data Blob: The actual data, up to 1 MB in size.
- Producers can send data at a rate of 1 MB/sec or 1,000 messages per second per shard. Therefore, if we have 6 shards, our stream's capacity is 6 MB/sec or 6,000 messages per second.
- Consumers retrieve data from Kinesis Data Streams which can take many forms, including Applications using the AWS SDK or Kinesis Client Library (KCL). AWS Lambda functions for serverless processing. Kinesis Data Firehose or Kinesis Data Analytics.
- When a consumer reads a record, it receives the partition key, sequence number (indicating its position in the shard), and data blob.
- Enhanced Fan-Out mode allows each consumer to have a throughput of 2 MB/sec per shard.
Data Retention and Immutability
- Kinesis Data Streams retain data for a specified period, which can be set from 1 day up to 365 days.
- Data in Kinesis is immutable, meaning it cannot be deleted once it is inserted, allowing for reprocessing and replay of data when needed.
Capacity Modes
- Provisioned Mode: In this mode, we manually set the number of shards. Each shard provides an ingestion rate of 1 MB/sec or 1,000 records per second, and an out-throughput rate of 2 MB/sec. We pay per shard, so careful planning is required.
- On-Demand Mode: Ideal for unpredictable or spiky workloads, as it scales capacity automatically based on usage. The default capacity is 4 MB/sec or 4,000 records per second, with automatic scaling based on peak throughput over the last 30 days. Pricing is based on data stream usage, including per hour and per GB I/O.
Security in Kinesis Data Streams
- IAM Policies: Control access to Kinesis shards for both producer and consumer.
- Encryption: Data is encrypted in transit (HTTPS) and at rest (KMS). Client-side encryption is also available for additional security but requires custom implementation for encryption and decryption.
- VPC Endpoints: Allow private access to Kinesis from within a VPC, bypassing the public internet.
- Monitoring: All API calls are logged and can be monitored via AWS CloudTrail.
Overview of Kinesis Data Firehose
Kinesis Data Firehose is a helpful tool for ingesting data from multiple producers and delivering it to various destinations. Producers can include:
- Applications, clients, SDKs, and Kinesis agents
- Kinesis Data Streams, Amazon CloudWatch Logs, and Events
Once data enters Kinesis Data Firehose, it can optionally be transformed using a Lambda function. After optional transformation, the data is written in batches to specified destinations, with no additional coding required for the writing process.
Destination Types
Kinesis Data Firehose supports multiple types of destinations:
- Amazon S3: Stores data directly.
- Amazon Redshift: Uses Amazon S3 as an intermediary before issuing a COPY command to transfer data from S3 to Redshift.
- Amazon OpenSearch: Allows for analytics and search capabilities.
Third-Party Partner Destinations
Firehose can deliver data to third-party services such as Datadog, Splunk, New Relic, MongoDB, and others.
Custom HTTP endpoints can be used for specific use cases, enabling data delivery to our own applications via APIs.
In addition to the primary destinations, Firehose offers options to:
- Backup all data to an S3 bucket
- Backup only failed data to a separate S3 bucket if there are issues writing to the primary destination
Key Features of Kinesis Data Firehose
- Fully Managed and Serverless: Requires no server management or manual scaling. Firehose automatically scales to accommodate the incoming data.
- Cost Efficiency: Charges only for the data volume processed.
- Near Real-Time Delivery: Data is delivered in batches, making it “near real-time.” Buffer intervals range from 0 to 900 seconds, and buffer sizes start at a minimum of 1 MB. Even with a 0-second buffer, slight delays (a few seconds) classify Firehose as near real-time.
Data Formats and Transformations
Kinesis Data Firehose supports multiple data formats, compressions, and conversions. We can also use AWS Lambda for custom transformations.
Backup and Recovery
Firehose provides the option to back up all data or only failed data into S3, ensuring a reliable data recovery option.
Kinesis Data Streams vs. Kinesis Data Firehose Comparison
Here’s a quick comparison to clarify when to use each:
- Kinesis Data Streams: For high-scale data ingestion with custom coding for producers and consumers. Supports real-time processing (latency of around 70-200 ms). Requires manual scaling and management of shards (e.g., shard splitting and merging). Allows multiple consumers, with storage duration between 1 and 365 days, and supports data replay.
- Kinesis Data Firehose: For automated data delivery to AWS services (like S3, Redshift, OpenSearch), third-party applications, or custom HTTP endpoints. Fully managed with automated scaling and is near real-time. No data storage or replay capability, meaning once data is delivered, it cannot be accessed for reprocessing. Ideal when we want ease of use without worrying about infrastructure or scaling.
??Java Software Engineer | Oracle Certified Professional
4 个月Interesting