Introduction
We have three options in Amazon Kinesis.
- Data Stream
- Data Firehose
- Data Analytics
For the purpose of this script, we are going with Data Stream. Data Stream in Amazon Kinesis refers to the continuous flow of data being ingested into the service, which can be processed and analyzed in real-time using various AWS services and applications.
Prerequisites
Before diving into reading data from an AWS Kinesis stream, ensure to have the following.
- AWS Account: A valid AWS account with appropriate permissions to interact with Kinesis services.
- Kinesis Stream: Set up a Kinesis stream through the AWS Management Console or AWS CLI.
Key Steps in Reading Data
- Authentication: Ensure to have the necessary AWS access credentials configured in the current working environment to authenticate the requests.
- Stream Identification: Specify the name of the Kinesis stream where read data from.
- Shard Iteration: Kinesis streams are divided into shards, and we need to iterate through these shards to access the data.
Obtain a shard iterator, which acts as a pointer to the position in the stream from which we want to start reading.
- Retrieve and Process Data: Use the shard iterator to fetch records (data entries) from the stream. Process the retrieved data according to the application's requirements.
When we search for Kinesis in the service section of the AWS console, we land on the dashboard page.
In our data architecture, we have established 26 distinct data streams, each linked to a Lambda function. The operational flow is orchestrated by a Lambda function created specifically for this purpose. This Lambda function can be triggered either manually or scheduled to activate whenever a new file is detected in the specified S3 bucket location. As soon as lambda is triggered, it will send data to the associated data stream (code for lambda is provided in the attachment for reference).
Once the data is in kinesis, we will have shards created. To read data from a shard in a Kinesis stream, you use a shard iterator. It is a pointer to a specific position in the shard. Shard iterators can be of types like TRIM_HORIZON (oldest available data) or LATEST (most recent data).
We have some terminologies which we should be aware of.
- Retention Period: The retention period is the duration for which Kinesis retains data records in the stream. By default, data is retained for 24 hours, but we can configure this period based on our specific use case.
- Capacity Mode: Amazon Kinesis provides two capacity modes for Kinesis Data Streams: the Provisioned Capacity Mode and the On-Demand Capacity Mode.
In Provisioned Capacity Mode, we manually specify the number of shards we need for our stream. Each shard provides a certain amount of capacity, both for read and write operations. On-Demand Capacity Mode is a more flexible option where we don’t need to specify the number of shards. The service automatically adjusts the number of shards based on the traffic to the stream.
The attached Python script, likely designed for AWS Lambda, reads data from S3, processes JSON and GZIP files, and streams the records to an Amazon Kinesis Data Stream. It incorporates error handling, retry mechanisms for throttling issues, and logs information about the process. The script is parametrized and configured through environment variables and a YAML config file retrieved from S3.