Speech-to-Text with AWS Batch + S3 + Transcribe + ECR
Introduction
AWS Batch takes advantage of multi-threading, managing jobs in a queue like system, job definitions, and dockerized production code. This AWS resource forms the corner stone of what I will be discussing today: leveraging Batch for resource intensive jobs.
S3 and Batch Operations
Batch operations were recently added to S3 in 2016, allowing its users to submit batch jobs for their S3 buckets. As suggested by an article published by AWS Tech Evangelist and Architect, Jeff Barr, the ability to invoke Lambda functions becomes an invaluable resource for users that wish to perform actions for some or all S3 items (be sure to look into creating inventories and using manifests).
I performed Transcribe operations on every MP3 in an S3 bucket. I decided to test running batch operations on 750+ MP3’s, but to my surprise it failed, and quite a bit. Unfortunately, things aren’t as simple as invoking lambda functions for each object. In this case, I had discovered a limitation of the Transcribe API, exposed by this CloudWatch log:
After a certain point, Transcribe cannot create more jobs, as described by the Boto3 Client library error shown in the log. One reason this could be is that some audios were longer than others, with varying audio quality, and naturally some would take longer than others to process by the Transcribe resource prompting the error. Another potential cause is certainly the number of Transcribe API calls that are made prevents more jobs being able to be created.
AWS Batch
The solution? Enter AWS Batch. This resource allows several EC2 instances to share the work load; compute instances are scheduled as jobs come into the compute environments queue, allowing for jobs to run concurrently.
Amazon EC2 instances support multithreading, which enables multiple threads to run concurrently on a single CPU core. Each thread is represented as a virtual CPU (vCPU) on the instance - AWS EC2 Documentation
Batch and Transcribe
Let's get a quick run down of some of the terminology and pieces of Batch that you should understand. For more detail on all Batch parameters checkout the docs here.
- Job Definition - The job definition is required before submitting jobs to Batch. The job definition specifies how your jobs will run on a case by case basis. It's here where you will provide parameters to your containerized solutions via Elastic Container Registry (more on that later). An important thing to note about job definitions is that they may be overwritten at run time, allowing you even more flexibility when submitting jobs.
- Job Queue - Job queues hold incoming jobs while they wait to be scheduled to run within a compute environment. There are several stages that jobs pass through, each of which can provide you powerful debugging information via CloudWatch. Additionally job queues have a priority parameter in order for you to set the order in which jobs are executed.
- Compute Environment - The compute environment contains the ECS compute instances that will process the jobs in the queue. Within the compute environment you can specify which compute instances you'd like your jobs to run on, or let Batch do the decision making for you, using the optimal keyword Batch will evaluate which compute instance is right for your job at runtime.
There are a ton of parameters I'm not going over, but I think you will have more fun if you investigate further on your own ??.
NOTE: Be very aware of the costs that may be incurred as a result of processing many API calls with Batch and in general, all that you do on AWS. If it's associated to a resource, it likely is associated to costs.
Submit Job API
Submitting Transcribe jobs to Batch was then a question of writing the code to retrieve the S3 bucket items where my target audios were stored, and sending each as a parameter to my Batch queues job definition.
Let's take a quick look at the submit_job API for a second. It's here where you'll submit jobs directly to the batch service.
batch.submit_job(jobQueue='demo', jobDefinition='myJobDef:1',jobName='myJob',
parameters= {
'param1': 'My first param!',
'param2': 'My second param!'
}
)
It's as simple as that! As you can see we have provided the parameters
- jobQueue - the job queue I have created in Batch named "demo", which is using a demo compute environment I set up to run with optimal compute instances.
- jobDefinition - the job definition I am telling Batch to run jobs with, notice the semi-colon operator. In Batch, you can revise job definitions i.e., edit them without having to get rid of older versions. This allows you to create several different job definitions under one name. Handy feature!
- jobName - the job name is the name of the job that Batch will give it. You can access this job name by use of the Batch API, and can be useful if you have to identify specific job status' or other metadata in a service like Lambda. There is a gotcha here, job names are unique identifiers, so if you're wondering why your for-loop sending jobs to Batch isn't working, make sure you're changing and recording the names of the jobs being sent.
AWS ECR
AWS ECR is used to store Docker images, and it's within this image where the Transcribe job specifications can be made using your preferred languages AWS client. Here I'm using Boto3 for Python. This workflow is an ace in your pocket, having "Dockerized" code for virtually any batch computing task you could think of.
Let's look at this example script that resides in the ECR image I have pushed:
import sys
import boto3
def send_to_transcribe(job):
myrecording = "s3://audio/linkedindemo.mp3"
batch = boto3.client('batch')
batch.submit_job(jobQueue='demo', jobDefinition='myJobDef:1',jobName=job,
parameters= {
'param1': 'My first param!',
'param2': 'My second param!'
}
)
send_to_transcribe(sys.argv[1])
As mentioned earlier, the job definition will tell Batch how to run each job, if given parameters like in the example above, you can actually capture input from the arguments sent to the image in ECR using system arguments! Now how cool is that?
Need help getting your code ready for production? Check out this ECR cheat sheet I made! For more ECR details and troubleshooting, I would recommend checking out the full docs here.
Conclusion
After the batch queue had proceeded through to the SUCCEEDED silo, I was finally at rest. More than two weeks had gone by since I first embarked on creating this process, and I finally had obtained my reward. I opened the bucket I had set to receive the results of Transcribe and all of my transcripts were present, and I was happy.
In this article we went over the use of AWS services together in order to achieve high throughput results. I am eager to see these resources be developed further in the near future. If you found this article interesting, feel free to drop me a line, or get in contact with us over at JAYA Company!
I'd also like to briefly thank the AWS community both on the official forums, reddit, and beyond. Your help, advice, and friendliness has helped me get on my way to a beautiful career in the cloud :) Thanks again!
Note on Transcribe Custom Vocabularies
I would like to take a quick moment to mention that the Custom Vocabulary on AWS Transcribe is an underdeveloped feature, and I would not recommend using it for enterprise solutions that require specific "find-and-identify-words-phrases-from-vocabulary" functionality. To my knowledge, the Transcribe resource is currently not accepting contributions from the community (correct me if I am wrong)