课程: MLOps Essentials: Model Development and Integration
Managed data pipelines
- [Instructor] In this chapter, we will look at the elements of ML ops that deal with the data engineering part. We start with building managed data pipelines. A data pipeline is an integral part of an ML workflow. A robust and managed data pipeline helps in creating repeatable machine learning processes while reducing human costs. Data issues usually become blockers for data scientists who keep waiting for the processed data for building models. Hence it's important to invest in a well managed data pipeline in the beginning of the ML project. Let's review the typical functions of a data pipeline. It starts with acquiring raw data. This may be batch data or streaming data that comes possibly from the production environment. The data then goes through a process of cleansing, filtering, and validation. The resulting output is then transformed and enriched to suit machine learning needs. Data elements that are ready for ML are known as features and they are stored in a feature store. The feature store is then accessed by data scientists to train models. After the models are deployed in production, new data is collected and the processing repeats. It then forms a continuous cycle of acquiring data, processing, and improving models. Now let's look at managed data pipelines. The word managed takes a lot of significance. Teams that do not manage their data pipelines properly end up with more issues, blockages, and additional effort in troubleshooting and fixing them. What constitutes a managed data pipeline? It first begins with having an engineering life cycle for data pipelines. As the data pipelines are owned by developers instead of operations engineers, there is a tendency to not follow engineering practices, but instead the data pipelines need to be treated as production code. A managed development life cycle like Agile needs to be followed for developing data pipeline code. There should be separate development, test, and production environments for these pipelines within engineering. And there should be proper promotion policies and practices. Integrated deployment pipelines should be used to deploy new code into these data pipelines. Next comes traceability of data. The lineage of data needs to be tracked from the source it was acquired from, timeline, the processing it has gone through, and the exceptions found. Similarly, pipeline code should go through proper code versioning and deployment tracking. An operating data pipeline should include observability features like logging, audits, and monitoring. An operating data pipeline should include observability features like logging, audits, and monitoring. Then comes reproducibility of results. It should be possible to reproduce the results in the feature store by simply reprocessing the raw data. Developers tend to do ad hoc data manipulations, which should be avoided. A strict data as code approach should be taken, and all steps needed to transform the raw data into its feature store should be available as version controlled code. Finally, there should be automation wherever possible in the pipeline. This includes how processing can be triggered and workflows executed. This could be on arrival of new data or a set schedule. Also errors and exceptions should automatically trigger rollbacks as well as reprocessing of data. The feature store should be kept in a consistent state and all errors should be reported and analyzed.
随堂练习,边学边练
下载课堂讲义。学练结合,紧跟进度,轻松巩固知识。