Setting up Data Collection (ML4Devs Newsletter, Issue 5)
Photo by Javier Miranda on Unsplash: https://unsplash.com/photos/MrWOCGKFVDg

Setting up Data Collection (ML4Devs Newsletter, Issue 5)

Cliché: Without data, there can be no data science.

But it is true.

While learning data science, we mostly use public data sets or scrape data off the web. But in ML-assisted products, most of the data is generated and collected through business applications.

The first step in any data pipeline is instrumenting your application to:

  • Capture needed data when an interesting event happens in the application
  • Ingest the captured data into your data storage (typically an event queue like Kafka)

This sequence of data is commonly known as event-stream or click-stream. The data quality depends on the accuracy and completeness of the data you capture and ingest.

There are alternatives for capturing and ingesting a click-stream.

Do It Yourself (DIY)

Write a small library in the language of your application that captures the event and sends it to a microservice endpoint or a cloud function for further processing and storage (such as AWS, Google Cloud, Azure, Snowflake, Databricks).

This is the most flexible alternative that you can optimize to the need of your application and data requirements.

It also takes the most development effort. You need to write code to process the data and store it in a data lake or data warehouse. If you use any 3rd-party analytics/ML application, most of these can consume data from a popular lake or warehouse.

Fully Outsource It

If you are doing analytics or business intelligence, you may use a tool like Google Analytics, MixPanel, Amplitude, or Heap.

This is the quickest and easiest approach to get started. These tools provide an SDK with simple APIs to dump the data. These tools can also compute and show the common analytics charts.

This approach is also the least flexible. I recommend it for analytics, but not for collecting data for data science or machine learning.

You should carefully examine the cost structure for data quantity slabs for deciding whether it is optimal for your data load.

The Middle Path

There are a number of tools that provide a library to send an event/data, and also a rich list of connectors to filter, lightly process, and route that data to multiple destinations (e.g. data lakes, warehouses, and popular 3rd party tools).

What is the best solution for you?

The convenience and rich connectors offered by tools like Fivetran, RudderStack, etc. are valuable. But it depends on how diverse your needs are and how deep your pockets are.

No alt text provided for this image

Source: Fivetran

No alt text provided for this image

Source: RudderStack

I recommend Do It Yourself if:

  • you have a high volume of events/data (convenience will most likely be expensive), or
  • your data processing is limited to a single cloud provider.

Only if you are collecting a moderate amount of data with a typical schema, and mostly doing analytics, you may consider Fully Outsource It.

For the rest of the use cases, tradeoffs will depend upon in-house data engineering expertise and the diversity of data sources and processors. I suggest checking out Snowplow and RudderStack GitHub repositories.

ML4Devs is a biweekly newsletter for software developers. The aim is to curate resources for practitioners to design, develop, deploy, and maintain ML applications at scale to drive measurable positive business impact. Each issue discusses a topic from a developer’s viewpoint.

Enjoyed this? Originally published in?ML4Devs.com. Don't miss the next issue. Join 1.2K+ subscribers and?get it in your email:

No alt text provided for this image
Sugriv Patil

Global Rotating Equipment Engineer @SBM Offshore / Machinery Engineer and Analyst(Exxon Asssets Ex_QuEst) , Advance Analytics with SEEQ

3 年

Thanks for posting

要查看或添加评论,请登录

Satish Chandra Gupta的更多文章

社区洞察

其他会员也浏览了