登录查看更多内容

Setting up Data Collection (ML4Devs Newsletter, Issue 5)

Satish Chandra Gupta

Data/AI Consultant ? I help startups & SMBs escape the AI PoC trap and turn prototypes into products ? Data, ML, DL, LLM, RAG, AI Agents ? Ex- Amazon, Microsoft Research

发布日期: 2022年3月15日

+ 关注

Cliché: Without data, there can be no data science.

But it is true.

While learning data science, we mostly use public data sets or scrape data off the web. But in ML-assisted products, most of the data is generated and collected through business applications.

The first step in any data pipeline is instrumenting your application to:

Capture needed data when an interesting event happens in the application
Ingest the captured data into your data storage (typically an event queue like Kafka)

This sequence of data is commonly known as event-stream or click-stream. The data quality depends on the accuracy and completeness of the data you capture and ingest.

There are alternatives for capturing and ingesting a click-stream.

Do It Yourself (DIY)

Write a small library in the language of your application that captures the event and sends it to a microservice endpoint or a cloud function for further processing and storage (such as AWS, Google Cloud, Azure, Snowflake, Databricks).

This is the most flexible alternative that you can optimize to the need of your application and data requirements.

It also takes the most development effort. You need to write code to process the data and store it in a data lake or data warehouse. If you use any 3rd-party analytics/ML application, most of these can consume data from a popular lake or warehouse.

Fully Outsource It

If you are doing analytics or business intelligence, you may use a tool like Google Analytics, MixPanel, Amplitude, or Heap.

This is the quickest and easiest approach to get started. These tools provide an SDK with simple APIs to dump the data. These tools can also compute and show the common analytics charts.

This approach is also the least flexible. I recommend it for analytics, but not for collecting data for data science or machine learning.

You should carefully examine the cost structure for data quantity slabs for deciding whether it is optimal for your data load.

领英推荐

Data Modernization – What is the best route for your…

ITC Infotech 1 年前

The Evolving Landscape of Data Analytics: Comparing…

Quadrant Technologies 6 个月前

Warping through Data pipelines

Mathias Halkj?r Petersen 1 年前

The Middle Path

There are a number of tools that provide a library to send an event/data, and also a rich list of connectors to filter, lightly process, and route that data to multiple destinations (e.g. data lakes, warehouses, and popular 3rd party tools).

What is the best solution for you?

The convenience and rich connectors offered by tools like Fivetran, RudderStack, etc. are valuable. But it depends on how diverse your needs are and how deep your pockets are.

Source: Fivetran

Source: RudderStack

I recommend Do It Yourself if:

you have a high volume of events/data (convenience will most likely be expensive), or
your data processing is limited to a single cloud provider.

Only if you are collecting a moderate amount of data with a typical schema, and mostly doing analytics, you may consider Fully Outsource It.

For the rest of the use cases, tradeoffs will depend upon in-house data engineering expertise and the diversity of data sources and processors. I suggest checking out Snowplow and RudderStack GitHub repositories.

ML4Devs is a biweekly newsletter for software developers. The aim is to curate resources for practitioners to design, develop, deploy, and maintain ML applications at scale to drive measurable positive business impact. Each issue discusses a topic from a developer’s viewpoint.

Enjoyed this? Originally published in?ML4Devs.com. Don't miss the next issue. Join 1.2K+ subscribers and?get it in your email:

ML4Devs

8,850 位关注者

Sugriv Patil

Global Rotating Equipment Engineer @SBM Offshore / Machinery Engineer and Analyst(Exxon Asssets Ex_QuEst) , Advance Analytics with SEEQ

3 年

Thanks for posting

1 次回应

要查看或添加评论，请登录

Satish Chandra Gupta的更多文章

MLOps: All-in-One Platform vs Piecemeal Tools (ML4Devs Newsletter, Issue 18)

2022年12月21日

MLOps: All-in-One Platform vs Piecemeal Tools (ML4Devs Newsletter, Issue 18)

As 2022 winds down, it is time to look back, learn, and find ways to do it better next year. Can you please help me in…
SQL Renaissance (ML4Devs Newsletter, Issue 17)

2022年11月26日

SQL Renaissance (ML4Devs Newsletter, Issue 17)

While I was in college in the 1990s, many of us considered databases as a solved problem. And by extension, wrote off…
Which Data Pipeline Orchestration Tool Is Right For?You? (ML4Devs Newsletter, Issue 16)

2022年11月11日

Which Data Pipeline Orchestration Tool Is Right For?You? (ML4Devs Newsletter, Issue 16)

You care about the data. Actually, you really care about the insights from that data or ML models you train with that…

3 条评论
Chasm of AI Security Between Research and Products (ML4Devs Newsletter, Issue 15)

2022年10月28日

Chasm of AI Security Between Research and Products (ML4Devs Newsletter, Issue 15)

AI research produces astonishing results every week (if not daily). And it is solving important and hard problems.
What Are Data, Machine Learning, and MLOps Pipelines (ML4Devs Newsletter, Issue 14)

2022年10月8日

What Are Data, Machine Learning, and MLOps Pipelines (ML4Devs Newsletter, Issue 14)

“Pipeline” is an overloaded term in data science and machine learning. People mean different things when they talk…
AI is Like Teenage?Sex… (ML4Devs Newsletter, Issue 13)

2022年9月23日

AI is Like Teenage?Sex… (ML4Devs Newsletter, Issue 13)

I learned about this question in a panel discussion at a conference last week: While most leaders acknowledge AI’s role…
Should You Care About MLOps? Why and How Much? (ML4Devs Newsletter, Issue 12)

2022年9月9日

Should You Care About MLOps? Why and How Much? (ML4Devs Newsletter, Issue 12)

The answer, as usual, is: it depends. What is MLOps? MLOps stands for Machine Learning Operations.
Machine Learning vs. Traditional Software Development (ML4Devs Newsletter, Issue 11)

2022年8月18日

Machine Learning vs. Traditional Software Development (ML4Devs Newsletter, Issue 11)

In the previous issue, we examined the MLOps ecosystem. It is a lot more complex compared to traditional software…
MLOps for Continuous Integration, Delivery, and Training (ML4Devs Newsletter, Issue 10)

2022年8月5日

MLOps for Continuous Integration, Delivery, and Training (ML4Devs Newsletter, Issue 10)

MLOps is a hot topic and everyone seems to be talking about it. I have been reading quite a bit of material, but I will…
When to (Not) Use Machine Learning (ML4Devs Newsletter, Issue 9)

2022年7月22日

When to (Not) Use Machine Learning (ML4Devs Newsletter, Issue 9)

In the previous issue, I discussed why Machine Learning projects fail. In this issue, let’s start figuring out how to…

1 条评论

See all articles

Setting up Data Collection (ML4Devs Newsletter, Issue 5)

Satish Chandra Gupta

Data/AI Consultant ? I help startups & SMBs escape the AI PoC trap and turn prototypes into products ? Data, ML, DL, LLM, RAG, AI Agents ? Ex- Amazon, Microsoft Research

Do It Yourself (DIY)

Fully Outsource It

领英推荐

The Middle Path

What is the best solution for you?

ML4Devs

8,850 位关注者

Satish Chandra Gupta的更多文章

社区洞察

其他会员也浏览了

The Five Important Trends in Data, and the One Megatrend Powering Them All

Debate - Data Lakes, Data Virtualization, and Data Warehouses by different Characters

Data Pipeline Tools for 2025: Top 13 Platforms to Transform Your Data Operations

A Smart Approach to Planning and Executing Data Lakes

Real-Time Analytics on Live Databases

3 ways Data Engineers can boost the ROI of your Snowflake implementation

Data Lakehouses: Bridging the Gap Between Data Lakes and Warehouses

Data Lake vs. Data Warehouse: Which to Choose and When?

Microsoft Fabric End-to-End Project?—? with Shorcut, Data Pipeline, DataFlow.

From Data Warehousing to Big Data Superhighway: Unleashing Azure Synapse Analytics

Do It Yourself (DIY)

Fully Outsource It

领英推荐

The Middle Path

What is the best solution for you?

ML4Devs

8,850 位关注者

Satish Chandra Gupta的更多文章

MLOps: All-in-One Platform vs Piecemeal Tools (ML4Devs Newsletter, Issue 18)

SQL Renaissance (ML4Devs Newsletter, Issue 17)

Which Data Pipeline Orchestration Tool Is Right For?You? (ML4Devs Newsletter, Issue 16)

Chasm of AI Security Between Research and Products (ML4Devs Newsletter, Issue 15)

What Are Data, Machine Learning, and MLOps Pipelines (ML4Devs Newsletter, Issue 14)

AI is Like Teenage?Sex… (ML4Devs Newsletter, Issue 13)

Should You Care About MLOps? Why and How Much? (ML4Devs Newsletter, Issue 12)

Machine Learning vs. Traditional Software Development (ML4Devs Newsletter, Issue 11)

MLOps for Continuous Integration, Delivery, and Training (ML4Devs Newsletter, Issue 10)

When to (Not) Use Machine Learning (ML4Devs Newsletter, Issue 9)

社区洞察

其他会员也浏览了

The Five Important Trends in Data, and the One Megatrend Powering Them All

Debate - Data Lakes, Data Virtualization, and Data Warehouses by different Characters

Data Pipeline Tools for 2025: Top 13 Platforms to Transform Your Data Operations

A Smart Approach to Planning and Executing Data Lakes

Real-Time Analytics on Live Databases

3 ways Data Engineers can boost the ROI of your Snowflake implementation

Data Lakehouses: Bridging the Gap Between Data Lakes and Warehouses

Data Lake vs. Data Warehouse: Which to Choose and When?

Microsoft Fabric End-to-End Project?—? with Shorcut, Data Pipeline, DataFlow.

From Data Warehousing to Big Data Superhighway: Unleashing Azure Synapse Analytics