登录查看更多内容

Data engineer v.s Model engineer,? What is the future of ML development?

Qingsong Yao

Principal Software Engineer at Salesforce

发布日期: 2023年9月3日

Below are the conversions related to data engineering, ML engineering, training data and model fine tune. Let me know what is your opinion.

D: think the future will be more towards to data engineer since the quality of data will be the key of succeed of a good model.

A: If this hypothesis is true, I think this argument makes sense. Personally, I feel this issue is quite complex, and often the bottleneck may not be on the data engineer’s side.

B: Yep, historically, the work of data cleaning hasn’t always belonged to DA/DS/model engineers, although everyone dislikes it a lot and feels this work is subordinate.

A: Yes, this is a typical relationship issue between infra engineers and modeling scientists. In some cases, the importance of the infra team stands out more. For example, when a model wants to introduce a data collection and labeling pipeline to build, since DEs have a lot of experience dealing with different ML teams, they may suggest the design by the ML scientist is unreasonable in certain areas. They can provide suggestions based on issues other teams have encountered before. In this case, the infra team can provide tremendous help to the scientists, saving them a lot of time. From this perspective, data quality mainly depends on DEs.

Another example where DEs dominate is when, due to some business needs, the ML model needs a large amount of new training data added, and the traditional approach of adding data pipelines one by one may no longer be applicable. At this point, a comprehensive plan is needed, to efficiently and reliably increase data collection through a platform, while also ensuring the system is maintainable without issues. In this case, the data engineer is the key factor determining the success of the whole project.

A: From another perspective, in some cases, the modeling team’s impact on data quality is dominant. For example, the data collection of a project may be relatively simple, but annotation is very tricky. At this time, scientists and annotators need to iterate to refine the annotation guidelines, especially finding all kinds of corner cases. In this situation, from the DE’s platform perspective, their role is not very significant. Another example is, the data annotation itself may not have issues, but the data sampling problem is tricky, like highly imbalanced positive and negative samples. How to balance sampling bias and sampling efficiency requires the ML scientist’s expertise. In this case, data quality mainly depends on the modeling scientists.

C: The viewpoint is model fine-tuning quality is determined by training data, while model trainer’s difficulty has decreased.

领英推荐

7 Challenges Faced by Data Scientists in Your…

Naveen Joshi 2 年前

7 Challenges Faced by Data Scientists in Your…

Naveen Joshi 2 年前

Data Science vs Software Engineering: Key Differences…

Pratibha Kumari J. 1 年前

A: I disagree with this viewpoint, because a scientist’s core value is not in training models in the first place. A project may spend half the time on building the measurement set and training set (including pipeline construction), 30% on iterating and handling various newly discovered failure patterns, and only about 20% on training the model itself, which is simple. So further reducing this part is not very important for improving overall efficiency, nor is it very important for scientists’ job security. From the perspective of time allocation above, how to collaborate with DE is actually the core skill for ML scientists.

D: If we change the question to data engineering vs. model engineering (not considering who does the work, purely discussing from an ML algorithm perspective): I feel there is still a lot of potential in doing data engineering in the short term on the existing transformer architectures (this was my point yesterday). Recently there are a lot of relevant academic works: doing only data engineering (whether data selection or generating synthetic data from large models) can greatly improve transformer performance and scaling laws.But in the long run, model engineering is definitely still needed;

I agree with some classmates yesterday that the transformer is unlikely the final architecture. Some work has started investigating this already.

In fact, looking back at history, data engineering and model engineering often exist in cycles. Each time a new useful model comes out, doing data engineering without model modifications can unlock a lot of improvements.

But when this space has been fully exploited, we need to go back to model engineering.

C: My viewpoint is DE/researchers should move more to the engineering side, combine more with business, while engineers should move more to the left, understand basic ML principles, know how to use models, understand model training.In the future, ML pipelines will become simpler and simpler (large companies have readily available solutions, DataBricks will have data and ML pipelines together), and tools will become easier to use. This gives everyone something to think about - what are the current issues with the composition of Infra teams and ML teams in companies, and how will this evolve in the future. Looking back, the cost of ML Infra is very high in many companies, sometimes DS hired underutilize their potential, or DS and Infra teams compete for projects, leading to poor results for many ML projects. OpenAI’s great success cannot be separated from Greg Brockman’s engineering management capabilities.

F: It takes the scientists and researchers a long time to recognize that AI needs to be data centric. Engineers are much more shrewd about this.

要查看或添加评论，请登录

Qingsong Yao的更多文章

Boost Enterprise GEN AI with Prompt Engineering and RAG: Why Fine-tuning Might Not Be the Answer

2024年5月16日

Boost Enterprise GEN AI with Prompt Engineering and RAG: Why Fine-tuning Might Not Be the Answer

Introduction Large language models (LLMs) hold immense potential for businesses, but unlocking that potential can be…

1 条评论
Building Intelligence Inside: The Benefits of Internal AI vs External AI

2023年9月19日

Building Intelligence Inside: The Benefits of Internal AI vs External AI

Today, the most common use case for GPT-driven assistants and generative AI is combining prompting with grounding…
Risks posed by Large Language Models

2023年4月12日

Risks posed by Large Language Models

A number of well-known AI researchers — and Elon Musk — have signed an open letter calling on AI labs around the world…
Readings on BigData and Telemetry

2015年3月16日

Readings on BigData and Telemetry

Here is some reading for BigData and Telemetry which I collected from infoq.com Scalable Big Data Stream Processing…

1 条评论

Data engineer v.s Model engineer,? What is the future of ML development?

Qingsong Yao

Principal Software Engineer at Salesforce

领英推荐

Qingsong Yao的更多文章

社区洞察

其他会员也浏览了

Selected Data Engineering Posts . . . October 2024

The Data Scientist's Toolbox.

Exploring Data with Pandas: Essential EDA Techniques for Data Science

DATA Pill #073 - Building ETL pipelines with Generative AI, Elementary for dbt

Top Data Engineering trends and tools to embrace to attain data success

Why Is Data Science Different than Software Development? It Starts with Data…Lots o’ DATA!!

Preliminary Data Analysis with Automated EDA: A CRISP ML(Q) Approach

Uses & Careers in Data Science

How LLMs are Automating Data Engineering Tasks?

The Essence of Data Science: Understanding the Fundamentals

领英推荐

Qingsong Yao的更多文章

Boost Enterprise GEN AI with Prompt Engineering and RAG: Why Fine-tuning Might Not Be the Answer

Building Intelligence Inside: The Benefits of Internal AI vs External AI

Risks posed by Large Language Models

Readings on BigData and Telemetry

社区洞察

其他会员也浏览了

Selected Data Engineering Posts . . . October 2024

The Data Scientist's Toolbox.

Exploring Data with Pandas: Essential EDA Techniques for Data Science

DATA Pill #073 - Building ETL pipelines with Generative AI, Elementary for dbt

Top Data Engineering trends and tools to embrace to attain data success

Why Is Data Science Different than Software Development? It Starts with Data…Lots o’ DATA!!

Preliminary Data Analysis with Automated EDA: A CRISP ML(Q) Approach

Uses & Careers in Data Science

How LLMs are Automating Data Engineering Tasks?

The Essence of Data Science: Understanding the Fundamentals