Data engineer v.s Model engineer,? What is the future of ML development?
Below are the conversions related to data engineering, ML engineering, training data and model fine tune. Let me know what is your opinion.
D: think the future will be more towards to data engineer since the quality of data will be the key of succeed of a good model.
A: If this hypothesis is true, I think this argument makes sense. Personally, I feel this issue is quite complex, and often the bottleneck may not be on the data engineer’s side.
B: Yep, historically, the work of data cleaning hasn’t always belonged to DA/DS/model engineers, although everyone dislikes it a lot and feels this work is subordinate.
A: Yes, this is a typical relationship issue between infra engineers and modeling scientists. In some cases, the importance of the infra team stands out more. For example, when a model wants to introduce a data collection and labeling pipeline to build, since DEs have a lot of experience dealing with different ML teams, they may suggest the design by the ML scientist is unreasonable in certain areas. They can provide suggestions based on issues other teams have encountered before. In this case, the infra team can provide tremendous help to the scientists, saving them a lot of time. From this perspective, data quality mainly depends on DEs.
Another example where DEs dominate is when, due to some business needs, the ML model needs a large amount of new training data added, and the traditional approach of adding data pipelines one by one may no longer be applicable. At this point, a comprehensive plan is needed, to efficiently and reliably increase data collection through a platform, while also ensuring the system is maintainable without issues. In this case, the data engineer is the key factor determining the success of the whole project.
A: From another perspective, in some cases, the modeling team’s impact on data quality is dominant. For example, the data collection of a project may be relatively simple, but annotation is very tricky. At this time, scientists and annotators need to iterate to refine the annotation guidelines, especially finding all kinds of corner cases. In this situation, from the DE’s platform perspective, their role is not very significant. Another example is, the data annotation itself may not have issues, but the data sampling problem is tricky, like highly imbalanced positive and negative samples. How to balance sampling bias and sampling efficiency requires the ML scientist’s expertise. In this case, data quality mainly depends on the modeling scientists.
C: The viewpoint is model fine-tuning quality is determined by training data, while model trainer’s difficulty has decreased.
领英推荐
A: I disagree with this viewpoint, because a scientist’s core value is not in training models in the first place. A project may spend half the time on building the measurement set and training set (including pipeline construction), 30% on iterating and handling various newly discovered failure patterns, and only about 20% on training the model itself, which is simple. So further reducing this part is not very important for improving overall efficiency, nor is it very important for scientists’ job security. From the perspective of time allocation above, how to collaborate with DE is actually the core skill for ML scientists.
D: If we change the question to data engineering vs. model engineering (not considering who does the work, purely discussing from an ML algorithm perspective): I feel there is still a lot of potential in doing data engineering in the short term on the existing transformer architectures (this was my point yesterday). Recently there are a lot of relevant academic works: doing only data engineering (whether data selection or generating synthetic data from large models) can greatly improve transformer performance and scaling laws.But in the long run, model engineering is definitely still needed;
I agree with some classmates yesterday that the transformer is unlikely the final architecture. Some work has started investigating this already.
In fact, looking back at history, data engineering and model engineering often exist in cycles. Each time a new useful model comes out, doing data engineering without model modifications can unlock a lot of improvements.
But when this space has been fully exploited, we need to go back to model engineering.
C: My viewpoint is DE/researchers should move more to the engineering side, combine more with business, while engineers should move more to the left, understand basic ML principles, know how to use models, understand model training.In the future, ML pipelines will become simpler and simpler (large companies have readily available solutions, DataBricks will have data and ML pipelines together), and tools will become easier to use. This gives everyone something to think about - what are the current issues with the composition of Infra teams and ML teams in companies, and how will this evolve in the future. Looking back, the cost of ML Infra is very high in many companies, sometimes DS hired underutilize their potential, or DS and Infra teams compete for projects, leading to poor results for many ML projects. OpenAI’s great success cannot be separated from Greg Brockman’s engineering management capabilities.
F: It takes the scientists and researchers a long time to recognize that AI needs to be data centric. Engineers are much more shrewd about this.