Intents for Data Science Workflows
A typical data science project involves the following stages:
1. Exploratory data analysis
2. Data Representation experiments
3. Modeling experiments
As we work through implementing these stages, there are a typical set of intentions that we wish to communicate and share with our team members. Explicitly documenting our intentions as we work through these stages can facilitate understanding and aid reproducibility in future iterations of the project. Use case complexity can be variable. For simple use cases, it may make sense to capture all stages in a single workflow. When complexities are introduced because of the nature of the use case or the nature of the data, then we may want to capture each intention as a workflow. The following are the typical set of intentions in an analytics or machine learning use case implementation.
1. Data Understanding: Data models used in business applications are reflective of what is needed by customers to interact with the business or what is needed by employees of the company to perform a business function. Your analytics or machine learning use case needs a data representation that is tailored for a report-generating task or modeling task. Converting operational data to what is needed for your analytics or machine learning use case can require experimentation. This intent fits in the exploratory data analysis stage. Of course, a complete specification of data understanding includes statistical characteristics of the data representations you derive.
领英推荐
2. Feature Assessments: Data representations need to be evaluated for their usefulness in a learning task. The learning task could be supervised (regression or classification) or unsupervised (for example, clustering). This intent fits in the data representation experiment phase. Sometimes when the learning task is complex, it is better to perform feature assessment as an independent workflow. The learning task may be complex because of non-linear relationships in regression, complex decision boundaries in classification, and non-spherical, non-separable (in raw attribute space) clusters in clustering (think swiss-roll dataset). Typically, the feature importance and visualizations that capture the complexities in the dataset (for example, non-linear relationships or decision boundaries) are communicated in a feature profile. Frequently, many teams wish to rigorously test the utility of introducing new features with A/B testing. Such assessments are best captured as an independent workflow.
3. Data Heterogeneity Assessment: Your data may include multiple sub-populations. You need to assess this on your dataset. This intent fits in the data representation stage. If you have multiple sub-populations, you may want to choose a modeling method that can account for this fact (for example, models that explicitly use a hierarchy). A data profile characterizes each of these sub-populations. Characterizing many sub-populations can produce a lot of information. So if this is the case with your data, you may want to use a separate workflow for a data profile.
4. Model Explanations: Having separate models for prediction and interpretation is common today. If you have heterogeneity, then you should use a separate workflow for model explanations. This is because explaining datasets with multiple sub-populations requires you to produce an explanation for each sub-population using a method such as LIME or SHAP. This can produce a lot of information. You may want to pick sample data from different regions identified in your data profile and show how your model explains data points in these regions. This intent belongs to the modeling stage.
5. Model Selection: You may wish to consider different modeling methods as candidates for your use case and then make a selection based on the performance of these candidates on a hold-out dataset. If you have a lot of candidate models, then it is best to pick a separate workflow for model selection. This is because you need to compare and discuss the performance of the candidates and communicate the strengths and weaknesses of the choices. This is a lot of information and capturing this as a separate analysis makes the communication better in my view.
The above set of intents is adequate to cover the documentation of an analytics or machine learning use case implementation. These intents can specified as tags applied to a set of observations. So the tag, which specifies the intent, and the details that are specified in the observations can provide the complete context for a set of observations. An implementation sketch in KMDS will follow in subsequent post.