A Three-Step Summary of How I Approach Data
I’m writing this to briefly summarize how I pragmatically approach data science problems. I don’t believe intractable problems exist, although I’ll admit we lack the tools and insight to solve all problems at present. Creativity is the ingredient needed to ameliorate this deficiency –?I digress.?
First, I identify the atomic unit of data. For images, this would be the pixel. For financial data, this would be the transaction. For language processing, this would be the character. For social network analysis, this would be the node. These atomic units don’t have much information in isolation, but I have trouble coming up with business problems that hinge on a single pixel (say outside Facebook ad campaigns). ?These atomic data units form local structures with other similar data units in a local proximity. The exact definitions of similar and local change are based on the questions being asked, but I conceptualize them as informational analogs to elements and molecules. When these local structures are aggregated, we have some information of substance. ?
The second step is to build data structures with feature engineering. While machine learning algorithms are fantastic at finding anomalies and classifying data,?it is the job of the data scientist to create the initial structures on which the algorithms will operate. This is where art meets science, and experience in the field reduces development time. ?
领英推荐
Determining what structures to create is determined via in-depth conversations with subject matter experts and key stakeholders. In the Crisp-DM model, this qualifies as business understanding. Rapid prototyping to explore the design space is paramount – experience can inform which direction to initially head in, but only iterative development will get you to an optimized answer. The best solution always depends on what executive leadership wants to solve. In the future, I’ll elaborate on what structures I have a proclivity for and what I’ve found works for the problems I’ve worked on.?
The third step is algorithm selection and tuning. The type of algorithm chosen depends on the business problem, feature engineering, and desired output. The exact algorithm selected needs to meet a variety of technical requirements, such as model size, execution time, maintainability, community support, and available documentation. Performance on the relevant metrics is also important but should be considered in tandem with other constraints. For example, a business problem that requires optimizing for single class precision may not be best solved by the model with the highest AUC ROC. ?
In summary, I’m echoing what experienced data scientists reiterate: the solution is found in understanding the data and cleaning the data is foundational in constructing a robust and performant solution to business problems. ?