AI/Machine Learning And Data Wrangling
AI/Machine Learning solution creators often need to wrangle "raw" data sets (i.e. prepare data to "better expose" the underlying pattern) to create algorithm-consumable inputs and specifically identify and eliminate what are called data leakage issues. This brief article provides a quick tour of Data Wrangling tools that form a new category of productivity-boosters aimed at AI/Machine Learning solutions (another perspective on these tools can be found here).
With the democratization of data preparation through the availability of cloud service infrastructure, frameworks, and low cost scaling-support wrangled inputs are poised to create differentiated-value for AI/Machine Learning model applicability in the real-world.
"Democratizing data preparation increases throughput and allows you to leverage the collective wisdom of the broader organization to achieve better outcomes faster"
Source: Adam Wilson, " Let the people who know the data best do the wrangling"
Think of Data Wrangling as a discovery process or an Extract-Transform-Load (ETL) Lite that is business subject matter expert friendly. Note that a number of articles take painstaking efforts to distinguish ETL from Data Wrangling- however, in the age of AWS Glue, I would opine ETL and Data Wrangling activities are rapidly converging.
Data Wrangling background. In 2011, the tool-transformations typically covered dropping null values, filtering, and selecting the right data including time series representation as appropriate. A combination of SQL+ Python(Pandas) + Spark that the practitioner would have to dabble with was replaced by interactive tools such as OpenRefine and DataWrangler.
Data Wrangling current state. DataWrangler from the Stanford Visualization Group laid the groundwork with some of its key contributors founding Trifacta.
Alongside Trifacta, today Alation, Crowdflower, Clearstory, Datameer, Datiku, Paxata, Platfora, SiSense, and Tamr are a few others that address this category of data-prep productivity-boosters for AI/Machine Learning solutions. Interestingly some of these tools themselves feature AI/Machine Learning to "automatically gather knowledge" about the data hence enabling non-data scientist users to experimentally lens transformed business feature-driven models instead of requiring extensive IT team support.
Data Wrangling future? Data Wrangling tools do not yet provide scored quantification of "nudge-worthiness" of a data set. Nor do they by default provide outcome driven measures of relevance classification for a consumer segment. The data science to support these (entropy scores and Brier curves) already exist, however these techniques have not been incorporated in a way a business user would employ to wrangle data. I see the rapid progression of Data Wrangling tools to encompass this "value-scoring" capability. In a nutshell, Data Wrangling tools might accelerate their establishment in the AI/Machine Learning market by supporting autonomous evaluation capabilities for the relative value of data sets based on differently wrangled transformations.
Another key area of wrangling is whether data needs to be in the clear or could be in encrypted mode while being wrangled. This has important implications for privacy support but requires 20X and higher compute power of even current day Deep Learning optimized GPUs. It is conceivable that wrangling encrypted data may be one of the most important capabilities of these tools in the future.
Where do you see Data Wrangling as a process heading? How about the associated tools? What role will differential privacy measures and anonymization in general play within the scope of these tools? Do share your thoughts - drop me a note privately or via the comment section below.
About the Author:
Solutions created by Madhu center around AI/Machine Learning. Madhu, in his seventh year with his current employer, has three decades of experience nurturing the emergence of beachhead markets worldwide. Note that what is expressed by Madhu here is of his own interest and is in no way reflective of his employer