AI/Machine Learning And Data Wrangling

AI/Machine Learning And Data Wrangling

AI/Machine Learning solution creators often need to wrangle "raw" data sets (i.e. prepare data to "better expose" the underlying pattern) to create algorithm-consumable inputs and specifically identify and eliminate what are called data leakage issues. This brief article provides a quick tour of Data Wrangling tools that form a new category of productivity-boosters aimed at AI/Machine Learning solutions (another perspective on these tools can be found here).

With the democratization of data preparation through the availability of cloud service infrastructure, frameworks, and low cost scaling-support wrangled inputs are poised to create differentiated-value for AI/Machine Learning model applicability in the real-world.

"Democratizing data preparation increases throughput and allows you to leverage the collective wisdom of the broader organization to achieve better outcomes faster"
Source: Adam Wilson, " Let the people who know the data best do the wrangling"

Think of Data Wrangling as a discovery process or an Extract-Transform-Load (ETL) Lite that is business subject matter expert friendly. Note that a number of articles take painstaking efforts to distinguish ETL from Data Wrangling- however, in the age of AWS Glue, I would opine ETL and Data Wrangling activities are rapidly converging.

Data Wrangling background. In 2011, the tool-transformations typically covered dropping null values, filtering, and selecting the right data including time series representation as appropriate. A combination of SQL+ Python(Pandas) + Spark that the practitioner would have to dabble with was replaced by interactive tools such as OpenRefine and DataWrangler.

Data Wrangling current state. DataWrangler from the Stanford Visualization Group laid the groundwork with some of its key contributors founding Trifacta.

Alongside Trifacta, today Alation, Crowdflower, Clearstory, Datameer, Datiku, Paxata, Platfora, SiSense, and Tamr are a few others that address this category of data-prep productivity-boosters for AI/Machine Learning solutions. Interestingly some of these tools themselves feature AI/Machine Learning to "automatically gather knowledge" about the data hence enabling non-data scientist users to experimentally lens transformed business feature-driven models instead of requiring extensive IT team support.

Data Wrangling future? Data Wrangling tools do not yet provide scored quantification of "nudge-worthiness" of a data set. Nor do they by default provide outcome driven measures of relevance classification for a consumer segment. The data science to support these (entropy scores and Brier curves) already exist, however these techniques have not been incorporated in a way a business user would employ to wrangle data. I see the rapid progression of Data Wrangling tools to encompass this "value-scoring" capability. In a nutshell, Data Wrangling tools might accelerate their establishment in the AI/Machine Learning market by supporting autonomous evaluation capabilities for the relative value of data sets based on differently wrangled transformations.

Another key area of wrangling is whether data needs to be in the clear or could be in encrypted mode while being wrangled. This has important implications for privacy support but requires 20X and higher compute power of even current day Deep Learning optimized GPUs. It is conceivable that wrangling encrypted data may be one of the most important capabilities of these tools in the future.

Where do you see Data Wrangling as a process heading? How about the associated tools? What role will differential privacy measures and anonymization in general play within the scope of these tools? Do share your thoughts - drop me a note privately or via the comment section below.

About the Author:

Solutions created by Madhu center around AI/Machine LearningMadhu, in his seventh year with his current employer, has three decades of experience nurturing the emergence of beachhead markets worldwide. Note that what is expressed by Madhu here is of his own interest and is in no way reflective of his employer

要查看或添加评论,请登录

Madhu Raman的更多文章

  • Agentic AI: Transforming Enterprise Automation Beyond Simple Productivity Gains

    Agentic AI: Transforming Enterprise Automation Beyond Simple Productivity Gains

    Disclaimer: Views expressed in this article are personal and are not the opinions of my employer, Amazon Web Services…

    2 条评论
  • AI Agent Security for Automation Executives

    AI Agent Security for Automation Executives

    The Dawn of Autonomous Enterprise. For enterprise automation, Day 1 of the AI agent revolution is unfolding, and with…

    2 条评论
  • AWS Machine Learning Stack Update

    AWS Machine Learning Stack Update

    What new AWS #MachineLearning Stack services have been added by Amazon Web Services? Here is an update as of December…

  • AI/Machine Learning and forecasting

    AI/Machine Learning and forecasting

    This article is about Amazon Forecast a fully-managed time series forecasting service that helps customers leverage…

  • AI/Machine Learning and contextual personalization

    AI/Machine Learning and contextual personalization

    This article introduces Amazon Personalize a fully-managed Machine Learning service that supports use cases that…

  • Deploy Intelligent Robotic Applications

    Deploy Intelligent Robotic Applications

    Some of you reached out in response to my post about Amazon Web Services announcing AWS RoboMaker at re:Invent. The…

    1 条评论
  • Custom Natural Language Processing

    Custom Natural Language Processing

    Without Machine Learning skills you can use Natural Language Processing and use custom entities and classification on…

  • AI, Machine Learning, and IoT

    AI, Machine Learning, and IoT

    The intersection of AI, Machine Learning, and IoT presents new opportunities to create value for your business…

  • AI/Machine Learning And Data Pipelines

    AI/Machine Learning And Data Pipelines

    Data Pipelines are the arteries that bring fresh and cleansed data to your AI/Machine Learning engine's heart. If you…

  • AI/Machine Learning And Facial Micro-Expression Detection

    AI/Machine Learning And Facial Micro-Expression Detection

    The use of AI/Machine Learning in Affective computing--systems that can recognize, detect, and respond to human…

社区洞察

其他会员也浏览了