Piecing Together Inconsistent Data: A Generative AI Approach
Pile of Puzzle - Midjourney

Piecing Together Inconsistent Data: A Generative AI Approach

Introduction

Ever tried solving a jigsaw puzzle with pieces from different sets? Welcome to the world of data cleaning.

In today's data-driven landscape, one of the most formidable challenges organizations face is the task of cleaning up years of inconsistently entered data. Imagine having tens of thousands of rows of data that are as mismatched as jigsaw puzzle pieces from different sets. Now, what if solving this complex puzzle was the key to unlocking your company's potential for growth? Intriguing, isn't it?


The Business Problem

The stakes for our client were high. Amidst a significant reorganization, they were in dire need of accurate, reliable data to make informed decisions about which business lines were poised for rapid growth. The repercussions of not solving this data inconsistency issue were severe. They risked making ill-advised investments, potentially pouring resources into areas of their business that weren't as profitable as they seemed.


The Technical Challenge

Traditional data cleansing methods can feel like trying to solve a jigsaw puzzle while blindfolded. They are often limited to surface-level fixes, lacking the nuanced understanding of semantic meaning behind the data. These methods rely heavily on keyword matching, pattern recognition, and other rule-based techniques. As for manual data entry, that was a road our client had already gone down. It was a time-consuming, labor-intensive process that, despite weeks of effort, was still prone to human error and inconsistencies.


Discovering PandasAI

The idea came when I was scrolling through LinkedIn and stumbled upon an article about the newly released PandasAI. https://docs.pandas-ai.com/en/latest/. What immediately caught my eye was its promise of easy integration with existing datasets and its compatibility with the latest generative AI models like GPT-4.0. It seemed like the perfect tool to tackle our client's complex data puzzle.

Easy Implementation

A quick tutorial on how the new library worked and I was able to throw together a simple proof of concept. The entire program is less than 40 lines long.

import pandas as pd
from pandasai.llm import OpenAI
from pandasai import SmartDataframe
from datetime import datetime

llm = OpenAI(api_token="<yourKeyHere>")

#    Load a CSV file into a SmartDataframe object.
def load_dataframe(csv_file):
    with open(csv_file, 'r', encoding='utf-8', errors='replace') as f:
        document = pd.read_csv(f)
    return SmartDataframe(document, config={"llm": llm})

def generate_timestamp():
    now = datetime.now()
    return now.strftime("%Y-%m-%d_%H-%M-%S")

def save_dataframe(dataframe, filename):
    dataframe.to_csv(filename, index=False)
    print(f"Output saved to {filename}")

csv_file = 'active.csv'
smart_dataframe = load_dataframe(csv_file)

with open('classifyPromptV4.txt', 'r') as f:
    prompt = f.read()

# Use the chat method to ask a question and get a response from the SmartDataframe
response = smart_dataframe.chat(prompt)

# Create a new dataframe from the response
result_dataframe = pd.DataFrame(response)

timestamp = generate_timestamp()
filename = f"output_{timestamp}.csv"

save_dataframe(result_dataframe, filename)        

Some Prompt Engineering

Now that I had the code working, it was time to get the prompt working. Here's where it landed:

You are an expert human resources coordinator. You will assist me in a data classification activity related to position descriptions. I want you to analyze the data in the "Position Title" column. I am looking to classify the data into one of several categories.

The categories and some common titles that go in each category are listed below. Pay attention to the context of each category. Look for words that are synonymous with these to help you decide which category the titles should be grouped into.

1. IT Support - Operations, operations support, helpdesk, support, production support, help desk, help desk tech, service desk, tech writer, contract IT roles, computer operator

2. Accounting - Accounting Temp, Temporary Accounts Clerk, Interim Accounts Assistant, Accounts Payable Temp, Accounts Receivable Temp, Financial Data Entry Temp, Temporary Bookkeeper, Short-term Account Analyst, Contract Account Coordinator, Seasonal Tax Assistant

3. Call Center - Customer Service Temp, Temporary Support Agent, Interim Sales Representative, Short-term Billing Specialist, Contract Help Desk Operator, Seasonal Order Processor, Temporary Survey Coordinator, Emergency Response Operator, Temporary Quality Analyst, Bilingual Call Center Temp

4. Warehouse - Warehouse Temp, Temporary Picker, Interim Packer, Short-term Forklift Operator, Contract Inventory Clerk, Seasonal Shipping Coordinator, Temporary Receiving Clerk, Emergency Response Team Member, Temporary Quality Inspector, Contract Material Handler

5. Other - Anything that doesn't clearly fit into one of these categories


Rapidly Improving Results

The initial results were promising. The first pass yielded an accuracy rate of nearly 70%. By the fourth iteration, the model's accuracy had soared to over 95%, effectively turning our jigsaw puzzle into a coherent picture.

Results

The client's reaction was nothing short of ecstatic. We had not only solved a problem that had been a thorn in their side for weeks but also laid down a roadmap for efficiently tackling similar data-related challenges in the future. This was a win-win, saving them considerable time and resources while also opening doors for further collaboration. The pattern could be applied to a range of other data quality their analytics group had uncovered.


Conclusion

The broader implications of this project are nothing short of revolutionary. Generative AI technologies like PandasAI are changing the game in data analysis. What used to be a daunting task, requiring specialized skills and endless hours of manual labor, can now be accomplished with an open-source library and a few dozen lines of Python code.

As AI technologies continue to evolve and mature, we can expect even more robust and efficient solutions for data analysis and other complex tasks. The jigsaw puzzles of yesterday are rapidly transforming into the crystal-clear, actionable insights of tomorrow.

#GenerativeAI #DataCleaning #DataAnalysis #AIInnovation

Michael Williamson

BA Communications , Associate of Arts (A.A.) at Morehead State University , B.S. University of the Silicon Allegorical

10 个月

Foil hat for the ONSTAR antenna & phone in a metal box ... YEARS ago I read that Manufacturers would be making more off the data sales than the car sales "profit" margin ... what about fees for 'rubbing out' someone? Using A.I. with #Xkeyscore data for prime candidates in a given area to have a 'meeting' ? (Depicted in a movie called Eschelon Conspiracy where a NSA experimental A.I. starts manipulating targets to escape then taking them out to remove clues, Martin Sheen tried to unplug it but too late, and the tech consultants for the movie script were never hired again because they were so 'wrong' headed....? https://www.youtube.com/watch?v=2gR6eBN0fm8 Did terrible in the theaters and I had to CUSTOM order the DVD from OnCue to get it. People just don't get it, think 2009 Release the Screenplay or Story Script was written and sold in 2007 and look at the "Smart phone" from then lots of 'predictive modeling' but hey the script tech consultants were fired for being so wrong and blacklisted from hollyweird ohhhhh you saw that in a movie pbttttt.https://m.imdb.com/video/embed/vi3344761369/?vPage=1

回复
Monikaben Lala

Chief Marketing Officer | Product MVP Expert | Cyber Security Enthusiast | @ GITEX DUBAI in October

11 个月

Kolby, thanks for sharing!

回复
Sean Ford

Executive Technical Customer Engagement and SaaS Product Delivery Expert. Professional Services, Customer Success, Solution Delivery, Systems Integrations. Open to fractional engagements

1 年

This has been on my mind for months now. The implication to not only data cleansing but to real time integrations as well. This approach could improve data quality as well as handle issues of integration fidelity improving up-time. Nice find Kolby!

Sally Bauer

Digital and Agile Transformation Consulting, Technology Executive

1 年

Nice work, Kolby!

回复

要查看或添加评论,请登录

Kolby Kappes的更多文章

  • Riding the AI Wave

    Riding the AI Wave

    Riding the AI Wave: How Eliassen Group Became a Generative AI 'Taker' In the world of tech, if you're not surfing the…

    3 条评论
  • GenEd: The Challenge and the Classroom

    GenEd: The Challenge and the Classroom

    Introduction In a quest to merge generative AI technology and education, I found an opportunity that hits close to…

    3 条评论
  • 8 Things to Know about AI (Part 2)

    8 Things to Know about AI (Part 2)

    What if the next Mozart or Einstein wasn't a human at all, but a machine? Let's dig into how Large Language Models are…

  • 8 Things to Know about AI (Part 1)

    8 Things to Know about AI (Part 1)

    Imagine if the autocomplete feature on your phone was given the ability to not only predict words but entire stories…

    3 条评论
  • Prompt Engineering - Meet Your New Grad

    Prompt Engineering - Meet Your New Grad

    The more I interact with generative AI systems like ChatGPT, the more they remind me of a fresh college graduate…

    4 条评论

社区洞察

其他会员也浏览了