Data mining and data enrichment with ChatGPT

Data mining and data enrichment with ChatGPT

In this article, we will explore how ChatGPT can be used to analyze and enhance data sets for a more comprehensive understanding of a particular subject. ChatGPT is a highly effective technology that has recently gained significant traction in the tech and business industries, despite some misconceptions surrounding its use.

Already an engineer and know how to code?

Jump directly to the gist on Github.

Plants are amazing organisms

For the past decade, I've developed a keen interest in permaculture, plants, and their interdependence with people, our society, and the way we perceive and engage with them. Despite their magical properties and unique purposes, plants are often disregarded and misunderstood.

If you haven't had the opportunity to cultivate plants before, I encourage you to start by growing a simple houseplant and researching its care. Not only will you grow the plant, but you'll also grow as a person.

Before we delve into the technological aspects, here are some fascinating facts about plants that you may not be aware of and that I find particularly intriguing.

The mathematical secret of plants

Plants are complex living beings that use mathematical processes to construct themselves. They do this by absorbing carbon dioxide from the atmosphere through photosynthesis, extracting water from their roots and leaves through stomata, and taking in nutrients from the soil.

  • The arrangement of leaves on a stem often follow a pattern in which each leaf is positioned at a 137.5-degree angle from the previous one, creating a spiral that reflects the Fibonacci sequence.
  • Plants use a process called phyllotaxis to arrange leaves and branches to maximizes exposure to sunlight. This pattern is based on the Golden Ratio.
  • The shape of some leaves, such as the Venus flytrap, can be described using mathematical models called fractal curves. These curves appear to be infinitely complex, but can be described using simple equations.
  • The pattern of leaves on a stem can be described using the phyllotactic spiral. This spiral is formed by the interaction of two opposing forces: the tendency of leaves to grow at a certain angle from the stem, and the need to space them out evenly to avoid shading each other.
  • The arrangement of flowers on a stem can be described using a mathematical model called a phyllotactic pattern. This pattern is based on the Golden Ratio, and it explains why some flowers, such as daisies, have a spiral pattern of petals.
  • The arrangement of seeds in some plant fruits, such as sunflowers, can be described using the Fibonacci spiral. This spiral reflects the Fibonacci sequence and allows seeds to be packed efficiently and uniformly.


ChatGPT is an AI language model

No alt text provided for this image
AI only has a single emotion, sorriness.

ChatGPT wants to remind you - often - that it is an AI language model. However what is happening under the hood is a model that uses a process called deep learning.

This predictive model is designed to anticipate subsequent words, and as it undergoes more iterations and receives more data, it confirms its accuracy by comparing its predictions against the original material. The GPT model has demonstrated a high success rate in prediction across numerous data sets. With each subsequent iteration, the model grows in size, resulting in GPT2, GPT3, and eventually GPT4, GPT5, GPT6, and so forth.

After the model has been trained, it can utilize its understanding of language patterns to generate responses to prompts or queries. The model's answers are based on the patterns and relationships it learned during the training process.

Data mining and ChatGPT

Data mining involves several key steps, including data collection, organization, formatting, and analysis. Achieving high-quality results typically required a custom framework or toolchain, which I have often utilized in the past to gather information on specific plant types and their cultivars.

Fortunately, ChatGPT's deep learning model has already streamlined the majority of these steps, with the exception of formatting and validation. This is where our prompt and understanding of existing data formatting models come into play.

By utilizing a GraphQL schema to describe the data types we require, we can instruct ChatGPT to generate models that return the desired data. We can define the data types we want to retrieve and even nest them within the model.

This approach enables the description of data types and setting of data requirements as models in ChatGPT. This means that prompts can receive a response in the specified model format.

Using your model to output quality data

Once you have defined the model you want to use with ChatGPT, you can instruct it to output that model in JSON format or any other preferred format, such as YAML or XML.

It is important to note that ChatGPT will typically describe this process in its responses, so you will need to specifically request that it only returns the JSON object you require.


No alt text provided for this image
"Lets get this boat on the road." - The AI education project

ChatGPT data mining in action

As an example, I will illustrate how I was able to data mine thousands of plant profiles, along with their validation reference data, and store it in an organized JSON format. This was achieved by utilizing the GraphQL models that were initially created in the application I have been developing.

Describing your model to ChatGPT

There are various methods to accomplish this step, but in my example, I will initially provide a simple model to ChatGPT and then refine it iteratively with the help of ChatGPT. To interpret the model, ChatGPT requires you to specify the model in GraphQL format within your prompt. The structure of the model/format you provide will serve as the foundation for the format in which ChatGPT will return the data.

Setup of the initial model

To establish the initial structure of our model with ChatGPT, we will start by defining the basics for the first iteration of our model.

lets create profiles for known plants by their binomials 
and return them in this format, but format them into JSON object

Here is the model:

  name: String!
  scientificName: String!
  family: String!
  genus: String!
  description: String!
  imageUrl: String!
  growthHabit: String!
  toxicity: String!
  cultivation: String!
  distribution: String!
  habitat: String!
  conservationStatus: String!

Supply the first plant:  Aldrovanda vesiculosas        
No alt text provided for this image
The first basic plant profile!

Continual revision of the model

Although the basic model appears to be functioning correctly and producing the desired output, I believe that we can achieve even better and more comprehensive results. Let's enhance the model by utilizing the full GraphQL model as described in my API schema. It is important to note that this is being developed within an application created using AWS Amplify, for context.

The updated model request prompt

extend the model to the below and rewrite

  name: String!
  scientificName: String
  kingdom: Kingdom!
  phylum: String
  class: String
  order: String
  family: String
  genus: String
  species: String
  cultivar: Cultivar
  description: String
  images: [ImageMedia!]
  growthHabit: String
  toxicity: String
  cultivation: String
  distribution: String
  habitat: String
  conservationStatus: ConservationStatus!e        

ChatGPT knows where this is going thanks to its predictive language model.

No alt text provided for this image
Somehow it already knows nearly what I mean

The updated output for the plant profile

No alt text provided for this image
Looking a lot more complete, but can we do better?

My final revision of the model

For my final revision, I reiterate the model using GraphQL notation and specify that the plant profile should be returned in JSON format. It is important to reaffirm your expectations with ChatGPT, as it can be forgetful at times.

refine the model to this and rewrite the plant profile in JSON

type ImageMedia
{
? image: String!
? altText: String!
? order: Int!
}

type ReferenceLink
{
? url: String!
? title: String!
? order: Int!
}

type PlantSize
{
? minCM: Int!
? maxCM: Int!
}

type Plant
{
? primaryCommonName: String!
? commonNames: [String]
? scientificName: String
? description: String
? habit: String
? plantType: String
? kingdom: Kingdom!
? phylum: String
? class: String
? order: String
? family: String
? genus: String
? species: String
? cultivar: Cultivar
? description: String
? images: [ImageMedia!]
? growthHabit: String
? toxicity: String
? cultivation: String
? distribution: String
? habitat: String
? physicalManagement: String
? biologicalManagement: String
? cultivationOptions: String
? edible: Boolean
? heightCM: PlantSize
? widthCM: PlantSize
? referenceLinks: [ReferenceLink]
? conservationStatus: ConservationStatus!
}        

Now I request for the plant profile for Dracaena marginata also known by its common name Dragon Tree.

No alt text provided for this image
Know this is getting really cool right?

Now ChatGPT improves my model

I have requested that ChatGPT provide additional fields that could enhance the model.

No alt text provided for this image
Yes please extend the model and amaze me!

Now provide me the full picture ChatGPT

No alt text provided for this image
This does help lol.

Now lets gather binomials

Now that the model has been established, it's time to move forward with requesting ChatGPT to provide me with the scientific/botanical names for plants, also known as binomials. These binomials will be utilized to create the plant profiles, as you have already seen in my earlier request for the original plant profiles.

No alt text provided for this image
The list continues on, but as expected it will exhaust its response at some point.

Resume the list to continue gathering binomials

As previously mentioned, the model will eventually run out of information, but it is possible to continually request it to resume. However, there will come a point where it may lose track of itself. To prevent this from occurring, I compiled a collection of the most widely used crop and plant types and created an array while eliminating duplicates. I later expanded this list by locating additional sources for names.

No alt text provided for this image
Resume the list and have it continue to provide binomials

Be persistent in reminding it what you require

As you can see from the two previous responses, ChatGPT has demonstrated an inconsistency in the format of its JSON array of strings, initially referring to it as JSON and then as a string in CSS format.? Whatever you say ChatGPT, I'll use sublime to clean this up.

Here is the output in sublime as a single line, lets convert this to JSON format.

No alt text provided for this image
ChatGPTs last response was a single line of JSON string array

I'll use the Sublime plugin Pretty JSON to format the JSON to the format I actually need

No alt text provided for this image
Pretty JSON is built for formatting JSON data

Pretty JSON formats the data to the needed seperate line format (it's not needed, just allows you to validate each entry better).

No alt text provided for this image

Salix is one of my favorite plants - it's the Willow tree. One of my favorite trees to grow is the Corkscrew Willow - Salix matsudana. Line 9!

No alt text provided for this image
Salix matsudana - aka Corekscrew Willow - Photo credit: F.D. Richards @ Creative Commons


No alt text provided for this image
Lets bottle this magic up with OpenAI API and Node JS

Automate the process w/ OpenAI + Node

Now that we have established the process with ChatGPT and have created our model, prompt, and anticipated outcome, let's automate the process.

To begin, we must establish a few patterns. As we will be collecting many binomials over time, we want to ensure that we only process unique binomials. This eliminates the need to continually query OpenAI, which, to be transparent, can be costly.

Our script will establish the following pattern:

# Include packages for HTTP interaction
# Load your OpenAI API key
# Load a JSON file containing the binomials we want to process
# Iterate the list of binomials
# Check if a file exists for the binomial, if it does skip it
# Query OpenAI
# Write the results to a file         

The Node JS Script

You can view the entire setup in a Gist I have on my Github which also includes the full prompt and description of plants.json and apikey.txt

No alt text provided for this image
Checkout the GIST above on my Github account for full code including the prompt and additional files

The additional files apikey.txt and plants.json

As you review this script, you will notice several key components. First, we load the plants.json file, which is a JSON formatted array of strings containing the binomials that we will request in our prompt. Additionally, on line 6, we load the apikey.txt file, which contains the necessary API key that you will need to generate within our OpenAI profile.

No alt text provided for this image
Click View API keys to generate yours

Now after you run the script you'll process the plants to JSON

You can see in the below screenshot the outcomes of running the script above with the apikey.txt and plants.json files setup correctly.

No alt text provided for this image
Processing, file exists, and processed data

Final thoughts and costing

As demonstrated in this brief example, ChatGPT is an incredibly powerful tool and marks the beginning of a new era of computing that utilizes collective data to produce exceptional outcomes. While it is crucial to validate all of the data you collect, the quality of the validated data and the starting point it provides is unparalleled.

In the process of constructing this solution, I utilized ChatGPT Plus, as well as the OpenAI API for the davinci-003 model, utilizing up to 1200 tokens per request. This amount of tokens ensures that I receive the full response I require. However, this comes at a high cost.

As you can see, I accrued almost $50 in charges in just two days of queries and only built 1800 plant profiles. There are currently 250,000 described plants, which at this rate, would cost approximately $6,950 to fully build out the entire data set, assuming that ChatGPT covers all of the plants.

No alt text provided for this image
It would be nice to have a OpenAI Plus account with unlimited tokens.

This won't be sustainable - use it for enrichment

At this rate and cost this type of mining won't be sustainable or valuable for businesses that can't leverage the data they mine. It will make more sense to use traditional tools to gather data and then enrich the missing portions of the data with a tool like ChatGPT/OpenAI.

I expect the open source community to continue to contribute models they have trained and reach an eventual global model that far exceeds what the corporations in this space create. The open source community is known for serving the need without the greed and I believe this will produce models and tools that will eventually replace the AI gold rush. I expect this cycle of hype to die quickly as those solutions arrive.

If you found this article helpful follow or connect with me here on LinkedIn.

Until next time - remember - HACK THE PLANET!

No alt text provided for this image
The movie that changed my life

Your article on #ChatGPT showcases a keen understanding of the potential in AI-driven data collection and enhancement – it's clear you recognize the transformative nature of these tools. ?? Generative AI can indeed elevate the quality of your work, streamlining processes to deliver insights faster and with greater precision. ??? Let's explore how generative AI can further amplify your data strategies and save valuable time. Book a call with us to unlock new possibilities: https://chat.whatsapp.com/L1Zdtn1kTzbLWJvCnWqGXn ?? Brian

回复
Amanda Mathias

Project Management & Training Professional | Strategic Planning | Employee Engagement | Cross-Functional Collaboration | Data-Driven Insights | Process Improvement

2 年

Good read! Relevant and timely

要查看或添加评论,请登录

社区洞察

其他会员也浏览了