Exploring the Impact of ChatGPT Code Interpreter on Data Science
Tony Gillett
Product Director at Morningstar | Product Leader | FinTech | ML and Data Science | API and SaaS Evangeliser | Deep experience in the investment funds industry | Cross Functional Collaborator | IoT | Automation
In this article, we delve into the world of generative AI, focusing on a new development: the ChatGPT Code Interpreter. This innovative tool is an extension of OpenAI's ChatGPT, but with a twist. Unlike the original ChatGPT, which excels at generating human-like text, the Code Interpreter is designed to understand, generate and execute code. This makes it a potentially powerful tool in the field of data science, where coding is a fundamental skill. By exploring a simple predictive model built using this technology, we aim to shed light on its capabilities and potential impact on data science.
Is ChatGPT Code Interpreter a game changer for data science or is it hype? Let’s start with some observations.
What it’s good at:?Loading, interpreting and relating data. Evaluating its results, drawing conclusions and making suggestions and offering options, providing strengths and weaknesses of its suggestions.
What can be frustrating:?It’s forgetful, at least as far as the code it writes — it struggles to revisit code it’s written a few steps back, often having to write it again (and encountering the same problems it’s already overcome).
What’s essential:?As with all generative AI, at this stage of development at least, good prompt technique is essential for good results. If you want to use the code it creates in your own environment, you need to keep track of it rather than expecting ChatGPT to give you a complete dump of all code when you’ve completed the task — it seems to guess at the code it’s previously written, often making simplistic assumptions. The good news is the chat history includes all the segments of code in the chat history so you can review (and copy) them whenever you wish.
Let’s get into our sample task. Can we build a simple model to predict the onset of?ionospheric storms?from a set of?ionosonde data? We obtain our observation data from the?UK Solar System Data Centre?and, so we know when storms occurred, we get the?AP Index from NOAA.
This is?the data we provide to ChatGPT. It’s a zip file containing multiple files.
Let’s see how the conversation goes. First, I ask it to load the index data.
Take a look in the “ap-index-data” directory in the zip file. You’ll find a series of files, each of which contains data for the year identified by its filename — the data we’re looking to extract is called the “ap index”. Here’s how to read the data. Characters 1- 2 are the YEAR (note that this is two digits and refers to years in the twentieth century if greater than 50 or the 21st century otherwise — for example, a value of “90” refers to 1990, a value of “03” refers to 2003). Characters 3- 4 are the MONTH, characters 5- 6 are the DAY 32–34 are the “ap index” value for time 0100, 35–37 are the “ap index” value for time 0400, 38–40 are the “ap index” value for time 0700, 41–43 are the “ap index” value for time 1000, 44–46 are the “ap index” value for time 1300, 47–49 are the “ap index” value for time 1600, 50–52 are the “ap index” value for time 1900, 53–55 are the “ap index” value for time 2200. Bear in mind the data we’re comparing to is intraday, so please make a separate row for each of the above, per day, i.e. each day will have 8 rows — each for the ap index values during the day. Here’s how to interpret the data in terms of ionospheric storms. Ap-index values from 0–7 are considered “quiet,” 8–15 “unsettled” 16–29 “active,” 30–49 “minor storm” 50–99 “major storm” and above 100 classified as a “severe storm”. Please store these categories against the data. I’d strongly suggest that you convert the year, month and day values into a datetime fields.
Some points to note: you’ll see I’ve given the information necessary to load the appropriate data from the files but I’ve done so informally. ChatGPT is easily smart enough to understand. I’ve told it what the data means, in terms of storm status. I’ve asked it to create separate records for each intraday value (the original source is one record per day). From experience, I’ve found that it can be inclined to store dates and times as strings, just as it finds in the original data, so I’ve nudged it to convert to a datetime value. This can help us later on, when it comes to combining data from multiple sources.
Here’s its response to the first instruction.
Let’s visualise this data — please do so in the way that you think is best.
What patterns do you see in the data?
Can you look into this further, please. I’d like to identify specific patterns.
It’s doing a great job of taking initiative, identifying trends which suggest seasonality and correlation with the solar cycle, and suggesting further analyses.
Let’s move on to the task of building a predictive model.
I’d like to explore if we can predict the onset of a storm a few days in advance. How do we best approach that question?
Now I’m going to introduce our observations data. The format of the data is rather unusual as it was likely just written out to a terminal from a data logging system, so I’m going to detail how it should load and interpret the data.
Let’s define a storm as ap-index >= 50. In the zip file, you’ll find more data — In the zip file, you’ll find in the ‘Observations’ folder a file called RL052_199501_209501.txt which contains Ionosonde data from the Chiltern station operated by Rutherford Appleton Laboratory. Note that the data file isn’t particularly clean. Here’s what I’ve noticed about it. The line “Time is given as UT+00” is repeated a few times — it can be ignored. It appears that it was probably a number of files joined together — notice that the header “YYYYMMDD HHMMSS 00 10 03 07 16 04 “ is repeated a number of times throughout the file. The header is important, of course, because it tells you what the file contains but don’t be fooled by it being repeated. Here’s a description of the codes referenced in the data: Parameter,Code,Units,Description foF2,0,.1 MHz,F2 layer o-mode (ordinary) critical frequency. fxF2,1,.1 MHz,F2 layer x-mode (extraordinary) critical frequency. fzF2,2,.1 MHz,F2 layer z-mode critical frequency. M3000F2,3,0.01,F2 layer M factor (the ratio of the maximum usable frequency divided by the critical frequency). h’F2,4,km,F2 layer o-mode minimum virtual height. hpF2,5,km,An estimate of the true height of the F2 layer (measurement of the ordinary mode virtual height at a frequency of 83.4% of the foF2). h’Ox,6,km,F layer minimum virtual height of the x-mode trace at a frequency equal to the foF2. MUF3000F2,7,.1 MHz,F2 layer maximum usable frequency for 3000km path. hc,8,km,The height of the maximum obtained by fitting a theoretical h’F curve for the parabola of best fit to the observed ordinary mode trace near foF2 and correcting for under-lying ionizaton. qc,9,km,EF layer scale height. foF1,10,.01 MHz,F1 layer o-mode critical frequency. fxF1,11,.01 MHz,F1 layer x-mode critical frequency M3000F1,13,.01 MHz,F1 layer M factor (see code 03). h’F1,14,km,F1 layer o-mode minimum virtual height. h’F,16,km,F layer o-mode minimum virtual height. MUF3000F1,17,.1 MHz,F1 layer maximum usable frequency(see code 07). The format is fixed length and note that the numeric values are accompanied by characters. Here’s how to load the file: From position 1, the next 8 characters determine the date in YYYYMMDD. From 10, the next 6 characters are the time in HHMMSS. From 17, the next 3 characters are parameter code 0 from the above list. It’s an integer. From 23, the next 3 characters are parameter code 10 from the above list. It’s an integer. From 29, the next 3 characters are parameter code 3 from the above list. It’s an integer. From 35, the next 3 characters are parameter code 7 from the above list. It’s an integer. From 41, the next 3 characters are parameter code 16 from the above list. It’s an integer. From 47, the next 3 characters are parameter code 4 from the above list. It’s an integer. Therefore, please load up the values for data point 00 (foF2), 10 (foF1), 03 (M3000F2), 07 (MUF3000F2), 16 (h’F) and 04 (h’F2). I suggest you convert the date and time values into a single datetime value — that’ll help when we come to relate the data to other time series data sets. We have two objectives — first, chart the data so that it can be explored interactively. Secondly, to create a model that can identify ionospheric storms and furthermore, ideally, predict them if a real-time feed is provided. Note that we are focused on “F layer” data. In the zip file we also have data that will tell us when a storm is in progress, but we’ll come to that later. Let’s get the observations data cleaned and loaded first.
ChatGPT encounters a few coding problems, which it swiftly overcomes without further prompting, and asks how to proceed.
Let’s take a moment to look behind the scenes at what ChatGPT is doing. At any point, we can view the code it's generating by selecting “Show Work”. For example, here’s the code it wrote to load the index data.
# Define a function to clean the values
def clean_value(value):
"""Extract the numeric part of the value and convert to int, ignoring any trailing characters."""
return int(''.join(filter(str.isdigit, value)))
# Define a function to load and clean the data
def load_ionosonde_data(filename):
"""Load and clean the Ionosonde data from the specified file."""
# Initialize lists to store the data
datetimes = []
foF2s = []
foF1s = []
M3000F2s = []
MUF3000F2s = []
hFs = []
hF2s = []
# Open the file and read the data
with open(filename, 'r') as file:
for line in file:
if line.strip() and not line.startswith("Time is given as UT+00") and not line.startswith("YYYYMMDD HHMMSS"):
# Parse the date and time
year = int(line[0:4])
month = int(line[4:6])
day = int(line[6:8])
hour = int(line[9:11])
minute = int(line[11:13])
second = int(line[13:15])
datetime = pd.Timestamp(year, month, day, hour, minute, second)
# Parse the data values
foF2 = clean_value(line[16:19])
foF1 = clean_value(line[22:25])
M3000F2 = clean_value(line[28:31])
MUF3000F2 = clean_value(line[34:37])
hF = clean_value(line[40:43])
hF2 = clean_value(line[46:49])
# Append the data to the lists
datetimes.append(datetime)
foF2s.append(foF2)
foF1s.append(foF1)
M3000F2s.append(M3000F2)
MUF3000F2s.append(MUF3000F2)
hFs.append(hF)
hF2s.append(hF2)
# Convert the lists to a DataFrame
data = pd.DataFrame({
'datetime': datetimes,
'foF2': foF2s,
'foF1': foF1s,
'M3000F2': M3000F2s,
'MUF3000F2': MUF3000F2s,
'hF': hFs,
'hF2': hF2s
})
return data
# Load the Ionosonde data
ionosonde_data = load_ionosonde_data("/mnt/data/Observations/RL052_199501_209501.txt")
ionosonde_data.head()
One foible I noticed working through this task is that the runtime environment that’s running in the background seems to be quite volatile. As you ask it to do more, variables it’s previously defined disappear and code it’s previously written seems to have been forgotten about, causing it to have to rework previous accomplishments. If you leave the conversation dormant for a while, the runtime appears to be completely deleted, causing it to have to start from fresh. I wonder whether these are symptoms of its beta status, or whether it’s by design.
Back to work.
Please visualise as you think best. Just to say, though, please visualise all parameters on a common x-axis (time).
But where is the chart? Sometimes ChatGPT gets confused and simply forgets to complete its work or proports to be providing something that it simply cannot. In such a case, it needs a nudge.
Where is the chart?
I’d like to see this data annotated with storm status, so I ask ChatGPT to do exactly that. Again, I have to prompt it to continue by asking what is the current status.
Can you annotate the chart with the storm status, please?
As you can see, it encounters and fixes some problems. It then just seems to drift off without completing its work, so I bring its attention back with another nudge.
What’s the status?
It’s drifted off again! Another nudge is necessary.
领英推荐
What’s the status?
Note that it’s remembered the information on storm classification that I provided at the start of the conversation. Note that it’s smart enough to guess that we’re only interested in major storms and above, but asks me to say if I want the lesser statuses to be plotted.
Now that we have a feel for the index data, let’s turn attention to building a predictive model. Quite deliberately, I issue a vague prompt and allow ChatGPT to drive the process.
Let’s create the model to see if we can predict a major storm (and above) 12 hours in advance.
It offers a suggested approach, which seems quite reasonable. I prompt it to continue.
Ok let’s go.
It sets things up and runs an initial model, noting that we have our work cut out for us because of the nature of the data. I ask it for detail on the work its just done.
Can you summarise the method used that yielded these last results? Step by step, please
As you can see, it’s done quite a lot of work here and evaluated the results — which weren’t great; a high false positive rate in particular. I decide to make life a little easier by shortening the prediction time window.
Let’s try the whole thing again, but attempt to predict 3 hours in ahead instead. Use the method you’ve just described.
It’s beginning to look like it’s just not possible to predict storms with the data available. And ChatGPT is coming to the same conclusion. Rather than just give up, let’s see if the data can tell us whether a storm is occurring in the moment, rather than predictively. If that doesn’t work then we have no chance of being predictive!
Can we try to see if the observations data can be used to determine a storm — and which kind of storm — is in progress, at a given moment in time?
At first sight this looks promising, with a 50% accuracy. But accuracy for anything other than “quiet” storm status is very poor. It’s bit like forecasting lightning storms simply by saying “there will not be a lightning storm today”. Your forecast might often be accurate but would be useless because you’d never forecast a storm, only a non-storm. But ChatGPT has some ideas. Let’s see where it takes us.
Let’s go with your suggestions.
ChatGPT tries a Random Forrest Classifier and finds that it also cannot identify storms. It tries some other ideas, but bumps up against (and recognises) limitations in its runtime environment. It still has ideas, however. It suggests adjusting the decision threshold of the model.
Please try your suggestion
As we see, it knows that it can’t apply the suggested approach to what I’ve previously asked it to do, so it asks me to make a choice — and I ask it to proceed with a binary choice of storm/no-storm.
This is encouraging — 66% of storms have been correctly identified, which is considerably more impressive than what we’ve seen so far. Not bad, considering the small dataset.
Let’s start to wrap this up.
Can you summarise the results, draw conclusions and suggest next steps.
Let’s ask ChatGPT to discuss its performance.
If you were describing your performance in this task, what would you say? What did you do well, and what did you do that could be better — and what were your weaknesses? What are your limitations today that people should be aware of?
Not a bad assessment. As to the question that started our discussion — well, again, let’s give ChatGPT the last word:
As an AI language model developed by OpenAI, I’m designed to assist with a wide range of tasks, including data science. However, whether I am a “game changer” may depend on the specific context and how I’m utilized. Here are a few ways I can potentially transform the data science field:
That being said, there are also limitations and challenges. I rely on the quality and quantity of the data provided, and my responses are limited to what I’ve been trained on and can’t exceed the state of knowledge as of my last training cut-off in September 2021. I don’t have the ability to understand or incorporate personal experiences or tacit knowledge that many data scientists bring to their work. Furthermore, I don’t have the ability to independently verify the correctness of the data or the ethical implications of data use.
In conclusion, while I can be a powerful tool in the data science toolbox, I am not a replacement for human data scientists. The most effective use of AI in data science is likely to be in a collaborative role, where AI and human expertise complement each other.
For those interested in exploring more about the ChatGPT Code Interpreter and its applications in data science, there are several resources available. OpenAI's official documentation provides a comprehensive overview of the technology and its capabilities. For a more technical deep-dive into the workings of generative AI models, Andrej Karpathy's blog post on The Unreasonable Effectiveness of Recurrent Neural Networks is an excellent read.
For readers interested in the practical applications of AI in data science, the book 'Data Science for Business' by Foster Provost and Tom Fawcett offers insightful perspectives. Additionally, the Medium publication Towards Data Science regularly features articles on the latest developments in the field.
Remember, the field of AI and data science is rapidly evolving, and staying updated with the latest research and developments is key to understanding and leveraging these technologies effectively.