Using GPT4 for Data Science: Experiment 1
David Carnahan MD MSCE
Healthcare IT expert leading Veteran and Military Health initiatives at AWS
NOTE: The opinions expressed in this blog are my own and do not represent that of my employer Amazon / AWS.
I have been using GPT4 a lot for my novel writing. It has been a tremendous help with research, description, metaphor generation, story analysis, etc. But recently, I started exploring how to use it with Amazon SageMaker as I dive deeper on this valuable tool. I figured I'll share a few 'static demos' for those who might be interested.
Let's say you want to process a large pdf file so you can do some Natural Language Processing (NLP) on the information stuck in the doc. For this experiment, I'm going to pull in a 312 page document the VA put out for Industry Partners to know where they intend on focusing their efforts. I'll use Amazon SageMaker Studio and Amazon Textract to do the extraction, and R Studio on the desktop to generate a wordcloud.
How could we use GPT4 for this project?
As you can see, it did a great job generating code. It took seconds to complete. Then you take the code written (complete script below in Appendix), and change out the fake bucket and fake file name. I wouldn't put anything that is sensitive in the GPT4 model since it is a PUBLIC model ... but using fake names as substitutes allow you to use the awesome capability without compromising your companies IP or sensitive information.
Extracting Text from PDF Using Amazon SageMaker Studio & Amazon Textract
It successfully ran. And I didn't change the code at all. It doesn't always succeed on the first try (as you will soon see) ... and if you get an error you can put that error in the GPT4 and ask it to rewrite the code and it usually apologizes, such a polite collaborator, and then fixes the error.
Turns out this first go round it only processed the first page of the 312 page document. But have no fear, you can simply tell the model what happened and reaffirm what you want. And it will adjust your code. Here is what I got in round two.
Round Two with Pagination
领英推荐
Wordcloud in R Studio
Now, let's write the code for R on the desktop, though you can use R Studio as the IDE of choice in SageMaker Studio if you have an enterprise license. I'm using R Studio on the desktop because there are a lot of data scientists and analysts out there that use the desktop version.
Now to pull the text file into R Studio and process it using the NLP corpus, matrix, word freq, etc. And generate a wordcloud.
Conclusion
I believe GPT4 is a powerful LLM that allows you to write code for data science use cases. It was pretty easy to get this code, correct any errors, and get the result I was looking for. There are a lot more useful things you can do than create a wordcloud, but I didn't want this first experiment to be too long. I'll be venturing into those other areas in future blogs.
I would love to hear how you are using GPT4 in your data science, or even novel writing work.
Note: Amazon CodeWhisperer is great for DevSecOps using common IDEs, but it isn't really embedded into Amazon SageMaker at this time, and probably won't have capabilities to write every language like GPT4 can at this time. However, I would stay tuned because it wouldn't surprise me if they added the AWS Code Whisperer feature to Amazon SageMaker in the future.
Appendix
Python code to extract text from PDF using Amazon Textract
import boto3
import time
# Set the S3 bucket name and document file name
bucket_name = 'xyz'
document_name = 'your_312_page_document.pdf'
result_prefix = 'result/'
# Initialize the S3 and Textract clients
s3 = boto3.client('s3')
textract = boto3.client('textract')
# Start a Textract job to analyze the document
response = textract.start_document_text_detection(
? ? DocumentLocation={
? ? ? ? 'S3Object': {
? ? ? ? ? ? 'Bucket': bucket_name,
? ? ? ? ? ? 'Name': document_name
? ? ? ? }
? ? }
)
# Get the job ID from the response
job_id = response['JobId']
# Wait for the Textract job to complete
job_status = 'IN_PROGRESS'
while job_status == 'IN_PROGRESS':
? ? response = textract.get_document_text_detection(JobId=job_id)
? ? job_status = response['JobStatus']
? ? print('Job status:', job_status)
? ? if job_status == 'IN_PROGRESS':
? ? ? ? time.sleep(5)
# Extract the text from the Textract response
extracted_text = ''
next_token = None
while True:
? ? if next_token:
? ? ? ? response = textract.get_document_text_detection(JobId=job_id, NextToken=next_token)
? ? else:
? ? ? ? response = textract.get_document_text_detection(JobId=job_id)
? ? for block in response['Blocks']:
? ? ? ? if block['BlockType'] == 'LINE':
? ? ? ? ? ? extracted_text += block['Text'] + '\n'
? ? next_token = response.get('NextToken', None)
? ? if not next_token:
? ? ? ? break
# Save the extracted text to a file in the S3 bucket with the 'result/' prefix
result_key = f'{result_prefix}extracted_text.txt'
s3.put_object(Body=extracted_text, Bucket=bucket_name, Key=result_key)
print(f"Extracted text saved to: s3://{bucket_name}/{result_key}")
R code to create wordcloud on text file created above using R Studio on the desktop
First,
install.packages(c("tm", "wordcloud", "RColorBrewer"))
Then,
Load required libraries
library(tm)
library(wordcloud)
library(RColorBrewer)
# Read the text file
file_path <- "path/to/your/text_file.txt"
text_data <- readLines(file_path)
# Create a Corpus object
corpus <- Corpus(VectorSource(text_data))
# Text preprocessing: convert to lowercase, remove punctuation, remove numbers, remove stopwords
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# Create a TermDocumentMatrix
tdm <- TermDocumentMatrix(corpus)
# Convert TermDocumentMatrix to a matrix
matrix <- as.matrix(tdm)
# Calculate the frequency of words
word_freqs <- sort(rowSums(matrix), decreasing = TRUE)
# Create a data frame with words and their frequencies
word_freqs_df <- data.frame(word = names(word_freqs), freq = word_freqs)
# Generate a word cloud using earth tone palette
set.seed(42)
wordcloud(words = word_freqs_df$word, freq = word_freqs_df$freq,?
? ? ? ? ? min.freq = 2, max.words = 200, random.order = FALSE,?
? ? ? ? ? colors = brewer.pal(8, "YlOrBr"))
Business Data Analyst | HRIS | Python | SQL | Tableau | Excel
1 年Wow David, Thank you so much for this very insightful read. As I am currently transitioning into the world of data, GPT has been a tool that I cannot wait to explore. It seems that that possibilities are endless! Best of luck with your writing, and I look forward to any further "gems" that you decide to drop in the future!
Data Scientist, Data Engineer, Principal Software Engineer IV, Solutions Architect, Board of Directors Member
1 年It also generates pretty good training data to fine-tune your own LLM models with custom NER labels. I used it to generate sentences for a CoNLL data set, so my NER model could recognize specific terms in context and label them accordingly.
Lead Architect, Legacy Data Consolidation Solution (LDCS) at Vista Defense Technologies, LLC
1 年Thank you, David, for making these topics more accessible!
Healthcare IT expert leading Veteran and Military Health initiatives at AWS
1 年Trey Oats Chakib Jaber Nathan Gould Agustin "Tino" Moreno, MS Jason Nixdorf Adam Ginsburg Aziz Nazha Travis Gibbons Audrey Reinke Eric Egan Jamie Baker Chris Nichols Seanna Carter Jesus J Caban, PhD Stanton Denman Matt Spaloss Matthew L. Ogburn Diana Morris, MBA Danielle Bolen Umair K. Vin Minichino Jim Jones Eric Haupt Chris Caggiano, MD FACEP Albert Bonnema, MD MPH Andrew D. Plummer, MD, MPH Peter Easter, DO, MBA, MHA, FAAP, FAMIA Ryan Costantino Charles Gabrial Gabriel Juarez Kyle Johnson Ye Jin "Jenna" Eun