Using GPT4 for Data Science: Experiment 1

Using GPT4 for Data Science: Experiment 1

NOTE: The opinions expressed in this blog are my own and do not represent that of my employer Amazon / AWS.


I have been using GPT4 a lot for my novel writing. It has been a tremendous help with research, description, metaphor generation, story analysis, etc. But recently, I started exploring how to use it with Amazon SageMaker as I dive deeper on this valuable tool. I figured I'll share a few 'static demos' for those who might be interested.

Let's say you want to process a large pdf file so you can do some Natural Language Processing (NLP) on the information stuck in the doc. For this experiment, I'm going to pull in a 312 page document the VA put out for Industry Partners to know where they intend on focusing their efforts. I'll use Amazon SageMaker Studio and Amazon Textract to do the extraction, and R Studio on the desktop to generate a wordcloud.

How could we use GPT4 for this project?

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

As you can see, it did a great job generating code. It took seconds to complete. Then you take the code written (complete script below in Appendix), and change out the fake bucket and fake file name. I wouldn't put anything that is sensitive in the GPT4 model since it is a PUBLIC model ... but using fake names as substitutes allow you to use the awesome capability without compromising your companies IP or sensitive information.

Extracting Text from PDF Using Amazon SageMaker Studio & Amazon Textract

No alt text provided for this image
No alt text provided for this image

It successfully ran. And I didn't change the code at all. It doesn't always succeed on the first try (as you will soon see) ... and if you get an error you can put that error in the GPT4 and ask it to rewrite the code and it usually apologizes, such a polite collaborator, and then fixes the error.

Turns out this first go round it only processed the first page of the 312 page document. But have no fear, you can simply tell the model what happened and reaffirm what you want. And it will adjust your code. Here is what I got in round two.

Round Two with Pagination

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Wordcloud in R Studio

Now, let's write the code for R on the desktop, though you can use R Studio as the IDE of choice in SageMaker Studio if you have an enterprise license. I'm using R Studio on the desktop because there are a lot of data scientists and analysts out there that use the desktop version.

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Now to pull the text file into R Studio and process it using the NLP corpus, matrix, word freq, etc. And generate a wordcloud.

No alt text provided for this image
No alt text provided for this image

Conclusion

I believe GPT4 is a powerful LLM that allows you to write code for data science use cases. It was pretty easy to get this code, correct any errors, and get the result I was looking for. There are a lot more useful things you can do than create a wordcloud, but I didn't want this first experiment to be too long. I'll be venturing into those other areas in future blogs.

I would love to hear how you are using GPT4 in your data science, or even novel writing work.

Note: Amazon CodeWhisperer is great for DevSecOps using common IDEs, but it isn't really embedded into Amazon SageMaker at this time, and probably won't have capabilities to write every language like GPT4 can at this time. However, I would stay tuned because it wouldn't surprise me if they added the AWS Code Whisperer feature to Amazon SageMaker in the future.

Appendix

Python code to extract text from PDF using Amazon Textract

import boto3
import time


# Set the S3 bucket name and document file name
bucket_name = 'xyz'
document_name = 'your_312_page_document.pdf'
result_prefix = 'result/'


# Initialize the S3 and Textract clients
s3 = boto3.client('s3')
textract = boto3.client('textract')


# Start a Textract job to analyze the document
response = textract.start_document_text_detection(
? ? DocumentLocation={
? ? ? ? 'S3Object': {
? ? ? ? ? ? 'Bucket': bucket_name,
? ? ? ? ? ? 'Name': document_name
? ? ? ? }
? ? }
)


# Get the job ID from the response
job_id = response['JobId']


# Wait for the Textract job to complete
job_status = 'IN_PROGRESS'
while job_status == 'IN_PROGRESS':
? ? response = textract.get_document_text_detection(JobId=job_id)
? ? job_status = response['JobStatus']
? ? print('Job status:', job_status)
? ? if job_status == 'IN_PROGRESS':
? ? ? ? time.sleep(5)


# Extract the text from the Textract response
extracted_text = ''
next_token = None


while True:
? ? if next_token:
? ? ? ? response = textract.get_document_text_detection(JobId=job_id, NextToken=next_token)
? ? else:
? ? ? ? response = textract.get_document_text_detection(JobId=job_id)


? ? for block in response['Blocks']:
? ? ? ? if block['BlockType'] == 'LINE':
? ? ? ? ? ? extracted_text += block['Text'] + '\n'


? ? next_token = response.get('NextToken', None)
? ? if not next_token:
? ? ? ? break


# Save the extracted text to a file in the S3 bucket with the 'result/' prefix
result_key = f'{result_prefix}extracted_text.txt'
s3.put_object(Body=extracted_text, Bucket=bucket_name, Key=result_key)


print(f"Extracted text saved to: s3://{bucket_name}/{result_key}")
        

R code to create wordcloud on text file created above using R Studio on the desktop

First,

install.packages(c("tm", "wordcloud", "RColorBrewer"))        

Then,


Load required libraries
library(tm)
library(wordcloud)
library(RColorBrewer)


# Read the text file
file_path <- "path/to/your/text_file.txt"
text_data <- readLines(file_path)


# Create a Corpus object
corpus <- Corpus(VectorSource(text_data))


# Text preprocessing: convert to lowercase, remove punctuation, remove numbers, remove stopwords
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))


# Create a TermDocumentMatrix
tdm <- TermDocumentMatrix(corpus)


# Convert TermDocumentMatrix to a matrix
matrix <- as.matrix(tdm)


# Calculate the frequency of words
word_freqs <- sort(rowSums(matrix), decreasing = TRUE)


# Create a data frame with words and their frequencies
word_freqs_df <- data.frame(word = names(word_freqs), freq = word_freqs)


# Generate a word cloud using earth tone palette
set.seed(42)
wordcloud(words = word_freqs_df$word, freq = word_freqs_df$freq,?
? ? ? ? ? min.freq = 2, max.words = 200, random.order = FALSE,?
? ? ? ? ? colors = brewer.pal(8, "YlOrBr"))        
Daniel Austin, SPHR, PMP, ICP

Business Data Analyst | HRIS | Python | SQL | Tableau | Excel

1 年

Wow David, Thank you so much for this very insightful read. As I am currently transitioning into the world of data, GPT has been a tool that I cannot wait to explore. It seems that that possibilities are endless! Best of luck with your writing, and I look forward to any further "gems" that you decide to drop in the future!

John Bonfardeci II, M.Sc.

Data Scientist, Data Engineer, Principal Software Engineer IV, Solutions Architect, Board of Directors Member

1 年

It also generates pretty good training data to fine-tune your own LLM models with custom NER labels. I used it to generate sentences for a CoNLL data set, so my NER model could recognize specific terms in context and label them accordingly.

Gabriel Juarez

Lead Architect, Legacy Data Consolidation Solution (LDCS) at Vista Defense Technologies, LLC

1 年

Thank you, David, for making these topics more accessible!

要查看或添加评论,请登录

David Carnahan MD MSCE的更多文章

  • Creating Your Own Chatbot with Amazon Bedrock

    Creating Your Own Chatbot with Amazon Bedrock

    The opinions expressed in this blog are my own, and do not necessarily represent the opinions of Amazon, Amazon Web…

    5 条评论
  • Harnessing the Data Tidal Wave: Exploring AWS for Unstructured Data Processing

    Harnessing the Data Tidal Wave: Exploring AWS for Unstructured Data Processing

    The opinions expressed in this blog are my own, and do not necessarily represent the opinions of Amazon, Amazon Web…

    4 条评论
  • Meal Planner App - Step 1

    Meal Planner App - Step 1

    NOTE: The opinions expressed in this blog are my own and do not represent that of my employer Amazon / AWS Meal Planner…

    2 条评论
  • The Meal App Project Plan

    The Meal App Project Plan

    NOTE: The opinions expressed in this blog are my own and do not represent that of my employer Amazon / AWS Introduction…

    2 条评论
  • Comparing ChatGPT vs Bard for Writing a Draft Scene

    Comparing ChatGPT vs Bard for Writing a Draft Scene

    This post will be personal. I love to write! Been working on a novel for about 10 months now .

    3 条评论
  • Using GTP4 for Data Science: Experiment 2

    Using GTP4 for Data Science: Experiment 2

    NOTE: The opinions expressed in this blog are my own and do not represent that of my employer Amazon / AWS Now, let's…

    1 条评论
  • GPT3 Healthcare Test #1

    GPT3 Healthcare Test #1

    Consider this clinical encounter and let me know when you are done. This is a visit with a 46 yo gentleman.

    18 条评论
  • Learning Journal Entry: Jupyter Notebook on AWS EC2

    Learning Journal Entry: Jupyter Notebook on AWS EC2

    This lab will be the first of many focused on how to use AWS for data science. The format will be like the one you used…

    1 条评论
  • Better Patient Engagement with Amazon Connect

    Better Patient Engagement with Amazon Connect

    When I was a staff physician at Wilford Hall Medical Center, I was amazed at how much work I did each day that was…

    8 条评论
  • Walking the High Wire

    Walking the High Wire

    "Every great dream begins with a dreamer. Always remember, you have within you the strength, the patience, and the…

    1 条评论

社区洞察

其他会员也浏览了