登录查看更多内容

Using GPT4 for Data Science: Experiment 1

David Carnahan MD MSCE

Healthcare IT expert leading Veteran and Military Health initiatives at AWS

发布日期: 2023年3月30日

NOTE: The opinions expressed in this blog are my own and do not represent that of my employer Amazon / AWS.

I have been using GPT4 a lot for my novel writing. It has been a tremendous help with research, description, metaphor generation, story analysis, etc. But recently, I started exploring how to use it with Amazon SageMaker as I dive deeper on this valuable tool. I figured I'll share a few 'static demos' for those who might be interested.

Let's say you want to process a large pdf file so you can do some Natural Language Processing (NLP) on the information stuck in the doc. For this experiment, I'm going to pull in a 312 page document the VA put out for Industry Partners to know where they intend on focusing their efforts. I'll use Amazon SageMaker Studio and Amazon Textract to do the extraction, and R Studio on the desktop to generate a wordcloud.

How could we use GPT4 for this project?

As you can see, it did a great job generating code. It took seconds to complete. Then you take the code written (complete script below in Appendix), and change out the fake bucket and fake file name. I wouldn't put anything that is sensitive in the GPT4 model since it is a PUBLIC model ... but using fake names as substitutes allow you to use the awesome capability without compromising your companies IP or sensitive information.

Extracting Text from PDF Using Amazon SageMaker Studio & Amazon Textract

It successfully ran. And I didn't change the code at all. It doesn't always succeed on the first try (as you will soon see) ... and if you get an error you can put that error in the GPT4 and ask it to rewrite the code and it usually apologizes, such a polite collaborator, and then fixes the error.

Turns out this first go round it only processed the first page of the 312 page document. But have no fear, you can simply tell the model what happened and reaffirm what you want. And it will adjust your code. Here is what I got in round two.

Round Two with Pagination

领英推荐

My new book on Language Models is here

Andriy Burkov 1 个月前

How does a vector database work?

Algolia 1 年前

Generative AI for Analytics: Performing Natural…

Gary Stafford 1 年前

Wordcloud in R Studio

Now, let's write the code for R on the desktop, though you can use R Studio as the IDE of choice in SageMaker Studio if you have an enterprise license. I'm using R Studio on the desktop because there are a lot of data scientists and analysts out there that use the desktop version.

Now to pull the text file into R Studio and process it using the NLP corpus, matrix, word freq, etc. And generate a wordcloud.

Conclusion

I believe GPT4 is a powerful LLM that allows you to write code for data science use cases. It was pretty easy to get this code, correct any errors, and get the result I was looking for. There are a lot more useful things you can do than create a wordcloud, but I didn't want this first experiment to be too long. I'll be venturing into those other areas in future blogs.

I would love to hear how you are using GPT4 in your data science, or even novel writing work.

Note: Amazon CodeWhisperer is great for DevSecOps using common IDEs, but it isn't really embedded into Amazon SageMaker at this time, and probably won't have capabilities to write every language like GPT4 can at this time. However, I would stay tuned because it wouldn't surprise me if they added the AWS Code Whisperer feature to Amazon SageMaker in the future.

Appendix

Python code to extract text from PDF using Amazon Textract

import boto3
import time


# Set the S3 bucket name and document file name
bucket_name = 'xyz'
document_name = 'your_312_page_document.pdf'
result_prefix = 'result/'


# Initialize the S3 and Textract clients
s3 = boto3.client('s3')
textract = boto3.client('textract')


# Start a Textract job to analyze the document
response = textract.start_document_text_detection(
? ? DocumentLocation={
? ? ? ? 'S3Object': {
? ? ? ? ? ? 'Bucket': bucket_name,
? ? ? ? ? ? 'Name': document_name
? ? ? ? }
? ? }
)


# Get the job ID from the response
job_id = response['JobId']


# Wait for the Textract job to complete
job_status = 'IN_PROGRESS'
while job_status == 'IN_PROGRESS':
? ? response = textract.get_document_text_detection(JobId=job_id)
? ? job_status = response['JobStatus']
? ? print('Job status:', job_status)
? ? if job_status == 'IN_PROGRESS':
? ? ? ? time.sleep(5)


# Extract the text from the Textract response
extracted_text = ''
next_token = None


while True:
? ? if next_token:
? ? ? ? response = textract.get_document_text_detection(JobId=job_id, NextToken=next_token)
? ? else:
? ? ? ? response = textract.get_document_text_detection(JobId=job_id)


? ? for block in response['Blocks']:
? ? ? ? if block['BlockType'] == 'LINE':
? ? ? ? ? ? extracted_text += block['Text'] + '\n'


? ? next_token = response.get('NextToken', None)
? ? if not next_token:
? ? ? ? break


# Save the extracted text to a file in the S3 bucket with the 'result/' prefix
result_key = f'{result_prefix}extracted_text.txt'
s3.put_object(Body=extracted_text, Bucket=bucket_name, Key=result_key)


print(f"Extracted text saved to: s3://{bucket_name}/{result_key}")

R code to create wordcloud on text file created above using R Studio on the desktop

First,

install.packages(c("tm", "wordcloud", "RColorBrewer"))

Then,


Load required libraries
library(tm)
library(wordcloud)
library(RColorBrewer)


# Read the text file
file_path <- "path/to/your/text_file.txt"
text_data <- readLines(file_path)


# Create a Corpus object
corpus <- Corpus(VectorSource(text_data))


# Text preprocessing: convert to lowercase, remove punctuation, remove numbers, remove stopwords
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))


# Create a TermDocumentMatrix
tdm <- TermDocumentMatrix(corpus)


# Convert TermDocumentMatrix to a matrix
matrix <- as.matrix(tdm)


# Calculate the frequency of words
word_freqs <- sort(rowSums(matrix), decreasing = TRUE)


# Create a data frame with words and their frequencies
word_freqs_df <- data.frame(word = names(word_freqs), freq = word_freqs)


# Generate a word cloud using earth tone palette
set.seed(42)
wordcloud(words = word_freqs_df$word, freq = word_freqs_df$freq,?
? ? ? ? ? min.freq = 2, max.words = 200, random.order = FALSE,?
? ? ? ? ? colors = brewer.pal(8, "YlOrBr"))

Daniel Austin, SPHR, PMP, ICP

1 年

Wow David, Thank you so much for this very insightful read. As I am currently transitioning into the world of data, GPT has been a tool that I cannot wait to explore. It seems that that possibilities are endless! Best of luck with your writing, and I look forward to any further "gems" that you decide to drop in the future!

1 次回应

John Bonfardeci II, M.Sc.

Data Scientist, Data Engineer, Principal Software Engineer IV, Solutions Architect, Board of Directors Member

1 年

It also generates pretty good training data to fine-tune your own LLM models with custom NER labels. I used it to generate sentences for a CoNLL data set, so my NER model could recognize specific terms in context and label them accordingly.

3 次回应

Gabriel Juarez

Lead Architect, Legacy Data Consolidation Solution (LDCS) at Vista Defense Technologies, LLC

1 年

Thank you, David, for making these topics more accessible!

1 次回应

David Carnahan MD MSCE

Healthcare IT expert leading Veteran and Military Health initiatives at AWS

1 年

Trey Oats Chakib Jaber Nathan Gould Agustin "Tino" Moreno, MS Jason Nixdorf Adam Ginsburg Aziz Nazha Travis Gibbons Audrey Reinke Eric Egan Jamie Baker Chris Nichols Seanna Carter Jesus J Caban, PhD Stanton Denman Matt Spaloss Matthew L. Ogburn Diana Morris, MBA Danielle Bolen Umair K. Vin Minichino Jim Jones Eric Haupt Chris Caggiano, MD FACEP Albert Bonnema, MD MPH Andrew D. Plummer, MD, MPH Peter Easter, DO, MBA, MHA, FAAP, FAMIA Ryan Costantino Charles Gabrial Gabriel Juarez Kyle Johnson Ye Jin "Jenna" Eun

1 次回应

查看更多评论

要查看或添加评论，请登录

David Carnahan MD MSCE的更多文章

Creating Your Own Chatbot with Amazon Bedrock

2024年4月2日

Creating Your Own Chatbot with Amazon Bedrock

The opinions expressed in this blog are my own, and do not necessarily represent the opinions of Amazon, Amazon Web…

5 条评论
Harnessing the Data Tidal Wave: Exploring AWS for Unstructured Data Processing

2024年3月26日

Harnessing the Data Tidal Wave: Exploring AWS for Unstructured Data Processing

The opinions expressed in this blog are my own, and do not necessarily represent the opinions of Amazon, Amazon Web…

4 条评论
Meal Planner App - Step 1

2023年4月12日

Meal Planner App - Step 1

NOTE: The opinions expressed in this blog are my own and do not represent that of my employer Amazon / AWS Meal Planner…

2 条评论
The Meal App Project Plan

2023年4月11日

The Meal App Project Plan

NOTE: The opinions expressed in this blog are my own and do not represent that of my employer Amazon / AWS Introduction…

2 条评论
Comparing ChatGPT vs Bard for Writing a Draft Scene

2023年4月3日

Comparing ChatGPT vs Bard for Writing a Draft Scene

This post will be personal. I love to write! Been working on a novel for about 10 months now .

3 条评论
Using GTP4 for Data Science: Experiment 2

2023年3月31日

Using GTP4 for Data Science: Experiment 2

NOTE: The opinions expressed in this blog are my own and do not represent that of my employer Amazon / AWS Now, let's…

1 条评论
GPT3 Healthcare Test #1

2023年1月27日

GPT3 Healthcare Test #1

Consider this clinical encounter and let me know when you are done. This is a visit with a 46 yo gentleman.

18 条评论
Learning Journal Entry: Jupyter Notebook on AWS EC2

2019年10月15日

Learning Journal Entry: Jupyter Notebook on AWS EC2

This lab will be the first of many focused on how to use AWS for data science. The format will be like the one you used…

1 条评论
Better Patient Engagement with Amazon Connect

2019年9月30日

Better Patient Engagement with Amazon Connect

When I was a staff physician at Wilford Hall Medical Center, I was amazed at how much work I did each day that was…

8 条评论
Walking the High Wire

2019年6月12日

Walking the High Wire

"Every great dream begins with a dreamer. Always remember, you have within you the strength, the patience, and the…

1 条评论

See all articles

Using GPT4 for Data Science: Experiment 1

David Carnahan MD MSCE

Healthcare IT expert leading Veteran and Military Health initiatives at AWS

Extracting Text from PDF Using Amazon SageMaker Studio & Amazon Textract

Round Two with Pagination

领英推荐

Wordcloud in R Studio

Conclusion

Appendix

Python code to extract text from PDF using Amazon Textract

R code to create wordcloud on text file created above using R Studio on the desktop

David Carnahan MD MSCE的更多文章

社区洞察

其他会员也浏览了

Machine Learning Libraries

A Beginner's Guide to ggplot2, Deep Reinforcement Learning, and Innovative AI Research Labs

Issue #194 - THE ML ENGINEER ??

The Journey from Concept to Creation: Building Smart Assistants

Neural Search with OpenAI, Cohere, and Postgres

A deep dive on Vector Search and its implementation

NLP-A Complete Guide for Topic Modeling- Latent Dirichlet Allocation (LDA) using Gensim!

Roadmap of skills required to create AI Agent

RAG || !2 RAG

Extracting Text from PDF Using Amazon SageMaker Studio & Amazon Textract

Round Two with Pagination

领英推荐

Wordcloud in R Studio

Conclusion

Appendix

Python code to extract text from PDF using Amazon Textract

R code to create wordcloud on text file created above using R Studio on the desktop

David Carnahan MD MSCE的更多文章

Creating Your Own Chatbot with Amazon Bedrock

Harnessing the Data Tidal Wave: Exploring AWS for Unstructured Data Processing

Meal Planner App - Step 1

The Meal App Project Plan

Comparing ChatGPT vs Bard for Writing a Draft Scene

Using GTP4 for Data Science: Experiment 2

GPT3 Healthcare Test #1

Learning Journal Entry: Jupyter Notebook on AWS EC2

Better Patient Engagement with Amazon Connect

Walking the High Wire

社区洞察

其他会员也浏览了

Machine Learning Libraries

A Beginner's Guide to ggplot2, Deep Reinforcement Learning, and Innovative AI Research Labs

Issue #194 - THE ML ENGINEER ??

The Journey from Concept to Creation: Building Smart Assistants

Neural Search with OpenAI, Cohere, and Postgres

A deep dive on Vector Search and its implementation

NLP-A Complete Guide for Topic Modeling- Latent Dirichlet Allocation (LDA) using Gensim!

Roadmap of skills required to create AI Agent

RAG || !2 RAG