登录查看更多内容

Using GTP4 for Data Science: Experiment 2

David Carnahan MD MSCE

Healthcare IT expert leading Veteran and Military Health initiatives at AWS

发布日期: 2023年3月31日

NOTE: The opinions expressed in this blog are my own and do not represent that of my employer Amazon / AWS

Now, let's start having fun. Let's take the text file we created in the last blog using Amazon Textract and run it through Amazon Comprehend which is Amazon's Natural Language Processing (NLP) service. There are a lot of features but we will just focus on 'Entity Recognition' and 'Key Phrase Extraction'. You can find all the other features here if you like.

Using Amazon Comprehend for NLP

Wow, that was fast. Literally took < 1 min. I guess Data Science is easy now with GPT4. [maniacal laughter in the background].

The reality is that this is really just the start of any data science project ... getting data, transforming it, processing it, storing the output, presenting output and much, much more are what make data science projects complex. However, I do believe GPT4 can help you learn how to do these things, and accelerate the time to completion of your projects. Which segue ways nicely to a common issue in this line of work ... the dreaded error.

So I asked GPT4, what happened? If you look at the error, you can probably guess what happened.

The nice thing is it made if very easy for me to understand this error.

Tip 1: Use GPT4 to learn from your mistakes. You'll save time. I used to go to Stack Overflow and scan many, many entries ... trying to find the one that matched my issue ... and then would run into changes from the answer because the issue was raised YEARS ago and the tech changed since then. Bottom line, GPT4 will get you to a solution faster.

So, in this case ... I forgot to change my fake bucket name to the one I am using in my environment. Easy fix. Now, let's run it again.

Wah, wah, wahhhh. Another error. This time it is a TextSizeLimitExceededException. Very straightforward to understand, so I went to GPT4 to refactor the code, which it did easily.

Tip 2: if you look above, you'll see when I began I gave the main code it first produced a label ... "comprehend output" code. This was so I could refer to it by name and make my future modifications of the code clear for the GPT4.

So I took this code and then inserted it into the code as you see below.

Step 1: insert it where I think it goes ...

Step 2: Delete the old parts that was changed.

I'm guessing you are probably way ahead of me on Tip 3, but let me play this out before we identify a key tip that makes things much easier.

I ran the code and got another error ... this one I won't post here, but the chunk size it used was 1 byte too large. So I asked it to adjust the chunk size, and then it worked.

领英推荐

How does a vector database work?

Algolia 1 年前

Deep Learning Approaches to Sentiment Analysis, Data…

Open Data Science Conference (ODSC) 1 年前

Analytics and Data Science News for the Week of…

Data Analytics and Business Intelligence Solutions Review 4 个月前

Using Amazon SageMaker to process output from Amazon Comprehend

Now I took the code it produced up at the top of the blog to get counts of the output for Entity Recognition and Key Phrase Extraction.

!@#&^*^%$.

Another error. But that is life in the fast lane. And to be perfectly honest, it took me longer to create this blog than to actually do the work. So, it is not near as frustrating as I make it out to be ... especially when you finally get across the finish line.

Now, this is where things get interesting. Check out this conversation I had with GPT4 and how it eventually made it very easy for me.

Tip 3: when GPT4 starts making changes to code and tells you to insert a certain part into the previous code ... it is much easier to ask it to rewrite the code inserting the new changes. That way, you don't make any mistakes when correcting the old code. Unless, of course, you like doing things the hard way ... and there is something to be said for doing so.

Dataframes and Finishing Touches

I ran the code, and then just sat there for a while, not realizing it was done. Haha. If the code doesn't tell you its done (meaning you ask GPT4 to include a print statement telling you it successfully ran ... this is probably Tip 4), then you need to look at the Jupyter notebook to see that little number in the brackets is there. That means it's done.

The I just printed out the dataframe and counter objects that I created with the above code and here is what you'll see.

One last thing ... I wanted to have it give me a one liner to sort the counter so I can see the most relevant entities and key phrases in that 312 word pdf I processed so here is what it gave me.

I then took this code ... simply used the objects named in output ... like 'key_phrases_count' and replaced the 'counter_output' in the sample code to get the descending sort of the key phrases I was looking for.

So, you can see the VA is very focused on the PACT Act, Operations and Maintenance, Modernization, etc., etc.

Conclusion

If you made it this far, congratulations. I hope this was helpful, and I will post the final code I used in the Appendix below.

Feel free to reach out to me if you have a miniproject you would like me to blog on or specific data science questions or problems you are trying to solve. I would love to help.

Appendix

Textract file to s3 to Amazon Comprehend

import boto3
import pandas as pd
from collections import Counter


bucket_name = 'xyz'
input_key = 'result/extracted_text.txt'


# Initialize the Boto3 clients for S3 and Comprehend
s3 = boto3.client('s3')
comprehend = boto3.client('comprehend')


# Download the text file from the S3 bucket
response = s3.get_object(Bucket=bucket_name, Key=input_key)
text_file_content = response['Body'].read().decode('utf-8')


# Define the text chunk processing function
def process_text_chunks(text, chunk_size, processor):
? ? chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
? ? results = []


? ? for chunk in chunks:
? ? ? ? response = processor(chunk)
? ? ? ? results.append(response)


? ? return results


chunk_size = 5000


# Perform entity recognition and key phrase extraction on text chunks
entities_response = process_text_chunks(text_file_content, chunk_size, lambda chunk: comprehend.detect_entities(Text=chunk, LanguageCode='en')['Entities'])
key_phrases_response = process_text_chunks(text_file_content, chunk_size, lambda chunk: comprehend.detect_key_phrases(Text=chunk, LanguageCode='en')['KeyPhrases'])


# Create a dataframe for entity recognition results
entities = [entity['Type'] for entity_list in entities_response for entity in entity_list]
entities_count = Counter(entities)
entities_df = pd.DataFrame(list(entities_count.items()), columns=['Entity', 'Count'])


# Create a dataframe for key phrase extraction results
key_phrases = [key_phrase['Text'] for key_phrase_list in key_phrases_response for key_phrase in key_phrase_list]
key_phrases_count = Counter(key_phrases)
key_phrases_df = pd.DataFrame(list(key_phrases_count.items()), columns=['KeyPhrase', 'Count'])

# Save to s3
s3.put_object(Body=json.dumps(entities_response), Bucket=bucket_name, Key=entities_key)
s3.put_object(Body=json.dumps(key_phrases_response), Bucket=bucket_name, Key=key_phrases_key)

print(f"Entity recognition saved to: s3://{bucket_name}/{entities_key}")
print(f"Key phrase extraction saved to: s3://{bucket_name}/{key_phrases_key}"))

Descending order code snippet

sorted_output = sorted(counter_output.items(), key=lambda x: x[1], reverse=True)

for item in sorted_output:
? ? print(item)

David Carnahan MD MSCE

Healthcare IT expert leading Veteran and Military Health initiatives at AWS

1 年

Trey Oats Chakib Jaber Nathan Gould Agustin "Tino" Moreno, MS Adam Ginsburg Aziz Nazha Travis Gibbons Audrey Reinke Eric Egan Jamie Baker Chris Nichols Seanna Carter Jesus J Caban, PhD Stanton Denman Matt Spaloss Matthew L. Ogburn Diana Morris, MBA Umair K. Vin Minichino Jim Jones Eric Haupt Chris Caggiano, MD FACEP Albert Bonnema, MD MPH Andrew D. Plummer, MD, MPH Peter Easter, DO, MBA, MHA, FAAP, FAMIA Ryan Costantino Charles Gabrial Gabriel Juarez Kyle Johnson Ye Jin "Jenna" Eun John Bonfardeci II Jason Nixdorf

1 次回应

要查看或添加评论，请登录

David Carnahan MD MSCE的更多文章

Creating Your Own Chatbot with Amazon Bedrock

2024年4月2日

Creating Your Own Chatbot with Amazon Bedrock

The opinions expressed in this blog are my own, and do not necessarily represent the opinions of Amazon, Amazon Web…

5 条评论
Harnessing the Data Tidal Wave: Exploring AWS for Unstructured Data Processing

2024年3月26日

Harnessing the Data Tidal Wave: Exploring AWS for Unstructured Data Processing

The opinions expressed in this blog are my own, and do not necessarily represent the opinions of Amazon, Amazon Web…

4 条评论
Meal Planner App - Step 1

2023年4月12日

Meal Planner App - Step 1

NOTE: The opinions expressed in this blog are my own and do not represent that of my employer Amazon / AWS Meal Planner…

2 条评论
The Meal App Project Plan

2023年4月11日

The Meal App Project Plan

NOTE: The opinions expressed in this blog are my own and do not represent that of my employer Amazon / AWS Introduction…

2 条评论
Comparing ChatGPT vs Bard for Writing a Draft Scene

2023年4月3日

Comparing ChatGPT vs Bard for Writing a Draft Scene

This post will be personal. I love to write! Been working on a novel for about 10 months now .

3 条评论
Using GPT4 for Data Science: Experiment 1

2023年3月30日

Using GPT4 for Data Science: Experiment 1

NOTE: The opinions expressed in this blog are my own and do not represent that of my employer Amazon / AWS. I have been…

7 条评论
GPT3 Healthcare Test #1

2023年1月27日

GPT3 Healthcare Test #1

Consider this clinical encounter and let me know when you are done. This is a visit with a 46 yo gentleman.

18 条评论
Learning Journal Entry: Jupyter Notebook on AWS EC2

2019年10月15日

Learning Journal Entry: Jupyter Notebook on AWS EC2

This lab will be the first of many focused on how to use AWS for data science. The format will be like the one you used…

1 条评论
Better Patient Engagement with Amazon Connect

2019年9月30日

Better Patient Engagement with Amazon Connect

When I was a staff physician at Wilford Hall Medical Center, I was amazed at how much work I did each day that was…

8 条评论
Walking the High Wire

2019年6月12日

Walking the High Wire

"Every great dream begins with a dreamer. Always remember, you have within you the strength, the patience, and the…

1 条评论

See all articles

Using GTP4 for Data Science: Experiment 2

David Carnahan MD MSCE

Healthcare IT expert leading Veteran and Military Health initiatives at AWS

Using Amazon Comprehend for NLP

领英推荐

Using Amazon SageMaker to process output from Amazon Comprehend

Dataframes and Finishing Touches

Conclusion

Appendix

David Carnahan MD MSCE的更多文章

社区洞察

其他会员也浏览了

Top Trending AI tools for 2023

Top 7 Data Science Trends that are Going to Dominate in 2023

Data & AI Newsletter April 2023

Using AI to Unlock the Power of Unstructured Data

Boost AI Fairness and Explainability with Amazon SageMaker Clarify

Unlocking Insights from Diverse Data: A Look at DataLLM

Unleashing the Power of Knowledge Graphs for Retrieval-Augmented Generation (RAG)

Transformers Unleashed: A Comprehensive Guide to Applying Transformers Across Data Types

Issue #277 - The ML Engineer ??

?? Moving beyond RAG

Using Amazon Comprehend for NLP

领英推荐

Using Amazon SageMaker to process output from Amazon Comprehend

Dataframes and Finishing Touches

Conclusion

Appendix

David Carnahan MD MSCE的更多文章

Creating Your Own Chatbot with Amazon Bedrock

Harnessing the Data Tidal Wave: Exploring AWS for Unstructured Data Processing

Meal Planner App - Step 1

The Meal App Project Plan

Comparing ChatGPT vs Bard for Writing a Draft Scene

Using GPT4 for Data Science: Experiment 1

GPT3 Healthcare Test #1

Learning Journal Entry: Jupyter Notebook on AWS EC2

Better Patient Engagement with Amazon Connect

Walking the High Wire

社区洞察

其他会员也浏览了

Top Trending AI tools for 2023

Top 7 Data Science Trends that are Going to Dominate in 2023

Data & AI Newsletter April 2023

Using AI to Unlock the Power of Unstructured Data

Boost AI Fairness and Explainability with Amazon SageMaker Clarify

Unlocking Insights from Diverse Data: A Look at DataLLM

Unleashing the Power of Knowledge Graphs for Retrieval-Augmented Generation (RAG)

Transformers Unleashed: A Comprehensive Guide to Applying Transformers Across Data Types

Issue #277 - The ML Engineer ??

?? Moving beyond RAG