Using GTP4 for Data Science: Experiment 2

Using GTP4 for Data Science: Experiment 2

NOTE: The opinions expressed in this blog are my own and do not represent that of my employer Amazon / AWS


Now, let's start having fun. Let's take the text file we created in the last blog using Amazon Textract and run it through Amazon Comprehend which is Amazon's Natural Language Processing (NLP) service. There are a lot of features but we will just focus on 'Entity Recognition' and 'Key Phrase Extraction'. You can find all the other features here if you like.

Using Amazon Comprehend for NLP

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Wow, that was fast. Literally took < 1 min. I guess Data Science is easy now with GPT4. [maniacal laughter in the background].

The reality is that this is really just the start of any data science project ... getting data, transforming it, processing it, storing the output, presenting output and much, much more are what make data science projects complex. However, I do believe GPT4 can help you learn how to do these things, and accelerate the time to completion of your projects. Which segue ways nicely to a common issue in this line of work ... the dreaded error.

No alt text provided for this image

So I asked GPT4, what happened? If you look at the error, you can probably guess what happened.

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

The nice thing is it made if very easy for me to understand this error.

Tip 1: Use GPT4 to learn from your mistakes. You'll save time. I used to go to Stack Overflow and scan many, many entries ... trying to find the one that matched my issue ... and then would run into changes from the answer because the issue was raised YEARS ago and the tech changed since then. Bottom line, GPT4 will get you to a solution faster.

So, in this case ... I forgot to change my fake bucket name to the one I am using in my environment. Easy fix. Now, let's run it again.

No alt text provided for this image

Wah, wah, wahhhh. Another error. This time it is a TextSizeLimitExceededException. Very straightforward to understand, so I went to GPT4 to refactor the code, which it did easily.

Tip 2: if you look above, you'll see when I began I gave the main code it first produced a label ... "comprehend output" code. This was so I could refer to it by name and make my future modifications of the code clear for the GPT4.

No alt text provided for this image
No alt text provided for this image

So I took this code and then inserted it into the code as you see below.

Step 1: insert it where I think it goes ...

No alt text provided for this image

Step 2: Delete the old parts that was changed.

No alt text provided for this image

I'm guessing you are probably way ahead of me on Tip 3, but let me play this out before we identify a key tip that makes things much easier.

I ran the code and got another error ... this one I won't post here, but the chunk size it used was 1 byte too large. So I asked it to adjust the chunk size, and then it worked.

Using Amazon SageMaker to process output from Amazon Comprehend

Now I took the code it produced up at the top of the blog to get counts of the output for Entity Recognition and Key Phrase Extraction.

No alt text provided for this image

!@#&^*^%$.

Another error. But that is life in the fast lane. And to be perfectly honest, it took me longer to create this blog than to actually do the work. So, it is not near as frustrating as I make it out to be ... especially when you finally get across the finish line.

Now, this is where things get interesting. Check out this conversation I had with GPT4 and how it eventually made it very easy for me.

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Tip 3: when GPT4 starts making changes to code and tells you to insert a certain part into the previous code ... it is much easier to ask it to rewrite the code inserting the new changes. That way, you don't make any mistakes when correcting the old code. Unless, of course, you like doing things the hard way ... and there is something to be said for doing so.

Dataframes and Finishing Touches

I ran the code, and then just sat there for a while, not realizing it was done. Haha. If the code doesn't tell you its done (meaning you ask GPT4 to include a print statement telling you it successfully ran ... this is probably Tip 4), then you need to look at the Jupyter notebook to see that little number in the brackets is there. That means it's done.

No alt text provided for this image

The I just printed out the dataframe and counter objects that I created with the above code and here is what you'll see.

No alt text provided for this image

One last thing ... I wanted to have it give me a one liner to sort the counter so I can see the most relevant entities and key phrases in that 312 word pdf I processed so here is what it gave me.

No alt text provided for this image
No alt text provided for this image

I then took this code ... simply used the objects named in output ... like 'key_phrases_count' and replaced the 'counter_output' in the sample code to get the descending sort of the key phrases I was looking for.

No alt text provided for this image

So, you can see the VA is very focused on the PACT Act, Operations and Maintenance, Modernization, etc., etc.

Conclusion

If you made it this far, congratulations. I hope this was helpful, and I will post the final code I used in the Appendix below.

Feel free to reach out to me if you have a miniproject you would like me to blog on or specific data science questions or problems you are trying to solve. I would love to help.

Appendix

Textract file to s3 to Amazon Comprehend

import boto3
import pandas as pd
from collections import Counter


bucket_name = 'xyz'
input_key = 'result/extracted_text.txt'


# Initialize the Boto3 clients for S3 and Comprehend
s3 = boto3.client('s3')
comprehend = boto3.client('comprehend')


# Download the text file from the S3 bucket
response = s3.get_object(Bucket=bucket_name, Key=input_key)
text_file_content = response['Body'].read().decode('utf-8')


# Define the text chunk processing function
def process_text_chunks(text, chunk_size, processor):
? ? chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
? ? results = []


? ? for chunk in chunks:
? ? ? ? response = processor(chunk)
? ? ? ? results.append(response)


? ? return results


chunk_size = 5000


# Perform entity recognition and key phrase extraction on text chunks
entities_response = process_text_chunks(text_file_content, chunk_size, lambda chunk: comprehend.detect_entities(Text=chunk, LanguageCode='en')['Entities'])
key_phrases_response = process_text_chunks(text_file_content, chunk_size, lambda chunk: comprehend.detect_key_phrases(Text=chunk, LanguageCode='en')['KeyPhrases'])


# Create a dataframe for entity recognition results
entities = [entity['Type'] for entity_list in entities_response for entity in entity_list]
entities_count = Counter(entities)
entities_df = pd.DataFrame(list(entities_count.items()), columns=['Entity', 'Count'])


# Create a dataframe for key phrase extraction results
key_phrases = [key_phrase['Text'] for key_phrase_list in key_phrases_response for key_phrase in key_phrase_list]
key_phrases_count = Counter(key_phrases)
key_phrases_df = pd.DataFrame(list(key_phrases_count.items()), columns=['KeyPhrase', 'Count'])

# Save to s3
s3.put_object(Body=json.dumps(entities_response), Bucket=bucket_name, Key=entities_key)
s3.put_object(Body=json.dumps(key_phrases_response), Bucket=bucket_name, Key=key_phrases_key)

print(f"Entity recognition saved to: s3://{bucket_name}/{entities_key}")
print(f"Key phrase extraction saved to: s3://{bucket_name}/{key_phrases_key}"))        

Descending order code snippet

sorted_output = sorted(counter_output.items(), key=lambda x: x[1], reverse=True)

for item in sorted_output:
? ? print(item)        

要查看或添加评论,请登录

David Carnahan MD MSCE的更多文章

  • Creating Your Own Chatbot with Amazon Bedrock

    Creating Your Own Chatbot with Amazon Bedrock

    The opinions expressed in this blog are my own, and do not necessarily represent the opinions of Amazon, Amazon Web…

    5 条评论
  • Harnessing the Data Tidal Wave: Exploring AWS for Unstructured Data Processing

    Harnessing the Data Tidal Wave: Exploring AWS for Unstructured Data Processing

    The opinions expressed in this blog are my own, and do not necessarily represent the opinions of Amazon, Amazon Web…

    4 条评论
  • Meal Planner App - Step 1

    Meal Planner App - Step 1

    NOTE: The opinions expressed in this blog are my own and do not represent that of my employer Amazon / AWS Meal Planner…

    2 条评论
  • The Meal App Project Plan

    The Meal App Project Plan

    NOTE: The opinions expressed in this blog are my own and do not represent that of my employer Amazon / AWS Introduction…

    2 条评论
  • Comparing ChatGPT vs Bard for Writing a Draft Scene

    Comparing ChatGPT vs Bard for Writing a Draft Scene

    This post will be personal. I love to write! Been working on a novel for about 10 months now .

    3 条评论
  • Using GPT4 for Data Science: Experiment 1

    Using GPT4 for Data Science: Experiment 1

    NOTE: The opinions expressed in this blog are my own and do not represent that of my employer Amazon / AWS. I have been…

    7 条评论
  • GPT3 Healthcare Test #1

    GPT3 Healthcare Test #1

    Consider this clinical encounter and let me know when you are done. This is a visit with a 46 yo gentleman.

    18 条评论
  • Learning Journal Entry: Jupyter Notebook on AWS EC2

    Learning Journal Entry: Jupyter Notebook on AWS EC2

    This lab will be the first of many focused on how to use AWS for data science. The format will be like the one you used…

    1 条评论
  • Better Patient Engagement with Amazon Connect

    Better Patient Engagement with Amazon Connect

    When I was a staff physician at Wilford Hall Medical Center, I was amazed at how much work I did each day that was…

    8 条评论
  • Walking the High Wire

    Walking the High Wire

    "Every great dream begins with a dreamer. Always remember, you have within you the strength, the patience, and the…

    1 条评论

社区洞察

其他会员也浏览了