Using GTP4 for Data Science: Experiment 2
David Carnahan MD MSCE
Healthcare IT expert leading Veteran and Military Health initiatives at AWS
NOTE: The opinions expressed in this blog are my own and do not represent that of my employer Amazon / AWS
Now, let's start having fun. Let's take the text file we created in the last blog using Amazon Textract and run it through Amazon Comprehend which is Amazon's Natural Language Processing (NLP) service. There are a lot of features but we will just focus on 'Entity Recognition' and 'Key Phrase Extraction'. You can find all the other features here if you like.
Using Amazon Comprehend for NLP
Wow, that was fast. Literally took < 1 min. I guess Data Science is easy now with GPT4. [maniacal laughter in the background].
The reality is that this is really just the start of any data science project ... getting data, transforming it, processing it, storing the output, presenting output and much, much more are what make data science projects complex. However, I do believe GPT4 can help you learn how to do these things, and accelerate the time to completion of your projects. Which segue ways nicely to a common issue in this line of work ... the dreaded error.
So I asked GPT4, what happened? If you look at the error, you can probably guess what happened.
The nice thing is it made if very easy for me to understand this error.
Tip 1: Use GPT4 to learn from your mistakes. You'll save time. I used to go to Stack Overflow and scan many, many entries ... trying to find the one that matched my issue ... and then would run into changes from the answer because the issue was raised YEARS ago and the tech changed since then. Bottom line, GPT4 will get you to a solution faster.
So, in this case ... I forgot to change my fake bucket name to the one I am using in my environment. Easy fix. Now, let's run it again.
Wah, wah, wahhhh. Another error. This time it is a TextSizeLimitExceededException. Very straightforward to understand, so I went to GPT4 to refactor the code, which it did easily.
Tip 2: if you look above, you'll see when I began I gave the main code it first produced a label ... "comprehend output" code. This was so I could refer to it by name and make my future modifications of the code clear for the GPT4.
So I took this code and then inserted it into the code as you see below.
Step 1: insert it where I think it goes ...
Step 2: Delete the old parts that was changed.
I'm guessing you are probably way ahead of me on Tip 3, but let me play this out before we identify a key tip that makes things much easier.
I ran the code and got another error ... this one I won't post here, but the chunk size it used was 1 byte too large. So I asked it to adjust the chunk size, and then it worked.
领英推荐
Using Amazon SageMaker to process output from Amazon Comprehend
Now I took the code it produced up at the top of the blog to get counts of the output for Entity Recognition and Key Phrase Extraction.
!@#&^*^%$.
Another error. But that is life in the fast lane. And to be perfectly honest, it took me longer to create this blog than to actually do the work. So, it is not near as frustrating as I make it out to be ... especially when you finally get across the finish line.
Now, this is where things get interesting. Check out this conversation I had with GPT4 and how it eventually made it very easy for me.
Tip 3: when GPT4 starts making changes to code and tells you to insert a certain part into the previous code ... it is much easier to ask it to rewrite the code inserting the new changes. That way, you don't make any mistakes when correcting the old code. Unless, of course, you like doing things the hard way ... and there is something to be said for doing so.
Dataframes and Finishing Touches
I ran the code, and then just sat there for a while, not realizing it was done. Haha. If the code doesn't tell you its done (meaning you ask GPT4 to include a print statement telling you it successfully ran ... this is probably Tip 4), then you need to look at the Jupyter notebook to see that little number in the brackets is there. That means it's done.
The I just printed out the dataframe and counter objects that I created with the above code and here is what you'll see.
One last thing ... I wanted to have it give me a one liner to sort the counter so I can see the most relevant entities and key phrases in that 312 word pdf I processed so here is what it gave me.
I then took this code ... simply used the objects named in output ... like 'key_phrases_count' and replaced the 'counter_output' in the sample code to get the descending sort of the key phrases I was looking for.
So, you can see the VA is very focused on the PACT Act, Operations and Maintenance, Modernization, etc., etc.
Conclusion
If you made it this far, congratulations. I hope this was helpful, and I will post the final code I used in the Appendix below.
Feel free to reach out to me if you have a miniproject you would like me to blog on or specific data science questions or problems you are trying to solve. I would love to help.
Appendix
Textract file to s3 to Amazon Comprehend
import boto3
import pandas as pd
from collections import Counter
bucket_name = 'xyz'
input_key = 'result/extracted_text.txt'
# Initialize the Boto3 clients for S3 and Comprehend
s3 = boto3.client('s3')
comprehend = boto3.client('comprehend')
# Download the text file from the S3 bucket
response = s3.get_object(Bucket=bucket_name, Key=input_key)
text_file_content = response['Body'].read().decode('utf-8')
# Define the text chunk processing function
def process_text_chunks(text, chunk_size, processor):
? ? chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
? ? results = []
? ? for chunk in chunks:
? ? ? ? response = processor(chunk)
? ? ? ? results.append(response)
? ? return results
chunk_size = 5000
# Perform entity recognition and key phrase extraction on text chunks
entities_response = process_text_chunks(text_file_content, chunk_size, lambda chunk: comprehend.detect_entities(Text=chunk, LanguageCode='en')['Entities'])
key_phrases_response = process_text_chunks(text_file_content, chunk_size, lambda chunk: comprehend.detect_key_phrases(Text=chunk, LanguageCode='en')['KeyPhrases'])
# Create a dataframe for entity recognition results
entities = [entity['Type'] for entity_list in entities_response for entity in entity_list]
entities_count = Counter(entities)
entities_df = pd.DataFrame(list(entities_count.items()), columns=['Entity', 'Count'])
# Create a dataframe for key phrase extraction results
key_phrases = [key_phrase['Text'] for key_phrase_list in key_phrases_response for key_phrase in key_phrase_list]
key_phrases_count = Counter(key_phrases)
key_phrases_df = pd.DataFrame(list(key_phrases_count.items()), columns=['KeyPhrase', 'Count'])
# Save to s3
s3.put_object(Body=json.dumps(entities_response), Bucket=bucket_name, Key=entities_key)
s3.put_object(Body=json.dumps(key_phrases_response), Bucket=bucket_name, Key=key_phrases_key)
print(f"Entity recognition saved to: s3://{bucket_name}/{entities_key}")
print(f"Key phrase extraction saved to: s3://{bucket_name}/{key_phrases_key}"))
Descending order code snippet
sorted_output = sorted(counter_output.items(), key=lambda x: x[1], reverse=True)
for item in sorted_output:
? ? print(item)
Healthcare IT expert leading Veteran and Military Health initiatives at AWS
1 年Trey Oats Chakib Jaber Nathan Gould Agustin "Tino" Moreno, MS Adam Ginsburg Aziz Nazha Travis Gibbons Audrey Reinke Eric Egan Jamie Baker Chris Nichols Seanna Carter Jesus J Caban, PhD Stanton Denman Matt Spaloss Matthew L. Ogburn Diana Morris, MBA Umair K. Vin Minichino Jim Jones Eric Haupt Chris Caggiano, MD FACEP Albert Bonnema, MD MPH Andrew D. Plummer, MD, MPH Peter Easter, DO, MBA, MHA, FAAP, FAMIA Ryan Costantino Charles Gabrial Gabriel Juarez Kyle Johnson Ye Jin "Jenna" Eun John Bonfardeci II Jason Nixdorf