登录查看更多内容

Recipe - Python Text Analyzer

Emilio Calderon

I'm passionate about data, information, knowledge management, and ontology for the betterment of humanity.

发布日期: 2023年9月23日

Anyone who knows me knows that manipulating data is my bread and butter. So here is a recipe for a very simple text analyzer that only uses two ingredients, neither of which is bread or butter.

Although fresh bread and butter are kind of my jam.

Note: There is also no jam.

Ingredients:

1 18oz can - Visual Studio Code (or preferred flavor of code editor)

10 bytes - Python

This text analyzer will take a given string as input and output a map of each unique word and its frequency, as well as query for instances of specific words.

When you're all done your spaghetti code should look like this:

Just look at that moist slow cooked variable... those finely diced text format attributes... not to mention the wonderfully toasted methods and functions with an ever so subtle drizzle of twice reduced arguments. But I'm probably just being nostalgic; I vividly remember being a young lad and watching my grandfather analyze text for hours back on the farm. But I digress:

Instructions:

Step 1: Create a variable to store the text to analyze

Before you can analyze text, first you need text to analyze. Take a large bowl and create a variable to store the string of text you want to analyze. For this recipe I will call my variable 'givenstring' and assign it to a string, seen here in quotes:

givenstring="Clinica Sierra Vista continues to be at the forefront of diagnosing, treating, and managing the health conditions of the most vulnerable residents in our community. We are proud to provide the best healthcare to the underserved and lend a voice to the voiceless. Our medical team is ready to care for you, whether it’s a joyous occasion like welcoming a new baby to the world, or something that is less welcome, like any given ailment. We offer support for infectious diseases such as COVID-19, patient-centered behavioral health services for patients of all ages, teen family planning, and much more. Clinica Sierra Vista offers comprehensive health care as an organization that serves the primary medical, dental, and behavioral health needs of the 200,000-person population of Kern, Fresno, and Inyo Counties. We are proud to reach out to the underserved, underinsured, and uninsured of our community. We understand how those who are under or near the poverty line don’t get the medical treatment they sorely need. We are here to fill that gap, no matter your ability to pay."

Step 2: Create the TextAnalyzer class

To get Python to analyze the text we have to provide the actions we want it to take. But we might want to analyze text multiple times a day, as we have no self-control, so rather than cooking the entire dish from scratch every time, we take those actions and put them into a list and give it a name 'TextAnalyzer'.

TextAnalyzer

This named list is an object. Just about everything in Python is an object which is literally just a thing that does things. But an object isn't of much use unless we actually give it something to do by defining its attributes; its attributes being its name, values, actions it will take and how. And because this particular object has several blocks of actions it will take, it becomes a specific kind of object called a class. A class being a blueprint for complex objects.

After we tell Python what properties this class of object has, we simply need to reference the name of the class and all of its attributes will already be baked in with a fine flaky crust. Because sometimes reheating a frozen pizza is so much quicker. It's not delivery, it's diobject.

class TextAnalyzer(object):

Now that Python knows this class is an object named TextAnalyzer, we can start to define its attributes. The first of which will be a method that will format our text.

Step 3: Define the text formatting method

A function in Python is a set of code that performs some pre-defined action and a method is just a function that is part of a class. So, the first method we toss into the TextAnalyzer class pot will remove punctuation, convert all text to lowercase, and store it in a nice little variable.

We begin by adding two tablespoons of: def __init__ (self, text):

class TextAnalyzer(object):
   def __init__ (self, text):

Every class must have an __init__ function that defines the object's properties or values when it is initialized, or used.

def lets Python know that the following code is going to apply to the aforementioned class or object, which is TextAnalyzer.

__init__ actually assigns the values, which we will be defining later, every time the class is used, or initialized.

(self, ...): as the first parameter, is used to self-reference the current instance of the class when assigning attributes to it. It's basically a nickname for TextAnalyzer. As we continue and create the methods for this class, we may need to tell Python explicitly that a specific action modifies the TextAnalyzer class. We will do this shortly which will help this concept make sense. For now just know that every class has to have this parameter.

(..., text): every parameter after the first one is optional. It can be named whatever we want but, in this case, we will add a parameter called text that we can assign specific actions to, because these actions will primarily be manipulating text.

We begin by skimming the scum that is punctuation by using the .replace() function which will remove (.),(!),(?),(,), and (-) from the givenstring and replacing it with a space before storing the result text into a variable named formattedText.

class TextAnalyzer(object):
   def __init___ (self,text):
      formattedText = text.replace('.',' ').replace('!',' ').replace('?',' ').replace(',',' ').replace('-',' ')

Next into the pot is the .lower() function which will turn the output of formattedText, now without punctuation, into lowercase, for that velvety smooth data texture.

class TextAnalyzer(object):
   def __init___ (self,text):
      formattedText = text.replace('.',' ').replace('!',' ').replace('?',' ').replace(',',' ').replace('-',' ')
      formattedText = formattedText.lower()

And finally, for this method, we convert formattedText into a new variable called self.fmtText by simmering for two hours. Notice that we're bringing back the fan favorite self as self.fmtText. This turns our final output formattedText into a new variable called fmtText.

class TextAnalyzer(object):
   def __init___ (self,text):
      formattedText = text.replace('.',' ').replace('!',' ').replace('?',' ').replace(',',' ').replace('-',' ')
      formattedText = formattedText.lower()
      self.fmtText = formattedText

We have now completed the text formatting portion of our Text Analyzer class. Let rest for 30 minutes or until slightly tacky to the touch while we work on the dictionary part of our analyzer.

Step 4 : Define the dictionary method

This method will take the output of the text formatting method (fmtText), split each word in the text into individual elements, create an empty dictionary to store the word frequency, then loop over the dictionary and update the count of each word every time it appears.

We begin by creating the freqAll() method to put our actions into.

def freqAll(self):

The first action in this method will take self.fmtText as an argument and split each word apart and finally store it into a variable called wordlist.

def freqAll(self):
   wordlist = self.fmtText.split(' ')

wordlist is the name of the final output as a variable.

self.fmtText is the variable which stored the output of the text formatting method.

.split() is the function that splits the text. However, it needs to know where to split the text so here we tell it to split the text every time it encounters a space by putting a space within single quotes (' '). The space would be known as the delimiter. So if you wanted to split the text into individual sentences you could use a period as the delimiter like this : split('.')

Now that the text is split into byte sized chunks, we need to keep track of them. A convenient way of doing this is by using a dictionary which is basically a table that stores items in key : value pairs. Each word will go into the dictionary as a key and the frequency of the word will later on be stored as the value.

We will create an empty dictionary to store our words.

def freqAll(self):
   wordlist = self.fmtText.split(' ')
   freqMap = {}

freqMap is the name of the variable.

{ } curly brackets indicate that this is a dictionary which is one of many ways to store elements in Python. Other examples include lists which use [ ] or tuples which use ( ). For now just know that { } represents a dictionary and is currently empty.

Here is the fun bit where we create a function that will loop over the list of words in the dictionary and update their frequency every time it encounters it.

For example our dictionary is currently empty:

freqMap = {}

But let's assume the .split() function has run and stored the unique words in the previously empty dictionary, it would look something like this:

freqMap = {
   "clinica": ,
   "health" : ,
   "centered: ,
}

Remember that a dictionary is a key : value pair. On the left hand side of the colon are the unique words (key) but the right hand side of the colon (value) is empty.

So now lets assume that the loop we are about to create has run and updated the frequency. Then the dictionary might look like this:

freqMap = {
   "clinica": 4,
   "health" : 6,
   "centered: 2,
}

It told us that in the initial text we provided, the word "clinica" shows up 4 times, "health" appeared 6 times, and "centered", 2 times.

So let's go ahead and start building the loop:

领英推荐

What are the Reasons behind Increasing Demand for…

Acquaint Softtech Private Limited 2 年前

10 Things Every Python Developer Should know About…

Benjamin Bennett Alexander 3 年前

CROPLAND's top picks from the rstudio::conf 2022:…

CROPLAND 2 年前

def freqAll(self):
   wordlist = self.fmtText.split(' ')
   freqMap = {}
   for word in set(wordlist):

for word in tells Python what element to loop over. In this case each word.

set(wordList) takes the wordList which is the variable currently storing the words and converts it into a List. Sets are just variables that store multiple items in a single variable. It is one of 4 built in data types in Python. We've mentioned the other three previously : {dictionaries}, [lists], and (tuples). Each one has unique properties and the reason we are converting the assortment of words in the variable wordList into a set is because one of the properties of sets is that they don't allow duplicate values. We don't need a dictionary that lists the word 'the' 38 different times.

So now that we have a variable that stores each unique word, we can count them and update the dictionary accordingly:

def freqAll(self):
   wordlist = self.fmtText.split(' ')
   freqMap = {}
   for word in set(wordlist):
      freqMap[word] = wordlist.count(word)
   return freqMap

freqMap[word] is accessing the key part of the key: value pair, which in this case is storing the list of words.

wordlist.count(word) uses the .count() function to count each (word) in wordList and update the dictionary every time the loop encounters it.

return freqMap puts the results of the loop into the freqMap variable and allows us to use it later on.

The freqAll() method will output the entire list of unique words with their frequency but what if we only want to know how often a single word appears?

Conveniently enough, we have a leftover method sitting in the back of the fridge that does exactly that.

Step 5: Creating a method to query specific words

First we thaw by tossing into the microwave for 3 minutes:

def freqOf(self,word):

Then we let the method know to look at the output of the freqAll() method, which is where all of the words currently are, and put it in a variable called freqDict:

def freqOf(self,word):
   freqDict = self.freqAll()

And lastly the method will look for the specific word, which we can specify later, and return that word along with its frequency or simply 0 if it does not appear in the dictionary:

def freqOf(self,word):
   freqDict = self.freqAll()
   if word in freqDict:
      return freqDict[word]
   else:
      return 0

Step 6 : Putting it all together

We are near the final steps of preparation. But before we can present the meal we must first plate it.

The main course will be the TextAnalyzer which provides the bulk of our calories. But we have to make it a little more palatable:

analyzed = TextAnalyzer(givenstring)

analyzed will be the new name of TextAnalyzer(givenstring) which will make it much easier to access when we call it.

Next we want to make the analyzed text more presentable:

analyzed = TextAnalyzer(givenstring)
print("Formatted Text:", analyzed.fmtText)

print("Formatted Text:", analyzed.fmtText) will output the words within quotes followed by the output of analyzed.fmtText which is our formatted string that has had punctuation removed and converted to lowercase.

This is what the output will look like on our screen:

Formatted Text: clinica sierra vista continues to be at the forefront of diagnosing? treating? and managing the health conditions of the most vulnerable residents in our community we are proud to provide the best healthcare to the underserved and lend a voice to the voiceless? our medical team is ready to care for you? whether it’s a joyous occasion like welcoming a new baby to the world? or something that is less welcome? like any given ailment? we offer support for infectious diseases such as covid 19? patient centered behavioral health services for patients of all ages? teen family planning? and much more? clinica sierra vista offers comprehensive health care as an organization that serves the primary medical? dental? and behavioral health needs of the 200 000 person population of kern? fresno and inyo counties? we are proud to reach out to the underserved? underinsured? and uninsured of our community? we understand how those who are under or near the poverty line don’t get the medical treatment they sorely need? we are here to fill that gap? no matter your ability to pay

Next we call the function that counts the frequency of our unique words and ask it to print the results:

freqMap = analyzed.freqAll()
print(freqMap)

The output of print(freqMap) will look like this:

{'': 23, 'medical': 3, 'an': 1, 'clinica': 2, 'such': 1, 'underserved': 2, 'something': 1, 'get': 1, 'matter': 1, 'your': 1, 'understand': 1, 'and': 6, 'be': 1, 'reach': 1, 'support': 1, 'diagnosing': 1, 'ready': 1, 'is': 2, 'dental': 1, 'patients': 1, '200': 1, 'joyous': 1, 'no': 1, 'forefront': 1, 'planning': 1, 'it’s': 1, 'are': 4, 'centered': 1, 'healthcare': 1, 'at': 1, 'who': 1, 'inyo': 1, 'don’t': 1, 'infectious': 1, 'treating': 1, 'community': 2, 'more': 1, 'fill': 1, 'how': 1, 'under': 1, 'patient': 1, 'given': 1, 'primary': 1, 'a': 3, 'managing': 1, 'whether': 1, 'team': 1, 'less': 1, 'continues': 1, 'of': 6, 'lend': 1, 'you': 1, 'organization': 1, 'population': 1, 'underinsured': 1, 'needs': 1, 'health': 4, 'offers': 1, 'voiceless': 1, 'family': 1, 'sorely': 1, 'like': 2, 'diseases': 1, 'they': 1, 'vista': 2, 'voice': 1, 'the': 12, '19': 1, 'much': 1, 'our': 3, 'covid': 1, 'line': 1, 'for': 3, 'person': 1, 'teen': 1, 'world': 1, 'sierra': 2, 'poverty': 1, 'care': 2, 'serves': 1, 'best': 1, 'as': 2, 'vulnerable': 1, 'gap': 1, 'near': 1, '000': 1, 'residents': 1, 'welcoming': 1, 'we': 5, 'conditions': 1, 'comprehensive': 1, 'welcome': 1, 'in': 1, 'provide': 1, 'those': 1, 'ailment': 1, 'kern': 1, 'that': 3, 'treatment': 1, 'or': 2, 'occasion': 1, 'offer': 1, 'to': 10, 'need': 1, 'behavioral': 2, 'services': 1, 'most': 1, 'ages': 1, 'baby': 1, 'any': 1, 'pay': 1, 'counties': 1, 'new': 1, 'fresno': 1, 'here': 1, 'out': 1, 'ability': 1, 'uninsured': 1, 'all': 1, 'proud': 2}

Now if we want to query for a specific word we can create a variable, word, to store that word:

word = "clinica"

... in order to pass that word to the function that looks for its frequency in the dictionary:

frequency = analyzed.freqOf(word)

And if we really want to, rather than displaying the results by using:

print(frequency)

Whose output would look like this:

We can add a bit of garnish by using code like this:

print("The word",word,"appears",frequency,"times.")

Whose output would look like this:

The word clinica appears 2 times.

And bam! You've got yourself a meal fit for a king, or at least a prince... well, maybe a jester on a good day.

You can scroll back to the top for a view of this entire delicacy, otherwise here is the same code with annotation:

要查看或添加评论，请登录

Emilio Calderon的更多文章

Why Does Multiplying a Negative by a Negative Equal a Positive?

2024年2月1日

Why Does Multiplying a Negative by a Negative Equal a Positive?

I made the mistake of pondering this question. Definitely not as an excuse to avoid doing the dishes.
A quick introduction to generating Synthetic Data with Python and Faker

2023年11月4日

A quick introduction to generating Synthetic Data with Python and Faker

Extra thanks to Jerome Weathers for his suggestion to use the CodeSnap extension in VS Code to produce the images used…
Extracting text from an email with Power Automate

2023年9月14日

Extracting text from an email with Power Automate

This is for Power Automate users who want to extract specific text from the body of an email or text file. THE…

2 条评论
Seeding databases with data from seeds

2023年8月3日

Seeding databases with data from seeds

Seeding databases with data from seeds We have tried storing data on just about everything. From stone, to bone, to…

Recipe - Python Text Analyzer

Emilio Calderon

I'm passionate about data, information, knowledge management, and ontology for the betterment of humanity.

领英推荐

Emilio Calderon的更多文章

社区洞察

其他会员也浏览了

The Anatomy of a Python Class

Python Class Methods: Class Vs. Instance Vs. Static Methods

5 Cool Things You Can Do with Python

The lambda() and more

What's new in Python 3.11

The Meaning of Underscores (_) in Python

The Tiny Python Tuple That Could (Represent Anything)

How to use Python dictionaries

Demystify Python 2D Charts -- A Hackable Step-by-step Jupyter Notebook

Understanding Linked Lists | C++ and Python Implementations | Usage | Why matters | Comprehensive Guide

领英推荐

Emilio Calderon的更多文章

Why Does Multiplying a Negative by a Negative Equal a Positive?

A quick introduction to generating Synthetic Data with Python and Faker

Extracting text from an email with Power Automate

Seeding databases with data from seeds

社区洞察

其他会员也浏览了

The Anatomy of a Python Class

Python Class Methods: Class Vs. Instance Vs. Static Methods

5 Cool Things You Can Do with Python

The lambda() and more

What's new in Python 3.11

The Meaning of Underscores (_) in Python

The Tiny Python Tuple That Could (Represent Anything)

How to use Python dictionaries

Demystify Python 2D Charts -- A Hackable Step-by-step Jupyter Notebook

Understanding Linked Lists | C++ and Python Implementations | Usage | Why matters | Comprehensive Guide