Process transcript with Python
Orsan Awawdi
Software Engineer | Extensive experience in automation using Python and Jenkins
Let's see what is the most popular word from Donald Trump speech.
Transcript of the speech can be found when googling "deal of the century transcript", Linkedin does not like hyperlinks to be pasted in the article, this is why I am not attaching a link here.
In order to process the text, I want to decide first what text I DO NOT need. So, we need to work according to the following:
# 1) only alphabetical tokens
# 2) normalized to lowercase
# 3) clean of punctuation and quotes
# 4) clean of undesirable words (like names, stop words, etc...)
For this, I created a #Python method which takes the transcript file and the destination file to be written to, as two arguments. The method loops the content of the input file, split this content by space, verify we take only alphabetical words, remove punctuation according to mapping table, normalize the words into lower case, remove any quote, select only desirable words, and finally insert each word into a list.
Now, let's count the frequency of each word. How common each word is?
For this, we use a dictionary that holds words as a key, and frequency of this particular word as a value. Last we write the result to a text file, or print the result. Nice!
Code can be found in GitHub: