ANALYSING WHATSAPP GROUP ACTIVITY WITH OPEN-SOURCE PROGRAMMING
Johanson Onyegbula
Remote Sensing Researcher | Geospatial Data Scientist | Software Engineer
My journey as a Master's student in the US won’t be complete without me mentioning specific group chats I belong to. And my experience as part of such groups gave birth to the idea I’ll share with you in this piece. You see, these groups were in high demand and prepared prospective graduate students for studies in the US and Western countries. Hence, the admins wanted only those dedicated to the process to be in the group. This led to a weekly routine when the admins removed inactive members, and new, enthusiastic members were added as replacements
The activity level in these groups is measured using the number of messages each participant sends. But imagine being an admin in multiple groups of over 300 participants each and having to "weed" out inactive members weekly or monthly from each of them, as the case may be. It's exhausting and often unrealistic to search through each participant's phone number or contact name to count the number of messages sent within a time frame. Such a strenuous task is not worth it for admins with any form of daily life activity beyond obsessing about a social media group.
My experience as an admin on such different groups where I regularly have to remove inactive members poses a strenuous challenge in determining the activity level of each large group. Therefore, I set out to find easier alternatives for assessing inactive participants. During this search, I discovered many third-party APIs and online services that integrate with my data (risking privacy concerns) and often require paywalls for any trial of significant use, which I disregarded immediately for obvious reasons.?
After a while, I settled for developing a program that could analyze activity levels in WhatsApp groups using open-source programming languages. The code for this was initially developed in R using regular expressions, and it was successfully tested a few times. Regular expressions are patterns used for searching and filtering for text/strings, based on matching combinations they tend to follow. However, since Python is more popular and I got several demands for the programmatic methods I used in determining active versus inactive members, I decided to rewrite it in Python to share with others.?
For the sake of clarity, I'll walk you through the detailed steps involved in analyzing WhatsApp group activity in a few steps discussed under different subheadings in this piece.
Aggregate and Export WhatsApp Chats:
Before any analysis can be made, you need to access the data and messages each participant has sent to the group. The first step is to export the entire group's chat to a document. To do this:
PS: I prefer to export Without Media as my option, as it saves time and data used in the export. Moreover, the media won’t be useful for the analysis.
This process might take a few minutes to generate a text (.txt) file, which you should save and transfer to the folder your program can access. If you’re exporting from a phone and using a laptop for your program, you can send the generated file to your email as a draft, to another WhatsApp chat, Google Drive, or any means where you'll be eventually able to access it and copy it elsewhere for use in programming.
Inspecting and Discovering Trends with the Data:
This section is optional and can be skipped if you care only about results. But for those keen on understanding the thought process behind the code that would follow next, keep reading. The essence of analyzing patterns helps in implementing regular expressions that would aid in determining the categorization of elements.
Here's what you need to do:
Open the text file on a computer with any editor (I use Notepad++) and view the data in it while comparing it to your WhatsApp chat. You'll notice messages are sorted by timestamps, with each starting on a new line and lots of weird characters. I noticed messages have the following characteristics:
month/day/year, hour:minute(strange character)AM/PM
month/day/year, hour:minute(strange character)AM/PM - number/contact: message
month/day/year, hour:minute(strange character)AM/PM - number/contact: <Media omitted>
A few exceptions exist for new members joining or old members leaving the group, in this case, the colon between the contact and the message is missing, as seen below:
month/day/year, hour:minuteAM/PM - number/contact joined using this group's invite link
month/day/year, hour:minuteAM/PM - number/contact left
Regular expressions:
While this article isn’t a tutorial on using regular expressions (regex), it is essential to quickly discuss it. Regular expressions are simply patterns that match a sequence of characters to a particular kind of text. It’s a simplistic but powerful search tool, much more effective for complex and bulky text compilations than conventional means of looking up words by searching for their literal occurrences.
An instance where regex wins over conventional specific text searches is when quickly looking up instances of months since October 2023 where a particular participant named “Drake” sent a media file (document, video, image, audio, sticker, etc.) on the group chat. Simply put:
“Find each message from October to December where Drake sent a media file.”
An efficient conventional keyword search would involve searching all instances of the word “Drake” and probably copying the lines over to a new file. Then delete all lines where the time stamp shows a month before October, which is easy to do as messages are ordered. And finally, looking up the word “<Media omitted>”. This simple process is time-consuming and complicated for this and any variation of text searches that might be needed.
However, regular expressions allow one to use patterns, which, from the previous section, we know the pattern would be of the form:
month/day/year, hour:minute(strange character)AM/PM - number/contact: <Media omitted>
To be more specific, this becomes:
month/day/year, (something here) - Drake: <Media omitted>
Where month is 2 digits, day can be 1 or 2 digits, year is 2 digits. This translates to:
领英推荐
\d{2}/\d{1,2}/\d{2}.+Drake: <Media omitted>
“\d” refers to any digit between 0 and 9; numbers within curly braces represent a fixed amount of times the digit appears or the range for how often it should appear. “.” refers to any character except newline characters, and “+” means such character must occur at least once. Documentation and standard patterns associated with regex can be easily found online, for those seeking to learn this powerful tool.
Python program to Analyze Messages and Classify Participants:
To analyze the activity level of WhatsApp group members based on the number of messages sent over a period, key details needed are the contact details/phone number of every user as well as the timestamp of each message. This information can then be aggregated to determine how many messages a single user sent over a time interval.
The Python code for this analysis is illustrated below:
First, the relevant modules are imported. The “os” module is only used to change the directory to the exported text file, which would be opened and analyzed programmatically. The “re” module is for regular expressions in Python, while pandas is used for making data frames which ease data storing and categorizing. Finally, the “datetime” module is also imported for handling the timestamp of the messages.
import os, re
import pandas as pd
from datetime import datetime
Next, we switch to the exported text file's directory or folder. Replace “mydirectory” with the full (absolute) path to the folder containing the text file containing chats you wish to analyze.
os.chdir("mydirectory")
The exported text file is opened in a read-only format with “utf-8” encoding to read the file, as that is the typical encoding of the characters in the file. Replace the “filename” variable with the actual name of your text file. Afterward, the entire contents are read into the “data” variable at once, and the file is closed to prevent corruption of its contents. This variable is what our regular expressions would be used for identifying patterns and relevant information.
filename = "whatsapp_chat.txt"
file = open(filename, 'r', encoding='utf-8')
data = file.read()
file.close()
As previously identified, the time stamp is the first element of every message in the exported chats:
month/day/year, hour:minute(strange character)AM/PM - number/contact: message
It typically consists of the date and time; dates consist of 1 or two digits of the month, a forward slash, one or two digits of the day, a forward slash and then 2 digits for the year. A comma separates it from the time given in hours and minutes in a 12-hour clock format, which ends with either AM or PM. Because the time section contains a variable number of undefined characters, it can be lumped into the regular expression pattern “.” which represents anything except newlines. Since we are interested in just the timestamp, we capture its pattern in parentheses and ignore everything that follows it. The re.findall finds all instances in the text that matches the pattern and stores the result in an array which is assigned to the “times” variable. The regex is shown below:
times = re.findall("(\d+/\d+/\d+.*[AP]M) - [^:]+: .*", data)
To extract contact information, we simply capture the next part immediately after the timestamp based on the identified structure of messages in the exported chats. Usually, the saved contact names or phone numbers follow a hyphen immediately after the AM/PM of the timestamp and often come before a colon and the message. This is then stored in a “contactNames” variable.
Note that the regex for both the times and contact names are the same and have a colon before the message represented by “.”, as we are not interested in timestamps and patterns representing the addition or departure of participants which often have no colon after the contact information. This ensures that in the subsequent analysis, members won’t be counted as having an extra message simply because they left or joined the group over a period.
contactNames = re.findall("\d+/\d+/\d+.*[AP]M - ([^:]+): .*", data)
The timestamp of WhatsApp is modified using list comprehension to a more practical and easy-to-understand format before further analysis. This is done in the “adjusted_times” variable.
adjusted_times = [datetime.strptime(instance, '%m/%d/%y, %I:%M\u202f%p').strftime('%Y-%m-%d %H:%M:%S') for instance in times]
Next, we create a data frame for pairing up the extracted information. The captured contact information and timestamp for each message should be of the same length. Hence, a 1:1 correspondence is established, and this can be stored in two columns of the same row in a data frame made with the pandas library. The “adjusted_times” is a formatted string, but needs to be converted to a datetime format if date intervals are to be used for subsequent analysis. Hence, this conversion is done in the Time column which is originally a string.
Whatsapp_group = pd.DataFrame()
Whatsapp_group['Contacts'] = contactNames
Whatsapp_group['Time'] = adjusted_times
Whatsapp_group['Time'] = pd.to_datetime(Whatsapp_group['Time'])
Now that a data frame has been created with WhatsApp contact names and timestamps of messages, we can filter for periods of interest in which we want to analyze group participants’ activity levels. Assuming you’ve been on the group consistently, you can filter for dates within which you wish to analyze activity levels. If the activity level of members is between the 26th of August, 2023, and the 11th of November, 2023, it can be done as seen in the example below:
relevant_members = Whatsapp_group[(Whatsapp_group['Time'] >= '2023-08-26') & (Whatsapp_group['Time'] <= '2023-11-11')]
Once this is done, the data frame can be aggregated by contact names to calculate the total number of messages each participant has sent to the group over that time. This, by default, outputs the result in descending order of total number of messages. Hence, the most active users come first, and the top 5 can be viewed with the “.head()” method of data frames, or modified as needed.
all_members = relevant_members['Contacts'].value_counts()
all_members.head()
Next, you can determine a threshold for which you want to determine what users fit into your activity definition. This threshold can then be used to determine what participants meet the threshold and can hence be classified as active. The “active” variable can be printed to reveal the unique set of contact details of such active members.
activity_threshold = 5
active = all_members[all_members >= activity_threshold]
active = set(active.index)
If you simply wish to print the unique contact names of every group member, sorted in ascending order, that can be easily done as well.
sorted(relevant_members['Contacts'].unique())
Finally, the inactive members are gotten similarly to the active members, with the operator before the threshold flipped to reveal those whose number of messages falls below it. The “inactive” variable can be printed out to reveal the contact names of each inactive participant.
inactive = all_members[all_members < activity_threshold]
inactive = set(inactive.index)
Conclusion:
With the above steps and the entire code run as a script, the activity level of group members can be analyzed within a second. Whatever future steps one wants to take with such information can be easily taken. There are tons of other personal reasons one might want to undertake this activity. Still, it is left for you to determine what you wish to do with the information and extend the code beyond its current capacity for personalized use.
CIO TS IT Regional Operations Sr. Manager for Latam and Europe
6 个月interesting approach but how do you handle those who joined the group but never ever sent a single message? if such exist, will not be on the chat export, therefore will not be present on any metric and despite being the most inactive members, will go unnoticed. I'm trying to add a second file with a list of group members so the script can compare it and also list those who are not actively participating at all but found problematic to export the members list beyond 500 group members
Quality Control Officer and French interpreter/translator at Innovative Data Solutions Ltd
1 年Great post
Skilled Data Analyst || Business Analysis || Excel, Microsoft Power BI, SQL, Python! || Devoted to customer retention || Technical Writer || Web3
1 年Wow, such a nice post. Going to implement this
This is great work, kudos
Graphics Designer I Data Scientist (in view) I Python I Power BI | SQL
1 年Good day Johanson Onyegbula. If you do not mind, do you mind sharing the link to this particular group chat you posted its picture? I’m a prospective grad student and I would love to join a community of like minds. Thank you.