Leveraging NLP for Actionable Insights: Transforming Customer Engagement in the Australian Health Insurance Industry
Kithmal Bulathsinhala
Business Analyst specialized in Data Science | Insight Development | Transforming data into strategies that drive customer value and business success
In June 2021, Australia saw a remarkable shift in its healthcare landscape. After years of steady decline, the number of Australians with private health insurance climbed to nearly 14 million, making up about 54.3% of the population. This was a 1.4% increase from June 2020 and marked the first annual rise in private health insurance uptake since 2015.
By 2023, the coverage landscape had further evolved. An impressive 54.6% of Australians had extras cover, which includes services like dental, optical, and physiotherapy. Meanwhile, 44.9% of Australians had hospital cover, providing access to private hospital services.
As Australians paid almost $25.7 billion in private health insurance premiums in the 2020-21 period, the government also played a significant role in healthcare spending. In that same year, the Australian government poured $37.6 billion into medical services and benefits, primarily through Medicare and Private Health Insurance rebate expenses.
Looking ahead, the health insurance market in Australia is projected to reach a gross written premium size of $26.97 billion USD by 2024. Despite these substantial investments and projections, opinions on Australia's healthcare system remain divided. Concerns and discussions have echoed across social media platforms, particularly X(Twitter), where many Australians voice their thoughts and experiences.
To understand these sentiments better, we turned to X(Twitter) data spanning from 2013 to 2022. By diving into this data, we aim to uncover patterns and common themes in conversations about insurance, providing valuable insights into what people are saying and feeling about their healthcare options.
NLP pipeline
To carry out an NLP task, it's essential to follow a structured pipeline, which includes the following steps:
1. Collecting Data: Gather the raw text data from relevant sources, such as tweets in our case.
2. Data Cleaning: Remove irrelevant elements like Twitter IDs, hashtags, and punctuations, and handle issues such as retweets and duplicates.
3. Pre-processing: Prepare the text for analysis through several key techniques:
- Stop Words Removal: Eliminate common but non-informative words.
- Tokenization: Break down the text into individual words or tokens.
- Stemming & Lemmatization: Reduce words to their base or root form to unify similar words.
4. Feature Extraction: Transform text data into numerical features for analysis:
- N-grams: Capture sequences of words to understand context.
- Term Frequency (TF): Measure how frequently each word appears.
- Bag of Words (BoW): Represent text by word counts, ignoring word order.
- TF-IDF: Evaluate the importance of words based on their frequency in the document and across the corpus.
- Word Embedding: Use models like Word2Vec or GloVe to capture semantic meaning through dense vector representations.
5. Text Analytics: Apply various techniques to extract insights:
- Classification: Categorize text into predefined categories.
- Topic Modeling: Identify and extract underlying topics from the text.
- Sentiment Analysis: Determine the sentiment expressed in the text (positive, negative, neutral).
- Summarization: Generate a concise summary of the text content.
By following these steps, we can effectively process and analyze text data, uncovering valuable insights and patterns in the information.
Exploratory Analysis
To carry out any analysis, gaining a comprehensive understanding of the data is crucial. The figure below showcases Australia's market share for private healthcare providers.
For our analysis, we will focus on the tweets related to specific providers: Medibank, HBF, and Bupa, as well as Medicare, which is the publicly funded universal healthcare insurance scheme in Australia. Using Python, we will employ natural language processing (NLP) techniques to explore and analyze these tweets, uncovering valuable insights into public sentiment and discussions surrounding these key players in Australia's healthcare landscape.
Our dataset comprises various attributes, but for our analysis, it is essential to focus only on the relevant data. Specifically, we will consider only the 'date' and 'tweet' columns. The 'date' column will allow us to track trends and changes over time, while the 'tweet' column will provide the actual content for our natural language processing (NLP) analysis. By narrowing our focus to these key attributes, we can efficiently analyze the data and uncover meaningful patterns and insights related to the public's views on Medibank, HBF, Bupa, and Medicare.
Our dataset contains a total of 82,772 tweets. Upon closer examination, we found that the HBF dataset alone comprises 7,024 tweets. Notably, 4,186 of these tweets were posted directly by the HBFHealth account itself. Since our analysis aims to understand public opinion on the healthcare system in Australia, it is crucial to consider only the perspectives of the customers. Therefore, we will exclude the tweets posted directly by the HBFHealth account from our dataset to ensure our analysis reflects genuine consumer sentiment.
Preprocessing
Data cleaning is essential before using NLP and AI techniques. We convert all text to lowercase for consistency and remove Twitter IDs, retweets, and duplicate tweets. Additionally, we eliminate punctuation, regular expressions, and common stop words that don't add value to analysis. Numbers in text columns are also removed for sentiment analysis clarity. These steps enhance dataset quality and relevance for NLP tasks.
Transforming to lower cases
Tweets from users can vary greatly in their casing-some may be in lowercase, uppercase, or a mix of both. In NLP tasks, such variations can cause words to be treated as unique and separate entities. To ensure consistency and accuracy in our analysis, we will convert all tweets to lowercase. This preprocessing step will help us treat similar words uniformly and enhance the reliability of our NLP analysis.
Removing twitter IDs
Twitter IDs (any sequence of characters followed by the '@' sign) and hashtags are commonly used elements in tweets. However, for our analysis, these elements do not contribute meaningful insights. Therefore, we will remove all Twitter IDs and hashtags from our dataset. This will help us focus on the core content of the tweets and ensure that our analysis is centered on the relevant text without the distraction of these non-essential components.
Duplicate removal
Next, we will address retweets and duplicate tweets. Users often retweet content, which can lead to repetitive entries in our dataset. Retweets and duplicate tweets can obscure the true sentiment and dilute the meaningful insights we aim to extract. To ensure our analysis reflects genuine and diverse opinions, we will remove retweets and duplicate tweets from our dataset. This will help us uncover clearer, more accurate insights into the public's views.
In our analysis, we identified 16,735 duplicates tweets and have removed them.
Removing punctuations
Removing punctuation is another important step in our preprocessing. Punctuation marks, such as periods, commas, and exclamation points, can be scattered throughout tweets and might interfere with our analysis. By stripping out these punctuation marks, we simplify the text, allowing us to focus more effectively on the content and meaning of the words. This helps ensure that our NLP tasks can accurately interpret the text without the distraction of unnecessary symbols.
Stop words and domain related word removal
In our analysis, we will also remove stop words and domain-specific terms. Stop words are commonly used words like 'I', 'me', 'my', 'we', 'our', and 'you' that often don't carry significant meaning for our purposes. Domain-specific terms that are overly generic, such as 'myself' or 'you’ve', also fall into this category. By eliminating these words, we can focus on the more meaningful content of the tweets, enhancing the clarity and relevance of our analysis.
In Python, the word "not" is often considered a stop word, but it plays a crucial role in sentiment analysis by indicating negation. To preserve the integrity of our sentiment analysis, we will retain such words that can significantly impact the meaning of the text.
Standardization of tweets
After cleaning our dataset, we applied both stemming and lemmatization techniques to further process the text data. Stemming involves stripping away prefixes, suffixes, and other affixes to derive the base or root form of a word. In contrast, lemmatization reduces words to their root or dictionary form, ensuring that the resulting root word belongs to the language.
Upon comparing the outcomes of each technique with the original text, we observed that lemmatization performed better in preserving the integrity of words compared to stemming. Stemming sometimes alters important words, like changing "everyone" to "everyon" in dataset, which can lead to semantic inaccuracies.
Therefore, in our analysis, lemmatization proved to be more effective for maintaining the contextual meaning of words, which is crucial for accurate NLP and sentiment analysis tasks. This step enhances the quality and reliability of our text data for subsequent analysis and modeling.
Word frequency analysis
Word frequency analysis is a key step in understanding the content of our tweets. By examining how often specific words appear, we can identify common themes and prominent topics within the dataset.
Word frequency for Medibank
The figure above illustrates the commonly used words for Medibank. We will delve into the details of these words for each provider individually. By identifying the unique words associated with each provider, we can pinpoint common, less meaningful words that are frequently used across the dataset. This process will help us eliminate such words from our analysis, ensuring that our insights are focused on the most relevant and impactful terms for each provider.
Word frequency for Medicare
领英推荐
Word frequency for BUPA
Word frequency for HBF
Overall Word frequency analysis
By analyzing the individual word frequency for each provider, we identify that words such as "health," "HBF," "Medibank," "Bupa," "Medicare," and "Australia" are frequently mentioned. Given that our dataset focuses on health insurance providers based in Australia, these words naturally appear in the tweets. However, they may not contribute meaningful insights for our analysis as they are common across the dataset.
Before removing these frequently mentioned words, it’s crucial to carefully examine their impact on the overall meaning of the tweets. We need to ensure that eliminating these terms will not distort the context or sentiment conveyed in the text. This detailed review helps us maintain the integrity of our analysis while filtering out non-informative common words.
After a thorough investigation, we identified that some of these frequently mentioned words do not add meaningful value to our analysis. Consequently, we have removed these terms from our dataset to refine our focus on more impactful and insightful content.
Text Feature Extraction
In our analysis, we employed both bi-gram and tri-gram techniques. Bi-gram analysis involves grouping words into pairs (two-word combinations), while tri-gram analysis groups words into triplets (three-word combinations). By applying these methods, we uncover meaningful word combinations that frequently appear together in the tweets. This approach helps us identify patterns and associations within the text data, shedding light on prevalent topics, phrases, or expressions used by customers and the companies.
We conducted this analysis separately for each provider to identify unique patterns specific to each one, as well as to recognize overarching trends. Detailed findings and additional insights from this analysis are available in my GitHub repository.
Based on the tri-gram analysis for HBF, it's evident that the company tweets mainly focus on promoting their free health check benefits and highlighting their involvement in sports initiatives, such as the "junior sports heroes" program. This suggests that the company is actively engaged in promoting health-related services and community activities through their tweets.
Additionally, the bi-gram analysis revealed specific word combinations like "sorry hear," which could potentially indicate the company's use of apologies or expressions of sympathy in response to customer tweets.
It was observed that references were made to an employee named Lauren (Lauren Underhill), likely who can be identified as a senior communication consultant responsible for handling customer inquiries or complaints on Twitter.
Looking at Tri-gram analysis of the customers most commonly used combination is about private health insurance and company’s new CEO John Van Der Wielen. Bi-gram analysis also illustrate that most of the customers have posted about their health insurance, company CEO & customer service.
Above detailed findings and additional insights are available in our GitHub repository.
By examining the bi-gram and tri-gram analysis across all providers, we identified several key topics that users frequently discuss. These include social security, aged care, and the rebate freeze. These recurring themes highlight the primary concerns and areas of interest for users when discussing health insurance and related services on Twitter.
Temporal Analysis
To gain a deeper understanding of how public sentiment and discussion topics evolve over time, we conducted a temporal analysis of the tweets.
In April 2022, the number of tweets about health insurance in Australia reached its highest point. This surge in activity suggests that something significant occurred in the Australian health insurance industry, prompting a wave of tweets from users discussing the event. This spike in online chatter indicates a period of heightened interest and engagement, reflecting the public's response to the notable developments during that time.
From the above figures, it's clear that most users were talking about something related to Medicare in April 2022. This indicates a significant event or development regarding Medicare during that time. We will delve into the details of this discussion later to uncover what drove this spike in user engagement and sentiment.
The figures above illustrate that 2022 recorded the highest number of tweets, with a significant portion of them related to Medicare. This indicates that Medicare had a notable impact on the increased tweet activity for that year. However, it's also clear that, in general, users have been consistently more engaged with content related to Medibank over the years.
By looking at the above figures, we can note that overall user engagement on Twitter is higher at the beginning of the week and gradually drops towards the end.
Conducting a temporal analysis helps us identify user interaction patterns, which, in turn, reveal the most substantial topics of interest to users. This analysis can also highlight what decisions or changes contributed to these patterns for each provider at different times of the year.
Additionally, understanding these engagement patterns can be valuable for marketing strategies. By recognizing when users are most active on Twitter, companies can effectively time their messages to maximize reach and impact. This insight allows health insurance providers to better connect with their audience and tailor their communication strategies accordingly.
Sentiment Analysis
Following the temporal and engagement analysis, a sentiment analysis was conducted to capture customer reactions towards the companies.
We can observe that overall, the tweets have recorded an average sentiment value of 0.0819, indicating that users generally hold a neutral sentiment towards these providers. The highest sentiment value was recorded by HBF. However, examining the boxplots reveals that both HBF and Medibank exhibit a high variety of sentiment, as indicated by their large interquartile range (IQR). In contrast, Medicare shows less variability in sentiment, meaning that users' opinions about Medicare are more consistent and less polarized compared to HBF and Medibank.
Next, we summarized the sentiment values based on the year and the provider. This allows us to easily identify which years recorded the most positive sentiments and which years had the most negative sentiments. By analyzing these trends, we can uncover the decisions and reasons that influenced these shifts in sentiment. This detailed breakdown helps us understand the impact of specific events, policy changes, or other factors on public opinion towards each health insurance provider over time.
The visualization below illustrates that the average sentiment value for health insurance providers is declining over time. This trend indicates that users are becoming increasingly negative or less positive about these providers. Such a decline in sentiment could be due to various factors, including policy changes, service issues, or broader industry trends. By identifying and understanding these underlying causes, providers can address the concerns and improve their public perception.
Topic Modeling
Topic modeling is a NLP technique used to uncover the main themes or topics present in a collection of documents. It is a form of unsupervised machine learning that aims to discover underlying patterns in text data without any predefined labels or categories. For our analysis, we will utilize Latent Dirichlet Allocation (LDA), a popular topic modeling technique. LDA helps us identify the topics that are most prevalent in the tweets, providing valuable insights into the key subjects of discussion among users.
The LDA model identifies 10 distinct topics, each represented by a combination of keywords with specific weightages contributing to the topic. Many of these topics revolve around various subjects, such as CEOs, Australian politics, and Medicaid, a U.S. government program that provides health insurance for adults and children with limited income and resources. Additionally, some topics seem to focus on customer service, likely related to call center interactions. These insights help us understand the diverse range of discussions and concerns expressed by users in their tweets.
Findings
We identified that in 2022, the tweet count for Medicare was high, possibly due to discussions about a rebate freeze during that time. This trend was further highlighted by our bi-gram analysis. Additionally, there was significant discussion around politics, as the 2022 Federal election brought some policy changes affecting healthcare insurance.
Further, the above figure highlights the reasons behind some of the tweet spikes during certain periods. For instance, the HBF CEO's announcement of premium increase cancellations, Medibank’s penalties, and BUPA’s lowest premium increase significantly contributed to the increased tweet activity. These insights provide a clearer understanding of the events and topics that drove public interest and engagement on Twitter.
Recommendations
Based on the comprehensive analysis conducted on the dataset, several key recommendations emerge.
Firstly, healthcare providers should leverage insights from temporal analysis to strategically schedule content and engagement activities during peak periods of customer interaction, such as the beginning of the year and early in the week, maximizing reach and responsiveness.
Allocating resources to promptly respond to customer inquiries, comments, and feedback on social media platforms is crucial, with a structured approach to address both positive and negative comments with empathy and professionalism.
To further enhance customer sentiment and strengthen relationships, providers can implement personalized engagement strategies, proactive customer service, and transparent communication. Initiating proactive outreach to customers through personalized messages, acknowledging their specific needs or concerns, and regularly sharing valuable content such as health tips, FAQs, or success stories can demonstrate care and expertise.
Maintaining transparent communication, especially during policy updates or changes, is essential. Clearly explaining the reasons behind decisions and addressing concerns openly will build trust and credibility.
Continuous monitoring of sentiment analysis is crucial to gauge customer perceptions and promptly address any negative sentiments to uphold a positive brand reputation. Using insights from topic modeling (e.g., LDA), providers can tailor marketing communications around identified key themes, aligning messaging with customer interests and priorities. By incorporating these strategies into their customer engagement approach, providers can cultivate positive sentiment, strengthen brand loyalty, and differentiate themselves as customer-centric organizations within the healthcare insurance industry.
Github repositories: https://github.com/nkbulath/NLP_task.git
Special thank to my team members Thamalshika Wijesundara , Dulani Ekanayaka & Himashi Hemachandra.
#NLP, #HealthInsurance, #CustomerEngagement, #DataScience, #MachineLearning, #AI, #Healthcare, #SentimentAnalysis, #TopicModeling, #DigitalHealth, #TechInHealthcare, #BigData, #AustralianHealthcare, #CustomerExperience, #HealthcareInnovation