Content intelligence in the age of AI & Machine Learning
Content = Publishers, Who are publishers?
As the world evolves, publishing space expands. In the past, we consider those who distribute newspapers as publishers, nowadays in the internet era, the scope of publishers includes websites, blogs, music, video games publishers, and even micropublishers. As you might have noticed, publishing revolves more around the content, than the medium it is published through. Which will be a key thing to understand for the publisher to take advantage of ML.
Data Strategy
The first step to utilise ML and AI in general, regardless of the business domain, is to develop a data strategy, or what you can call the Maslow's Hierarchy of needs for Data. A nice image that can explain this is below:
“Think of AI as the top of a pyramid of needs. Yes, self-actualization (AI) is great, but you first need food, water and shelter (data literacy, collection and infrastructure).” Monica Rogati.
For publishers, the way you can look at it is:
Data Warehouse
Why centralised data? If you have the data at one place it is easier to connect the dots across different datasets, not to mention that it is easier for users to access such data from one system. In a large organisation, having a centralised data warehouse helps in avoiding data duplication, since everyone has access to the same source. That mentioned, it is important to consider having the right resources (ex. data engineers), technology (data pipelines, workflows, cloud services), and practices (documentation, design, change control, PM) to have a healthy reliable data warehouse. Data is simply the foundation for ML, so having clean, organising, and easily managed access to data is crucial before considering going further, it’s your food, water, and shelter. Once you have that you can do:
- Business Intelligence: Helps in informing people in Marketing, Product, or Editorials to make decisions based on historical data.
- Machine Learning & AI: Using the same data, you can predict future and plan accordingly, or develop products/solutions. That can be developed using ML APIs (example by Google or Amazon) which provides to you services Natural Language Processing, Speech detection/recognition, Video Intelligence, Computer vision, and several other tools, in addition, you can develop your own custom ML models using the likes of TensorFlow, DataProc/Spark, IBM Watson and plenty of other services (Your Data Scientist will know those). The purpose of this article is not to talk about the technical side, as such, feel free to explore those on your own if you have a technical background, or just skip it, actual use cases will be discussed in the article later.
In addition to your own data, external data can be acquired using APIs or services. A good example of that is Google BigQuery. In addition to being a data warehouse service itself, it gives you access to several public datasets from Google Analytics, Newsletter, Google Adwords, DoubleClick, Subscriptions data, and many more. It has a nice service as well called BQML that enables you to execute ML models using standard SQL! So basically any engineer with knowledge in SQL can now utilise the data and ML models in a much simpler fashion (This is not a paid ad for Google :) ).
Natural Language Processing
Skip this section if you know what is NLP, jump to the use cases
So what is Natural Language Processing, or NLP? Simply, as human beings, we have been writing things down for thousands of years (between 3400 and 3300 BC), because of such long history, we managed to develop an exceptional skill of understanding text. We don’t only understand, but we also feel and imagine the text written as well in relation to real life. NLP is simply a sub-field of AI that focuses on understanding human language, which is a tedious job. Why? Let me give you an example, check the below sentence:
Salah was on fire yesterday, he destroyed Chelsea
Now as a human you can easily understand that this is talking about Mo Salah the football player and that he played very well last night against Chelsea. Even if you are not a football fan, you will understand that it’s some sort of a player. While a computer might think that someone called Salah, was literally on fire yesterday and that this person literally destroyed and brought to the ground the area/region of Chelsea in London! Although it’s funny, you can imagine how dangerous that can be if this system is used for fraud detection in a bank, or even in the case of a publisher, understanding reader’s interests.
The core features in NLP is that it helps us to define entities in text (persons, places, organizations, events, etc), and categories of text (ex. sports, art, etc). You can go to Google NLP API and test it yourself on their website by adding any text. The text in the wiki page of Game of Thrones, for instance, will result in recognising the following entities:
and the categories of /Arts & Entertainment/TV & Video/TV Shows & Programs
The process/Architecture
From a process and operational perspective, and avoiding the technical side, a publisher can simply run any content added or ingested to their CMS through an API to extract different NLP entities we discussed above, and classify such content. That is considered as additional Metadata for the content (NLP tags). This data should be stored as well to the Data Warehouse to be accessible for any solution or product built on top of such data, whether it is a BI report, or some ML engine (ex. recommendation engine).
Use Cases
Now all of that sounds good, but are these categories or entities have any business value. Out of the box, probably not. But using NLP as a publisher you can:
- Semantic Data within CMS for editors to view
- BI for content-based analysis reports
- Content recommendation
- Content to video recommendation
- Build customer segments
- Match ads to content and other 3rd party ads using categories
Here are some use cases you can kick off with as a publisher
Segmentation
Using categories generated by NLP (whether your own internal API or external such as Google API), you can segment your users/readers based on their reading behaviour. Example of that, you can know users who read Food content vs Sports content. Using these segments, you can approach them with the appropriate marketing message. So if you know a group of readers who are interested in Football, for instance, you can push a newsletter targetted towards such segment. ML can help you have more granular segmentation as well than manual segmentation.
Ad Targeting
Running your historical content through NLP, and extracting categories, you can create a key-pair of the actual category of that page/content, and the NLP generated category. Using that, you can run Ad Campaigns targeting, for example, World Cup-related content. Previously, to reach a similar behaviour, you would have to manually label such pages/content, which is time-consuming, hard to maintain, and prone to errors.
BI Analytics
You can add your usual Google Analytics numbers, example user visits, to categories generated by your NLP engine. So you can figure out which category people visit and read most, and how much articles already exists in that category. For instance, that can give you an indication of which categories to focus on, or that your users currently are interested in. This is just one example, there are plenty of things you can generate out of that, for instance, trends around a certain topic, or a certain celebrity.
Google Analytics + Content + NLP + BI = Powerful content analytics
One other common usage is analysing 3rd party content. Since publishers nowadays receive plenty of 3rd party content, you can create filters based on categories and content sources. So for instance source X provide mostly Sports content, as such, you can tailor offerings based on these results.
Recommendation Engines
A company like Netflix values its recommendation engine at 1 Billion dollars. So as a publisher using Machine Learning (ML) and NLP you can reuse your existing content that nobody is looking at it, hidden, which can create more ad revenue or subscribers. The recommendation can be done Content-to-Content (which is the usual type of recommending articles that are similar or related to the article read), a Personalized recommendation (Which is based on browsing the history, and/or similar people, which is what Netflix do), or lastly, Video-to-Content recommendation.
How is that technically possible? Without going into technical details (as promised in the article), since we have tags generated for each content/article in our data warehouse, so any two articles that have high overlapping of NLP entities (aka tags) are related. You just set some rules that you run on a periodic base. Finally, such overlapping content is sent to the website to be displayed to the reader.
Video Intelligence
A video is one of the most popular content types nowadays, whether that is a paid ad or a content created for the users. That is what we call nowadays Video Marketing. One way to analyse your video content is to convert videos to speech, and speech to text. Once you have that text, you can apply the usual NLP tricks we discussed in this post, like tagging the text. That can be used then in a content recommendation, or ad placement. That approach will save you tons of money that you would spend on external vendors of video-to-text recommender and will generate as well transcripts on the go for your videos.
Custom in-house solutions
Does using external APIs, like Google, always works? well, most of the cases, yes, but in some cases, the solution does not exist outside, or you have a novel/sensitive data that are proprietary (ex. user data). In this case, you need to build from scratch your personalized recommendation engine, which is in the form of a deep learning model of a neural network using TensorFlow or other methods. Fancy tech words, but your data scientist would know them, and there are plenty of examples of how to build such models. So if you have users reading history and the nature of the content they have read, in addition to reading habits of similar users, you can train your model on this data, and provide even more interesting recommendations tailored for your offerings. In the end, this same box will push the recommended content to the website as in the case of using an external API.
Other than recommendations, you can as well build Custom content classification, Forecasting reports, Content-to-revenue predictions, Content virality predictions, or Propensity Modeling.
Churn Prediction
Retaining subscribers is one of the most challenging things for publishers, a subscriber saved is basically money made! Could ML help in that? We can predict who is going to unsubscribe by training our model on who cancelled, and who did not cancel as data sets. Gathering enough data points (features if you are more tech-savvy) is mandatory in that case. Data like Subscription length, Demographics (age, income, etc), Subscribed Newsletters, Web browsing, can help in creating a decent model to predict churn. All this data should be in your Data Warehouse. As you can see, we mentioned the warehouse in many situations, which shows it’s importance as a basic need! You join a couple of tables, and you will have this data ready for your Data Scientist to work on. Once such a model is functional, your subscription retention team can look into the subscribers expected to churn, and take action, or have a forecast of the future. ML model can learn as well with time, and you can monitor that learning and performance as well, one of the methods is false positive and negative. A false positive is when you predict for instance a user to churn, but actually that user wouldn’t churn in reality, and false negative is when you predict that the user won’t churn, but in reality, s/he would churn. Depending on the case, but in this case a false positive will lead to sending extra emails, which is not a big deal, while a false negative is something to avoid, as it will lead to missing users that would churn if we didn’t act.
Evergreen Content Detection
Evergreen content is a content that naturally would have a longer time span, such as a review on some historical place, or a biography for someone. While an article about an accident, or a tech article about iPhone 4 wouldn’t be very relevant to recommend for instance. ML can be utilised in tagging your content as evergreen or not, which can be useful in filtering your content. The reality though, this is much harder than it sounds, since the computer needs to understand the text in more depth, it is doable though with newly popular methods such as LSTM.
Conclusion
Yes, Artificial Intelligence & Machine Learning can help you as a publisher, or someone dealing with content or ads to understand and utilise such content more. That said, you must start with a Data Strategy first, and figure out which applications or use cases can help your business and company short and long term strategy. Which will help to scale your existing cash-cow offerings, and which will open new opportunities for you, inform you about your performance & users, or optimise your operations by saving your time.
If you have any question, please leave in the comments section.