Email Subject Line Optimization
Ekansh Verma, Manish Barnwal, Zeta AI/ML Group

Email Subject Line Optimization

The email subject line is one of the first things your eyes seek when deciding whether to open an email or not. So, crafting an appropriate email subject line is of utmost importance to marketers targeting their audience via emails. ?Imagine you are running an email campaign and you have multiple subject lines as candidates. How do you decide which email subject line to choose for the campaign? ?

The answer is a resounding whichever subject line generates a higher open rate. You can read more about the relevant attributes that can boost the open rates for a subject line here: ?8 Tips for Optimizing Email Subject Lines According to Zeta AI.?

That is the business problem we have catered to over the last few months. We developed a data-driven machine learning system that predicts the estimated open rate for an email subject line. ?Zeta offers end-to-end services to customers to run their campaigns. These campaigns generate data, and we utilized this campaign data to train various ML models to predict the open rate given an email subject line.?

This project led us to build an ML product that estimates open rates for an email subject line, and it is being used by many of our existing customers. ?In this post, we will discuss the ML approaches and their nuances that led to building the product. Let us begin with the data-gathering aspects.?

Data acquisition??

Data pre-processing?

The email subject line is the first impression of an incoming email. Subject lines contain a variety of text, emojis, and symbols. Often, they might include information specific to users such as the first name, location, or interests to further add to the personalization. ?

The same campaign is sent to multiple users in the targeted audience. These campaigns have a template to add in the user’s first name and last name. We needed to remove these user-specific details from the subject line and then group these subject lines to create an overall open rate for the different subject lines.?

Machine learning model requirements?

Now that we talked about the business objective and the data gathering, let us briefly touch upon the requirements from the machine learning model aspects. The most important requirement from the business was that we wanted our model to be interpretable – meaning we are not only interested in the best model performance but we also want to understand why the model predicts a certain open rate. This meant we had to choose a model that was interpretable on a per-example basis.?

But what does interpretability mean exactly???We will next discuss a couple of ideas related to interpretability.?

What is interpretability??

From the literature,?interpretability is defined as the degree to which a human can understand the cause of a decision. The decision in machine learning could be at the per-example level or the overall model level i.e., based on our business objectives, we might be interested in two different forms of interpretability.?

  1. Prediction interpretability: Explains why a prediction is made for a single input by the trained model.?
  2. Model interpretability: Explains what the trained model has learned. We investigate how the model works globally by looking at the architecture and parameters of a deep model. For instance, the famous feature importance plot of the random forest model is an example of model interpretability.?

For our use case, we are interested in prediction interpretability. We want to understand why a prediction is made for every single input by the trained model. This constraint of prediction interpretability made us selective of the type of models we should be looking at for our problem at hand. Let us dig into the various models we approached. ?

Further, we can categorize interpretability methods into two broad categories.?

  1. Intrinsic interpretability is achieved by constructing self-explanatory models which incorporate interpretability directly into their structures. Linear and logistic regression fall under this category.??
  2. Post-hoc Methods require creating a separate method to provide explanations for an existing model. Integrated gradients is one method that can be used to explain deep networks.?

We will discuss the interpretability methods in subsequent sections.?

Iterated model development?

When it comes to model training, it is always a good exercise to find a baseline model that is easily trained and serves as the bare-minimum reference for complex models. There is no point in training a complex model if it can’t beat a baseline model. So, we started with training a baseline model and then graduated to training complex models.??

We tried many models but for the scope of this blog, we will discuss the following models:?

1. Baseline: We use a bag of words as the feature extraction method and a linear predictor as the baseline model.??

2. Interpretable Convolution Neural Network Model: We discuss this approach in detail just below.?

3. Transformer-based model:?We used the pre-trained twitter-based Roberta model and fine-tuned it on our dataset.?

Let us dive deep into the model that aptly served our business objective – a CNN-based model.?

Building an interpretable Convolution Neural Network Model??

Convolutional neural networks (CNNs) are particularly well suited for processing data that has a spatial structure, such as images. CNNs have been successfully applied to a variety of tasks in computer vision, including image classification, object detection, and image segmentation.?

CNNs are also well-suited for processing text data. Text data often has a structure that can be represented as a sequence of words or characters. CNNs can be used to learn features from this data that can be used for a variety of tasks, such as text classification and sentiment analysis.?

Similar to images where convolution filters capture local patterns, text filters capture the n-gram features which are important for the downstream open rate prediction task. In the case of language-based tasks, we have a one-dimensional representation of text. Here the architecture of the Convolution Nets is changed to 1D convolutional-and-pooling operations.?Moreover, single-layer text convolution networks with multi-size filters are intrinsically interpretable. We can derive word attribution scores and the n-gram features for each filter.??

These two reasons motivate our choice to use convolution models for our task.?

Architecture?

No alt text provided for this image
Figure 1: Single Layer Convolution Neural Network with multiple filter sizes for open rate prediction

We use pre-trained Fasttext word vectors along with Emojional Embedding (Emoji embedding model) to embed the subject lines in dense vector representation.?

We apply the 1-D convolution layer to word vectors, followed by an activation function and the max pooling layer. Finally, a linear fully connected layer with sigmoid activation outputs the open rate. We use multiple convolution layers in parallel with different window sizes for enhanced predictive power. ?

Interpretability?

We adapt the work in Understanding Convolutional Neural Networks for Text Classification for our regression task. Open rate prediction for a subject line is based on input word embeddings, trained convolution filters, and weights for the fully connected layer.??

Let’s try to visualize how we can arrive at the word attribution scores for a subject line. Each convolution filter produces the score for n-grams in the text during the forward pass based on their window size. Applying max-pooling across these n-gram scores results in selecting the highest-scoring n-grams for every filter. These n-grams and their corresponding scores are responsible for the final open rate prediction.

We further decompose the n-gram scores into individual word attribution scores calculated using the inner product of the filter slot values and the word embeddings.?

No alt text provided for this image
Figure 2: Overview of Convolution Neural Network for open rate prediction
No alt text provided for this image
Figure 3: Decomposition of n-gram score to individual word scores

Going from a single model to a multiple-site model?

At Zeta, we work with clients from multiple diverse domains such as sports, news, etc. As we can imagine user demographics subscribed to our clients will hugely vary across these domains. Hence, to build a reliable open rate prediction model we need to capture the text styles, user interests, and past performance of campaigns specific to our clients.

As our first step to incorporate client-specific domain traits, we train individual CNN models for every client. Based on the previous point, we expect the client-specific models to perform better than a single model on average. This is exactly the conclusion we arrive at after evaluating the two different methods.?

Single site-specific model?

Serving and maintaining multiple client-specific models in a production setting is a daunting task. To solve this challenge, we built a single model incorporating the client-specific features into our method. We utilize client-specific important keywords as the additional supervision signal for our single model.

Interestingly, we can extract the important keywords from the client-specific trained CNN model for open rate prediction. We probe each filter in the trained model and select the highest-scoring n-grams. We further arrive at the word importance scores by considering the individual inner product using the word embedding and the filer slot values.?

We can make inferences for any new client at test time by extracting the keywords and using them as the feature in our trained keyword-based model. This eliminates the need to alter the model architecture as our client base evolves over a period of time.?

The introduction of transformer models has been a revolutionary step for natural language processing (NLP) research and NLP-based real-world applications. Consequentially, they have become the go-to model in most of the NLP tasks.?In the next section, we describe our experiments with transformer models for open-rate prediction.??

Transformer?

Model selection?

One of the key steps in selecting the appropriate transformer model is to check the pre-training task and the in-domain text corpus transformer model was trained on. We use twitter-roberta-base-emoji as subject lines tend to be shorter than regular sentences and emojis are often used to captivate the user’s attention.?

Training strategies?

Training large transformer models with limited labeled data can lead to degenerate training runs and subpar performance. We use the strategies demonstrated in REVISITING FEW-SAMPLE BERT FINE-TUNING to fine-tune the transformer model for the open rate prediction task.?Fine-tuning strategies which helped us stabilize training runs and faster convergence are discussed below:?

  • AdamW optimizer and slow starting learning rate in the range of 2 × 10?5 to 5 × 10?5?
  • Re-initialize some of the starting layers for the pre-trained transformer model?
  • Discriminative learning rate: Use different learning rates for different layers. We apply higher learning rates for top layers and lower learning rates for bottom layers?

No alt text provided for this image
Figure 4: Discriminative learning rates with cosine scheduler and warmup

Understanding Integrated Gradients?

Integrated Gradient was introduced as a post-hoc interpretability method in Axiomatic Attribution for Deep Networks. The paper introduces a set of axioms that ought to be satisfied by an interpretability method to be considered reliable.?

For linear models, a product of coefficients and input features is used as the attribution score. A natural extension for deep neural networks is to use the product of input gradient and input features for assigning attribution. However, gradient breaks the desirable properties of an explanation method named sensitivity mentioned in the paper. The proposed method leverages gradients as a starting point and integrates the gradients in a straight-line path from a baseline to the original input.??

For text tasks, the integrated gradients method can be used to compute the attribution of each word in the input to the model's predictions. For text models, we consider the baseline as the zero-embedding vector.?

No alt text provided for this image
Figure 5: Integrated gradient for fireboat image


Evaluating explanations for faithfulness?

It is hard to empirically evaluate the quality of explanations when ground-truth explanations are unavailable. However, evaluation metrics have been proposed in NLP to test explanation methods' specific properties. We follow the framework in the ERASER benchmark by DeYoung et al. (2020) to measure the explainability aspect of a model. We measure this using faithfulness.?

Faithfulness is one of the most desirable aspects of our explanation method based on our business use case. Faithfulness refers to how accurately explanations reflect the true reasoning process of the model. We consider two metrics to evaluate faithfulness.?

  1. Comprehensiveness computes the difference in prediction probability resulting from perturbing the features deemed influential by a given post hoc explanation.??
  2. Sufficiency is the inverse of Comprehensiveness which perturbs the unimportant features and measures the change in prediction probability.?

Following the ERASER, we create perturbed samples by deleting the tokens from the original email subject line.?Higher values of Comprehensiveness imply greater explanation faithfulness. Similarly, lower values of Sufficiency imply a greater faithfulness for explanations.

Summary??

We describe our systematic experiments to effectively predict open rates for email subject lines while integrating client-specific features from multiple industry domains.?

  1. We train two deep learning models - a Convolutional Neural Network and a Transformer network to predict open rates for crafted subject lines. We discussed different training strategies for both models in terms of the optimization parameters and the role of architectural choice.?
  2. One of the primary business needs for our machine learning system is interpretability. In this blog, we work through two ways to address interpretability: an intrinsically interpretable model and a post-hoc method.?
  3. ?Finally, we look at two evaluation metrics that can be used to score various interpretability methods in Natural Language Processing tasks.?

?References:?

  1. Understanding Convolutional Neural Networks for Text Classification?
  2. cardiffnlp/twitter-roberta-base-emoji · Hugging Face?
  3. Revisiting Few-sample BERT Fine-tuning?
  4. Axiomatic Attribution for Deep Networks?
  5. ERASER: A Benchmark to Evaluate Rationalized NLP Models?
  6. Slides: Understanding Convolutional Neural Networks for Text Classification?
  7. Emojional Embedding?
  8. Fasttext?
  9. ?Integrated Gradients in Tensorflow

Shreya Soni

Staff Software Engineer (MTS 1) at PayPal | Ex-Adobe | Ex-Walmart

1 年

Nice one Ekansh Verma!!

回复
Abhishek Mungoli

Staff Data Scientist @ InMobi | YouTube DataTrek | Ex Meesho, Walmart | IIIT-H CSE

1 年

nice one Ekansh Verma :)

回复
Gregory Dixon

Driving Retail Excellence Through Data Science | Transformative Supply Chain Design | Machine Learning Enthusiast | Analytics Innovator | Leveraging Military Intelligence Background

1 年

Awesome job!!

回复
s C

R&D Engineer | Distributed Systems Engineer| 9 y o.e in scalable backend.proven expertise in d.s and algorithms,microserv

1 年

Very useful

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了