The Marketer's Curse
Cookie Monster contemplates most sensible cookie groupings prior to consumption...

The Marketer's Curse

    I.         Intro

Wouldn’t the world be easier for CEO’s if all their customers were the same (and profitably produced growing and sustainable revenues)? No doubt, too, doctors would embrace predictable diagnosis and treatment paradigms in the wake of this challenging COVID-19 crisis.

Yes, marketing’s budget (and headcount) would be but an afterthought since customer acquisition (and retention) quickly becomes trivial - and doctors likely wouldn't be compensated nearly as much as their true skill, in reality, necessitates. One set of products, recommendations, and content would work seamlessly for every customer - just as one set of symptoms leads universally to a known diagnosis, to which one guaranteed-effective treatment protocol follows. Things would be so much simpler!

But, alas, customers are as diverse as the world’s population - and so are patients. In fact, in many instances, one could argue that there exists more customer types (as defined by a combination of their interactions and engagements with a company’s website, staff, products [web & mobile, and mobile-web apps]) than the very demographic information that characterizes the same world’s population. Add to that additional information available on the customer – whether that relate to their financial profiles, health events, automotive or insurance data, etc. – and the problem of getting the right material to the best customer at the perfect time quickly grows into a non-trivial challenge!

Before we delve into customer-product recommendation engines (upcoming in another post!), let’s first tackle the vital task of customer segmentation, and that’s where “clustering” comes in…

Let me share some recent work I performed for a fintech client in outline form, in the spirit of learning by example (which, personally, has been most effective for me):

    II.         What is Clustering?

a.    Intuition

                                                 i.    Clustering is the process of partitioning a set of objects into clusters such that objects in the same cluster are more similar (in some sense or another) to each other than objects in different clusters.

                                                ii.    Clustering is an unsupervised machine learning problem, meaning that unlabeled data are used, and there may be anywhere from one to n clusters of some set of objects, D.

                                               iii.    In other words, a cluster is some collection of observations more alike than not.

1.    For example, certain ranges of income + age groups may combine to form clustersFor example, with two clusters, we may think of hypothesized clusters as under $100k & under 40 vs. over $100k and over 40.

b.    Visualized

No alt text provided for this image

Credit: Scikit Learn

c.    General Clustering Definition

                                                 i.    More formally, clustering can be defined as follows:

                                                ii.    Let D be a set of objects. A clustering C?{C | C ? D} of D is a division of D into sets for which the following conditions hold: U Ci∈C Ci = D, and ?Ci, Cj ∈ C : Ci ∩ Cj=i = ?. The sets Ci are called clusters.

                                               iii.    With respect to the set of objects D the following shall be stipulated:

1.    |D| = n

2.    The objects in D represent points in the Euclidean space of dimension m.

3.    Based on a metric d : D × D → R, the similarity or the dissimilarity between any two points in D can be stated.

For a given or to be suitably chosen K ∈ N,

                       min f(C)

subject to C = (C1, . . . , Ck), C1∪˙ . . . ∪˙ Ck = Π,

whereby Π = {x1, . . . , xn} is the set of objects to be grouped in K disjoint clusters Ck.

Finally, f is a nonnegative objective function. Its minimization aims at optimizing the quality of clustering.

The minimization of f(C) corresponds with the maximization of the minimum distance between any two clusters.

No alt text provided for this image

  III.         Applications – Market Segmentation

a.    Email targeting:

                                                 i.    Timing, frequency, content most salient to individual consumer

b.    Product recommendations for base consumer tranches

c.    Lifetime-Value Model (LTV):

                                                 i.    Actionable insights on user acquisition and retention

d.    Forecasting

                                                 i.    Increase accuracy

e.    Predictive analytics

  IV.         Techniques

No alt text provided for this image

From Stein & Bush (2005).

No alt text provided for this image

Image from scikit-learn.org.

   V.         Density-Based Spatial Clustering of Applications with Noise (DBSCAN) – with proprietary modifications

a.    Algorithm Considerations

                                                 i.    Don’t know, and thus can’t use as an input, the number of user types beforehand, as would be required for such popular unsupervised machine learning clustering techniques as k-means, for example.

1.    Reasoning: Don’t want assessment to be a self-fulfilling prophecy; want true added value from clustering.

                                                ii.    [Modification] Don’t want boundary points to be mutually exclusive across groups.

1.    Reasoning: A user may be extremely interested in credit cards, for example, but also may be very interested in, say, auto loans, and auto loans may actually be a better fit for certain, user-specific reasons. The traditional DBSCAN algorithm would group this consumer type into credit cards altogether, if grouping by vertical, and we wouldn't be aware of the user's significant interest (and potential need/creditworthiness) in auto loans from the results.

2.    Reasoning: Not all points are representative of the relevant relationships, so having an algorithm that does not have to utilize all points in its evaluation is essential.

                                               iii.    Consumer types likely fall in dense regions in terms of likeness across engagement-based features, as opposed to cardinal (or even ordinal) relative distances in user characteristics across the feature set.

                                              iv.    Because of the generally high dimensionality of the data and the characteristics about the variables, clusters likely will be of varying size and structure, i.e., not necessarily spherical.

                                                v.    [Modification] Want to mitigate the curse of dimensionality, which refers to the fact that the convergence of any estimator to the true value of a smooth function defined on a space of high dimension is very slow.

1.    This phenomenon relates to “overfitting”: the predictive power of a classifier or regressor first increases as number of dimensions/features used is increased but then decreases (increased noise), requiring relatively many more observations to be robust.

No alt text provided for this image
No alt text provided for this image

Credit: Erik Bernhardsson

  VI.         Applying DBSCAN

a.    After reducing the feature space (the set of variables), e.g., by running [Oblique] Principal Component Analysis (PCA) and selecting the first k features such that 85% of the variance is explained [suggested rule of thumb], for the reasons mentioned previously, invoke the following algorithm:

b.    Density-Based Spatial Clustering of Applications with Noise (DBSCAN), with proprietary modifications to allow for boundary points to not be mutually exclusive:

                                                 i.    (1) Define a region R ? D, which forms the basis for density analyses;

                                                ii.    (2) another to propagate density information (the provisional cluster label) of R.

c.    In DBSCAN a region is defined as the set of points that lie in the ε-neighborhood of some point p. Cluster label propagation from p to the other points in R happens if |R| exceeds a given MinPts threshold.

 VII.         Sample DBSCAN Algorithm Parameters to Use

a.    MinPts = max(2 * dim, ln(# observations)) [suggested rule of thumb]

b.    Epsilon = iterative, (trial and error) choosing discrete combinations of most values

c.    *Note that a modification of DBSCAN called “Ordering Points to Identify the Clustering Structure (OPTICS)” does this automatically but runs much more slowly than DBSCAN and produces hierarchical results; moreover, OPTICS expects some kind of density decline to find cluster borders and can not [easily] be modified to treat border points as not mutually exclusive across groups.

d.    [Modification] Distance function (metric): Fractional Minkowski Distance (contrary to the popular choice of standard Euclidean distance) based on the degree of hubs and anti-hubs in the data set (hubness analysis), choosing a fractional ?p norm accordingly (disregarding the Triangle Inequality for the moment; that is, fractional norms are technically not norms, but this fact is not of material interest in this context).

                                                 i.    d(x, y) = ,

                                                ii.    Where in this special case, 0 < p < 1

                                               iii.    Motivation: Aggarwal et al. showed in “On the Surprising Behavior of Distance Metrics in High Dimensional Space” (2001) that the relative contrasts of the distances to a query point depend heavily on the Lk metric used. This provides considerable evidence that the meaningfulness of the Lk norm worsens faster with increasing dimensionality for higher values of [p]...[f]ractional distance metrics, in which [p] is allowed to be a fraction smaller than 1…[are] even more effective at preserving the meaningfulness of proximity measures.

 

VIII.         Convergence of Density-Based Algorithm Limits Actionability

a.    If, as in our case, the density portion of the hybrid algorithm converges at a number of segments/clusters/groups too large to pair with tailored content, then employ the hierarchical component of the algorithm.

                                                 i.    Think of the hierarchical component as a March Madness bracket – a nostalgic formulation during these post-COVID-19 strange times…

1.    At the very top of the “bracket,” the winning team prevails – analogous to one single segment for all data points (e.g., users or customers).

2.    At the bottom, each and every team is alive – representative of a segment for each and every data point (e.g., again, a user or customer).

b.    Say. Hypothetically, natural convergence of the density-based algorithm converges at 100+ user segments.

                                                 i.    Just testing these segments retrospectively against content offerings, to the extent that a taxonomy of [email] content was even created and available, would require substantial resources, and the permutations quickly result in exponential numbers!

                                                ii.    Alternatively, we can now employ the hierarchical shell on top of the density base.\

1.    Consider the below example's (please forgive the differing context) "dendrogram":

No alt text provided for this image

2.    With supererogatory numbers, we need only move up a level(s) to reach more manageable numbers of clusters which retain their innate similarity within themselves and differences across.

a.    In the pictured example, say the algorithm converged at the base, or the left-handed origin, of twelve (12) clusters, and we wanted fewer.

b.    Moving a level up (to the right) results in just four (4) clusters.

c.    Moving yet again one more segment towards agglomeration yields merely two clusters.

d.    Hopefully, the point has now been sufficiently illustrated.

  IX.         Turning DBSCAN market segmentation results into actionable insights

a.    Research and characterize groups, e.g., through a Jobs-to-Be-Done framework (by motivation).

b.    Input the results into any relevant business model (e.g., LTV by customer group) / iterate model for each group (e.g., product recommendations by customer groups).

   X.         Test!

Without knowledge of the ground truth classes, there are limited performance metrics available for evaluation.

Among them (Silhouette Coefficient, Calinski-Harabaz Index, Davies–Bouldin index, Dunn’s Index), Dunn’s Index seems to penalize density-based clusters over convex clusters the least.

Dunn’s Index:

No alt text provided for this image

Compares the size of the groups with the distances between groups. The further apart the groups, relative to their size, the larger the index and the “better” the clustering.

Results correspond to a Dunn’s Index value of 0.8, which can be interpreted as effective clustering.

a.    First, if at all possible, arrive at hypotheses of relative performance (e.g., conversions, CTR, directly-prompted monetization, etc.) of each cluster against each content offering (e.g., email campaign theme, product recommendation(s), etc.) and validate:

                                                 i.    It is recommended to begin testing retrospectively such that each user or customer segment exposed to multiple content offerings is tracked in terms of whatever performance metrics are considered most critical.

1.    This way, current and future performance metrics are unaffected by testing.

                                                ii.    Once sufficient information is collected, it is suggested to employ one of the most effective means of dynamic multiple-iteration testing in this context, known as “multi-armed bandit testing” (another topic for another day), and, specifically, using a “decreasing epsilon” strategy, where epsilon, in this context, effectively represents randomness such that the information obtained in earlier tests is leveraged going forward, thereby increasing returns at each iteration as much as possible.

  XI.         Arrive at a finished set of user/customer segments, which can be uniquely addressed with distinct content/recommendations in such a way as to maximize the key metrics of interest. Evaluate on an ongoing basis!

Hope you found this contribution informative, and PLEASE PLEASE PLEASE, share your own comments, concerns, suggestions, questions, etc.!!

Chad Fite

Chief Data Officer and Member, Board of Directors at Machine Learning - Data Science Company

4 年

Paul Karner, just to follow up, I recently spearheaded a successful engagement down these lines, and you should feel free to check out the corresponding deck I presented at 2018's AI Innovation Summit in SF using the following Google Drive link: https://drive.google.com/file/d/1T0GvDjoj6xSo4ygqM8da5u93Bgo-i6EF/view?usp=drivesdk Please don't hesitate to reach back out with any further questions, and thanks for your interest!

Chad Fite

Chief Data Officer and Member, Board of Directors at Machine Learning - Data Science Company

4 年

Thanks, Paul Karner, and great question! This approach allows execs to be as data-driven as possible; in addition to hopefully learning more about the customers, marketing, product, and customer success efforts can be more effectively tailored accordingly. Let's consider the simple hypothetical example of arriving at two customer segments: (1) < 40 yrs of age & income < $100k; (2) 40+ yrs & $100k+ income. The next thing a professional leader may want to do would be to understand how those segments interact with the product, retrospectively, in order to improve the customer experience (and optimize revenue & margins) going forward. Specifically, a good practical application of this tactic lies in email marketing intended to spur increased product usage where the most salient email content for each segment is created (in the hopefully-unlikely event that no such content actually existed at that point); optimal content-segment pairings are hypothesized and tested retrospectively in an A/B/C/D format; and then the hypotheses are revised, along with content, as appropriate; and forward "quasi-testing" such as employing a decreasing-epsilon strategy in the multi-armed-bandit testing framework is applied. More to follow...

Paul Karner

"The Dashboard Guy" | Expert data analysis and instrumentation for founders, operators, and investors

4 年

Chad, nice post! Can you share some quick thoughts on what makes the output of the approach you outline here actionable for execs? How do I know the customer "segments" the algorithm identifies will be helpful, what kinds of things should the business expect to be able to do with that information? Best, Paul

要查看或添加评论,请登录

Chad Fite的更多文章

社区洞察

其他会员也浏览了