Modern Visual RecSys: Intro to Visual?RecSys
In this chapter, we will explore the “hello world” data for visual models, the FashionMNIST dataset from Zalando with PyTorch, Tensorboard and Colab.
This is part of my Modern Visual RecSys series; feel free to check out the rest of the series at the end of the article.
FashionMNIST and the visual challenge
Source: FashionMNIST by by Kashif Rasul & Han Xiao
The data consists of:
- Training set of 60,000 images and a test set of 10,000 images.
- Each image is 28x28 grayscale, across 10 classes: T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot.
Our goal is to gain an intuitive understanding of two critical concepts: embeddings and distance, as they are fundmental building blocks to the following chapter on Convolutional Neural Networks (CNN).
What are embeddings and why we need them?
Top: The images from FashionMNIST. Bottom: The numerical representation of images
Traditionally, we represent images as a massive array of integers (3D array for RGB images and 1D array for grayscale images). These arrays are unwieldy and grow exponentially — we will need to keep track of millions of numbers to analyze hundreds of high-resolution images! It will be impossible to scale any modeling using arrays of integers; hence the modern approach of embeddings is created.
One excellent illustration of the embeddings concept is the example in “Neural Network Embeddings Explained” by Will Koehrsen. Traditionally, we made use of techniques like one-hot encoding to represent items in a matrix. For example, given three books, we will have a 3x3 matrix, where each item is represented by an array of discrete numbers that grows with every new item added (4x4 with 4 items, 5x5 with 5 items…). Furthermore, there is no sense of similarity nor distance as the items are not connected with any form of relationship.
# One Hot Encoding Categoricals books = ["Harry Potter and The Philosopher's Stone", "Harry Potter and The Chamber of Secrets", "The Lean Startup"] books_encoded = [[1, 0, 0], [0, 1, 0], [0, 0, 1]] Similarity (dot product) between First and Second = 0 Similarity (dot product) between Second and Third = 0 Similarity (dot product) between First and Third = 0
Once we apply a transformation to convert the objects into embeddings, we now limit the number of elements in the array representing each item (the limit is 2 in this example) on a continuous scale, and the values have a relationship-based meaning. Objects close to each other based on similarity (dot product) are highly related.
# Idealized Representation of Embedding books = ["Harry Potter and The Philosopher's Stone", "Harry Potter and The Chamber of Secrets", "The Lean Startup"] books_encoded_ideal = [[0.53, 0.85], [0.60, 0.80], [-0.78, -0.62]] Similarity (dot product) between First and Second = 0.99 Similarity (dot product) between Second and Third = -0.94 Similarity (dot product) between First and Third = -0.97
Instead of analyzing millions of discrete variables for each image, embeddings are vector representations of discrete variables. For deep learning, we usually make use of neural network embeddings to reduce the dimensionality of categorical variables into something managable
FashionMNIST embeddings projection on Tensorboard — for details, see “The Code” section
Because we can control the size of the vector representations, we can scale down a huge image array into a small vector made up of far fewer numbers. The result can be seen from the FashionMNIST objects in the image above, where the objects are projected onto the 3D vector space. Through the process of embeddings, image vectors that are similar will be projected close to each other. Thus, when we have the embeddings, we can:
- Project the objects into vector space and formulate the concept of distance and neighbors for visualization and simple recommendations (this chapter).
- Make use of embeddings to train deep learning models (next chapter)
Let’s take a look at how we can build such embeddings.
The tools
A quick overview of the tools that we will be using for the coding sections:
- PyTorch vs. Tensorflow: They are the two dominating frameworks in the deep learning space. PyTorch is gaining momentum in research due to its simplicity and pythonic nature that makes it easier to integrate into python workflows. Even organizations like OpenAI is switching to PyTorch. I find Tensorflow verbose and confusing to use due to changes in Tensorflow 2.0 that break old code with new function names and parameters. PyTorch’s simplicity fits our workflow. We will use PyTorch in all our code examples. I have added PyTorch learning materials (book and tutorials) under the further readings section.
- Tensorboard: Tensorboard used to be one key differentiating factor for Tensorflow. Now that Tensorboard supports PyTorch natively, we can code in PyTorch and visualize it in Tensorboard.
- Colab: Google’s Colab hosts Jupyter notebooks in the cloud with GPU access for free. This is an excellent way for us to share code and explore deep learning frameworks without the hassle and cost of setting up the GPU environment (you just need a free Google Account).
The Code
You will need to start Colab in order to interact with the Tensorboard (the most important part of this chapter).
What have we learned
In this chapter, we learn about embeddings, how they work and why they matter. We also explored the code in PyTorch, experimented with the visualization in Tensorboard to have an intuitive understanding of how recommendations can be done via embeddings.
In the next chapter, we will build on all these understanding to develop a Convolutional Neural Networks (CNN) based recommender.
Explore the rest of Modern Visual RecSys Series
- How does a Recommender Work?
- How to Design a Recommender?
- Intro to Visual RecSys [we are here]
- Convolutional Neural Networks Recommender [Pro]
- COVID-19 Case Study with CNN [Pro]
- Building a Personalized Real-Time Fashion Collection Recommender [Pro]
- Temporal Modeling [Pro]
- The Future of Visual Recommender Systems: Four Practical State-Of-The-Art Techniques [Foundational]
Series labels:
- Foundational: general knowledge and theories, minimum coding experience needed.
- Core: more challenging materials with code.
- Pro: Difficult materials and code, with production-grade tools.