Shape of Data: An Introduction to Topological Data Analysis, Part 1

Shape of Data: An Introduction to Topological Data Analysis, Part 1

This two-part series explores how Topological Data Analysis (TDA) and specifically Persistent Homology (PH) can provide unique insights into data analysis. In Part 1, we’ll cover the fundamental concepts of TDA, understanding data shapes, simplicial complexes, and how persistent homology works. Part 2 will focus on a practical application of these concepts to time series analysis, particularly in analyzing financial transaction patterns, along with real-world applications across different domains.

1. Understanding the Data Landscape

Imagine you’re exploring a dataset that looks like this:

Each row corresponds to one sample of the data, and the columns represent some attribute of this sample. In reality, the columns often have physical meanings associated with them. For example:

● Credit card transactions with features like amount, day, average spend etc.

● Time series of sensor readings in a power plant.

● Facial keypoints in a KYC journey.

● Token embeddings in language models.

● Pixel values or embeddings of images.

For most of these datasets, one common trait is that they are high-dimensional. Dimension refers to the number of columns or features that each sample has. Our example dataset lies in ?8 , which means that each sample can be represented as a point in an 8-dimensional hyperspace. This collection of data points forms what we call a “point cloud”.

While we can’t directly visualize high-dimensional spaces, we can get glimpses through 2D or 3D slices of the dataset:

But this may not give us much insight. We can also use dimensionality reduction techniques such as removing correlated columns, Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), t-SNE, etc., although often it leads to some loss of information.

2. Traditional Approaches to Data Analysis

That takes us to the next logical question — what do we want to do with this data? Typically, our goal with such datasets is to extract meaningful insights or make predictions. We might want to:

● Classify transactions as fraudulent or genuine.

● Predict machine breakdowns.

● Detect live vs. pre-recorded KYC videos.

● Predict the next token in a language model.

● Generate image descriptions.

We usually approach these tasks using domain expertise, rule-based logic, statistical methods, or machine learning algorithms. These methods often involve transforming the data from one space to another. For instance, a neural network might transform 8D data to 16D, then 4D, and finally to a single score, aiming to increase class separability at each step.

3. Introducing Topology: A New Lens for Data

So we’ve a bunch of tools available to us to work with our data and solve the business problem at hand. Why do we need to know about the “shape of data”? What do we even mean by shape? Does the data have any distinguishable shape? And if it does, can it help us in any way?

To answer these, let’s first look at what we mean by the topology of data, a particular technique in topological data analysis (TDA) called Persistent Homology (PH), and how to apply it to our data. That sounds like a lot, but we’ll slowly build up the concepts and finally look at some real use cases of PH. Hopefully, by the end of the article, we’ll have our answers.

3.1 What is Topology?

Topology is a branch of mathematics that studies the properties of geometric objects that remain unchanged under continuous deformations. The classic example is that a coffee mug and a donut are topologically equivalent — they can be deformed into each other without tearing or gluing.

From a topological perspective, an important property of the object is the number of ‘holes’ it has. A hole can be of any dimension:

● 0-dimensional holes: connected components

● 1-dimensional holes: loops (like the hole in a ring)

● 2-dimensional holes: voids or cavities (like the inside of a basketball)

● Higher-dimensional holes in more complex spaces

We represent these holes using Betti numbers (named after mathematician Enrico Betti): β?, β?, β?, and so on. Take a look at these objects and their associated Betti numbers.

3.2 From Continuous Shapes to Discrete Points

So far, the objects we’ve been looking at have continuous shapes. But what about point cloud datasets, which are made up of discrete points. How do we count holes on those? Let’s look at some sample point clouds.

These look like they’ve been sampled from actual continuous distributions with some noise. Even though all the points are discrete, can you guess the Betti numbers β?, β?, and β? looking at these plots?

Read More@ https://medium.com/perfiostechblog/shape-of-data-an-introduction-to-topological-data-analysis-part-1-ab25004d56b

Akshay G Rao

Postgrad @ CSE, IIT Madras | Deep Learning Researcher | Ex-Perfios Fullstack developer

2 周

Insightful blog. Eagerly waiting to see how the concepts are useful for analysis in part 2 of this blog. The intuition and motivation for these techniques could have been better like what does holes indicate about data etc.. Say an example case where PCA, t-SNE fails miserably but this PH analysis works far better.

要查看或添加评论,请登录

Perfios的更多文章

社区洞察

其他会员也浏览了