登录查看更多内容

Shape of Data: An Introduction to Topological Data Analysis, Part 1

Perfios

Lead with brilliance / Leap to outdo

发布日期: 2025年3月5日

This two-part series explores how Topological Data Analysis (TDA) and specifically Persistent Homology (PH) can provide unique insights into data analysis. In Part 1, we’ll cover the fundamental concepts of TDA, understanding data shapes, simplicial complexes, and how persistent homology works. Part 2 will focus on a practical application of these concepts to time series analysis, particularly in analyzing financial transaction patterns, along with real-world applications across different domains.

1. Understanding the Data Landscape

Imagine you’re exploring a dataset that looks like this:

Each row corresponds to one sample of the data, and the columns represent some attribute of this sample. In reality, the columns often have physical meanings associated with them. For example:

● Credit card transactions with features like amount, day, average spend etc.

● Time series of sensor readings in a power plant.

● Facial keypoints in a KYC journey.

● Token embeddings in language models.

● Pixel values or embeddings of images.

For most of these datasets, one common trait is that they are high-dimensional. Dimension refers to the number of columns or features that each sample has. Our example dataset lies in ?8 , which means that each sample can be represented as a point in an 8-dimensional hyperspace. This collection of data points forms what we call a “point cloud”.

While we can’t directly visualize high-dimensional spaces, we can get glimpses through 2D or 3D slices of the dataset:

But this may not give us much insight. We can also use dimensionality reduction techniques such as removing correlated columns, Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), t-SNE, etc., although often it leads to some loss of information.

2. Traditional Approaches to Data Analysis

That takes us to the next logical question — what do we want to do with this data? Typically, our goal with such datasets is to extract meaningful insights or make predictions. We might want to:

● Classify transactions as fraudulent or genuine.

● Predict machine breakdowns.

● Detect live vs. pre-recorded KYC videos.

● Predict the next token in a language model.

领英推荐

When Humans Need to Answer Tough Questions About?Data

Towards Data Science 1 年前

If data is the new oil, then AI is the new nuke

Naveen Joshi 3 年前

The Data scientist Newsletter: AI in the Middle East

?? Dr Stylianos Kampakis, CStat ?? 5 个月前

● Generate image descriptions.

We usually approach these tasks using domain expertise, rule-based logic, statistical methods, or machine learning algorithms. These methods often involve transforming the data from one space to another. For instance, a neural network might transform 8D data to 16D, then 4D, and finally to a single score, aiming to increase class separability at each step.

3. Introducing Topology: A New Lens for Data

So we’ve a bunch of tools available to us to work with our data and solve the business problem at hand. Why do we need to know about the “shape of data”? What do we even mean by shape? Does the data have any distinguishable shape? And if it does, can it help us in any way?

To answer these, let’s first look at what we mean by the topology of data, a particular technique in topological data analysis (TDA) called Persistent Homology (PH), and how to apply it to our data. That sounds like a lot, but we’ll slowly build up the concepts and finally look at some real use cases of PH. Hopefully, by the end of the article, we’ll have our answers.

3.1 What is Topology?

Topology is a branch of mathematics that studies the properties of geometric objects that remain unchanged under continuous deformations. The classic example is that a coffee mug and a donut are topologically equivalent — they can be deformed into each other without tearing or gluing.

From a topological perspective, an important property of the object is the number of ‘holes’ it has. A hole can be of any dimension:

● 0-dimensional holes: connected components

● 1-dimensional holes: loops (like the hole in a ring)

● 2-dimensional holes: voids or cavities (like the inside of a basketball)

● Higher-dimensional holes in more complex spaces

We represent these holes using Betti numbers (named after mathematician Enrico Betti): β?, β?, β?, and so on. Take a look at these objects and their associated Betti numbers.

3.2 From Continuous Shapes to Discrete Points

So far, the objects we’ve been looking at have continuous shapes. But what about point cloud datasets, which are made up of discrete points. How do we count holes on those? Let’s look at some sample point clouds.

These look like they’ve been sampled from actual continuous distributions with some noise. Even though all the points are discrete, can you guess the Betti numbers β?, β?, and β? looking at these plots?

Perfios Tech Blog

11,857 位关注者

Akshay G Rao

Postgrad @ CSE, IIT Madras | Deep Learning Researcher | Ex-Perfios Fullstack developer

2 周

Insightful blog. Eagerly waiting to see how the concepts are useful for analysis in part 2 of this blog. The intuition and motivation for these techniques could have been better like what does holes indicate about data etc.. Say an example case where PCA, t-SNE fails miserably but this PH analysis works far better.

1 次回应

查看更多评论

要查看或添加评论，请登录

Perfios的更多文章

See all articles

Shape of Data: An Introduction to Topological Data Analysis, Part 1

Perfios

Lead with brilliance / Leap to outdo

1. Understanding the Data Landscape

2. Traditional Approaches to Data Analysis

领英推荐

3. Introducing Topology: A New Lens for Data

3.1 What is Topology?

3.2 From Continuous Shapes to Discrete Points

Perfios Tech Blog

11,857 位关注者

Perfios的更多文章

社区洞察

其他会员也浏览了

Data Squared Submits Response to Federal AI Action Plan RFI

Episode #113: Top 5 data analytics predictions for 2023

Dear Data Padawan 3 – In the Beginning There was Data…

Member Spotlight: SAS Institute

Isolation Forest: Unmasking Anomalies in Your Data

Data Informed, Anecdote Inspired: The Bezos Approach to Balancing Metrics with Human Insights

Diary of an Architect Series - (2) Data Strategies for Gen. AI era

Artificial Intelligence in Big Data Analysis Market Next Big Thing | Major Giants Amazon, AOL, Apple

AI Can Turn Your Data Problems into an Outright Catastrophe. Here’s How to Avoid It.

Detecting Data Distortions: The Three Types of Biases every Manager and Data Scientist should know

1. Understanding the Data Landscape

2. Traditional Approaches to Data Analysis

领英推荐

3. Introducing Topology: A New Lens for Data

3.1 What is Topology?

3.2 From Continuous Shapes to Discrete Points

Perfios Tech Blog

11,857 位关注者

Perfios的更多文章

The Mask on Truth: A Deep Dive into the Threat of Deepfakes

Embracing Reactive Programming: Optimizing Cloud, Database, and Web Operations

Building an Expense Manager using GenAI| Exploring Pydantic, Tools, and Agents

Shape of Data: An Introduction to Topological Data Analysis, Part 2

How India Spends: A Deep Dive into Consumer Priorities and Business Opportunities

Building a WebSDK for Face Liveness Detection: Moving Beyond APIs

Lessons in Scaling Applications for Account Aggregator Tech Products

Navigating WebRTC Video Call Challenges Amid Rapid OS, Browser, and Device Upgrades in Video KYC

Top 5 Challenges in Customer Onboarding and How Unassisted Video KYC Solves Them

Transforming Verification: Embrace the Future with Agentless Video KYC

社区洞察

其他会员也浏览了

Data Squared Submits Response to Federal AI Action Plan RFI

Episode #113: Top 5 data analytics predictions for 2023

Dear Data Padawan 3 – In the Beginning There was Data…

Member Spotlight: SAS Institute

Isolation Forest: Unmasking Anomalies in Your Data

Data Informed, Anecdote Inspired: The Bezos Approach to Balancing Metrics with Human Insights

Diary of an Architect Series - (2) Data Strategies for Gen. AI era

Artificial Intelligence in Big Data Analysis Market Next Big Thing | Major Giants Amazon, AOL, Apple

AI Can Turn Your Data Problems into an Outright Catastrophe. Here’s How to Avoid It.

Detecting Data Distortions: The Three Types of Biases every Manager and Data Scientist should know