Visualizing Multivariate dataset using Radviz.
Dipankar Mazumdar, M.Sc
Staff Data Engineer Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Author of "Engineering Lakehouses"
With the ever-increasing dimensionality of complex datasets, the ability to represent high dimensional data has always been a concern. Although there have been quite some advancements in the field of Visualization for the effective representation of a multidimensional dataset, problems like user interpretation and clutter-free representation contribute negatively to future usability. As part of my Graduate level course - Visualization, I had the opportunity to develop a prototype that aims at addressing the aforementioned problems by proposing a radial visualization called RadViz.
Radviz:
Radviz is a point-based projection technique that projects high-dimensional multivariate data into the 2D plane by displaying the distribution of data using a spring-force analogy. For an m-dimensional dataset, m anchors are created and then distributed over the circumference of a circle. Each data point in a Radviz is attached to a fixed spring which is also attached to the dimension anchors around the circle. When the sum of the spring forces equals zero, the data points are displayed at that position. The anchor position Aj of the jth anchor (j = [0, m-1]) is calculated as:
where r is the radius of the circle drawn and c = (cx, cy) is its center.
An instance of data is mapped to a position z on the ‘convex hull’ of the dimension anchors. The position of a data instance z is calculated as:
where xj = (x0, x1, …., xm-1) is the vector for the instance and Aj is the position of the jth instance.
The Prototype:
To address the problems described, a prototype has been developed which comprises a set of visualizations and a couple of interaction methodologies. The prototype is a web application deployed in Python’s Flask Server for evaluation purposes.
The datasets used for the experiment in this project are centered around numerical values. The main prototype was tested and evaluated with the ‘Spotify’s Song attribute 2017 data’. Two complimentary datasets (Winequality-red.csv, Iris) are provided with the prototype to demonstrate the capabilities of the visualization in different domains.
User Interface.
The User interface is the prime explorer of this Project work that allows a user to visualize the Radviz and allows for interact ability. The explorer renders the Radviz with the default dataset – ‘Spotify’s Song attributes’ and provides the user the capability to hover over individual data points inside the Radviz and gain information on the attributes. The Radviz anchors have been designed with histogram-assisted hover over features to let the user have an understanding of the data distribution for each of the attributes. The interface also constitutes multiple other functionalities associated with Radviz visualization. To address the problem of node occlusion, an ‘Opacity Slider’ is developed that helps in controlling the transparency of the densely projected data points. Our dataset comprises a column (Target) that classifies songs as liked or disliked, and hence the Radviz has two separate colors: Green and Orange to visually distinguish between the two categories. In the future, if a user decides to change their preference and do their analysis on the songs, they can manipulate the values in the dataset (as 1 or 0) and the Radviz will automatically reflect the changes.
Since the overall purpose of this project work is to enable users to represent multidimensional multivariate data more efficiently and to understand its application in multiple domains, a feature to load other numerical datasets is also provided in the User interface. The User interface also presents three Scatterplots (Figure above) to show the correlation between different attributes and makes a comparative analysis of 2D visualization solutions (Radviz and Scatterplots). Together with the functionalities mentioned above, the Project also presents two important add-on features. Since our data is high dimensional and there is an obvious problem of cluttering, Machine Learning capabilities have been enabled in the backend to group similar type of songs (based on the attributes) together. Also, to allow users to have an alternate solution to multidimensional datasets, a Parallel coordinate has been developed (Figure below).
Machine Learning capability.
Machine Learning techniques have seen a tremendous application in the area of Visualization. Throughout time techniques such as Clustering, Dimensionality reduction has been used extensively for projecting high dimensional data into the 2D space. Their applications also address some of the common problems with data visualization like node occlusion, cluttered representations, etc. Taking inspiration from the work of Jorge et al. and Hyunwoo et al., the prototype developed as part of this project implements a Machine Learning clustering technique in the background. Clustering is an unsupervised learning method that tries to find groups within the elements in data. These groups or clusters are formed based on the similarities within the data points. The K-means algorithm assigns data points to a cluster in such a way that the sum of the squared distance between the data points and the cluster’s centroid is at the minimum. For keeping the implementation simple, two clusters have been defined by default and this has been visually represented inside the Radviz using two colors – Orange & Blue. When users click on the ‘Group Similar tracks’ button, the backend Machine learning algorithm is triggered, and songs are grouped into two categories based on the similarity of the attributes. The Radviz is designed to handle the results of clustering and allows users to drag and drop data points from individual clusters into the 2D space and make cluster-wise comparisons. The incorporation of Machine learning technique helps us achieve two important goals:
1. User perspective: Clustering helps to render similar types of songs into 2 different groups, making analysis easier for users.
2. Designer perspective: High-dimensional data can be represented in the 2D space using techniques like clustering effectively.
Force based Layouts:
The visualization components of this project work have been developed using D3.js. D3 leverages a physics-based simulator to position the visual elements. As per researcher Eades et al. a force-directed layout considers a visual element as a physical element where nodes are attracted or repelled, and they reach a stable state to achieve equilibrium. The Radviz developed as part of this work, leverages the force-based algorithm based on Hooke’s law of physics. The force functions like link distance, charge, friction etc. are the ones responsible for adjusting the position and velocity of elements. Force-based technique also presents repulsion functions which allows data points to either attract or repel towards each other based on the positive or negative values of the function. To deal with the problem of cluttered representation of high-dimensional data points, we have used repulsion functions and enabled charged based distancing which is expected to improve the performance of our force layout and produce a more localized layout.
Implementation:
The prototype has been developed as a Full-stack solution using the Python Flask framework. Flask is a lightweight and extensible web server that has seen its wide usage in recent times. Flask server allows for the creation of multiple RESTful API’s which contributes to building scalable and manageable web applications. The different components of the code are described below:
Front-end: The front-end part of the Project has been developed using HTML and CSS along with a couple of images for background and logos etc. In total, two webpages have been implemented intended to serve two basic operations – a) Homepage with the proposed visualization and its functionalities and b) Render the results of the Machine learning operations.
Back-end: The back-end comprises of the code behind and is primarily developed using vanilla Javascript.
Machine Learning API: The prototype allows users to group music tracks based on similarities. To achieve this, I have leveraged Scikit Learn’s k-means++ clustering algorithm technique. The K-means algorithm has been defined as an API endpoint inside the Python Flask framework and handled as an onlick event in the Front-end. A snippet of the code is shown below:
@d3_flask.route('/cluster/new', methods=['POST']) def cluster_new_param(): dataset = pd.read_csv(f'./app/static/spotify.csv') ds_new = pd.read_csv(f'./app/static/spotify_base.csv') X = dataset.iloc[:, :-1].values json_data = json.loads(request.data) n_clusters = json_data['n_clusters'] print(n_clusters) km = KMeans( n_clusters=n_clusters, init='k-means++', n_init=10, max_iter=300, tol=1e-04, random_state=0 ) y_km = km.fit_predict(X) ds_new['prediction'] = y_km ds_new.to_csv(f'./app/static/predictedn' + str(n_clusters) + '.csv', index=False) return "You have successfully applied clustering with " + str(n_clusters) + " params" + " Please click on Clustering results"
Visualizations: The visual representations in this project work have been implemented using the D3.js library.
The overall idea of this project work was to have a robust and effective visualization technique to represent multivariate multidimensional datasets. For a detailed implementation of the code and Github link you can reach out to me and I will be happy to share.
~Dipankar