登录查看更多内容

Visualization of the Graph Techniques and Different Layouts

Kinjal Ami

Data Science and Machine Learning

发布日期: 2018年9月17日

Introduction

The results of much of our effort in data analysis end up before the eyes of our stakeholders or clients. Effective visualization is essential in gaining attention and in providing insightful and readily digestible representations, and an appreciation of the techniques and tools available to achieve this is most helpful. Having experience or qualification in graphical design (or similar disciplines) is also advantageous in complementing data analytical skills – at least in my own experience. Here I will brief the graph techniques and different layouts of data visualization using Gephi using an example.

Data Visualization is the process of representing information visually which provides a clear, effective and interactive understanding of abstract business or scientific data. In relational data visualization, data is automated for analysis such as extracting a graph from data and clustering. Data is illustrated (“visualized”) using various tools (which will be discussed soon) to draw a graph. Graph visualization has a wide range of applications such as organization charts, data flow diagrams, state-transition diagrams, entity-relationship diagrams, and networks. Graph visualization constructs a 2D or 3D representation of graphs according to user’s requirements.

Graphs are a pictorial representation of the vertices and edges of graph’s database. The unique arrangement of these vertices and edges provides understandability of databases. There are different graph layout methods such as Force-based Layout, Tree Layout, Spectral Layout, Arc Diagrams, Circular Layout, Orthogonal Layout Methods, Dominance Drawing, Sugiyama-style Drawing. ("Graph drawing" 2018)

Gephi 0.9.2

Gephi is the open-source visualization exploration software used by data analysts and scientists for all kinds of graphs and networks. Gephi is an interactive tool for analyzing properties of graphs and networks in detail without having coding skills. The following functions a user can perform using Gephi software, such as interaction with the final representation, manipulation with structures, appearance properties and understand patterns in visualization. The primary purpose of this tool is to ease the work of data analysts to create visualizations from complex datasets, to make a hypothesis, reveal patterns from database and perform isolation of structure. The Gephi provides high performance, customizable plug-ins and accepts native file formats. ("Gephi - The Open Graph Viz Platform" 2018)

Applications of Gephi:

· Exploratory Data Analysis

· Link Analysis

· Social Network Analysis

· Biological Network Analysis

· Poster Creation

Gephi provides various layouts for visualization. Gephi provides Force-based Algorithms and optimization techniques for graph readability. In software layout palette has a list of layouts which allows the user to try different layouts while running and increase feedback and experience. Layouts are listed below:

· Batch Partial Expansion

· Circle Pack Layout

· Circular Layout

· Contraction

· Dual Circle Layout

· Event Graph Layout

· Expansion

· Force Atlas Layout

· Force Atlas 2 Layout

· Fruchterman Reingold Layout

· Geo Layout

· Graphviz

· Isometric Layout

· MDS Layout

· MultiGravity Force Atlas 2 Layout

· Network Splitter 3D

· Noverlap Layout

· OpenOrd Layout

· Yifan Hu Layout

· Yifan Hu Proportional Layout

Gephi supports native file formats such as GDF, GraphML (NodeXL), GML, NET (Pajek), GEXF and more. It also provides customized plugins like layouts, matrices, data sources, rendering presets, manipulation tools, etc. ("Gephi - The Open Graph Viz Platform" 2018).

Advantages and Disadvantages of Graph visualization

Graph visualization is the visual representation of the real world datasets using the nodes and edges of a graph. Visualisation is possible using dedicated algorithms for different layouts according to the dataset and requirement, calculate the node positions and display the data on 2D or 3D spaces. (Devaux 2018)

Visualization using Graph Techniques helps to understand and discover patterns speedily in large graphs. ("Features" 2018) It shows a trend in the data relationship. It retains exact data values and sample size. Graphs can represent multiple dimensions. Graphs are easy to understand relationships between nodes and fast to traverse them. Graphs have a good layout and they are easy to implement and well suited for general graphs with different variations.

One of the disadvantages of the graph is that it is inappropriate for large data sets. For large datasets, it gives inconclusive results. Proper labeling of graph visualization is crucial to allow a viewer to understand what is being shown. In graph drawing, the problem of labeling is compounded, not only because of the potential for many nodes but also because labels might also be needed for the links between nodes (Ward, Grinstein & Keim 2015). Navigation or interaction is another issue of graph visualization.

Choosing of appropriate method and tool for dataset depends on physicality of our data, visual variables and relationships between those variables. There are many methods for visualization available, one should be careful while determining the most suitable and efficient representation method for a particular dataset. For example, for comparing categories (comparisons between relative and absolute sizes of categorical values) bar chart is the most suitable visualization method. The Pie chart will be best for assessing hierarchies and part-to-whole relationships. We can represent changes over the continuous time frame (temporal data) using line charts. Scatter plots are another type of visualization methods which are very efficient to plot connections and relationships. This technique is used for multivariate datasets which provides the most complex visual final outcomes. For mapping geospatial data choropleth map is the popular approach (Kirk et al. n.d.).

Dynamic Graphs and Multilevel Graphs are two special kinds of graphs which are the most efficient while visualizing the data which keeps changing with the timeframe in the real world systems. In such data, relationships between variables are much complex (Khokhar 2015).

Dataset

The data set is 2004 new cars and truck data which has 428 observations and 19 variables. In dataset, the 19 variables cover prices of cars, measurements relating to the size of vehicle and fuel efficiency and other factors. ("2004 Cars and Trucks | Interactive Data Visualization" 2018) In this data set, there are different types of vehicles such as sports car, sports utility vehicle (SUV), wagon, minivan, pickup, all wheel drive (AWD) and rear wheel drive (RWD). Here 1 means yes, and 0 means no - according to the vehicle model and vehicle type. Columns contain other parameters such as retail price, dealer cost, engine size, Cyl, HP, City MPG, Highway MPG, weight, wheel base, length, and width.

Visualization and Analysis

Visualizing different layouts in Gephi for the cars dataset are described and illustrated below:

· The Initial representation of the Data set

· Force Atlas Layout

· Force Atlas 2 Layout

· Fruchterman Reingold Layout

· OpenOrd Layout

· Circular Layout

· MultiGravity Force Layout

· Yifan Hu Layout

Graph Output:

The initial visualization of cars dataset, in Gephi Tool is shown below.

Figure 1. Initial Data Visualization

The figure 2 below shows the initial data visualization of cars dataset after altering appearance properties (Colors, Size, Label Color and Label Size) for nodes and edges. Size of the node is set to 8. Using Gephi’s other features such as labels, borders, and opacity; we will customize this graph and produce different layouts for better visualization.

Figure 2. Initial Data Visualization applying Appearance Properties

· Force Atlas Layout :

Figure 3. Force Atlas Layout Visualization

The above figure 3 shows Force Atlas Layout with 444 nodes and 5217 Edges. This layout has color properties. Force Atlas layout has following settings. Autostab strength is set to 2000, Repulsion strength is set to 1000, Atrraction strength is 1, Gravity is 100 and Attraction distribution is checked. Force Atlas Layout algorithm is a spatial layout algorithm and falls under Force Directed Algorithms category. This algorithm is suited for small world networks. (Khokhar 2015)

· Force Atlas 2 Layout :

Figure 4. Force Atlas 2 Layout Visualization

The above graph is Force Atlas 2 Layout with “indegree range” 289. Context (Nodes: 433 – 97.52% visible, Edges : 627 – 12.02% visible, Directed Graph). Force Atlas 2 Layout is the new version of Force Atlas Layout, scaled for small to medium size graphs. This layout algorithm is very fast because of availability of more options and innovative optimizations. This algorithm resolves shortcomings of the previous version by making a balance between quality of final visualisation and speed of computations. The working complexity of this algorithm comes down from O(N2) to O(N log N) due to the replacement of direct-sum simulation used in Force Atlas with Barnes-Hut-simulation. (Khokhar 2015) Force Atlas 2 layout has following settings.

Linear attraction and logarithmic repulsion (lin-lin by default) make clusters tighter. In this layout, scaling is 100 and edge weight influence is set to 0 which calcuate forces without edge weight. This algorithm works efficiently with the large networks.

· Fruchterman Reingold Layout :

Figure 5. Fruchterman Reingold Layout Visualization

The above layout shows Fruchterman Reingold Layout with appearance properties. It dissembles the graph as a system of large quantity of particles. In the above layout, nodes acts as mass particles and edges behave as springs between the nodes. This algorithm is very slow but it tries to minimize the energy of this system. This is force directed kind of graph with O(N2) complexity. The above layout has following settings. The Area is 10000.0, Gravity is set to 10, and speed is 1.0 (attract all nodes to the center to avoid dispersion of disconnected components). Here nodes are 433(97.52% visible) and edges are 627 (12.02% visible). Degree range settings are set to 376. This is directed graph with having appearance properties. The disadvantage of this algorithm is that it works better with small and medium size graphs but not with large graphs because of its high computational complexity.

Figure 6. Fruchterman Reingold Layout (Customized) Visualization

The above visualization shows Fruchterman Reingold Layout with Topology settings (out degree range 0-10). This layout follows following technical settings: Area – 10000.0, Gravity – 10.0, Speed – 1.0.

· OpenOrd Layout :

Figure 7. OpenOrd Layout Visualization

Above Visualisation shows the OpenOrd Layout for cars data set. This layout takes undirected weighted graphs and creates better distinguishable clusters. OpenOrd layout technique speeds up computing by running it parallel and automatically stops when it is finished. The OpenOrd algorithm is originally based on Frutcherman-Reingold. It works with a fixed number of iterations which are controlled by a simulated annealing type schedule (liquid, expansion, cool-down, crunch, and simmer). To separate clusters, long edges are cut. The OpenOrd layout has following settings. There are 444 nodes and 5217 edges in the layout.

Figure 8. OpenOrd Layout Settings

This visualization method is very helpful for large graphs.

Figure 9. OpenOrd Layout with Labels

The above visualization layout is OpenOrd Layout with Out Degree Range Settings 0-10. Here layout follows above technical settings (Figure 8).

· Circular Layout :

Figure 10. Circular Layout with Labels

The figure above shows the Circular Layout of cars dataset. This is direct graph example with 61.05% node visibility and 56.62% edge visibility. Out degree range setting has been set to 0-8. The settings of the layout are shown in the below figure 11. The complexity of the algorithm is O(N).

Figure 11. Circular Layout Technical Settings

· MultiGravity Force Atlas Layout :

Figure 12. MultiGravity Force Atlas Layout

The figure 12 shows MultiGravity Force Atlas Layout of cars data set. Here Threads number is 3. Tolerance speed is 1.0, approximate repulsion is checked and the approximation is set to 1.2. Tuning settings are set (Scaling 2.0, Gravity X and Y Scaling is set to 2.5, and 1.0 Gravity). Behavior Alternatives parameters such as LinLog mode is checked and prevent overlap is also checked. Edge weight influence is set to 1.0. Appearance settings for Nodes and edges are set as well to give graph better visual effect.

Figure 13. MultiGravity Force Atlas Layout with type labels

The figure above shows vehicle types such as small, sporty, compact, large sedan, RWD, AWD, pick up, Wagon, Minivan, and SUV. The layout is MultiGravity Force Atlas Layout.

Figure 13. MultiGravity Force Atlas Layout depth 1

The figure above shows “Wagon” type all vehicles relationship in Multi Gravity Force Atlas Layout with depth 1. Here Degree Range settings are set to 7-281 using Topology Filters. Gravity is set to 1.0 for this layout.

Figure 14. SUV Cars Relationships

The figure above explains all vehicles which have “SUV” car type. Here degree range is 7-281. The figure layout is MultiGravity ForceAtlas 2.

· Yifan Hu Layout :

Figure 15. Yifan Hu Layout

The above figure shows Yifan Hu layout with appearance properties for nodes and edges. Here Yifan Hu’s properties, Optimal Distance is 100.0 (to place nodes further apart), Relative strength is 0.2, the initial step size is 20.0, step ratio is 0.95 (gives a better quality vs speed), and adaptive cooling is checked. Yifan Hu is a very fast algorithm and works well with large datasets. It is Force Directed Algorithm. The complexity of this algorithm is O(N*log(N)). In this algorithm, only adjacent nodes are taken into consideration for computations of forces that lead to reduce the complexity of the algorithm. Yifan Hu Layout algorithm uses the concept of “Adaptive cooling scheme” which helps in faster convergence and stops from getting stuck in local minima. (Khokhar 2015)

Conclusion

From the cars’ dataset of the year 2004 above, the visualization describes the brief connection between different 19 variables recorded for new vehicles. From the different visualization layouts of graph describes the values of variables as edges such as fuel efficiency, length – weight, prices, and type of vehicle. The graph techniques are fast and well suited for general graphs with different variations. This is widely used in practice and easy to implement. Gephi is an excellent tool for different types of graph visualization layout with topology settings and other settings for getting the desired outcome. If the performance of the visualization is considered, Graph techniques used as Force Directed methods provide good layouts but they don’t scale well on large graphs (having more than thousands of nodes) and it is slow in computational cost which is O(N3) or O(N2log). These are various types of methods available for visualizing different kinds of data using different visualization tools to understand the nature of data.

Visualization of the Graph Techniques and Different Layouts

Kinjal Ami

Data Science and Machine Learning

更多精彩文章

社区洞察

其他会员也浏览了

Advanced Data Visualization Methods

EDA & Feature Engineering 101

11 Essential Plots That Data Scientists Use 95% of the Time

Data Visualization with Entity Theory

Preliminary Data Analysis with Automated EDA: A CRISP ML(Q) Approach

Unveiling Patterns: The Magic of Scatter Diagrams

Unleashing the Power of Data Visualization: Transforming Data into Insight

Data Visualization Charts

Meet Ultipa Manager: Graph Visualization

Storytelling with Data

Big Data - NoSQL Database

2020年3月14日

Classification of Fashion MNIST using KNN

2019年11月11日

Classification of Spoken Numbers' Spectrograms using CNN Model

2019年10月21日

Future of Business Intelligence in Cloud Computing

2019年10月7日

Sequencing and analysis of bacterial genomes

2019年9月21日

Analysis of Illumina Gene Expression "Human Body Map 2.0" Experiment Data

2019年9月16日

Social Media Network Analysis (Twitter)

2019年9月12日

Weather Forecasting using Machine Learning Models and Model Accuracy Assessment

2018年11月21日

Multidimensional Data Visualization using Tableau

2018年10月20日