登录查看更多内容

Exploratory Data Analysis Using D-Tale Library

Amit Jain

Actively looking for new job | 7.2+ YoE as a Data Scientist

发布日期: 2021年11月11日

D-Tale for interactive data exploration

D-Tale is python library allows us to visualize a Pandas DataFrame. D-Tale generates an interactive graphical interface.

D-Tale presents a variety of details about the data provided. It supports a wide range of file formats including CSV, TSV, XLS, XLSX. It is a Python library built with Flask backend and React as the frontend.

There are two ways in which we can start a D-Tale interface and load the data in Jupyter notebooks:

Either pass the dataframe object to the D-Tale function. This instantiates the GUI within the Jupyter cell only.

import dtale

dataset = pd.read_csv("eda_train.csv")

dtale.show(dataset) ?????

2. Initialize the D-Tale interface without passing the dataframe. It will show up an interaction menu with the GUI to load the data and provide various other options too.

import dtale

dtale.show(open_browser=True)

?As soon as we run this code, we will get this GUI menu:

First-time interface

Here we have the following options –

Loading data from a file
Loading data from websites. Here we need to pass the link of the website from where we can fetch files such as CSV, JSON, TSV, or Excel.
Loading sample datasets. These datasets may require some background downloading to fetch the datasets from the server.

As we load the dataset, a table will be displayed just like the pandas dataframe. All the cells of this table can be edited, and we can make direct changes to the values just like in excel.

Column Menu Functions

Whenever we click on the column header, we will get a list of options depending upon the type of data the column contains. The common thing in all three of them is sorting by Ascending or Descending order. Other than that, there would be different filter approaches for different types of data.

Also, in the string type column, there would be no heat map and Variance Report option but there will be a new option called Clean Columns which is not present in integer and datetime value column.

?1. Lock

?The lock option sticks the column to the left of the screen allowing we to freely scroll/navigate to other columns without the locked column being displaced. This can prove useful in cases when we want to have a look at columns that are placed apart.

?2. Hide and Delete

The hide option removes the column from the dataframe view. It is not deleted from the actual dataframe. We can simply unhide the column from the top right strip.

The delete option will remove the column from the dataframe permanently. It is similar to the pandas drop function. In the backend, it is iterating over the list of columns to select the column to delete from the dataframe.

?3. Replacements and Type conversion

A replacement option is used for replacing some values of the column with a constant or nan value. We can make this replacement in place or create a separate column. Replacement type can also be defined as replacing specific values, spaces, or specific string replacement.

?4. Describe

The describe function in pandas helps in providing a statistical summary of the column or the dataset. The describe option here works in the same way but it provides way more information than the normal pandas functions. As it is named column analysis, it provides a unique summary for each data type.

It also generates histogram and value_counts graph for the features.

For integer type columns, it provides measures of centrality and spread along with the frequency of most frequent value along with Kurtosis and Skewness. It also represents the data in the box plot, histogram, value_count plot, and Q-Q plot.

?Pic For string type columns, it provides the most frequent word and its frequency, detailed summary on characters present, word value count plot, and value counts plot.

领英推荐

Missingno

360DigiTMG 1 年前

Python’s Must-Have Libraries for Data Science Beginners

Walter Shields 4 个月前

Seaborn: Elevating Data Visualization in Python

Shakil Khan 5 个月前

??5. Filters

These are used to make a subset of the data. Filtering data in D-Tale is very easy and we just need to specify the type of filters we want.

?6. Variance Report

This option is not available for string-type values. Variance report shows whether the feature has low variance or not. It decides this based on two checks:

Count of unique values in a feature / sample size < 10%
Count of most common value / Count of second most common value > 20

?It displays the result with the calculations and a histogram to present the findings.

?7. Clean columns

This option is only available for string-type values. D-Tale provides all the possible text cleaning methods that can be applied to the text. We simply need to select the methods we want to apply to the text and the work will be done in the backend.

8. Formats

With the option Formats we can define how the numbers are displayed.

Main Menu Options

The main menu has almost all the same options as provided in the column menu but in the main menu, they are generalized, and we can do the operations in one place on multiple columns rather can manually picking them from the display. Here are some of the options which are exclusive to the Main menu and work differently.

1. Build Column

This option allows us to create new features/columns out of the already available columns. We can create these new features by performing arithmetic operations on columns or using two columns to perform operations. We can also provide the name of the new column to be made and its datatype.

2. Summarize Data

In pandas, we summarise the data via group-by or pivot tables. The same thing we can do with this package too. The pandas required us to write the code for every group by and pivot tables but with D-Tale, we can select the columns, the aggregation function, and the columns we want in the final dataset.

3. Missing Analysis

D-Tale uses missingno python package to visualize the missing values present in the dataset. It provides matrix, bar, heatmap, and dendrogram too.

4. Charts

D-Tale uses plotly to create interactive plots on the go. It offers Line, Bar, Scatter, Pie, word cloud, Heatmap, 3D scatter, Surface, Maps, Candlestick, Treemap, and funnel charts. Different types of data support different types of plots.

5. Highlighters

These are used to highlight some sections of the dataset. Like we use stylers in pandas to bring out the odd values, highlighters do the same job. We can highlight missing values, Data types, Outliers, and range.

6. Code Export and Data Export

All the operations we have done on our dataframe in D-Tale are automatically converted into their python/pandas/plotly equivalent code. They can be accessed by clicking on the export code option present at every operation and chart GUI.

The code export option in the main menu captures all the changes done on the dataframe. We can directly export the final dataset after changes to CSV or TSV using the export option.

You can check my GitHub profile for code.

Onkar Mulay

Machine Learning & Deep Learning, Bioinformatician (SingleCell, SpatialTranscriptomics and Alternative Splicing)

3 年

This is awesome

2 次回应

要查看或添加评论，请登录

Amit Jain的更多文章

How to install WML(Watson Machine Learning) using catalog in Openshift

2022年9月14日

How to install WML(Watson Machine Learning) using catalog in Openshift

WML Installation process Step 1: Login into https://cloud.ibm.
Using Fast loading libraries like Vaex

2021年12月15日

Using Fast loading libraries like Vaex

Vaex is a high-performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore…

1 条评论
Shapash : Machine Learning Interpretable & Understandable

2021年12月15日

Shapash : Machine Learning Interpretable & Understandable

Shapash is a Python library which aims to make machine learning interpretable and understandable by everyone. It…

1 条评论
Azure Cognitive Services

2021年12月14日

Azure Cognitive Services

What is Azure Cognitive Services? Cognitive Services brings AI within reach of every developer and data scientist. With…
Autoviz & Autovizwidget

2021年11月24日

Autoviz & Autovizwidget

Autoviz is an open-source python library that mainly works on visualizing the relationship of the data, it can find the…

3 条评论
Exploratory Data Analysis using pandas visual analysis library

2021年11月12日

Exploratory Data Analysis using pandas visual analysis library

Pandas Visual Analysis is an open-source python library which is used to visually analyze the data and that too in just…
Exploratory Data Analysis Using Pandas Profiling

2021年11月10日

Exploratory Data Analysis Using Pandas Profiling

Pandas profiling is an open-source Python module with which we can quickly do an exploratory data analysis, it also…

2 条评论
Exploratory Data Analysis with Sweetviz

2021年9月8日

Exploratory Data Analysis with Sweetviz

Sweetviz is an open-source pandas-based library to perform the primary EDA task. It also generates a summarized report…
Python program to check available slots for Covid vaccination centers in your nearest pin code

2021年5月3日

Python program to check available slots for Covid vaccination centers in your nearest pin code

Here is the Python script which checks the available slots for Covid-19 vaccination centers pin code wise from CoWIN…

1 条评论

See all articles

Exploratory Data Analysis Using D-Tale Library

Amit Jain

Actively looking for new job | 7.2+ YoE as a Data Scientist

领英推荐

Amit Jain的更多文章

社区洞察

其他会员也浏览了

Top 10 Tools or Applications or Libraries or Packages Used by Data Scientists in Day-to-Day Work and their mapping to Data Science Life Cycle in IT

Matplotlib

Pandas for Data Science

Analyzing Excel Sales Data with Python Pandas and Seaborn - Part I

Analyzing Excel Sales Data with Python Pandas and Seaborn - Part II

Aggregation in Pandas DataFrame

Mastering Data Visualization with Matplotlib: A Comprehensive Guide to Creating Powerful Plots and Charts

Polars Vs Pandas: Benchmarking performances and beyond

Cleaning Data with Pandas

R, Python Duel As Top Analytics, Data Science software – KDnuggets 2016 Software Poll Results

领英推荐

Amit Jain的更多文章

How to install WML(Watson Machine Learning) using catalog in Openshift

Using Fast loading libraries like Vaex

Shapash : Machine Learning Interpretable & Understandable

Azure Cognitive Services

Autoviz & Autovizwidget

Exploratory Data Analysis using pandas visual analysis library

Exploratory Data Analysis Using Pandas Profiling

Exploratory Data Analysis with Sweetviz

Python program to check available slots for Covid vaccination centers in your nearest pin code

社区洞察

其他会员也浏览了

Top 10 Tools or Applications or Libraries or Packages Used by Data Scientists in Day-to-Day Work and their mapping to Data Science Life Cycle in IT

Matplotlib

Pandas for Data Science

Analyzing Excel Sales Data with Python Pandas and Seaborn - Part I

Analyzing Excel Sales Data with Python Pandas and Seaborn - Part II

Aggregation in Pandas DataFrame

Mastering Data Visualization with Matplotlib: A Comprehensive Guide to Creating Powerful Plots and Charts

Polars Vs Pandas: Benchmarking performances and beyond

Cleaning Data with Pandas

R, Python Duel As Top Analytics, Data Science software – KDnuggets 2016 Software Poll Results