ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Using DataDistillr to Clean and Analyze Data (Startup Pt. 15)

Charles Givre

Experienced cyber security data scientist and data engineer. CISSP | Ex CIA, JP Morgan. GenAI | NLP | Python | SQL | Java | Speaker | Blackhat Instructor and O'Reilly Author | Classic car enthusiast.

å‘å¸ƒæ—¥æœŸ: 2022å¹´7æœˆ13æ—¥

My Medium feed thankfully has returned to tech articles and one popped up that caught my attention:?Data Analysis project- Using SQL to Clean and Analyse Data. I created my startup DataDistillr to help in such situations and I was wondering could we accomplish the same tasks in less time and effort. In this article, the author takes some data from?Real World Fake Data?in a CSV file and does the following:

Step 1. Create a MySQL database from the CSV file
Step 2. Load the CSV file into the database
Step 3. Clean the data
Step 4. Exploratory Data Analysis
Step 5. Create a dashboard?

I thought it would be a great example of how DataDistillr can accelerate your time to value simply by walking through this use case using DataDistillr.

Step 1: Skip Steps One and Two

Hereâ€™s the cool part. Using DataDistillr, we can skip steps one and two. All we have to do is upload the file. Technically, since this data is hosted on data.world, we could actually query that data directly w/o even doing that, but for this example, weâ€™re going to download the data and upload it to DataDistillr. Since this is very straightforward, weâ€™ll skip that step, but you can read about how to upload files in the?DataDistillr documentation.?

You can see in the screen shot above that DataDistillr automagically figured out the schema of the CSV file. By running a simple?SELECT *?query we can sample the data and see what the first rows look like as well.?

Step 3: Exploratory Data Analysis

The next step is figuring out the shape of the data. The original author does this by running some?INFORMATION_SCHEMA?queries. That is not necessary in DataDistillr as the columns are viewable in the nav tree, and simply running the original?SELECT *?query will tell you how many rows you have.?

é¢†è‹±æŽ¨è

Data Analysis with Pandas: Why Pandas Series Deserve Your Attention

Data Analysis with Pandas: Why Pandas Series Deserveâ€¦

Benjamin Bennett Alexander 1 å¹´å‰

The past, present, and future of the semantic layer

Prukalpa ? 2 å¹´å‰

The Building Blocks of a Successful Data Analytics Project

The Building Blocks of a Successful Data Analyticsâ€¦

Walter Shields 1 ä¸ªæœˆå‰

The next step the author does is to count the values in some columns by running a number of?SELECT DISTINCT?queries. DataDistillr saves you time on this front as well. As we are scanning the data, DataDistillr calculates summary descriptive statistics about each column, as shown on the left.

As you can see, this is a major time saver in that you donâ€™t have run additional queries for every column that you wish to summarize.

Once this is done, the author executes a series of queries which involve aggregating and counting values in columns. Some of these can be accomplished with the column summary view, but others are exactly the same in DataDistillr so weâ€™ll skip them here. As you can see in the screen shot below, the results are basically the same.

The last query, the author uses a windowing function to figure out the maximum call duration per day. I wasnâ€™t sure why he chose to do this as the query below will do the trick without window functions. I ran this query and got different results from the author. After further investigation, I found that the authorâ€™s query with the window function was not correct. Iâ€™m not sure what it was doing, but what I found was that for most days, the max call length was 45 min.?

SELECT call_timestamp,
MAX(CAST(call_duration_in_minutes AS INT)) AS max_call_length
FROM `demoupload`.`Call-Center.csv`
GROUP BY call_timestamp
ORDER BY max_call_length DESC

Step 5: Visualizing The Results

The final step is to visualize the results. The author used Tableau for this and created a public dashboard. DataDistillr has a web data connector for Tableau and thus, you could do exactly the same thing but using DataDistillr as your data source. Alternatively, you can visualize the results right in DataDistillr. The screenshot below demonstrates what that might look like.

TL;DR

As you can see, with DataDistillr, you can quickly analyze tabular data such as that found in a CSV files, using standard SQL, without having to create databases or move data around. In the next example, Iâ€™ll show you how you can pull data directly from an API and do the same thing! If you are interested in kicking the tires, please?submit the form for our private beta!

Christie McGregor

Skilled Quality Manager with over a decade experience in case management and customer service, training and employee development, advocacy, project management and human services.

2 å¹´

Jay M.

èµž

å›žå¤

1 æ¬¡å›žåº”

Alvinus Melius, BSc, MSc

Educator @ Government of Saint Lucia | Junior Machine Learning Engineer @ Omdena

2 å¹´

DataDistillr seems like a neat solution. It seems to save time and headaches. It would be awesome to try it.

èµž

å›žå¤

æŸ¥çœ‹æ›´å¤šè¯„è®º

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Charles Givreçš„æ›´å¤šæ–‡ç«

All Great Things Part 2: The Founder's Dilemma

2023å¹´12æœˆ14æ—¥

All Great Things Part 2: The Founder's Dilemma

I recently posted an article about the demise of DataDistillr.?It was painful to write and I was worried that by doingâ€¦

4 æ¡è¯„è®º
All Great Things...

2023å¹´12æœˆ4æ—¥

All Great Things...

Well, this is the post Iâ€™d hoped to never write, but alas, weâ€™ve reached the conclusion that itâ€™s time to shut downâ€¦

65 æ¡è¯„è®º
Why You Shouldn't Rely on GPT to Write Code

2023å¹´7æœˆ26æ—¥

Why You Shouldn't Rely on GPT to Write Code

A lot of people have tried out ChatGPT and other LLMs for code their code writing abilities. My theory was that theâ€¦

20 æ¡è¯„è®º
Tests in a GenAI World

2023å¹´6æœˆ2æ—¥

Tests in a GenAI World

I teach a graduate level data management class at the University of Maryland, Baltimore County (UMBC). Let me prefaceâ€¦

5 æ¡è¯„è®º
Five Things I Learned Writing SQL with Gen AI

2023å¹´3æœˆ31æ—¥

Five Things I Learned Writing SQL with Gen AI

ChatGPT has been all over the news for the last few months and again with the release of GPT-4. At DataDistillr, weâ€¦

7 æ¡è¯„è®º
It's The Assumptions That Get You

2023å¹´2æœˆ7æ—¥

It's The Assumptions That Get You

Iâ€™ve had a number of conversations recently that have highlighted to me how not understanding peopleâ€™s assumptions canâ€¦

4 æ¡è¯„è®º
ChatGPT, Meet DataDistillr! Youâ€™ll have lots to discuss!

2023å¹´1æœˆ6æ—¥

ChatGPT, Meet DataDistillr! Youâ€™ll have lots to discuss!

Happy New Year everyone! Iâ€™m pretty excited about this. Like every other tech geek out there, I was experimenting withâ€¦

24 æ¡è¯„è®º
Five Technologies That I Think Are Bullshit

2022å¹´11æœˆ13æ—¥

Five Technologies That I Think Are Bullshit

This is going to piss people off. I took a road trip a few weeks ago to New York and listened to an interview with Markâ€¦

49 æ¡è¯„è®º
We Launched! (Beta)

2022å¹´9æœˆ28æ—¥

We Launched! (Beta)

Well, that day has finally come! After months of testing, speaking with customers and investors, our public beta isâ€¦

13 æ¡è¯„è®º
Joining Difficult Data: How to Join Data on Extracted Domains

2022å¹´8æœˆ24æ—¥

Joining Difficult Data: How to Join Data on Extracted Domains

2 æ¡è¯„è®º

See all articles

Using DataDistillr to Clean and Analyze Data (Startup Pt. 15)

Charles Givre

Experienced cyber security data scientist and data engineer. CISSP | Ex CIA, JP Morgan. GenAI | NLP | Python | SQL | Java | Speaker | Blackhat Instructor and O'Reilly Author | Classic car enthusiast.

Step 1: Skip Steps One and Two

Step 3: Exploratory Data Analysis

é¢†è‹±æŽ¨è

Step 5: Visualizing The Results

TL;DR

Charles Givreçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Embark on Your Data Odyssey: Unveiling the Data Science Guidebook for Success

The Business Value of Data and Advanced Analytics

Data Analytics 101 for Beginners: All You Need to Know

Difference between Data Science and Business Intelligence

From Hero Mentality to Reproducibility: DataOps for Humans

?? SQL for Data Analytics & Data Science (Day 19) Coffee Shop Example ?

A Beginner's Guide to the Data Science Pipeline

Ensuring Data Quality with Snowflake

The right approach in data analytics: Solve business problems like a pro!

The Data Science Ecosystem

Step 1: Skip Steps One and Two

Step 3: Exploratory Data Analysis

é¢†è‹±æŽ¨è

Step 5: Visualizing The Results

TL;DR

Charles Givreçš„æ›´å¤šæ–‡ç«

All Great Things Part 2: The Founder's Dilemma

All Great Things...

Why You Shouldn't Rely on GPT to Write Code

Tests in a GenAI World

Five Things I Learned Writing SQL with Gen AI

It's The Assumptions That Get You

ChatGPT, Meet DataDistillr! Youâ€™ll have lots to discuss!

Five Technologies That I Think Are Bullshit

We Launched! (Beta)

Joining Difficult Data: How to Join Data on Extracted Domains

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Embark on Your Data Odyssey: Unveiling the Data Science Guidebook for Success

The Business Value of Data and Advanced Analytics

Data Analytics 101 for Beginners: All You Need to Know

Difference between Data Science and Business Intelligence

From Hero Mentality to Reproducibility: DataOps for Humans

?? SQL for Data Analytics & Data Science (Day 19) Coffee Shop Example ?

A Beginner's Guide to the Data Science Pipeline

Ensuring Data Quality with Snowflake

The right approach in data analytics: Solve business problems like a pro!

The Data Science Ecosystem

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†