Using DataDistillr to Clean and Analyze Data (Startup Pt. 15)

Using DataDistillr to Clean and Analyze Data (Startup Pt. 15)

My Medium feed thankfully has returned to tech articles and one popped up that caught my attention:?Data Analysis project- Using SQL to Clean and Analyse Data. I created my startup DataDistillr to help in such situations and I was wondering could we accomplish the same tasks in less time and effort. In this article, the author takes some data from?Real World Fake Data?in a CSV file and does the following:

  • Step 1. Create a MySQL database from the CSV file
  • Step 2. Load the CSV file into the database
  • Step 3. Clean the data
  • Step 4. Exploratory Data Analysis
  • Step 5. Create a dashboard?

I thought it would be a great example of how DataDistillr can accelerate your time to value simply by walking through this use case using DataDistillr.

Step 1: Skip Steps One and Two

Here’s the cool part. Using DataDistillr, we can skip steps one and two. All we have to do is upload the file. Technically, since this data is hosted on data.world, we could actually query that data directly w/o even doing that, but for this example, we’re going to download the data and upload it to DataDistillr. Since this is very straightforward, we’ll skip that step, but you can read about how to upload files in the?DataDistillr documentation.?

No alt text provided for this image

You can see in the screen shot above that DataDistillr automagically figured out the schema of the CSV file. By running a simple?SELECT *?query we can sample the data and see what the first rows look like as well.?

Step 3: Exploratory Data Analysis

The next step is figuring out the shape of the data. The original author does this by running some?INFORMATION_SCHEMA?queries. That is not necessary in DataDistillr as the columns are viewable in the nav tree, and simply running the original?SELECT *?query will tell you how many rows you have.?

No alt text provided for this image


The next step the author does is to count the values in some columns by running a number of?SELECT DISTINCT?queries. DataDistillr saves you time on this front as well. As we are scanning the data, DataDistillr calculates summary descriptive statistics about each column, as shown on the left.

As you can see, this is a major time saver in that you don’t have run additional queries for every column that you wish to summarize.

Once this is done, the author executes a series of queries which involve aggregating and counting values in columns. Some of these can be accomplished with the column summary view, but others are exactly the same in DataDistillr so we’ll skip them here. As you can see in the screen shot below, the results are basically the same.

No alt text provided for this image

The last query, the author uses a windowing function to figure out the maximum call duration per day. I wasn’t sure why he chose to do this as the query below will do the trick without window functions. I ran this query and got different results from the author. After further investigation, I found that the author’s query with the window function was not correct. I’m not sure what it was doing, but what I found was that for most days, the max call length was 45 min.?

SELECT call_timestamp,
MAX(CAST(call_duration_in_minutes AS INT)) AS max_call_length
FROM `demoupload`.`Call-Center.csv`
GROUP BY call_timestamp
ORDER BY max_call_length DESC         

Step 5: Visualizing The Results

The final step is to visualize the results. The author used Tableau for this and created a public dashboard. DataDistillr has a web data connector for Tableau and thus, you could do exactly the same thing but using DataDistillr as your data source. Alternatively, you can visualize the results right in DataDistillr. The screenshot below demonstrates what that might look like.

No alt text provided for this image

TL;DR

As you can see, with DataDistillr, you can quickly analyze tabular data such as that found in a CSV files, using standard SQL, without having to create databases or move data around. In the next example, I’ll show you how you can pull data directly from an API and do the same thing! If you are interested in kicking the tires, please?submit the form for our private beta!

Christie McGregor

Skilled Quality Manager with over a decade experience in case management and customer service, training and employee development, advocacy, project management and human services.

2 å¹´
Alvinus Melius, BSc, MSc

Educator @ Government of Saint Lucia | Junior Machine Learning Engineer @ Omdena

2 å¹´

DataDistillr seems like a neat solution. It seems to save time and headaches. It would be awesome to try it.

赞
回复

要查看或添加评论,请登录

Charles Givre的更多文章

  • All Great Things Part 2: The Founder's Dilemma

    All Great Things Part 2: The Founder's Dilemma

    I recently posted an article about the demise of DataDistillr.?It was painful to write and I was worried that by doing…

    4 条评论
  • All Great Things...

    All Great Things...

    Well, this is the post I’d hoped to never write, but alas, we’ve reached the conclusion that it’s time to shut down…

    65 条评论
  • Why You Shouldn't Rely on GPT to Write Code

    Why You Shouldn't Rely on GPT to Write Code

    A lot of people have tried out ChatGPT and other LLMs for code their code writing abilities. My theory was that the…

    20 条评论
  • Tests in a GenAI World

    Tests in a GenAI World

    I teach a graduate level data management class at the University of Maryland, Baltimore County (UMBC). Let me preface…

    5 条评论
  • Five Things I Learned Writing SQL with Gen AI

    Five Things I Learned Writing SQL with Gen AI

    ChatGPT has been all over the news for the last few months and again with the release of GPT-4. At DataDistillr, we…

    7 条评论
  • It's The Assumptions That Get You

    It's The Assumptions That Get You

    I’ve had a number of conversations recently that have highlighted to me how not understanding people’s assumptions can…

    4 条评论
  • ChatGPT, Meet DataDistillr! You’ll have lots to discuss!

    ChatGPT, Meet DataDistillr! You’ll have lots to discuss!

    Happy New Year everyone! I’m pretty excited about this. Like every other tech geek out there, I was experimenting with…

    24 条评论
  • Five Technologies That I Think Are Bullshit

    Five Technologies That I Think Are Bullshit

    This is going to piss people off. I took a road trip a few weeks ago to New York and listened to an interview with Mark…

    49 条评论
  • We Launched! (Beta)

    We Launched! (Beta)

    Well, that day has finally come! After months of testing, speaking with customers and investors, our public beta is…

    13 条评论
  • Joining Difficult Data: How to Join Data on Extracted Domains

    Joining Difficult Data: How to Join Data on Extracted Domains

    2 条评论

社区洞察

其他会员也浏览了