Google Data Analyst Capstone Project: How does a bike-share navigate speedy success?

Hi everyone, these are the things I encountered while analysing data for my Google Data Analyst Practitioner Certificate. The Capstone project is where you put all of your skills learnt during the course together. It's the first time you get to analyze data. So here are some insights!

Source Data:

I had to analyze a year's worth of bike ride data. There are 12 csv files, 1 for each month. I choose June 2022 to May 2023. The data was quite sizeable.

For realistic estimates sake, when combined (using R Desktop), the combined file was over 1.3 gigabytes in size. There are over 5 million rows of data. That's a lot of data. Each row represents a bicycle ride.

What I did:

I looked at the data via Excel, and tried to get a feel of it. I attempted to combine it into a data source, using Microsoft Power Query: (for those of you, it's a feature in Microsoft Excel and PowerBI). However, the long loading times and times my laptop hung meant that I needed something a little more reliable. I decided to use R Desktop to manage my data.

Power Query Editor, it's probably useful for smaller datasets. Not 5 million rows worth.

Using R-Desktop

After reading thru a few other articles written by my coursemates who have completed their projects, I decided to try it out on my own to combine the data into 1 CSV file, and to clean up and filter out any abnormalities.

Here are the major steps I took in R Desktop:

  • Loaded the major packages and libraries like dplyr, ggplot2, lubridate, and data.table.
  • Set the working directory to my source file directory.
  • Using lapply() and rbindlist(), combine the data into a single data frame called merged_data.
  • I had a bit of issue with merging the data as not all the files and column names were of the same data type. But managed to remedy it by redefining the data type and changing the format.
  • After that, using R Desktop, regenerate new columns. These columns are: Day_name, trip_duration, trip_duration_hms, month_name, start_hour, start_time_classification

new columns

  • Do note that MS Power Query also has the feature to create day names and month names and start hour timings, however they don't have the feature to automatically allocate classifications for timings unless you do Index Match or Vlookup.
  • Some trips only last less than a minute. That is not exactly correct. We will remove them or filter them out. Filter out all trip data below one minute.
  • Next we will also check for duplicates within the TripID. No duplicates found.
  • Next we will output all the data (filtered data, that is, with all negative values removed, all sub 1 min trips removed.)
  • The command is write.csv(unique_filtered_data, "unique_filtered_data.csv", row.names = FALSE)

I am largely finished with R Desktop at this point in time, so I decided to try MS Power Query on Excel, and using Pivot Tables for visualizations and data viewing. The reason why I decided to use MS Excel and Power Query is because, we are all familiar with Excel, and with Excel, even if we are limited to about a million rows, Pivot Tables are still very useful.


Using Excel and Power Query

Now that the csv file is combined (after 15 minutes of loading, on my old laptop), we will proceed to get the data. I'm showing you the button that is used to get the data from our combined csv file.

  • On Power Query, click Transform and Load
  • Select: Connection Only
  • Close Power Query, and when it's done, click on Insert Pivot Table



And now, we'll use Insert Pivot Table to add in our data, and manipulate what you want to see.

in general, shorter duration rides for electric bikes, longer for classic bikes.

Once you have reached the stage of Pivot Tables, you are able to create Pivot Charts from them, and you should be able to manipulate and present your findings.

I also referred to and 'discussed' my process with ChatGPT and also read other examples of accounts that are published.

Good luck!


Adaku Obikwere

PRINCE2?| PMD-Pro| Assistant Manager - Agricultural Value Chain Production, Integration and Services at NIRSAL Connect

11 个月

Congratulations ....... This will be helpful for my capstone project

Torben Nils von Spreckelsen ????

Question to question more. Ai Integration Advocate | Transcultural Coach Posting Daily Provocations

1 年

Congratulations

回复

要查看或添加评论,请登录

Julian Lim的更多文章

  • My Part in Busting Another Scam

    My Part in Busting Another Scam

    Dear friends, this is the time I busted a 2nd scam, and saved my client possibly tens of thousands of dollars. One…

    10 条评论
  • Planning for the Probable vs Planning for the Confirmed

    Planning for the Probable vs Planning for the Confirmed

    Dear friends, many of you would have seen or heard of that old saying, 'Failure to plan is planning to fail'. In my…

  • The Building Blocks of Prosperity

    The Building Blocks of Prosperity

    I’m going to let you in on a little secret. It is the secret to wealth.

社区洞察

其他会员也浏览了