My na?ve Data analytics tutorial: Python vs. SQLite

My na?ve Data analytics tutorial: Python vs. SQLite

Tired of Excel for number crunching? Want to check out other tools? Well, there are many great alternatives out there!

Why not utilize power languages like SQL or Python in your data analysis universe?

In this tutorial, we will learn how to prepare an analytics environment on Windows Desktop, install development tools, and do a simple data analysis like it could be done with Excel.

My assumption is that you are a total beginner in SQL and Python and that you are using the Microsoft Windows environment. Just to mention that switch to Linux is very easy since SQL and Python are fully multiplatform languages.

We will not go in-depth with any of the programming languages but instead, have real analytics problems that can be resolved with the usage of SQL or Python. After your analytics environment is prepared, we will take the next five steps that are mandatory in any research with data:

  • Set your analytics goal – what is a real-life problem that needs research
  • Collect data – find sources of data
  • Prepare data – check data quality, integrity, completeness and perform data cleaning (avoid GIGO – “Garbage In, Garbage Out” situation)
  • Analyze data – the fun part
  • Interpret results – understood results, do a reality check, go back to phase 2 or 3 if needed, do data visualization, prepare actions

I think this is more than enough for the beginning of this na?ve tutorial, and it is time to do something practical!

Ready? Here we go!

***

Step 1: Set your data analytics environment

a) Install SQLite on your windows machine

SQLite is a relational database and the special thing about it is that is not a classical client-server database. Rather it is embedded into the end program and it can be used as a portable application that can run even without installation on your system.

SQLite is mostly compliant with the SQL standard, generally following PostgreSQL database syntax. It is lightweight, and it can be easily used for small analysis and training.

  • Go to web location to download it:

https://sqlitebrowser.org/dl/

No alt text provided for this image
  • Click on the “DB Browser for SQLite - PortableApp” and download an archive file containing SQLite portable application.
No alt text provided for this image

Locate the file “SQLiteDatabaseBrowserPortable_3.12.0_English.paf.exe” on your file system and double click on it to run the extraction program to unpack portable application.

No alt text provided for this image
  • Select the location for the portable program and click “Install”.
No alt text provided for this image
No alt text provided for this image
  • When it is done, click “Finish” to close the dialog.
No alt text provided for this image
  • Optionally you can create a desktop shortcut icon by going to the directory where SQLite portable application was extracted and choosing “Desktop (create shortcut)” on the right-click menu.
No alt text provided for this image

b) Install Anaconda distribution for the Python on your windows machine

My recommendation is to use Anaconda distribution for the Python for data analysis and other purposes. Anaconda offers a free and open-source distribution of the Python programming language for scientific computing (data science, machine learning applications, etc.). The main advantage is simplified package management and deployment (over 250 packages automatically installed, and over 7,500 additional open-source packages).

No alt text provided for this image
  • Scroll down to the “Download” section.
No alt text provided for this image
No alt text provided for this image
  • Click on the link for the “64-Bit Graphical Installer (466 MB)” version.

This will start the download and that can take a while.

No alt text provided for this image
No alt text provided for this image
  • Start the installation by double-clicking on the installation file in your download directory
No alt text provided for this image
  • Install Anaconda Individual Edition with default features.
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Be patient as this can take some time depending on your system performance. A lot of cool packages for data analytics are coming your way! ??

No alt text provided for this image
  • When it is done, click “Next”, “Next” and “Finish” to end the installation process.
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

If you have time, I encourage you to review documentation, but since this is a quick tutorial, we will not do it now. Let’s better use our data analytics environment to do some serious data processing! ??


Step 2: Set your data analytics goal

This the shortest section but it most important. Here we define the reason why we are starting with this project task.

Our data analytics project will deal with the following problem:

  • Find the effectiveness of testing of COVID-19 per country by comparing the number of tests for a single day and the number of new cases.

Reasoning: High effectiveness in testing could mean that there is not enough testing, i.e., that there is a significantly higher number of infected people in the population than it was detected.

DISCLAIMER: The purpose of the following data analysis is educational and done on a dataset that was not verified by its accuracy. The results of this analysis, therefore, are not to be interpreted for any other purpose than educational or to be used as accurate reports on COVID-19 tests and cases.

I had to mention this since there is always a chance that the intention of this exercise could be misinterpreted. I wanted to show you how data analysis is important for real-life situations and this is the main reason to use the COVID-19 dataset. ??

 

Step 3: Collect data

A sample data for the analysis could be found on this excellent web resource:

https://ourworldindata.org/coronavirus-testing

No alt text provided for this image

You need to download the ".CSV" text file that contains all data in the comma-separated format:

https://covid.ourworldindata.org/data/owid-covid-data.csv

On the “Our World in Data” web site you can find the actual version of the data. However, I recommend using a version of data that I used for this analysis since it will be easier to follow the exact results and progress of activities. This version can be found on my GitHub location:

https://github.com/nevendujmovic/Data_analysis_COVID-19_dataset

  • Click on "Code" -> "Download ZIP" to download "owid-covid-data.csv" file with sample data.
No alt text provided for this image
  • Make a project directory on your file system (e.g. in my case I created the “Data_analysis” folder on the desktop).
No alt text provided for this image
No alt text provided for this image
  • Extract and copy “owid-covid-data.csv” to the project directory (e.g. in my case “C:\Users\<user name>\Desktop\Data_analysis”).
No alt text provided for this image


a) Load data to the SQLite database

  • Open the SQLite database by clicking on the desktop shortcut or “SQLiteDatabaseBrowserPortable.exe” file in the SQLite portable application directory.
No alt text provided for this image
  • Select “New database” icon (or on the menu go to “File” -> “New database…”) and in the “Choose a filename to save under” dialog provide name “covid19data” and location (e.g. “data_analysis” directory). Click “Save” to create a database file.
No alt text provided for this image
  • Click “Cancel” the “Edit table definition” dialog that will open. We will not create a table in this way but rather we will import the table from sample data ".CSV" file.
No alt text provided for this image
  • Go to “File” -> “Import” -> “Table from CSV file…”.
No alt text provided for this image
  • Locate and select “owid-covid-data.csv” file from the “Data_analysis” directory and click “Open”.
No alt text provided for this image
  • In the “Import CSV file” dialog check “Column names in first line” and click “OK”.
No alt text provided for this image
  • The new table “owid-covid-data” with all sample data is created. Review the list of columns to get an idea of what data we have available.
No alt text provided for this image

Table name has "not recommended" character “-” and it will be best to change it immediately. Right-click on the table name and select the “Modify table” option.

No alt text provided for this image
  • Click “Save” on the dialog that will pop up.
No alt text provided for this image
  • Change the name of the table to “CovidData” and click “OK”.
No alt text provided for this image
  • Now, let’s make a simple query to review data entries and find useful data for our analysis. Go to the “Execute SQL” tab and write the following SQL statement in the editor.
SELECT * FROM owid-covid-data;

  • Select the text of written SQL statement and run it by clicking the little right arrow or by pressing the F5 key on the keyboard.
No alt text provided for this image
  • Data will be displayed and by review of entries, we can detect data columns that are relevant for our analysis.
No alt text provided for this image

Data columns that we need are the following:

  • location – country
  • date – a specific date
  • new_cases – detected positive COVID-19 cases on a specific date
  • new_tests – number of tests done on a specific date

You can also see that 37194 rows were returned by this query. That’s is length (number of records) in our dataset.

Let’s check the assigned data types for relevant columns to prevent potential misinterpretation of data due to improper conversion.

  • Go to “Data Structure”, expand the “CovidData” table and you will see all columns listed.
No alt text provided for this image
  • There is a problem with detected data types regarding the “new_tests” column. For our calculation, we need a numeric value. The reason for that incorrect assignment of data type could be NULL values for that specific column at the beginning of the data file. Also, the “new_cases” is probably the integer, but this is not an issue for or calculation. No problem, we can easily change data type by right-clicking on the “CovidData” table and selecting “Modify table” option as described above.
No alt text provided for this image
  • Change data types for both columns to INTEGER and click “OK”. Now, is also a good time to save our project. Go to “File” -> “Save Project”.
No alt text provided for this image
  • Save the project file in which we will store SQL statements and configuration to the “Data_analysis” directory. You can use “covid19data” as a name.
No alt text provided for this image
  • Go back to the “Execute SQL” to start the next data analytics phase. ??


b) Load data to the Python Pandas environment

  • Use windows search to find “Anaconda Navigator” and start it. This can take some time so wait patiently for everything to load.
No alt text provided for this image
No alt text provided for this image
  • On the main “Anaconda Navigator” window choose “Spyder” and select “Launch” to load Python development environment.
No alt text provided for this image
  • When the “Spyder” editor is loaded save the empty Python file at the location of your “Data_analysis” directory.
No alt text provided for this image
No alt text provided for this image
  • Provide a name for the file (“covid19data.py”) and select “Python files” for type. Click the “Save” button.
No alt text provided for this image

Note: it critical that you place Python script file “covid19data.py” and sample data ".CSV" file “owid-covid-data.csv” on the same location on your file system (e.g. “Data_analysis” directory). Otherwise, you will have to specify the full path to sample data ".CSV" file “owid-covid-data.csv” in your Python code.

No alt text provided for this image
  • Write or paste the following code in your “Spyder” editor and click green arrow (or F5 key) to execute code.
# import pandas
import pandas as pd

# Load the data from the comma-separated values (CSV) text file
# to the "covid_data" variable.
covid_data = pd.read_csv('owid-covid-data.csv')

Note: On the first line of the code, we imported the "pandas" library that holds all you will need for data analytics.

No alt text provided for this image
  • All data from the sample data ".CSV" file “owid-covid-data.csv” is loaded to the “covid_data” variable.
No alt text provided for this image
  • Click on the “Variable explorer” on the right side of the “Spyder” editor window and you will see the “covid_data” variable listed. Click on the “covid_data” variable and the window with data will be displayed.
No alt text provided for this image

Great! Data collection and load to analytical tools are finished.

On top of this, you now know how to execute SQL and Python code to get results. We will utilize what we learned in the next section, where we will deal with code and data.

 

Step 4: Prepare data

In this section we will do some queries both with SQL and Python Pandas in order to check data, clean it, and prepare it for analysis.

This will be done in the following way:

  • Python Pandas code will be executed for data extraction or data analytics
  • and then checked with the corresponding SQL statement.

In real life situations you can, of course, use only one language for your analytical purposes. However, this is a training course and I think is very important to use both Python Pandas and SQL simultaneously to compare tools and language syntax.

Note: As mentioned in the introduction, we will not go in-depth with any of the programming languages. There are a lot of tutorials and free resources online that are dealing with SQL and Python syntax, and I encourage you to use them during this tutorial or later. However, that would take time depending on your previous experience, and this would extend this data analytic education to the level of details that could be too technical. My idea is to explore the steps in the data analysis, look into tools and their usage, and when this was done, you will decide to look under the hood or not to see how it works from a detailed technical side. In other words, the complexity of the system on the low level should not be a stopper of using tools. That is, in a nutshell, what is meant with the “naive” approach in education.

a) Check and remove NULL (empty) values

With the targeted query we will extract only those data that we would require to fulfill our analytical goal. In the initial analysis above we decided that we only need the following data: “location”, “date”, “new_cases”, “new_tests”.

  • In the Python (“Spyder” editor) execute the following code:
# Make a copy of dataset to use it for data preparation.
ds = covid_data.copy()

# We identified the following three columns that are needed for our analysis.
covid_ds = ds[['location', 'date', 'new_tests', 'new_cases']]

No alt text provided for this image
  • In the SQLite execute the following statement to get the same result:
SELECT location, date, new_tests, new_cases FROM CovidData;

No alt text provided for this image

Let's do a simple check of the quality of data concerning the number of records and the number of missing values.

  • In the Python (“Spyder” editor) execute the following code:
number_of_records1 = covid_ds.count()\
                   .reset_index(name='count')\
                   .sort_values(['count'], ascending=False)

Note: "\" is used to go continue python line of code to a next line

We noticed a different number of records which implies the presence of a significant number of NULL values or as Python calls it "nan" values. This must be interpreted further with a nice Python line of code.

  • Let's identify the columns with missing values along with the count and print it on the console.
print ('')
print ('Missing values in the dataset')
print (covid_ds.isnull().sum(axis=0))


No alt text provided for this image
  • We will get the same result if we execute the following SQL statement in the SQLite:
SELECT COUNT(*) FROM CovidData WHERE location IS NULL;
-- no NULL entry found

SELECT COUNT(*) FROM CovidData WHERE date IS NULL;
-- no NULL entry found

SELECT COUNT(*) FROM CovidData WHERE new_tests IS NULL;
-- there are 25588 rows where test data for the specific date is missing

SELECT COUNT(*) FROM CovidData WHERE new_cases IS NULL;
-- there are 352 rows where test data for the specific date is missing

The good news is that the "location" and "date" have no missing values.

The bad news is that the "new_tests" column only 11606 records out of 37194 records have value entry.

DECISION: it is not possible to perform analysis without "new_tests" value. We will remove all rows where the "new_tests" value is missing.

  • Execute the following Python code to drop NULL values for the “new_tests” column and print the resulting status after this action:
covid_ds = covid_ds.dropna(subset = ["new_tests"])

print ('')
print ('Missing values in the dataset after we dropped \"new_tests\"')
print (covid_ds.isnull().sum(axis=0))

No alt text provided for this image

Also, it is not good that the "new_cases" column has still missing values.

  • Let's select records where the "new_cases" column has NaN.
new_cases_nan = covid_ds[covid_ds['new_cases'].isnull()]

Well, there are some discoveries:

  1. Data where "new_tests" are done and "new_cases" are missing are all in the period from 2020-02-06 to 2020-03-19. This was at the beginning of the pandemic, and this could be a test accuracy or reporting problem.
  2. Also, we can see that the United Arab Emirates conducted 33555 tests in that period, and data of the "new_cases" are missing. We can say that those records from the United Arab Emirates are clear outliers that will significantly influence results for that country.

Note: Outliers are unusual values in the dataset that significantly vary from other data. Outliers are very problematic for many analyses because they can distort results and cause tests to either miss significant findings.

DECISION: We will remove all rows where the "new_cases" value is missing.

  • Execute the following Python code to drop NULL values for the “new_cases” column and print the resulting status after this action:
covid_ds = covid_ds.dropna(subset = ["new_cases"])

print ('')
print ('Missing values in the dataset after we dropped \"new_cases\"')

print (covid_ds.isnull().sum(axis=0))

No alt text provided for this image
  • Again, we will check the number of records for each column.
number_of_records2 = covid_ds.count()\
                   .reset_index(name='count')\
                   .sort_values(['count'], ascending=False)

No alt text provided for this image
  • The SQL statement that will reflect this elimination of NULL values would look like this (added to the SQL WHERE clause section):
/* Count the number records when NULL values are removed. */

SELECT Count(*) FROM CovidData
WHERE new_tests IS NOT NULL AND new_cases IS NOT NULL;

No alt text provided for this image

Great! There are now 11457 records without missing values.

b) Check and remove inconsistent and error values

We will now check for possible records with cases where there are more new positive cases for Covid-19 than tests in a single day. This will not make sense for our analysis.

  • Execute the following Python code to find those cases:
temp1_before = covid_ds\
      .loc[covid_ds['new_cases'] > covid_ds['new_tests']]

Unfortunately, we have found such cases. However, this is only for the 65 records. It is also worrying that we have seen negative values. This situation should be an alarm to the analyst. Perhaps, the data collection was not adequate, and it may be best to do a new collection or reach for some other data sources.

DECISION: We will remove all rows where the "new_cases" > "new_tests".

  • Execute the following Python code to drop all rows where the "new_cases" > "new_tests":
covid_ds.drop(covid_ds[covid_ds['new_cases'] > covid_ds['new_tests']]\
             .index, inplace = True)

temp1_after = covid_ds.loc[covid_ds['new_cases'] > covid_ds['new_tests']]

Also, the same number of tests and positive cases could be unrealistic since 100% accuracy of the test is not expected. This scenario is most likely an error.

  • Execute the following Python code to find those cases:
temp2_before = covid_ds\
      .loc[covid_ds['new_cases'] == covid_ds['new_tests']]

There are six records found! The inspection of data shows that it is probably an error.

DECISION: We will remove all rows where the "new_cases" == "new_tests".   

  • Execute the following Python code to drop all rows where the "new_cases" == "new_tests".
covid_ds.drop(covid_ds[covid_ds['new_cases'] == covid_ds['new_tests']]\
             .index, inplace = True) 

temp2_after = covid_ds\
      .loc[covid_ds['new_cases'] == covid_ds['new_tests']]

  • Again, we will check the number of records for each column.
number_of_records3 = covid_ds.count()\
                   .reset_index(name='count')\
                   .sort_values(['count'], ascending=False)

No alt text provided for this image
  • The SQL statement that will reflect the elimination of errors will look like this:
SELECT Count(*) FROM CovidData
WHERE new_tests IS NOT NULL AND new_cases IS NOT NULL AND new_cases < new_tests;

No alt text provided for this image

Now, we will also check for all negative values for "new_tests" and "new_cases". Those values must be removed for our dataset since it is evident that we are dealing with error entries.

  • Execute the following Python code to get all negative entries for the "new_tests" column.
temp3 = covid_ds.loc[covid_ds['new_tests'] < 0 ]

No action is required since there are no records with negative value entries for the "new_tests."

  • Execute the following Python code to get all negative entries for the "new_cases" column.
temp4_before = covid_ds.loc[covid_ds['new_cases'] < 0]

There are 7 records where the "new_cases" column has a negative value entry.

  • Execute the following Python code to remove all negative entries for the "new_cases" column.
covid_ds.drop(covid_ds[covid_ds['new_cases'] < 0]\
             .index, inplace = True)

temp4_after = covid_ds.loc[covid_ds['new_cases'] < 0]

  • Again, we will check the number of records for each column.
number_of_records4 = covid_ds.count()\
                   .reset_index(name='count')\
                   .sort_values(['count'], ascending=False)

No alt text provided for this image
  • The SQL statement that will reflect the elimination of negative values will look like this:
SELECT Count(*) FROM CovidData
WHERE new_tests IS NOT NULL AND new_cases IS NOT NULL AND new_cases < new_tests AND new_tests > 0 AND new_cases >= 0

No alt text provided for this image
  • Let's see a list of countries to verify entries by executing the following Python code.
list_of_countries = covid_ds.drop_duplicates('location')[['location']]\
                                 .reset_index()

Note: The reset_index() is used to add a new index column starting with 0 for the resulting dataset. Usage is optional and if not used, original index entries remain, and in some cases that could be useful to link record back to original data.

No alt text provided for this image
No alt text provided for this image
  • The corresponding SQL statement would look like this:
SELECT DISTINCT location AS Country FROM CovidData 
WHERE new_tests IS NOT NULL AND new_cases IS NOT NULL AND new_cases < new_tests 
      AND new_tests > 0 AND new_cases >= 0
ORDER BY location;

No alt text provided for this image

We are left with 87 countries for the analysis after data cleaning activity.

This ends the "data preparation" step, which is most usually the most time consuming and complicated part of the analytical process. Now the "covid_ds" dataset is, after adjustments and data cleaning, prepared for the analyses. As mentioned above, it is not the perfect data source, and its use would probably be questionable regarding data quality, but for this training, it will be just fine.


Step 5: Analyze data

Now, is time for the main show! Soon we will see the fruits of our labor. ??

We will add an additional calculated column ("test_percentage") where the percentage of the ratio between Covid-19 test cases and tests that are done in the specific day will be stored.

  • Execute the following Python code to add the calculated column:
covid_ds['test_percentage'] = covid_ds['new_cases'] / covid_ds['new_tests'] * 100

No alt text provided for this image

Now it is time to approach the goal of our analysis and list average percentages (cases vs. tests) grouped by countries.

  • Execute the following Python code get results for our analysis on full data sample:
avg_percentage_test_cases =\
         covid_ds.groupby(['location'])['test_percentage']\
         .mean()\
         .reset_index()\
         .sort_values(['test_percentage'], ascending=False)

No alt text provided for this image
  • Execute the SQL statement get same results for our analysis on full data sample:
SELECT location AS Country, AVG(new_cases*1.0 / new_tests) * 100 AS [Average percentage of detection] FROM CovidData
WHERE new_tests IS NOT NULL AND new_cases IS NOT NULL AND new_cases < new_tests AND new_tests > 0 AND new_cases >= 0
GROUP BY location
ORDER BY [Average percentage of detection] DESC;

No alt text provided for this image

We can also use a time dimension in our query by providing date values to analyze data only in a specific period.

  • Execute the following Python code get results for the specific time period:
covid_ds_time_filter = covid_ds\
   .loc[(covid_ds['date'] >= '2020-07-01') & (covid_ds['date'] <= '2020-08-01')]

avg_percentage_time_filter =\
         covid_ds_time_filter.groupby(['location'])['test_percentage']\
         .mean()\
         .reset_index()\
         .sort_values(['test_percentage'], ascending=False)

No alt text provided for this image
  • Execute the SQL statement get same results for the specific time period:
SELECT location AS Country, ROUND((AVG(new_cases*1.0 / new_tests) * 100), 4) AS [Average percentage of detection] FROM CovidData
WHERE new_tests IS NOT NULL AND new_cases IS NOT NULL AND new_cases < new_tests AND new_tests > 0 AND new_cases >= 0
     AND DATE(date) BETWEEN DATE('2020-07-01') AND DATE('2020-08-01')
GROUP BY location
ORDER BY [Average percentage of detection] DESC;

Note: The BETWEEN operator is inclusive. It returns true when the test_expression is less than or equal to high_expression and greater than or equal to the value of low_expression: test_expression >= low_expression AND test_expression <= high_expression.

Also, note the use of a ROUND function to display results with 4 decimal places.

No alt text provided for this image


Step 6: Interpret results

The data analysis in the previous section provided results and we can now put them to real use. It is important to be always critical and do some reality check before making conclusions. As we mentioned there are concerns about the quality of data and the analysis itself could have errors. Therefore, ask yourself before proceeding any further, a simple question:

Does it make sense?

I will leave you with this question and let you decide. Remember that information is power, and as we gain more knowledge, we are becoming more wise human beings.

No alt text provided for this image

But it is not time to finish our lecture yet! Allow me to show you one very cool thing before we end today's session. ??

How about some visualization of results like it is usual to do in Excel or similar applications?

Python Pandas packages provide you with excellent resources for data visualization. We will use the "bar" chart type to show our results with a visual representation. This can be done right from our Python code.

  • Input following code to push our result dataset towards visualization object and execute Python script:
# import libraries that will enable plotting of graphs
import matplotlib.pyplot as plt

# define x and y axes from the results dataset and select "bar" type graph
ax = avg_percentage_time_filter.plot(x ='location', y='test_percentage', kind = 'bar')
 

plt.show()

The graph is shown on the right side under the tab “Plots” on the right side of the “Spyder” editor window.

No alt text provided for this image

All data is there, but x-axes look little overcrowded with labels containing the country’s names. With a little Python magic, and we can rotate labels with country names by 90 degrees and display only every 3rd country. And we can do even more customization by changing axes labels and title of the graph.

# import libraries that will enable plotting of graphs
import matplotlib.pyplot as plt

# define x and y axes from the results dataset and select "bar" type graph
ax = avg_percentage_time_filter.plot(x ='location', y='test_percentage', kind = 'bar')

# to avoid crowded labels on x-axes, we will show every third country and 
# rotate labels by 90 degrees
for i, t in enumerate(ax.get_xticklabels()):
   if (i % 3) != 0:
       t.set_visible(False)

plt.xticks(rotation=90)

# additionally we can set chart title, legend and axes labels
plt.title('COVID-19 data results')
plt.legend(['from 2020-07-01 to 2020-08-01'])
plt.xlabel('Countries')
plt.ylabel('cases vs. test ratio in %')

plt.show()

Now, it looks much more readable and the graph can be saved or copied directly to your presentation.

No alt text provided for this image

There are a lot of ways to adjust and use visualization tools in Python, and I encourage you to explore further. This topic is broad and probably will require a separate and dedicated education session that would deal with visualization in detail. But for now, the provided example is enough to get an idea about the basic mechanics of this powerful feature.


Step 7: Testing scripts on the newest data

Before we finish, I suggest using our script on actual data that can be downloaded from:

https://covid.ourworldindata.org/data/owid-covid-data.csv

As explained above, on the “Our World in Data” web site, you can find the current version of the data.

  • Rename ".CSV" file, copy it to the “Data_analysis” directory, and provide this name to Python script.
No alt text provided for this image
covid_data = pd.read_csv('owid-covid-data_new.csv')

  • Adjust also the time filter to get results for last month and run the Python script.
covid_ds_time_filter = covid_ds\
    .loc[(covid_ds['date'] >= '2020-07-25') & (covid_ds['date'] <= '2020-08-25')]

      
avg_percentage_time_filter =\
         covid_ds_time_filter.groupby(['location'])['test_percentage']\
         .mean()\
         .reset_index()\
         .sort_values(['test_percentage'], ascending=False)

# and on graph legend
plt.legend(['from 2020-07-25 to 2020-08-25'])

  • Execute the Python script to get new results and you can see some changes.

Note that once made script can easily be reused on a new set of data. ??

No alt text provided for this image
No alt text provided for this image

As an exercise, you can perform verification of results with SQL. Once again, you are the one with the task to interpret the results. ??

Complete source code and data can be found on my GitHub location for this project:

https://github.com/nevendujmovic/Data_analysis_COVID-19_dataset

No alt text provided for this image

My friends, this is the end of today's tutorial, and I hope it was time well spent.

Now you know how to, in a short time, prepare your desktop environment for data analysis and perform necessary steps in an analytical process.

And I hope you enjoyed using this tutorial as much I have enjoyed writing it!

I wish you all the best, and stay healthy!

 

Author

Neven Dujmovic

 

 

P.S. if you like this tutorial be sure to check out other tutorials in my “naive” series on the LinkedIn platform.

During the COVID-19 pandemic, I provided free education via the LinkedIn articles platform. All tutorials are covering useful and advanced information technology concepts but are presented on the "easy to learn" way and are adjusted for beginners.

Each tutorial is:

-         completely free to use,

-         only non-commercial software is used

-         tutorials are documented in a high level of details

-         live screenshots from used IT systems are provided for each tutorial step

-         tutorials are quick to finish (duration max 6-8 hours) and adopted knowledge has direct application

-         provided materials can serve as a reference in further usage of the technology.

Following IT educations are available:

-         My "naive" Java tutorial with Quarkus on Linux (https://www.dhirubhai.net/pulse/my-naive-java-tutorial-quarkus-linux-neven-dujmovic/)

-         My “naive” Python tutorial (https://www.dhirubhai.net/pulse/my-naive-python-tutorial-neven-dujmovic/)

-         My “naive” virtualization & Linux installation tutorial (https://www.dhirubhai.net/pulse/my-naive-virtualization-linux-installation-tutorial-neven-dujmovic)

-         My “naive” Python development tutorial on Linux (https://www.dhirubhai.net/pulse/my-naive-python-development-tutorial-linux-neven-dujmovic)

-         My “naive” Microsoft SQL Server desktop installation and usage tutorial (https://www.dhirubhai.net/pulse/my-naive-microsoft-sql-server-desktop-installation-usage-dujmovic/)

 

 

要查看或添加评论,请登录

Neven Dujmovic的更多文章

社区洞察

其他会员也浏览了