My na?ve Data analytics tutorial: Python vs. SQLite
Tired of Excel for number crunching? Want to check out other tools? Well, there are many great alternatives out there!
Why not utilize power languages like SQL or Python in your data analysis universe?
In this tutorial, we will learn how to prepare an analytics environment on Windows Desktop, install development tools, and do a simple data analysis like it could be done with Excel.
My assumption is that you are a total beginner in SQL and Python and that you are using the Microsoft Windows environment. Just to mention that switch to Linux is very easy since SQL and Python are fully multiplatform languages.
We will not go in-depth with any of the programming languages but instead, have real analytics problems that can be resolved with the usage of SQL or Python. After your analytics environment is prepared, we will take the next five steps that are mandatory in any research with data:
- Set your analytics goal – what is a real-life problem that needs research
- Collect data – find sources of data
- Prepare data – check data quality, integrity, completeness and perform data cleaning (avoid GIGO – “Garbage In, Garbage Out” situation)
- Analyze data – the fun part
- Interpret results – understood results, do a reality check, go back to phase 2 or 3 if needed, do data visualization, prepare actions
I think this is more than enough for the beginning of this na?ve tutorial, and it is time to do something practical!
Ready? Here we go!
***
Step 1: Set your data analytics environment
a) Install SQLite on your windows machine
SQLite is a relational database and the special thing about it is that is not a classical client-server database. Rather it is embedded into the end program and it can be used as a portable application that can run even without installation on your system.
SQLite is mostly compliant with the SQL standard, generally following PostgreSQL database syntax. It is lightweight, and it can be easily used for small analysis and training.
- Go to web location to download it:
- Click on the “DB Browser for SQLite - PortableApp” and download an archive file containing SQLite portable application.
Locate the file “SQLiteDatabaseBrowserPortable_3.12.0_English.paf.exe” on your file system and double click on it to run the extraction program to unpack portable application.
- Select the location for the portable program and click “Install”.
- When it is done, click “Finish” to close the dialog.
- Optionally you can create a desktop shortcut icon by going to the directory where SQLite portable application was extracted and choosing “Desktop (create shortcut)” on the right-click menu.
b) Install Anaconda distribution for the Python on your windows machine
My recommendation is to use Anaconda distribution for the Python for data analysis and other purposes. Anaconda offers a free and open-source distribution of the Python programming language for scientific computing (data science, machine learning applications, etc.). The main advantage is simplified package management and deployment (over 250 packages automatically installed, and over 7,500 additional open-source packages).
- Download Anaconda Individual Edition via the link: https://www.anaconda.com/products/individual
- Scroll down to the “Download” section.
- Click on the link for the “64-Bit Graphical Installer (466 MB)” version.
This will start the download and that can take a while.
- Start the installation by double-clicking on the installation file in your download directory
- Install Anaconda Individual Edition with default features.
Be patient as this can take some time depending on your system performance. A lot of cool packages for data analytics are coming your way! ??
- When it is done, click “Next”, “Next” and “Finish” to end the installation process.
If you have time, I encourage you to review documentation, but since this is a quick tutorial, we will not do it now. Let’s better use our data analytics environment to do some serious data processing! ??
Step 2: Set your data analytics goal
This the shortest section but it most important. Here we define the reason why we are starting with this project task.
Our data analytics project will deal with the following problem:
- Find the effectiveness of testing of COVID-19 per country by comparing the number of tests for a single day and the number of new cases.
Reasoning: High effectiveness in testing could mean that there is not enough testing, i.e., that there is a significantly higher number of infected people in the population than it was detected.
DISCLAIMER: The purpose of the following data analysis is educational and done on a dataset that was not verified by its accuracy. The results of this analysis, therefore, are not to be interpreted for any other purpose than educational or to be used as accurate reports on COVID-19 tests and cases.
I had to mention this since there is always a chance that the intention of this exercise could be misinterpreted. I wanted to show you how data analysis is important for real-life situations and this is the main reason to use the COVID-19 dataset. ??
Step 3: Collect data
A sample data for the analysis could be found on this excellent web resource:
https://ourworldindata.org/coronavirus-testing
You need to download the ".CSV" text file that contains all data in the comma-separated format:
https://covid.ourworldindata.org/data/owid-covid-data.csv
On the “Our World in Data” web site you can find the actual version of the data. However, I recommend using a version of data that I used for this analysis since it will be easier to follow the exact results and progress of activities. This version can be found on my GitHub location:
https://github.com/nevendujmovic/Data_analysis_COVID-19_dataset
- Click on "Code" -> "Download ZIP" to download "owid-covid-data.csv" file with sample data.
- Make a project directory on your file system (e.g. in my case I created the “Data_analysis” folder on the desktop).
- Extract and copy “owid-covid-data.csv” to the project directory (e.g. in my case “C:\Users\<user name>\Desktop\Data_analysis”).
a) Load data to the SQLite database
- Open the SQLite database by clicking on the desktop shortcut or “SQLiteDatabaseBrowserPortable.exe” file in the SQLite portable application directory.
- Select “New database” icon (or on the menu go to “File” -> “New database…”) and in the “Choose a filename to save under” dialog provide name “covid19data” and location (e.g. “data_analysis” directory). Click “Save” to create a database file.
- Click “Cancel” the “Edit table definition” dialog that will open. We will not create a table in this way but rather we will import the table from sample data ".CSV" file.
- Go to “File” -> “Import” -> “Table from CSV file…”.
- Locate and select “owid-covid-data.csv” file from the “Data_analysis” directory and click “Open”.
- In the “Import CSV file” dialog check “Column names in first line” and click “OK”.
- The new table “owid-covid-data” with all sample data is created. Review the list of columns to get an idea of what data we have available.
Table name has "not recommended" character “-” and it will be best to change it immediately. Right-click on the table name and select the “Modify table” option.
- Click “Save” on the dialog that will pop up.
- Change the name of the table to “CovidData” and click “OK”.
- Now, let’s make a simple query to review data entries and find useful data for our analysis. Go to the “Execute SQL” tab and write the following SQL statement in the editor.
SELECT * FROM owid-covid-data;
- Select the text of written SQL statement and run it by clicking the little right arrow or by pressing the F5 key on the keyboard.
- Data will be displayed and by review of entries, we can detect data columns that are relevant for our analysis.
Data columns that we need are the following:
- location – country
- date – a specific date
- new_cases – detected positive COVID-19 cases on a specific date
- new_tests – number of tests done on a specific date
You can also see that 37194 rows were returned by this query. That’s is length (number of records) in our dataset.
Let’s check the assigned data types for relevant columns to prevent potential misinterpretation of data due to improper conversion.
- Go to “Data Structure”, expand the “CovidData” table and you will see all columns listed.
- There is a problem with detected data types regarding the “new_tests” column. For our calculation, we need a numeric value. The reason for that incorrect assignment of data type could be NULL values for that specific column at the beginning of the data file. Also, the “new_cases” is probably the integer, but this is not an issue for or calculation. No problem, we can easily change data type by right-clicking on the “CovidData” table and selecting “Modify table” option as described above.
- Change data types for both columns to INTEGER and click “OK”. Now, is also a good time to save our project. Go to “File” -> “Save Project”.
- Save the project file in which we will store SQL statements and configuration to the “Data_analysis” directory. You can use “covid19data” as a name.
- Go back to the “Execute SQL” to start the next data analytics phase. ??
b) Load data to the Python Pandas environment
- Use windows search to find “Anaconda Navigator” and start it. This can take some time so wait patiently for everything to load.
- On the main “Anaconda Navigator” window choose “Spyder” and select “Launch” to load Python development environment.
- When the “Spyder” editor is loaded save the empty Python file at the location of your “Data_analysis” directory.
- Provide a name for the file (“covid19data.py”) and select “Python files” for type. Click the “Save” button.
Note: it critical that you place Python script file “covid19data.py” and sample data ".CSV" file “owid-covid-data.csv” on the same location on your file system (e.g. “Data_analysis” directory). Otherwise, you will have to specify the full path to sample data ".CSV" file “owid-covid-data.csv” in your Python code.
- Write or paste the following code in your “Spyder” editor and click green arrow (or F5 key) to execute code.
# import pandas import pandas as pd # Load the data from the comma-separated values (CSV) text file # to the "covid_data" variable. covid_data = pd.read_csv('owid-covid-data.csv')
Note: On the first line of the code, we imported the "pandas" library that holds all you will need for data analytics.
- All data from the sample data ".CSV" file “owid-covid-data.csv” is loaded to the “covid_data” variable.
- Click on the “Variable explorer” on the right side of the “Spyder” editor window and you will see the “covid_data” variable listed. Click on the “covid_data” variable and the window with data will be displayed.
Great! Data collection and load to analytical tools are finished.
On top of this, you now know how to execute SQL and Python code to get results. We will utilize what we learned in the next section, where we will deal with code and data.
Step 4: Prepare data
In this section we will do some queries both with SQL and Python Pandas in order to check data, clean it, and prepare it for analysis.
This will be done in the following way:
- Python Pandas code will be executed for data extraction or data analytics
- and then checked with the corresponding SQL statement.
In real life situations you can, of course, use only one language for your analytical purposes. However, this is a training course and I think is very important to use both Python Pandas and SQL simultaneously to compare tools and language syntax.
Note: As mentioned in the introduction, we will not go in-depth with any of the programming languages. There are a lot of tutorials and free resources online that are dealing with SQL and Python syntax, and I encourage you to use them during this tutorial or later. However, that would take time depending on your previous experience, and this would extend this data analytic education to the level of details that could be too technical. My idea is to explore the steps in the data analysis, look into tools and their usage, and when this was done, you will decide to look under the hood or not to see how it works from a detailed technical side. In other words, the complexity of the system on the low level should not be a stopper of using tools. That is, in a nutshell, what is meant with the “naive” approach in education.
a) Check and remove NULL (empty) values
With the targeted query we will extract only those data that we would require to fulfill our analytical goal. In the initial analysis above we decided that we only need the following data: “location”, “date”, “new_cases”, “new_tests”.
- In the Python (“Spyder” editor) execute the following code:
# Make a copy of dataset to use it for data preparation. ds = covid_data.copy() # We identified the following three columns that are needed for our analysis. covid_ds = ds[['location', 'date', 'new_tests', 'new_cases']]
- In the SQLite execute the following statement to get the same result:
SELECT location, date, new_tests, new_cases FROM CovidData;
Let's do a simple check of the quality of data concerning the number of records and the number of missing values.
- In the Python (“Spyder” editor) execute the following code:
number_of_records1 = covid_ds.count()\ .reset_index(name='count')\ .sort_values(['count'], ascending=False)
Note: "\" is used to go continue python line of code to a next line
We noticed a different number of records which implies the presence of a significant number of NULL values or as Python calls it "nan" values. This must be interpreted further with a nice Python line of code.
- Let's identify the columns with missing values along with the count and print it on the console.
print ('') print ('Missing values in the dataset') print (covid_ds.isnull().sum(axis=0))
- We will get the same result if we execute the following SQL statement in the SQLite:
SELECT COUNT(*) FROM CovidData WHERE location IS NULL; -- no NULL entry found SELECT COUNT(*) FROM CovidData WHERE date IS NULL; -- no NULL entry found SELECT COUNT(*) FROM CovidData WHERE new_tests IS NULL; -- there are 25588 rows where test data for the specific date is missing SELECT COUNT(*) FROM CovidData WHERE new_cases IS NULL; -- there are 352 rows where test data for the specific date is missing
The good news is that the "location" and "date" have no missing values.
The bad news is that the "new_tests" column only 11606 records out of 37194 records have value entry.
DECISION: it is not possible to perform analysis without "new_tests" value. We will remove all rows where the "new_tests" value is missing.
- Execute the following Python code to drop NULL values for the “new_tests” column and print the resulting status after this action:
covid_ds = covid_ds.dropna(subset = ["new_tests"]) print ('') print ('Missing values in the dataset after we dropped \"new_tests\"') print (covid_ds.isnull().sum(axis=0))
Also, it is not good that the "new_cases" column has still missing values.
- Let's select records where the "new_cases" column has NaN.
new_cases_nan = covid_ds[covid_ds['new_cases'].isnull()]
Well, there are some discoveries:
- Data where "new_tests" are done and "new_cases" are missing are all in the period from 2020-02-06 to 2020-03-19. This was at the beginning of the pandemic, and this could be a test accuracy or reporting problem.
- Also, we can see that the United Arab Emirates conducted 33555 tests in that period, and data of the "new_cases" are missing. We can say that those records from the United Arab Emirates are clear outliers that will significantly influence results for that country.
Note: Outliers are unusual values in the dataset that significantly vary from other data. Outliers are very problematic for many analyses because they can distort results and cause tests to either miss significant findings.
DECISION: We will remove all rows where the "new_cases" value is missing.
- Execute the following Python code to drop NULL values for the “new_cases” column and print the resulting status after this action:
covid_ds = covid_ds.dropna(subset = ["new_cases"]) print ('') print ('Missing values in the dataset after we dropped \"new_cases\"') print (covid_ds.isnull().sum(axis=0))
- Again, we will check the number of records for each column.
number_of_records2 = covid_ds.count()\ .reset_index(name='count')\ .sort_values(['count'], ascending=False)
- The SQL statement that will reflect this elimination of NULL values would look like this (added to the SQL WHERE clause section):
/* Count the number records when NULL values are removed. */ SELECT Count(*) FROM CovidData WHERE new_tests IS NOT NULL AND new_cases IS NOT NULL;
Great! There are now 11457 records without missing values.
b) Check and remove inconsistent and error values
We will now check for possible records with cases where there are more new positive cases for Covid-19 than tests in a single day. This will not make sense for our analysis.
- Execute the following Python code to find those cases:
temp1_before = covid_ds\ .loc[covid_ds['new_cases'] > covid_ds['new_tests']]
Unfortunately, we have found such cases. However, this is only for the 65 records. It is also worrying that we have seen negative values. This situation should be an alarm to the analyst. Perhaps, the data collection was not adequate, and it may be best to do a new collection or reach for some other data sources.
DECISION: We will remove all rows where the "new_cases" > "new_tests".
- Execute the following Python code to drop all rows where the "new_cases" > "new_tests":
covid_ds.drop(covid_ds[covid_ds['new_cases'] > covid_ds['new_tests']]\ .index, inplace = True) temp1_after = covid_ds.loc[covid_ds['new_cases'] > covid_ds['new_tests']]
Also, the same number of tests and positive cases could be unrealistic since 100% accuracy of the test is not expected. This scenario is most likely an error.
- Execute the following Python code to find those cases:
temp2_before = covid_ds\ .loc[covid_ds['new_cases'] == covid_ds['new_tests']]
There are six records found! The inspection of data shows that it is probably an error.
DECISION: We will remove all rows where the "new_cases" == "new_tests".
- Execute the following Python code to drop all rows where the "new_cases" == "new_tests".
covid_ds.drop(covid_ds[covid_ds['new_cases'] == covid_ds['new_tests']]\ .index, inplace = True) temp2_after = covid_ds\ .loc[covid_ds['new_cases'] == covid_ds['new_tests']]
- Again, we will check the number of records for each column.
number_of_records3 = covid_ds.count()\ .reset_index(name='count')\ .sort_values(['count'], ascending=False)
- The SQL statement that will reflect the elimination of errors will look like this:
SELECT Count(*) FROM CovidData WHERE new_tests IS NOT NULL AND new_cases IS NOT NULL AND new_cases < new_tests;
Now, we will also check for all negative values for "new_tests" and "new_cases". Those values must be removed for our dataset since it is evident that we are dealing with error entries.
- Execute the following Python code to get all negative entries for the "new_tests" column.
temp3 = covid_ds.loc[covid_ds['new_tests'] < 0 ]
No action is required since there are no records with negative value entries for the "new_tests."
- Execute the following Python code to get all negative entries for the "new_cases" column.
temp4_before = covid_ds.loc[covid_ds['new_cases'] < 0]
There are 7 records where the "new_cases" column has a negative value entry.
- Execute the following Python code to remove all negative entries for the "new_cases" column.
covid_ds.drop(covid_ds[covid_ds['new_cases'] < 0]\ .index, inplace = True) temp4_after = covid_ds.loc[covid_ds['new_cases'] < 0]
- Again, we will check the number of records for each column.
number_of_records4 = covid_ds.count()\ .reset_index(name='count')\ .sort_values(['count'], ascending=False)
- The SQL statement that will reflect the elimination of negative values will look like this:
SELECT Count(*) FROM CovidData WHERE new_tests IS NOT NULL AND new_cases IS NOT NULL AND new_cases < new_tests AND new_tests > 0 AND new_cases >= 0
- Let's see a list of countries to verify entries by executing the following Python code.
list_of_countries = covid_ds.drop_duplicates('location')[['location']]\ .reset_index()
Note: The reset_index() is used to add a new index column starting with 0 for the resulting dataset. Usage is optional and if not used, original index entries remain, and in some cases that could be useful to link record back to original data.
- The corresponding SQL statement would look like this:
SELECT DISTINCT location AS Country FROM CovidData WHERE new_tests IS NOT NULL AND new_cases IS NOT NULL AND new_cases < new_tests AND new_tests > 0 AND new_cases >= 0 ORDER BY location;
We are left with 87 countries for the analysis after data cleaning activity.
This ends the "data preparation" step, which is most usually the most time consuming and complicated part of the analytical process. Now the "covid_ds" dataset is, after adjustments and data cleaning, prepared for the analyses. As mentioned above, it is not the perfect data source, and its use would probably be questionable regarding data quality, but for this training, it will be just fine.
Step 5: Analyze data
Now, is time for the main show! Soon we will see the fruits of our labor. ??
We will add an additional calculated column ("test_percentage") where the percentage of the ratio between Covid-19 test cases and tests that are done in the specific day will be stored.
- Execute the following Python code to add the calculated column:
covid_ds['test_percentage'] = covid_ds['new_cases'] / covid_ds['new_tests'] * 100
Now it is time to approach the goal of our analysis and list average percentages (cases vs. tests) grouped by countries.
- Execute the following Python code get results for our analysis on full data sample:
avg_percentage_test_cases =\ covid_ds.groupby(['location'])['test_percentage']\ .mean()\ .reset_index()\ .sort_values(['test_percentage'], ascending=False)
- Execute the SQL statement get same results for our analysis on full data sample:
SELECT location AS Country, AVG(new_cases*1.0 / new_tests) * 100 AS [Average percentage of detection] FROM CovidData WHERE new_tests IS NOT NULL AND new_cases IS NOT NULL AND new_cases < new_tests AND new_tests > 0 AND new_cases >= 0 GROUP BY location ORDER BY [Average percentage of detection] DESC;
We can also use a time dimension in our query by providing date values to analyze data only in a specific period.
- Execute the following Python code get results for the specific time period:
covid_ds_time_filter = covid_ds\ .loc[(covid_ds['date'] >= '2020-07-01') & (covid_ds['date'] <= '2020-08-01')] avg_percentage_time_filter =\ covid_ds_time_filter.groupby(['location'])['test_percentage']\ .mean()\ .reset_index()\ .sort_values(['test_percentage'], ascending=False)
- Execute the SQL statement get same results for the specific time period:
SELECT location AS Country, ROUND((AVG(new_cases*1.0 / new_tests) * 100), 4) AS [Average percentage of detection] FROM CovidData WHERE new_tests IS NOT NULL AND new_cases IS NOT NULL AND new_cases < new_tests AND new_tests > 0 AND new_cases >= 0 AND DATE(date) BETWEEN DATE('2020-07-01') AND DATE('2020-08-01') GROUP BY location ORDER BY [Average percentage of detection] DESC;
Note: The BETWEEN operator is inclusive. It returns true when the test_expression is less than or equal to high_expression and greater than or equal to the value of low_expression: test_expression >= low_expression AND test_expression <= high_expression.
Also, note the use of a ROUND function to display results with 4 decimal places.
Step 6: Interpret results
The data analysis in the previous section provided results and we can now put them to real use. It is important to be always critical and do some reality check before making conclusions. As we mentioned there are concerns about the quality of data and the analysis itself could have errors. Therefore, ask yourself before proceeding any further, a simple question:
Does it make sense?
I will leave you with this question and let you decide. Remember that information is power, and as we gain more knowledge, we are becoming more wise human beings.
But it is not time to finish our lecture yet! Allow me to show you one very cool thing before we end today's session. ??
How about some visualization of results like it is usual to do in Excel or similar applications?
Python Pandas packages provide you with excellent resources for data visualization. We will use the "bar" chart type to show our results with a visual representation. This can be done right from our Python code.
- Input following code to push our result dataset towards visualization object and execute Python script:
# import libraries that will enable plotting of graphs import matplotlib.pyplot as plt # define x and y axes from the results dataset and select "bar" type graph ax = avg_percentage_time_filter.plot(x ='location', y='test_percentage', kind = 'bar') plt.show()
The graph is shown on the right side under the tab “Plots” on the right side of the “Spyder” editor window.
All data is there, but x-axes look little overcrowded with labels containing the country’s names. With a little Python magic, and we can rotate labels with country names by 90 degrees and display only every 3rd country. And we can do even more customization by changing axes labels and title of the graph.
# import libraries that will enable plotting of graphs import matplotlib.pyplot as plt # define x and y axes from the results dataset and select "bar" type graph ax = avg_percentage_time_filter.plot(x ='location', y='test_percentage', kind = 'bar') # to avoid crowded labels on x-axes, we will show every third country and # rotate labels by 90 degrees for i, t in enumerate(ax.get_xticklabels()): if (i % 3) != 0: t.set_visible(False) plt.xticks(rotation=90) # additionally we can set chart title, legend and axes labels plt.title('COVID-19 data results') plt.legend(['from 2020-07-01 to 2020-08-01']) plt.xlabel('Countries') plt.ylabel('cases vs. test ratio in %') plt.show()
Now, it looks much more readable and the graph can be saved or copied directly to your presentation.
There are a lot of ways to adjust and use visualization tools in Python, and I encourage you to explore further. This topic is broad and probably will require a separate and dedicated education session that would deal with visualization in detail. But for now, the provided example is enough to get an idea about the basic mechanics of this powerful feature.
Step 7: Testing scripts on the newest data
Before we finish, I suggest using our script on actual data that can be downloaded from:
https://covid.ourworldindata.org/data/owid-covid-data.csv
As explained above, on the “Our World in Data” web site, you can find the current version of the data.
- Rename ".CSV" file, copy it to the “Data_analysis” directory, and provide this name to Python script.
covid_data = pd.read_csv('owid-covid-data_new.csv')
- Adjust also the time filter to get results for last month and run the Python script.
covid_ds_time_filter = covid_ds\ .loc[(covid_ds['date'] >= '2020-07-25') & (covid_ds['date'] <= '2020-08-25')] avg_percentage_time_filter =\ covid_ds_time_filter.groupby(['location'])['test_percentage']\ .mean()\ .reset_index()\ .sort_values(['test_percentage'], ascending=False) # and on graph legend plt.legend(['from 2020-07-25 to 2020-08-25'])
- Execute the Python script to get new results and you can see some changes.
Note that once made script can easily be reused on a new set of data. ??
As an exercise, you can perform verification of results with SQL. Once again, you are the one with the task to interpret the results. ??
Complete source code and data can be found on my GitHub location for this project:
https://github.com/nevendujmovic/Data_analysis_COVID-19_dataset
My friends, this is the end of today's tutorial, and I hope it was time well spent.
Now you know how to, in a short time, prepare your desktop environment for data analysis and perform necessary steps in an analytical process.
And I hope you enjoyed using this tutorial as much I have enjoyed writing it!
I wish you all the best, and stay healthy!
Author
Neven Dujmovic
P.S. if you like this tutorial be sure to check out other tutorials in my “naive” series on the LinkedIn platform.
During the COVID-19 pandemic, I provided free education via the LinkedIn articles platform. All tutorials are covering useful and advanced information technology concepts but are presented on the "easy to learn" way and are adjusted for beginners.
Each tutorial is:
- completely free to use,
- only non-commercial software is used
- tutorials are documented in a high level of details
- live screenshots from used IT systems are provided for each tutorial step
- tutorials are quick to finish (duration max 6-8 hours) and adopted knowledge has direct application
- provided materials can serve as a reference in further usage of the technology.
Following IT educations are available:
- My "naive" Java tutorial with Quarkus on Linux (https://www.dhirubhai.net/pulse/my-naive-java-tutorial-quarkus-linux-neven-dujmovic/)
- My “naive” Python tutorial (https://www.dhirubhai.net/pulse/my-naive-python-tutorial-neven-dujmovic/)
- My “naive” virtualization & Linux installation tutorial (https://www.dhirubhai.net/pulse/my-naive-virtualization-linux-installation-tutorial-neven-dujmovic)
- My “naive” Python development tutorial on Linux (https://www.dhirubhai.net/pulse/my-naive-python-development-tutorial-linux-neven-dujmovic)
- My “naive” Microsoft SQL Server desktop installation and usage tutorial (https://www.dhirubhai.net/pulse/my-naive-microsoft-sql-server-desktop-installation-usage-dujmovic/)