Innovating the Social Sciences with Cutting-Edge Data Science
Personally, I identify as a striving intellectual whose main interests lie undoubtedly in the Social Sciences. I have been thoroughly inspired and driven by the likes of Adam Grant, Jonathan Haidt, and many others; amazed by the research methods and literature available and applied in the modern world; and driven by the potential enjoyment of projects such as, say, analyzing the best economic system to develop for a particular area considering its demographic composition, geographic location, current economic policies, etc.
Thus, hypothesizing about how the power of modern, cutting-edge Data Science tools can be applied on the vast amounts of well-researched and, thankfully, highly available social science data; I decided to take on a new project.
This would be another full-scale investigation into a topic which would not only be a great exhibition for my professional portfolio, but also a good display of my more intermediate- to advanced-level Data Science skills as well as an eye-opening and entertaining read for my viewers.
And, in this work, I decided to integrate more of Generative AI tools into my skill set as well. In the modern era, it is almost a necessity to be well-versed in the revolutionary technology and its tools such as ChatGPT - hence, this would also serve as a biennale of my ability to efficiently utilize Generative AI tools along with core Data Science skills.
Let's Get To It
Before we proceed to the work itself, it should be made clear that this project is more focused on the exploratory side and on authentically having fun and truly "connecting" with the data through applying the higher-level techniques in Data Science.
The insights gleaned were more of a secondary-yet-important consequence - the main focus of the readers should be on the process - unearthing and comprehending the surprising patterns across the data - and not as much on the result.
Additionally, unlike my other projects, this one does not utilize the Google Six-Stage approach to Data Analytics; it is oriented around two main "phases" within which there are subcategories of concentrated styles of work done.
Also, all readers are more than welcome to take inspiration from my code and utilize it on their own. You can find my entire GitHub portfolio for this project, here. Have fun and enjoy learning!
Preliminary Phase: Obtaining & Preparing the Data
A total of six different datasets were exported from Our World in Data (ourworldindata.org), in the form of CSV files. The main purpose was to obtain reliable data especially pertaining to demographics, historical economics, and socioeconomic information about the nations of the world and across time.
Once these six datasets were downloaded from the website and stored in a folder specifically created for this project's files, one of the files was opened and saved as an Excel Workbook (.xlsx format). In this workbook, then, five new sheets were created and on each sheet was pasted the entire dataset from each of the other CSV files originally downloaded. This way, there was one consolidated Excel workbook to use.
Once the sheets were finalized, all of the CSV files were deleted and work resumed on the one Excel workbook. The sheets were renamed according to the topic of their respective datasets.
Quite crucially, one sheet contained information regarding the share of a nation's population living in urban areas, as a percentage. That was the only main measurement it contained. So, it was decided to use the incisive XLOOKUP function shown below to extract that column and align its values perfectly in sync with its matching "Country" and "Year" values in the first sheet, AVG_AdultHeight_1896to1996, in which there was a new matching column made titled "Urban Population (% of Total Population)":
The sheet with only the original urban population share values was hidden.
Subsequently, a new column was added for each sheet to verify if the values in the "Country" and "Code" columns were in the appropriate Proper and Upper cases, respectively. This column was named "Case Check", and the formula devised and applied to verify the cases was:
=IF(OR(ISNUMBER(MATCH(OR(lowercase(B2), propercase(B2)), B2:$B$5354)), ISNUMBER(MATCH(OR(uppercase(A2), lowercase(A2)), A2:$A$5354))), "Improper Case Found", "Fine")
It may not appear to be, but the case-checking was essential as wrong cases may lead to false distinct values, avoidance in aggregation of column measures, and so on. Thankfully, none of the sheets yielded any "Improper Case Found" results.
Further data verification measures or "checks" were completed in the form of removing duplicates, ensuring appropriate data types for each column, and detecting any misdiagnosed NULL values.
Once this preliminary and base-building phase of data collection and organization was complete, the project moved on to the Data Wrangling, Exploring, and Manipulation phase.
Main Phase: Exploring, Dissecting, and Visualizing the Data, and Conclusions Derived
The Data Wrangling stage began in PostgreSQL, through its pgAdmin4 Graphical User Interface. However, to get each sheet of our Workbook into the PostgreSQL database "selfproject1" (specifically created for this project) as a distinct table, Jupyter Notebook's Python programming language was used!
By utilizing Python's "pandas" and "pcycopg2" libraries:
SQL
Now that the tables were in PostgreSQL, the Data Wrangling phase finally began.
It was started with the basic check of running a query to see all the data from a table (SELECT * FROM table_name), ensuring all tables were loaded in correctly. This might seem solely ceremonial, but was actually very helpful in spotting any immediately-apparent anomalies. And so it did: in all the tables, there turned out to be Continents listed in the respective columns that would contain values for countries (e.g., "Africa" listed in the Country column).
Hence, from these columns all such misleading results were deleted.
Subsequently, from all columns which contained important quantitative measures, such as the mean male height (in cm) or the average performance of 15-year-olds on the science scale, all values which contained non-numeric characters were converted to NULL values.
These preliminaries or "base preparations" were then proceeded by converting any quantitative measure column with an inappropriate data type to a numeric data type, and finished with the renaming of longer column names to shorter and more fitting names.
Finally, I moved on to properly exploring the data! All-time average scores of 15-year-olds on each of the science, mathematics, and reading scales by country; population and GDP per capita of each country from 1965 to 2023; pulling up 'High Income' countries by utilizing an effective subquery for a threshold of at least $20,000 in GDP per capita - these were all examples of the data exploration done.
Some of the most significant and in-depth data manipulations were done to glean:
For every nation, there were fascinating yearly height and urban population share change measurements and the earliest recorded year was 1896. Such measurements are better shown in visual form, and we will get into that later.
And so on it went with Data Wrangling and Exploration in SQL; many intriguing peculiarities were noticed and further strength was added on to the belief that SQL is a fun, powerful, and important language for all Data Scientists.
If you would like to review my complete SQL code for this project, you may easily visit the GitHub link to my project.
Python
Moving forward, Python programming language was chosen as the platform for further Data Manipulation.
Once in Python in a new Jupyter Notebook, the four main libraries for this project were imported: Pandas, NumPy, Matplotlib, and Seaborn. Following that, the consolidated Excel workbook was loaded in and each sheet looped through and stored in separate dataframes. Lastly, each sheet was defined to a distinct variable (starting with "df1" and ending at "df6" for a total of six variables).
The "Case Check" column was also removed, through a "for" loop, from each sheet and then I proceeded on to Exploratory Data Analysis (EDA) in Python.
Beginning with the basics (the ".info" or ".summary" function for preliminary descriptive statistics and dropping any duplicates and setting data types to more appropriate values) - yes, these measures had to be taken again for Python's interface - the mining of the data went into more detailed procedures such as extracting all unique country names, sorting them in alphabetical order, and then assigning a serial number to each for subsequent trend-based visualizations.
EDA was not done as much in Python for this project, as SQL had already sort-of fulfilled that role.
Some Amazing Visualizations Indeed
Python was also chosen as the main source of data visualization for this endeavor, along with some Excel - unlike other projects, it was decided not to utilize Tableau and Power BI and to instead focus on Python's amazing capabilities for data visualization.
领英推荐
This "trend-based" visualization manifested itself in its "ultimate form"; a linear graph displaying average mathematics, science, and reading scores as well as average male and female height and average urban population share, across time:
Mind you, this visualization demanded a lengthy coding process of merging dataframes astutely and carefully plotting each measure's line graphs and customizing them for optimal display. You are welcome to view the complete Python code, here.
Similarly, in Excel, advanced analytics and visualization techniques were applied in the form of creating a Pivot Table and a corresponding pivot chart. The SocioeconomicChanges_1965to2023 sheet was chosen for its wide array of insightful quantitative measures to choose from; a pivot table was made using this sheet, showing average GDP per capita for each year from 1965 to 2023, for all nations aggregated. Additionally, the "Country" column was placed in as a filter - if anyone wanted to view a distinct nation's average GPD per capita across time, he or she would have had the choice of choosing just that nation in the filter.
Based on the aggregated measures, a pivot chart was made in line graph form:
The overall trend is healthy and exhibits an expectable shape; the world, materialistically, has improved and gotten better in terms of living standards, technological advancements, and human rights. Furthermore, certain noticeable dips in GDP per capita can be seen in expected periods such as 2020, the year COVID-19 hit.
Coming back to Python, there were box-and-whisker plots made too. The standout was for a dissection of the share of a country's population living in urban areas in the latest year of measurement, which should be 2023. And here it is:
As can be deduced, on average, 51.38% of any given country's population can be expected to be living in urban areas in the year 2023.
This is not a bad sign as this factors in many poor countries. Usually, residing in urban areas gives people more access to modern resources like education, healthcare, infrastructure, public services, and the like, which increase the chance to contribute to the nation's economy in better ways. So, if, on average, about half of any nation's population in 2023 could have been expected to have resided in urban areas, that can be classified as a good sign of progress made throughout time.
Finally, and most intriguingly of all, an all-encompassing correlation matrix was designed for all quantitative measures. This would be a great overview, providing an overarching insight into how each measure correlated with all others as well as helping identify key correlations between metrics to capitalize on. To top it all off, it would look very aesthetically pleasing!
Alas, there we have it:
Very incisive and appealing. Served its purpose well as a modern visualization for wide-ranging insights. Some results were self-explanatory, like the total population of a nation being highly correlated (0.98) with the total food available (in kilograms) in the nation due to the country having a higher workforce to produce more food as well as needing more food imported to feed a larger population.
Some results, on the other hand, were a bit shocking, such as urban population share being weak-moderately related (0.41-0.44) to scores in mathematics, reading, and science. This was alarming because, as previously mentioned, living in urban areas is expected to grant people more access to modern-day resources for education and overall quality of life.
But, that is why we do Data Science and Analytics; to bring to the surface these hidden patterns!
Ending it in Python
At the end, it was realized that a "final" dataframe had been made in the process which included all important quantitative measures (e.g., average performance of 15-year-olds on science, mean male height, etc.) as well as some qualitative ones (e.g., country, year), and it was exported as a CSV file onto my folder for this project. This file would be used to build a multiple regression model in the R statistical programming language.
R
The Multiple Regression model was built to predict the average performance of 15-year-olds in science (i.e., this was the dependent variable) based on these independent variables: average performance of 15-year-old students in reading, GDP per capita, share of urban population per total population, and total food available (kilocalories per day).
These independent variables were selected for the model as these appeared to be intuitively intertwined with performance on the ultimate dependent variable: science performance. If a person performs well in reading (and the correlation matrix gives a correlation between these two of a whopping 0.97 or 97%!), he can be expected to perform better in science - on average - compared to someone who performed worse in reading.
Similarly, the more the people in a country living in urban areas, the more the access to superior facilities and, thus, the higher the chances of a better science score. Lastly, of course, if more food is available in a nation, it is definitely more likely that a larger and healthier population can be maintained which would lead to higher science scores much more probably than a more rural area would.
As we finally delve into the descriptive statistics of the regression model, it would be worthwhile to give readers an overview of the core aspects of the results to come:
Regression Coefficients:
Unstandardized Coefficients: Practical impact in original units.
Example: A coefficient of -0.0928 for Urban Population means a 1% increase in urban population reduces science scores by 0.0928 points.
Standardized Coefficients: Relative importance of variables, unitless and directly comparable.
Example: reading scores with a coefficient of 0.97 have a significantly stronger positive impact on science scores (1 standard deviation increase in reading scores leads to a 0.97 standard deviation increase in science scores, holding all else constant) than urban population share, which has a small negative impact of -0.03.
Both, unstandardized coefficients and standardized coefficients, help interpret regression models: one for effect size in real terms, the other for variable significance.
Alas, the descriptive statistics of the Multiple Regression model built:
The average performance of 15-year-olds in reading seemed to be significantly impactful on average performance in science, as a 1-point increase in reading scores appears to bring about a 4.34-point increase in science scores, on average. Massive, but not surprising.
What was surprising was the negative correlation of food availability and urban population share with science scores - these are metrics which would normally be assumed to be positively related to science performance. After evaluating the Standardized Regression Coefficients in the following part, we will discuss this in more detail.
The Standardized Regression Coefficients:
In the analysis, the standardized coefficients gave a clear view of the relative importance of each predictor variable in the model. Here's a breakdown of how to interpret these results:
The (unstandardized) coefficient of this variable in the regression model was -0.0928; that means that for every 1 percentage point increase in urban population share, the average performance in science decreases by 0.0928 points. Not a big decrease but, still, the inverse relationship is shocking and worthy of further examination.
Yet, just like its predecessor, the (unstandardized) coefficient of this variable is extremely small (1.3 x 10^-11) - way smaller than even urban population share's. The impact of food availability on science scores can, thus, be considered negligible. This still is not an expected positive correlation, but that is exactly why we dig into data: to test ideas by putting them against the grind from all angles.
This was a section full of pertinent findings. Regression is definitely a core statistical modeling concept and I am glad I got to utilize it in an efficient and exciting way to detect massive hidden trends.
This concluded my analysis in the statistical programming language of R.
Wrapping It All Up: Final Results & Implications
And with that, this project reaches its conclusion. This was a healthily challenging endeavor in which I got to employ a meticulous methodology that examined the situation from all possible angles. My training and experience - technical and mental, both - in Data Science were exemplified through this social science-oriented analysis, in efficiently organizing and analyzing the information while exploring various potential pathways; it demonstrates the rigor and dedication of a Data Analyst committed to continuous growth.
I had a great time in this endeavor - having the freedom to choose a topic of my liking, formulating potentially rich hypotheses to test with my acquired skills in data science and its array of popular tools, and coalescing all varying parts of the process together into a coherent whole also digestible for the public.
Truly, I realized that the role of a Data Scientist is much more than being able to interpret the data and analyze it; it also delves into meshing all its parts together wonderfully with acute business sense and creating a good story to convey to stakeholders and enthusiasts alike. Additionally, the Data Science professional must always be on the lookout for new skills to acquire and to perpetually refine his skill set by integrating new technologies (as Generative AI is for this era) into his arsenal.
It is my sincere hope that you enjoyed the process behind this work of mine as much as I did. And, as mentioned before, my work is free for any fellow enthusiast to take inspiration from.
Lastly, before ending, even though I had mentioned that this project was focus on the process and not the result, I would still like to share some key takeaways from the process which I found, simultaneously, intriguing and insightful:
Thank you, dear reader, for your time. If you would like to connect with me or contact me for prospective discussions, feel free to reach out to me via LinkedIn.
Medical Oncologist/Hematologist at Oncology San Antonio
3 个月Great work. Impressive