Innovating the Social Sciences with Cutting-Edge Data Science

Innovating the Social Sciences with Cutting-Edge Data Science

Personally, I identify as a striving intellectual whose main interests lie undoubtedly in the Social Sciences. I have been thoroughly inspired and driven by the likes of Adam Grant, Jonathan Haidt, and many others; amazed by the research methods and literature available and applied in the modern world; and driven by the potential enjoyment of projects such as, say, analyzing the best economic system to develop for a particular area considering its demographic composition, geographic location, current economic policies, etc.

Thus, hypothesizing about how the power of modern, cutting-edge Data Science tools can be applied on the vast amounts of well-researched and, thankfully, highly available social science data; I decided to take on a new project.

This would be another full-scale investigation into a topic which would not only be a great exhibition for my professional portfolio, but also a good display of my more intermediate- to advanced-level Data Science skills as well as an eye-opening and entertaining read for my viewers.

And, in this work, I decided to integrate more of Generative AI tools into my skill set as well. In the modern era, it is almost a necessity to be well-versed in the revolutionary technology and its tools such as ChatGPT - hence, this would also serve as a biennale of my ability to efficiently utilize Generative AI tools along with core Data Science skills.


Let's Get To It

Before we proceed to the work itself, it should be made clear that this project is more focused on the exploratory side and on authentically having fun and truly "connecting" with the data through applying the higher-level techniques in Data Science.

The insights gleaned were more of a secondary-yet-important consequence - the main focus of the readers should be on the process - unearthing and comprehending the surprising patterns across the data - and not as much on the result.

Additionally, unlike my other projects, this one does not utilize the Google Six-Stage approach to Data Analytics; it is oriented around two main "phases" within which there are subcategories of concentrated styles of work done.

Also, all readers are more than welcome to take inspiration from my code and utilize it on their own. You can find my entire GitHub portfolio for this project, here. Have fun and enjoy learning!


Preliminary Phase: Obtaining & Preparing the Data

A total of six different datasets were exported from Our World in Data (ourworldindata.org), in the form of CSV files. The main purpose was to obtain reliable data especially pertaining to demographics, historical economics, and socioeconomic information about the nations of the world and across time.

Once these six datasets were downloaded from the website and stored in a folder specifically created for this project's files, one of the files was opened and saved as an Excel Workbook (.xlsx format). In this workbook, then, five new sheets were created and on each sheet was pasted the entire dataset from each of the other CSV files originally downloaded. This way, there was one consolidated Excel workbook to use.

Once the sheets were finalized, all of the CSV files were deleted and work resumed on the one Excel workbook. The sheets were renamed according to the topic of their respective datasets.

Quite crucially, one sheet contained information regarding the share of a nation's population living in urban areas, as a percentage. That was the only main measurement it contained. So, it was decided to use the incisive XLOOKUP function shown below to extract that column and align its values perfectly in sync with its matching "Country" and "Year" values in the first sheet, AVG_AdultHeight_1896to1996, in which there was a new matching column made titled "Urban Population (% of Total Population)":


The sheet with only the original urban population share values was hidden.

Subsequently, a new column was added for each sheet to verify if the values in the "Country" and "Code" columns were in the appropriate Proper and Upper cases, respectively. This column was named "Case Check", and the formula devised and applied to verify the cases was:

=IF(OR(ISNUMBER(MATCH(OR(lowercase(B2), propercase(B2)), B2:$B$5354)), ISNUMBER(MATCH(OR(uppercase(A2), lowercase(A2)), A2:$A$5354))), "Improper Case Found", "Fine")

It may not appear to be, but the case-checking was essential as wrong cases may lead to false distinct values, avoidance in aggregation of column measures, and so on. Thankfully, none of the sheets yielded any "Improper Case Found" results.

Further data verification measures or "checks" were completed in the form of removing duplicates, ensuring appropriate data types for each column, and detecting any misdiagnosed NULL values.

Once this preliminary and base-building phase of data collection and organization was complete, the project moved on to the Data Wrangling, Exploring, and Manipulation phase.


Main Phase: Exploring, Dissecting, and Visualizing the Data, and Conclusions Derived

The Data Wrangling stage began in PostgreSQL, through its pgAdmin4 Graphical User Interface. However, to get each sheet of our Workbook into the PostgreSQL database "selfproject1" (specifically created for this project) as a distinct table, Jupyter Notebook's Python programming language was used!

By utilizing Python's "pandas" and "pcycopg2" libraries:

  1. A connection was established to my PostgreSQL database.
  2. The Excel workbook file was loaded in and saved as an object.
  3. Each sheet within the workbook was looped through, its column names and sheet name itself checked for and cleaned of any special or inapplicable characters.
  4. A table for each sheet was created with that respective sheet's data inserted into it.
  5. Each table was added to the "selfproject1" database.


SQL

Now that the tables were in PostgreSQL, the Data Wrangling phase finally began.

It was started with the basic check of running a query to see all the data from a table (SELECT * FROM table_name), ensuring all tables were loaded in correctly. This might seem solely ceremonial, but was actually very helpful in spotting any immediately-apparent anomalies. And so it did: in all the tables, there turned out to be Continents listed in the respective columns that would contain values for countries (e.g., "Africa" listed in the Country column).

Hence, from these columns all such misleading results were deleted.

Subsequently, from all columns which contained important quantitative measures, such as the mean male height (in cm) or the average performance of 15-year-olds on the science scale, all values which contained non-numeric characters were converted to NULL values.

These preliminaries or "base preparations" were then proceeded by converting any quantitative measure column with an inappropriate data type to a numeric data type, and finished with the renaming of longer column names to shorter and more fitting names.

Finally, I moved on to properly exploring the data! All-time average scores of 15-year-olds on each of the science, mathematics, and reading scales by country; population and GDP per capita of each country from 1965 to 2023; pulling up 'High Income' countries by utilizing an effective subquery for a threshold of at least $20,000 in GDP per capita - these were all examples of the data exploration done.

Some of the most significant and in-depth data manipulations were done to glean:

  • Each country's average science, mathematics, and reading scores including the overall average score for each measure as well as each country's difference from the overall average. Using a sophisticated Common Table Expression (CTE) and JOIN.

  • Since science is generally the most crucial and all-encompassing measure of knowledge and/or competence, the results were ordered by highest-to-lowest all-time average science scores. And the results were revealing indeed:

  • Utilizing WINDOW functions to measure raw mean male and female height by year, as well as the raw change and the percentage of average height change from each year to the next - and also applying the same measurements to urban population share of the total population (in %).

For every nation, there were fascinating yearly height and urban population share change measurements and the earliest recorded year was 1896. Such measurements are better shown in visual form, and we will get into that later.

  • The dynamic measurements - average score of each country in science, mathematics, and reading by the passing years; and coupled with the all-time average population count and share of urban population of each country as well as all-time average GDP per capita.

And so on it went with Data Wrangling and Exploration in SQL; many intriguing peculiarities were noticed and further strength was added on to the belief that SQL is a fun, powerful, and important language for all Data Scientists.

If you would like to review my complete SQL code for this project, you may easily visit the GitHub link to my project.


Python

Moving forward, Python programming language was chosen as the platform for further Data Manipulation.

Once in Python in a new Jupyter Notebook, the four main libraries for this project were imported: Pandas, NumPy, Matplotlib, and Seaborn. Following that, the consolidated Excel workbook was loaded in and each sheet looped through and stored in separate dataframes. Lastly, each sheet was defined to a distinct variable (starting with "df1" and ending at "df6" for a total of six variables).

The "Case Check" column was also removed, through a "for" loop, from each sheet and then I proceeded on to Exploratory Data Analysis (EDA) in Python.

Beginning with the basics (the ".info" or ".summary" function for preliminary descriptive statistics and dropping any duplicates and setting data types to more appropriate values) - yes, these measures had to be taken again for Python's interface - the mining of the data went into more detailed procedures such as extracting all unique country names, sorting them in alphabetical order, and then assigning a serial number to each for subsequent trend-based visualizations.

EDA was not done as much in Python for this project, as SQL had already sort-of fulfilled that role.

Some Amazing Visualizations Indeed

Python was also chosen as the main source of data visualization for this endeavor, along with some Excel - unlike other projects, it was decided not to utilize Tableau and Power BI and to instead focus on Python's amazing capabilities for data visualization.

This "trend-based" visualization manifested itself in its "ultimate form"; a linear graph displaying average mathematics, science, and reading scores as well as average male and female height and average urban population share, across time:

Mind you, this visualization demanded a lengthy coding process of merging dataframes astutely and carefully plotting each measure's line graphs and customizing them for optimal display. You are welcome to view the complete Python code, here.

Similarly, in Excel, advanced analytics and visualization techniques were applied in the form of creating a Pivot Table and a corresponding pivot chart. The SocioeconomicChanges_1965to2023 sheet was chosen for its wide array of insightful quantitative measures to choose from; a pivot table was made using this sheet, showing average GDP per capita for each year from 1965 to 2023, for all nations aggregated. Additionally, the "Country" column was placed in as a filter - if anyone wanted to view a distinct nation's average GPD per capita across time, he or she would have had the choice of choosing just that nation in the filter.

Based on the aggregated measures, a pivot chart was made in line graph form:

The overall trend is healthy and exhibits an expectable shape; the world, materialistically, has improved and gotten better in terms of living standards, technological advancements, and human rights. Furthermore, certain noticeable dips in GDP per capita can be seen in expected periods such as 2020, the year COVID-19 hit.

Coming back to Python, there were box-and-whisker plots made too. The standout was for a dissection of the share of a country's population living in urban areas in the latest year of measurement, which should be 2023. And here it is:

As can be deduced, on average, 51.38% of any given country's population can be expected to be living in urban areas in the year 2023.

This is not a bad sign as this factors in many poor countries. Usually, residing in urban areas gives people more access to modern resources like education, healthcare, infrastructure, public services, and the like, which increase the chance to contribute to the nation's economy in better ways. So, if, on average, about half of any nation's population in 2023 could have been expected to have resided in urban areas, that can be classified as a good sign of progress made throughout time.

Finally, and most intriguingly of all, an all-encompassing correlation matrix was designed for all quantitative measures. This would be a great overview, providing an overarching insight into how each measure correlated with all others as well as helping identify key correlations between metrics to capitalize on. To top it all off, it would look very aesthetically pleasing!

Alas, there we have it:

Very incisive and appealing. Served its purpose well as a modern visualization for wide-ranging insights. Some results were self-explanatory, like the total population of a nation being highly correlated (0.98) with the total food available (in kilograms) in the nation due to the country having a higher workforce to produce more food as well as needing more food imported to feed a larger population.

Some results, on the other hand, were a bit shocking, such as urban population share being weak-moderately related (0.41-0.44) to scores in mathematics, reading, and science. This was alarming because, as previously mentioned, living in urban areas is expected to grant people more access to modern-day resources for education and overall quality of life.

But, that is why we do Data Science and Analytics; to bring to the surface these hidden patterns!

Ending it in Python

At the end, it was realized that a "final" dataframe had been made in the process which included all important quantitative measures (e.g., average performance of 15-year-olds on science, mean male height, etc.) as well as some qualitative ones (e.g., country, year), and it was exported as a CSV file onto my folder for this project. This file would be used to build a multiple regression model in the R statistical programming language.


R

The Multiple Regression model was built to predict the average performance of 15-year-olds in science (i.e., this was the dependent variable) based on these independent variables: average performance of 15-year-old students in reading, GDP per capita, share of urban population per total population, and total food available (kilocalories per day).

These independent variables were selected for the model as these appeared to be intuitively intertwined with performance on the ultimate dependent variable: science performance. If a person performs well in reading (and the correlation matrix gives a correlation between these two of a whopping 0.97 or 97%!), he can be expected to perform better in science - on average - compared to someone who performed worse in reading.

Similarly, the more the people in a country living in urban areas, the more the access to superior facilities and, thus, the higher the chances of a better science score. Lastly, of course, if more food is available in a nation, it is definitely more likely that a larger and healthier population can be maintained which would lead to higher science scores much more probably than a more rural area would.

As we finally delve into the descriptive statistics of the regression model, it would be worthwhile to give readers an overview of the core aspects of the results to come:

Regression Coefficients:

Unstandardized Coefficients: Practical impact in original units.

  • Show how much the dependent variable changes for a 1-unit change in the independent variable, assuming all other independent variables are constant.
  • Units match the original variables (e.g., percentage, dollars).

Example: A coefficient of -0.0928 for Urban Population means a 1% increase in urban population reduces science scores by 0.0928 points.

Standardized Coefficients: Relative importance of variables, unitless and directly comparable.

  • Show the relative strength of each independent variable’s influence on the dependent variable.
  • Scaled to have no units (mean = 0, SD = 1), allowing direct comparison of variables with different scales.
  • Larger values = stronger influence; smaller values = weaker influence.

Example: reading scores with a coefficient of 0.97 have a significantly stronger positive impact on science scores (1 standard deviation increase in reading scores leads to a 0.97 standard deviation increase in science scores, holding all else constant) than urban population share, which has a small negative impact of -0.03.


Both, unstandardized coefficients and standardized coefficients, help interpret regression models: one for effect size in real terms, the other for variable significance.

Alas, the descriptive statistics of the Multiple Regression model built:

The average performance of 15-year-olds in reading seemed to be significantly impactful on average performance in science, as a 1-point increase in reading scores appears to bring about a 4.34-point increase in science scores, on average. Massive, but not surprising.

What was surprising was the negative correlation of food availability and urban population share with science scores - these are metrics which would normally be assumed to be positively related to science performance. After evaluating the Standardized Regression Coefficients in the following part, we will discuss this in more detail.

The Standardized Regression Coefficients:

In the analysis, the standardized coefficients gave a clear view of the relative importance of each predictor variable in the model. Here's a breakdown of how to interpret these results:

  • Reading Scale Performance (0.970): This has the largest standardized coefficient, indicating that improvements in the average reading performance of 15-year-old students are strongly associated with improvements in their science performance. Since this coefficient is near 1, it suggests that reading performance and science performance are highly correlated, with a nearly proportional relationship.
  • GDP per Capita (0.036): This coefficient is positive but much smaller, suggesting that while wealthier countries (as indicated by GDP per capita) do see better science performance, the effect is relatively small compared to performance on reading. A country's economic factors play a role but are not as significant.
  • Urban Population Share (-0.029): Interestingly, the negative sign indicates a slight inverse relationship between the proportion of urban population and science performance. This could imply that higher urbanization doesn't necessarily boost science performance. However, once analyzed in more detail, the magnitude of the surprise diminishes.

The (unstandardized) coefficient of this variable in the regression model was -0.0928; that means that for every 1 percentage point increase in urban population share, the average performance in science decreases by 0.0928 points. Not a big decrease but, still, the inverse relationship is shocking and worthy of further examination.

  • Total Food Availability (-0.028): The negative coefficient here shows that greater food availability is slightly associated with lower science performance. Again, this is a metric which is shockingly containing a negative sign, as greater food availability in a nation would be expected to provide better nutrition for people, which would lead to more energy for studying and keeping a good routine overall.

Yet, just like its predecessor, the (unstandardized) coefficient of this variable is extremely small (1.3 x 10^-11) - way smaller than even urban population share's. The impact of food availability on science scores can, thus, be considered negligible. This still is not an expected positive correlation, but that is exactly why we dig into data: to test ideas by putting them against the grind from all angles.

This was a section full of pertinent findings. Regression is definitely a core statistical modeling concept and I am glad I got to utilize it in an efficient and exciting way to detect massive hidden trends.

This concluded my analysis in the statistical programming language of R.


Wrapping It All Up: Final Results & Implications

And with that, this project reaches its conclusion. This was a healthily challenging endeavor in which I got to employ a meticulous methodology that examined the situation from all possible angles. My training and experience - technical and mental, both - in Data Science were exemplified through this social science-oriented analysis, in efficiently organizing and analyzing the information while exploring various potential pathways; it demonstrates the rigor and dedication of a Data Analyst committed to continuous growth.

I had a great time in this endeavor - having the freedom to choose a topic of my liking, formulating potentially rich hypotheses to test with my acquired skills in data science and its array of popular tools, and coalescing all varying parts of the process together into a coherent whole also digestible for the public.

Truly, I realized that the role of a Data Scientist is much more than being able to interpret the data and analyze it; it also delves into meshing all its parts together wonderfully with acute business sense and creating a good story to convey to stakeholders and enthusiasts alike. Additionally, the Data Science professional must always be on the lookout for new skills to acquire and to perpetually refine his skill set by integrating new technologies (as Generative AI is for this era) into his arsenal.

It is my sincere hope that you enjoyed the process behind this work of mine as much as I did. And, as mentioned before, my work is free for any fellow enthusiast to take inspiration from.

Lastly, before ending, even though I had mentioned that this project was focus on the process and not the result, I would still like to share some key takeaways from the process which I found, simultaneously, intriguing and insightful:

  • Reading skills are a very strong predictor of science performance, making literacy programs a potential area of focus for nations intending to improve their overall educational quality.
  • GDP and economic factors have fairly lower-than-expected effects on educational output, and social or educational investments might play a more crucial role than sheer wealth.
  • The negative associations of science performance with urbanization and food availability are subtle and may require further exploration to understand contextual factors that could explain this relationship.
  • Average female and male height, and average science, reading, and mathematics score have grown fairly slowly across the years since the 1950s whereas the urbanization of nations has grown faster during the same time period, and the average GDP per capita across the world has steadily increased - with the few expected occasional dips - since 1965 until the latest year of measurement (2023).


Thank you, dear reader, for your time. If you would like to connect with me or contact me for prospective discussions, feel free to reach out to me via LinkedIn.


Syed Raza

Medical Oncologist/Hematologist at Oncology San Antonio

3 个月

Great work. Impressive

要查看或添加评论,请登录

Ali R.的更多文章

社区洞察

其他会员也浏览了