Relational Databases in Data Science
In the age of big data, data scientists should leverage relational databases in their workflow. Doing so, analysts can integrate the power of a database engine for munging and calculating and streamlined workflow for data summarization, visualization, modeling, and high end computing, all while rendering a reproducible, efficient process.
However, many data analysts today continue to work with small to large flat files including delimited text (.csv, .tab, .txt) files; Excel (.xls, .xlsx, .xlsb, .xlsm) files; other software binary types (.sas7bdat, .dta, .sav); nested XML/JSON files; and other formats that can easily be changed, moved, deleted, and corrupted to interrupt workflows and version controls set in place. Additionally these formats can maintain redundant, repetitive indicator information for inefficient disk storage use.
As a solution, relational databases provide a sound solution in most workflows as they provide:
- Relational model of primary/foreign keys between related tables to avoid repetitive, redundant information and orphaned records;
- Powerful engine that adheres to query optimization with indexes and execution plans;
- Expressive declarative, universal SQL language for easy set-based operations (SELECT, JOIN, UNION, GROUP BY) compared to counterpart operations in programming languages (i.e., Java, C++, Python, R) or software (SAS, Stata, SPSS, Matlab);
- Ensure reproducibility in data analytics and research process with stable data access and sourcing and constraints for data typology and value mismatches;
- Secure, reliable ACID-based platform with user access controls and backup recoveries in place.
Today, practically all programming languages and software maintain database APIs, including popular tools in data science:.
- Python with its PEP 249 specification: generalized (pyodbc, JayDeBeApi) and specific (cx_Oracle, pymssql, pymysql, ibmdb, psycopg2, and sqlite3);
- R with its DBI standard including: generalized (odbc, RJDBC) and specific APIs (RPostgreSQL, RMySQL, RSQLite, ROracle, and others);
- Julia databases including: general interfaces (ODBC.jl, JDBC.jl, DBI.jl), and specific (MySQL.jl, SQLite.jl, PostgreSQL.jl);
- Excel ODBC/OLEDB connections via ADO or DAO modules;
- SAS drivers for JDBC and ODBC driver via `libname` and `proc sql` module;
- Stata ODBC drivers and DSN connections via `odbc` command;
- SPSS ODBC/OLEDB connection via `GET DATA /TYPE` command;
- Matlab ODBC/JDBC connection via `database` command.
Consider integrating such database solutions in your data science workflow. Your IT departments may provide instances for your team. Additionally, all open-source relational databases (PostgreSQL, MySQL, MariaDB, SQLite) are free and can be set up on most platforms (Windows, Mac, Linux). Even some proprietary vendors provide free express versions with limitations including Oracle, IBM DB2, Microsoft SQL Server (now with R+Python machine learning services), Sybase, Teradata and others. Finally, many provide convenient text file load procedures to get you started.
For illustration, see a few of my working Python and R data analysis projects connecting to relational databases:
- Chicago Bus/Rail CTA Ridership - Python/PostgreSQL and R/PostgreSQL, Python/SQLite and R/SQLite
- Chicago Divvy Ridership - R/DB2
- Analysis Graphs (with bar charts, network trees, word cloud, and DiGraph outputs) - Python/SQL Server, Python/Oracle, R/Oracle, R/PostgreSQL
Associate Director, Risk for AG OTC Structured Products at RJO
6 年Very informative.?
SRE Technical Lead US
6 年I support you view 100%. Sad thing is people tend to have very little understanding of the nature of the data amd how to handle it correctly. I've seen dates stored as string, int as string, even whole essays in an excel cell, where essentialy you need to provide a brief comment. Also, incredible redundancy due to multiple versions of spreadhseets across multiple folders, sharepoints, etc. Just recently discovered Flask + sql alchemy and it works like magic. Also Flask migrate allows dynamically adjust DB should your project require changes and it's very intuitive. Tools exist, but change need to happen in mindset. Work smarter not harder.