Relational Databases in Data Science

Relational Databases in Data Science

In the age of big data, data scientists should leverage relational databases in their workflow. Doing so, analysts can integrate the power of a database engine for munging and calculating and streamlined workflow for data summarization, visualization, modeling, and high end computing, all while rendering a reproducible, efficient process.

However, many data analysts today continue to work with small to large flat files including delimited text (.csv, .tab, .txt) files; Excel (.xls, .xlsx, .xlsb, .xlsm) files; other software binary types (.sas7bdat, .dta, .sav); nested XML/JSON files; and other formats that can easily be changed, moved, deleted, and corrupted to interrupt workflows and version controls set in place. Additionally these formats can maintain redundant, repetitive indicator information for inefficient disk storage use.

As a solution, relational databases provide a sound solution in most workflows as they provide:

  • Relational model of primary/foreign keys between related tables to avoid repetitive, redundant information and orphaned records;
  • Powerful engine that adheres to query optimization with indexes and execution plans;
  • Expressive declarative, universal SQL language for easy set-based operations (SELECT, JOIN, UNION, GROUP BY) compared to counterpart operations in programming languages (i.e., Java, C++, Python, R) or software (SAS, Stata, SPSS, Matlab);
  • Ensure reproducibility in data analytics and research process with stable data access and sourcing and constraints for data typology and value mismatches;
  • Secure, reliable ACID-based platform with user access controls and backup recoveries in place.

Today, practically all programming languages and software maintain database APIs, including popular tools in data science:.

  • Python with its PEP 249 specification: generalized (pyodbc, JayDeBeApi) and specific (cx_Oracle, pymssql, pymysql, ibmdb, psycopg2, and sqlite3);
  • R with its DBI standard including: generalized (odbc, RJDBC) and specific APIs (RPostgreSQL, RMySQL, RSQLite, ROracle, and others);
  • Julia databases including: general interfaces (ODBC.jl, JDBC.jl, DBI.jl), and specific (MySQL.jl, SQLite.jl, PostgreSQL.jl);
  • Excel ODBC/OLEDB connections via ADO or DAO modules;
  • SAS drivers for JDBC and ODBC driver via `libname` and `proc sql` module;
  • Stata ODBC drivers and DSN connections via `odbc` command;
  • SPSS ODBC/OLEDB connection via `GET DATA /TYPE` command;
  • Matlab ODBC/JDBC connection via `database` command.

Consider integrating such database solutions in your data science workflow. Your IT departments may provide instances for your team. Additionally, all open-source relational databases (PostgreSQL, MySQL, MariaDB, SQLite) are free and can be set up on most platforms (Windows, Mac, Linux). Even some proprietary vendors provide free express versions with limitations including Oracle, IBM DB2, Microsoft SQL Server (now with R+Python machine learning services), Sybase, Teradata and others. Finally, many provide convenient text file load procedures to get you started.

For illustration, see a few of my working Python and R data analysis projects connecting to relational databases:




Adam Ginensky, Ph.D.

Associate Director, Risk for AG OTC Structured Products at RJO

6 年

Very informative.?

Bartek Malysz ??

SRE Technical Lead US

6 年

I support you view 100%. Sad thing is people tend to have very little understanding of the nature of the data amd how to handle it correctly. I've seen dates stored as string, int as string, even whole essays in an excel cell, where essentialy you need to provide a brief comment. Also, incredible redundancy due to multiple versions of spreadhseets across multiple folders, sharepoints, etc. Just recently discovered Flask + sql alchemy and it works like magic. Also Flask migrate allows dynamically adjust DB should your project require changes and it's very intuitive. Tools exist, but change need to happen in mindset. Work smarter not harder.

要查看或添加评论,请登录

Parfait Gasana的更多文章

  • Importance of Long Format Data

    Importance of Long Format Data

    Often in the world of data science, there is the need to reshape data. Nearly every data analytics tool and language…

    4 条评论
  • Vectorized Functions in R and Python

    Vectorized Functions in R and Python

    Data analytics tools such as the popular open source languages, R and Python, often have nuanced functions and…

    1 条评论

社区洞察

其他会员也浏览了