登录查看更多内容

Relational Databases in Data Science

Parfait Gasana

Research Analyst

发布日期: 2019年3月23日

In the age of big data, data scientists should leverage relational databases in their workflow. Doing so, analysts can integrate the power of a database engine for munging and calculating and streamlined workflow for data summarization, visualization, modeling, and high end computing, all while rendering a reproducible, efficient process.

However, many data analysts today continue to work with small to large flat files including delimited text (.csv, .tab, .txt) files; Excel (.xls, .xlsx, .xlsb, .xlsm) files; other software binary types (.sas7bdat, .dta, .sav); nested XML/JSON files; and other formats that can easily be changed, moved, deleted, and corrupted to interrupt workflows and version controls set in place. Additionally these formats can maintain redundant, repetitive indicator information for inefficient disk storage use.

As a solution, relational databases provide a sound solution in most workflows as they provide:

Relational model of primary/foreign keys between related tables to avoid repetitive, redundant information and orphaned records;
Powerful engine that adheres to query optimization with indexes and execution plans;
Expressive declarative, universal SQL language for easy set-based operations (SELECT, JOIN, UNION, GROUP BY) compared to counterpart operations in programming languages (i.e., Java, C++, Python, R) or software (SAS, Stata, SPSS, Matlab);
Ensure reproducibility in data analytics and research process with stable data access and sourcing and constraints for data typology and value mismatches;
Secure, reliable ACID-based platform with user access controls and backup recoveries in place.

Today, practically all programming languages and software maintain database APIs, including popular tools in data science:.

Python with its PEP 249 specification: generalized (pyodbc, JayDeBeApi) and specific (cx_Oracle, pymssql, pymysql, ibmdb, psycopg2, and sqlite3);
R with its DBI standard including: generalized (odbc, RJDBC) and specific APIs (RPostgreSQL, RMySQL, RSQLite, ROracle, and others);
Julia databases including: general interfaces (ODBC.jl, JDBC.jl, DBI.jl), and specific (MySQL.jl, SQLite.jl, PostgreSQL.jl);
Excel ODBC/OLEDB connections via ADO or DAO modules;
SAS drivers for JDBC and ODBC driver via `libname` and `proc sql` module;
Stata ODBC drivers and DSN connections via `odbc` command;
SPSS ODBC/OLEDB connection via `GET DATA /TYPE` command;
Matlab ODBC/JDBC connection via `database` command.

Consider integrating such database solutions in your data science workflow. Your IT departments may provide instances for your team. Additionally, all open-source relational databases (PostgreSQL, MySQL, MariaDB, SQLite) are free and can be set up on most platforms (Windows, Mac, Linux). Even some proprietary vendors provide free express versions with limitations including Oracle, IBM DB2, Microsoft SQL Server (now with R+Python machine learning services), Sybase, Teradata and others. Finally, many provide convenient text file load procedures to get you started.

For illustration, see a few of my working Python and R data analysis projects connecting to relational databases:

Chicago Bus/Rail CTA Ridership - Python/PostgreSQL and R/PostgreSQL, Python/SQLite and R/SQLite
Chicago Divvy Ridership - R/DB2
Analysis Graphs (with bar charts, network trees, word cloud, and DiGraph outputs) - Python/SQL Server, Python/Oracle, R/Oracle, R/PostgreSQL

Adam Ginensky, Ph.D.

Associate Director, Risk for AG OTC Structured Products at RJO

6 年

Very informative.?

1 次回应

Bartek Malysz ??

SRE Technical Lead US

6 年

I support you view 100%. Sad thing is people tend to have very little understanding of the nature of the data amd how to handle it correctly. I've seen dates stored as string, int as string, even whole essays in an excel cell, where essentialy you need to provide a brief comment. Also, incredible redundancy due to multiple versions of spreadhseets across multiple folders, sharepoints, etc. Just recently discovered Flask + sql alchemy and it works like magic. Also Flask migrate allows dynamically adjust DB should your project require changes and it's very intuitive. Tools exist, but change need to happen in mindset. Work smarter not harder.

1 次回应

查看更多评论

要查看或添加评论，请登录

Parfait Gasana的更多文章

Importance of Long Format Data

2020年1月24日

Importance of Long Format Data

Often in the world of data science, there is the need to reshape data. Nearly every data analytics tool and language…

4 条评论
Vectorized Functions in R and Python

2016年11月4日

Vectorized Functions in R and Python

Data analytics tools such as the popular open source languages, R and Python, often have nuanced functions and…

1 条评论

Relational Databases in Data Science

Parfait Gasana

Research Analyst

Parfait Gasana的更多文章

社区洞察

其他会员也浏览了

Why SQL is a must in data science?

Day 9: Data Storage and Management

Top 6 Skills Required to Become a Successful Data Engineer [2022]

Top Data Science Tools

SQL

WHY SQL IS THE BASE KNOWLEDGE FOR DATA SCIENCE

RDD vs DataFrame : The Deference!

Text To SQL With Context Injection

Basic SQL Queries in Data Science

Parfait Gasana的更多文章

Importance of Long Format Data

Vectorized Functions in R and Python

社区洞察

其他会员也浏览了

Why SQL is a must in data science?

Day 9: Data Storage and Management

Top 6 Skills Required to Become a Successful Data Engineer [2022]

Top Data Science Tools

SQL

WHY SQL IS THE BASE KNOWLEDGE FOR DATA SCIENCE

RDD vs DataFrame : The Deference!

Text To SQL With Context Injection

Basic SQL Queries in Data Science