登录查看更多内容

Data Science Storage Tools

Babak Rezaei Bastani

Senior Web Developer

发布日期: 2019年6月6日

The data science ecosystem has a set of tools that we use to build our solutions. The capabilities of this environment are developing rapidly and new developments take place every day. There are two basic methods supported by data processing tools. In the following, advantages and disadvantages is described.

Schema-on-Write ecosystems

In a traditional relational database management system (RDBMS), you need a schema before you can load the data. To retrieve data from structured data schemas, we use standard SQL. Advantages of this method include:

In the traditional data ecosystem, the tools accept schema and work as the schema is defined, so there is only one view of the data.
An extremely valuable approach in expressing relationships between given points, so previously the relationships are configured.
This is an efficient way to store "dense" data.
All data is in same data warehouse.

On the other hand, schema-on-write has not responded to any scientific problem. Along with the drawbacks of this approach is that

Its designs are routinely made, which makes them hard to change and maintain
Generally, raw / atomic data loses as a source for future analysis.
Before we can work with data, we need to have a significant modeling / implementation.
If we cannot store a specific type of data in the schema, we cannot effectively process it in the schema.

Currently, schema-on-write is a common method for storing data.

Schema-on-Read Ecosystems

This method does not require a template before data can be stored before it can be downloaded. Basically, you save data with minimal structure. During the initial query phase, the initial design is necessary.

Advantages include:

Provides flexibility to store unstructured, semi-structured, and unorganized data.
Provides unlimited flexibility when querying data from the structure.
The leaf area data will remain unchanged for future reference in the future.
This methodology supports testing and exploration.
Increases the speed of production of new know-how.
Reduces the cycle time between data production and the availability of practical knowledge.

In general, a combination of schema-on-read ecosystems and schema-on-write for data science and engineering is recommended.

要查看或添加评论，请登录

Babak Rezaei Bastani的更多文章

NameNode Server in HDFS

2019年7月11日

NameNode Server in HDFS

The main node in HDFS is that it maintains and manages the blocks on the DataNodes. NameNode is a very…
HDFS Architecture (Basic concepts)

2019年7月11日

HDFS Architecture (Basic concepts)

HDFS is a blocked file system in which each file is split into blocks of predefined size. These blocks are stored in…
What is MapReduce?

2019年6月30日

What is MapReduce?

MapReduce is a processing method and a Java-based distribution model for distributed computing. The MapReduce algorithm…
HDFS goals

2019年6月28日

HDFS goals

Fault detection and recovery : Because HDFS contains a large number of commodity hardware, the probability of failure…
An overview of HDFS

2019年6月28日

An overview of HDFS

The Hadoop file system was developed using distributed file system design and runs on commodity hardware. Unlike other…
Introduction to Hadoop

2019年6月27日

Introduction to Hadoop

Hadoop is an apache-based open source framework written in Java programming language, which allows simple…
Data Science Processing Tools

2019年6月11日

Data Science Processing Tools

Once learned with data storage, you need to be familiar with data processing tools for converting data lakes to data…
Data Warehouse Bus Matrix

2019年6月8日

Data Warehouse Bus Matrix

The Enterprise Bus Matrix is a data warehouse planning tool developed by Ralph Kimball and is being used by numerous…
Data vault

2019年6月8日

Data vault

Data vault modeling, designed by Dan Linstedt, is a database modeling method that has been deliberately structured in…
Data Lake

2019年6月7日

Data Lake

A Data lake is a data storage tank for a large amount of raw data. Waiting for future needs, the data lake saves the…

See all articles

Data Science Storage Tools

Babak Rezaei Bastani

Senior Web Developer

Babak Rezaei Bastani的更多文章

社区洞察

其他会员也浏览了

The Data Science Lifecycle

Unlocking Insights: The Power of Data Engineering

Data-Ops: Empowering Data Scientists with Effective Data Management

Mastery of Data Scie-nce: A Practical Guide to Impleme-ntation

Data Wrangling in the Digital Age: Your Essential Guide to Transforming Raw Data into Actionable Insights

Deployment as a Critical Business Data Science Discipline

Avoiding Common Mistakes in Data Science: A Complete Guide

Demystifying Data Storage: A Dive into Vector Databases

Mastering Data Cleaning with Pandas: Essential Functions and Examples

Understanding Data Modeling in Data Lakes: Managing Different Types of Data

Babak Rezaei Bastani的更多文章

NameNode Server in HDFS

HDFS Architecture (Basic concepts)

What is MapReduce?

HDFS goals

An overview of HDFS

Introduction to Hadoop

Data Science Processing Tools

Data Warehouse Bus Matrix

Data vault

Data Lake

社区洞察

其他会员也浏览了

The Data Science Lifecycle

Unlocking Insights: The Power of Data Engineering

Data-Ops: Empowering Data Scientists with Effective Data Management

Mastery of Data Scie-nce: A Practical Guide to Impleme-ntation

Data Wrangling in the Digital Age: Your Essential Guide to Transforming Raw Data into Actionable Insights

Deployment as a Critical Business Data Science Discipline

Avoiding Common Mistakes in Data Science: A Complete Guide

Demystifying Data Storage: A Dive into Vector Databases

Mastering Data Cleaning with Pandas: Essential Functions and Examples

Understanding Data Modeling in Data Lakes: Managing Different Types of Data