Data Science Storage Tools
The data science ecosystem has a set of tools that we use to build our solutions. The capabilities of this environment are developing rapidly and new developments take place every day. There are two basic methods supported by data processing tools. In the following, advantages and disadvantages is described.
Schema-on-Write ecosystems
In a traditional relational database management system (RDBMS), you need a schema before you can load the data. To retrieve data from structured data schemas, we use standard SQL. Advantages of this method include:
- In the traditional data ecosystem, the tools accept schema and work as the schema is defined, so there is only one view of the data.
- An extremely valuable approach in expressing relationships between given points, so previously the relationships are configured.
- This is an efficient way to store "dense" data.
- All data is in same data warehouse.
On the other hand, schema-on-write has not responded to any scientific problem. Along with the drawbacks of this approach is that
- Its designs are routinely made, which makes them hard to change and maintain
- Generally, raw / atomic data loses as a source for future analysis.
- Before we can work with data, we need to have a significant modeling / implementation.
- If we cannot store a specific type of data in the schema, we cannot effectively process it in the schema.
Currently, schema-on-write is a common method for storing data.
Schema-on-Read Ecosystems
This method does not require a template before data can be stored before it can be downloaded. Basically, you save data with minimal structure. During the initial query phase, the initial design is necessary.
Advantages include:
- Provides flexibility to store unstructured, semi-structured, and unorganized data.
- Provides unlimited flexibility when querying data from the structure.
- The leaf area data will remain unchanged for future reference in the future.
- This methodology supports testing and exploration.
- Increases the speed of production of new know-how.
- Reduces the cycle time between data production and the availability of practical knowledge.
In general, a combination of schema-on-read ecosystems and schema-on-write for data science and engineering is recommended.