Data Storage vs Database

Data Storage vs Database

Nowadays, massive amounts of data are generated. It is natural for you to ask where these data go and how they are stored? Well the answer is a storage system.

Data Storage is a concern which deals with where and how you store information in a digital system. We store data in the storage system. Let’s talk about how you store data? Normally you store data in 2 different mechanisms: Datafiles and Databases. So basically database means how you store your file.?

So in Azure cloud storage and databases are 2 different things. Azure Databases include SQL Db, Maria Db, MySQL Db and Cosmos Db. However, remember databases store their db files into the storage system only. Normally a storage account in azure is used to store raw un-structure files. It could be any file including a database file or virtual machine image file. Cosmos DB internally uses Azure page storage account to store it’s db files. Azure SQL database also uses Azure Page storage. In azure cloud storage account is how physically you store data plus they also provide tables, queues.?

Before you go further lets understand File System.?

File System

On a computer you use the Files System to name, store, locate and read your data. It is just a digital equivalent of an organized file cabinet. You put information in a text file, put them in a folder and then put them in a larger folder in your computer.??

No alt text provided for this image

You can put anything you want to a particular location. You can store any data including “Unstructured Data” that includes “documents, videos, spreadsheet, picture, music” you name it. File-system does not care what you store. File system reads and writes data into the physical hard-disk of your computer. Any application that you create or use like media player, visual studio, calculator, notepad etc. they are all stored in your machine file system. From the file system they will be written to your? hard disk.?

How to install a File System?

When you install an OS on a computer then a File System is installed on your machine. For Windows OS file system name is NTFS ( New technology file system ), for Linux OS file system name is EXT (Extended File System). You see these file format names when you try to format your pen-drive or hard disk.?

No alt text provided for this image

File storage systems are:?

  • FAT ( File Allocation Table used by Legacy Windows OS )
  • NTFS ( New Technology File System )
  • HFS ( Hierarchical File System)
  • EXT (Extended File System)

No alt text provided for this image


EXT

Extended File System (EXT) developed for Unix and Linux OS in 1992. Maximum file size in ext4 is 16TB.

NTFS?

New Technology File System (NTFS) developed by Microsoft for Windows OS in 1993. NTFS is transactional means it allows files folders to be? recreated, renamed, deleted and many more without affecting others. NTFS is Journal means it stores the metadata and file changes as well. One file could be up to 16 EiB ( 1 EiB = 2^60 ). File and directory names can be up to 255 characters long, including any extensions.?

HFS?

Hierarchical File System (HFS) developed by Apple for MAC OS in 1998. One file size up to 2GB only.

Linux is most compatible with EXT, Windows is with NTFS and FAT and Mac OS with HFS, AHFS.

HDFS

One important example of file system Hadoop Distributed File System (HDFS). It uses massive parallel processing to store big-data.?

What are the types of Datafiles?

  • Delimited Text Files: file contains data that represents 2 dimensions table with column and rows. Data itself is stored in text format with breaks between col and rows using special characters (tab, comma and pipes ). The extension of the files are ‘.csv’. Text files are used to store as raw text in .txt format. Text files are understood by a wide variety of systems.?
  • Extensible Markup Language (XML) files: it is a flexible structure to encode documents and data. Web page, application and messaging system uses XML. It requires a more sophisticated interface to decode and do analysis or reading.?
  • Log Files: to capture event data from the system machine, messaging and web applications. Log files require a parser to read them. It needs special tools to read some special log files.
  • Application-Specific Files:? Most tools have their own proprietary file format for storing data along with other data called as metadata. Like Microsoft Excel files used for data analytics and calculations. SAS tools have their own format file.?

What is Database?

Organized collection of data. When you say database you mean both the structure and the design of the data environment as well as data itself. It seeks to store the data in a more complex way than what can be achieved by regular datafile. Databases usually store a number of different data entities with unifying information about how those entities are arranged or related. This enables access to a wider array of? information in one common environment versus storing that information in multiple data files that may or may not be tied together.?

Usually databases are constructed by a database management system (DBMS). DBMS is a software application which is used for creating, maintaining and accessing databases.?

Commonly used databases are called relational databases. We store information in 2 dimension tables and define specific relationships between those tables. E.F. Codd at IBM in 1969 invented a Relational model of data.?

Relational Databases are:

  • SQL Db
  • MySQL
  • PostgreSQL
  • MariaDB

NoSQL Databases: 4 common alternative non-relational databases are:

  1. Graph Databases
  2. Document Stores
  3. Columnar Databases
  4. Key-value Stores

Note: non-tabular or non-relational databases are also called NoSQL (aka “not only SQL”) databases.

Graph Databases

Based on graph theory. It can work with highly interconnected data. Like relationships between people or locations. Also used by social media applications like Netflix etc.?

Graph Database Examples:

  • Amazon Neptune
  • Apache Cassandra?

Document Stores

Designed to store and read documents along with key pieces of metadata describing the data. Useful to store unstructured data or different data types in a way that's a little more useful than a typical file system. Example blob storage in azure.?

Document Store Examples:

  • Azure Cosmos DB?
  • Amazon DynamoDB
  • Google Cloud Firestore?
  • MongoDB
  • Couchbase Server

Columnar Databases

Columnar Databases are storage mechanisms that seek to improve the performance of data-access by focusing on columns vs the row based approach in relational databases.?When we store transactional records in databases row by row.

No alt text provided for this image

However, if we are interested in reading or writing data more specific to columns then columnar databases are good for that.?

No alt text provided for this image

Columnar databases are useful for Analyzing big-data?

Columnar Databases Examples:

  • Amazon Redshift: belongs to Amazon AWS.?
  • Apache Cassandra: open-source built on Apache Hadoop
  • MariaDB Column Store: open-source
  • SAP HANA: belongs to SAP
  • Monet DB : open-source, focus on data-mining
  • HDInsight: from Azure

Key-value Stores

They store information in key and value pairs. Uses less memory and high speed. However, it needs more sophisticated programming to manipulate and extract data.?

Examples of Key-value Stores are:

  • Apache Cassandra.
  • Couchbase Server
  • Redis

Why are File Storage Systems not good for Data Analysis?

In the regular file system it is unclear?

  • how you access the data.?
  • how? you perform data analysis on documents, images without using some in-mediate process.?

Why do you want to analyze your data? What exactly does analysis mean? Well suppose you're creating a YouTube system to allow users to upload video files. You want to restrict un-appropriate videos to be uploaded. How would you perform this task? Using file analysis technology like in Azure you can use azure media service to analyze your video file. Similarly lets say you are designing a twitter system and want to stop inappropriate tweets you can use Azure Stream Analytics to do live analysis and understand which are appropriate or inappropriate tweets.?

It turns out that each source usually has their storage system to hold the data that it is producing. Source storage systems are normally optimized for functional performance like transactions ( delete, update, create files).? However, these data stores are not good for data extraction and analysis.?

There are storage systems optimized for business transactions called (OLTP) online transaction processing storage. Other storage systems optimized for data analysis are called online analytical processing (OLAP). Analytical storage systems are best for analyzing your data example to find-out if un-appropriate videos/images are uploaded to your system. Transactional storage systems normally contain lots of environmental details and metadata that is not useful for analytics purposes.? Also source storage systems are already using those files to do Realtime business transaction operations. You should not run analytics on them to slow down your business operations. Finally source storage systems are dealing with a high volume of data. They may not store the data for a long time which is not required for that system.?

Therefore, if you want to do analysis then you have to copy the source data into different analysis optimized storage for longer duration. It could be a central repository where in one place you put all data, virtual where physically it is stored in many physical locations however it appears at one place to the end user or semi-centralized repository.?

Databases for Analytics in Azure are:

What is NoSQL Database?

NoSQL (aka “Non SQL” or “Not Only SQL”) database is used to refer to any non-relational database. A common misconception is that NoSQL databases or non-relational databases don’t store relationship data. Well NoSQL databases can store relationship data—they just store it differently than relational databases do. Here related data doesn’t have to be split between tables. NoSQL data models allow related data to be nested within a single data structure. NoSQL databases allow developers to store huge amounts of unstructured data, giving them a lot of flexibility. NoSQL databases can be scale-out instead of scale-up. Scale-out feature of the database is also called Sharding or horizontal partitioning. In the Agile world NoSQL works better so you focus more on model and domain and do not care how it is stored. Domain Driven Design and code-first strategy encourage us to use NoSQL databases.?Even in the cloud technology when cloud providers have to scale the databases they do by horizontal scaling and therefore, they also prefer storing data in NoSQL format only.?

What is SQL Database?

Relational databases accessed by SQL (Structured Query Language) are called SQL databases where data is stored in a tabular format with fixed column and row count, data type and schema. In the Agile world every sprint you get to evolve your system you want to change your model which incur changes in the data table. Therefore, nowadays software engineers are preferring NoSQL databases. In waterfall days applications used to use SQL dB and we used to do a Data-First approach.?

Summary

Datafiles?

  • Delimited Text Files
  • XML Files
  • Log Files
  • Application-specific Files

Databases

  • Relational Databases
  • Graph Databases
  • Document Stores?
  • Columnar Databases
  • Key-value Stores


Reference:

https://www.geeksforgeeks.org/understanding-file-system/?

https://www.youtube.com/watch?v=l-NmE1ix0MI

https://www.mongodb.com/nosql-explained

https://docs.microsoft.com/en-us/azure/architecture/guide/technology-choices/data-store-overview


要查看或添加评论,请登录

社区洞察

其他会员也浏览了