Data Vectorization [Series#2: I am Data!]
Mustafa Qizilbash
‘Open for New Opportunities (Globally) Author & Podcaster of “Let’s Talk About Data!”, Data & AI Practitioner & CDMP Certified, Innovator of DAC Architecture & PVP Approach
Data Vectorization is something very common now a days specially since the inception of Big Data or Hadoop. It’s not like, it was not in-use in past, but I would say it was not famous.
All of us must have heard about MPP (Massive Parallel Processing) right? But do we know how it work at back end? Data Vectorization is about enabling parallel processing to fetch data.
Let’s decode it…..
There are four types of instructions to pull data i.e., SISD, SIMD, MISD and MIMD (all explained in separate topic). In this topic, we will be referring to SISD and SIMD only.
Traditional, a computer or machine or a server works in SISD mode i.e., Single Instruction and Single Data, means each instruction fetch required data one by one.
领英推荐
In Data Vectorization, we change our approach to MPP mode i.e., computer or machine or server starts working in SIMD mode i.e., Single Instructure and Multiple Data, mean, if one query is executed and data is residing in multiple data nodes, data from all nodes will be pulled in parallel making computation must faster as compared to SISD.
Data Vectorization has become key component of any data solution specially since Hadoop, No SQL databases and Cloud has surfaced. Now most of the databases are Data Vectorization or MPP enabled.
Question is, why Data Vectorization is so important? Please note, in Big Data era since social media, CCTV, Audio etc., kinds of datasets are also able to produce valuable insights, organizations has started to store and utilize those. But to utilize one must process those. To process such kind of huge datasets, SISD was not a suitable technical as processing Gigabyte and Terabytes of data in sequential mode would take days to process so SIMD or MPP or Data Vectorization has been a chosen technique which could process data in massively parallel mode making computation 100s of times faster as compared to SISD.
Cheers.
I am an Enterprise Data Management, Data Governance, Data Modeling Experienced Professional | As a Team Leader, I ensure the highest data quality, security, and compliance standards.
2 年Well explained Data Vectorization: I understood this topic in data science where python and R have vector-type variables. The implementation in Databases makes a real difference. I believe in-memory databases incorporate the storage or table space in vector form, making a real difference when we retrieve or summarize a massive amount of data.