Hadoop & NoSQL Data Modeling & OLTP

Hadoop & NoSQL Data Modeling & OLTP

The original origin of Hadoop is the UNIX file system database with NFS enabled. NFS stands for Network File System. It is a distributed file system protocol originally developed by Sun Microsystems in 1984, allowing a user on a client computer to access files over a computer network as if they were local. UNIX, the operating system for which it was designed, and the file system database was originally created on November 3, 1971, but it was not called Hadoop. The open source community modified and re-branded it Hadoop on December 10, 2011, but it's simply a legacy file system database with a network file system from the 1970's and 1980's.

Hadoop and NoSQL databases have never had DIRECT support for data modeling nor OLTP and were designed for OLAP queries, very limited concurrent users, data archiving, or individual desktop NoSQL databases such as MS Access or Excel. Hadoop relies on pre-constrained data from an external RDBMS because it does not support primary keys, foreign keys, nor all of the features of an RDBMS which allows the encoding of business requirements into the data structure using data modeling. An Enterprise Data Modeler is required to create a conceptual, logical, and physical data model. Only then can data from the physical model be exported to Hadoop. To get data into Hadoop correctly, simply run it through a correctly designed multi-dimensional physical data model, then create a procedure that selects the data and spools it to a CSV or TSV file; comma separate value or tab separated value. Spool the file into a directory or file that matches the table name to keep track of from where the data came. The process is exactly the same as it was with the original UNIX file system. Hadoop uses standard UNIX commands such as mkdir, cp, rcp, ftp, and rsync. Simply place "hdfs dfs -" before the standard UNIX command. You may also compare the Hadoop list to the commands published by Oracle for Solaris / UNIX. Below is a sample CSV export script for Oracle. Note that the line size parameter must fit the data, otherwise the export may omit the comma delimiter. The numwidth parameter may also need adjustment to fit your data. The colsep parameter specifies a comma for CSV.

set linesize 60     -- column width setting
set colsep ,        -- separate columns with a comma
set headsep off     -- remove dashes that separate headers
set pagesize 0      -- remove header row from output
set trimspool on    -- remove trailing blank spaces 
set numwidth 3      -- size of the number column

spool employees.csv --create file for data

select EMPLOYEE_ID, FIRST_NAME, LAST_NAME from hr.employees;

spool off           -- stop appending to data file 

The output file should be similar to the following. Note that the file format matches the order of columns in the SQL statement above. The order must be maintained on data imports with the table create statement in Hive or other ETL tools.

    100,Steven       ,King
    101,Neena        ,Kochhar
    102,Lex          ,De Haan
    103,Alexander    ,Hunold
    104,Bruce        ,Ernst
    105,David        ,Austin
    106,Valli        ,Pataballa
    107,Diana        ,Lorentz
    108,Nancy        ,Greenberg
    109,Daniel       ,Faviet
    110,John         ,Chen
    111,Ismael       ,Sciarra
    112,Jose Manuel  ,Urman
    113,Luis         ,Popp
    114,Den          ,Raphaely
    115,Alexander    ,Khoo

Once the data is in CSV or TSV, loading it with hadoop commands is about the same as using SQL and a few UNIX style commands, although it has been re-branded Hive. Hive can be used through a web browser by going to the URL provided by your Hadoop instance; https://yourHadoop.yourURL.com:8080 or https://localhost:8080. Simply use the menu to go to Hive 2.0 to get a command window. The Hive command to create tables is nearly identical to standard SQL with the exception of specifying a delimiter. Remember Hadoop is not an RDBMS. Create table does not create a traditional table, just a flat file with references that mimic the functionality of a table. In case you're wondering, you should probably never put social security numbers into this type of system. If you do, you must first plan a way to secure the data with encryption before it is exported to CSV and long before it is imported to Hadoop. The line that reads FIELDS TERMINATED BY ' , ' tells Hadoop that you are importing a comma separated value file, CSV. The "skip.header.line.count"="1" tell the hive script to skip the first line in the CSV file while importing data if the CSV file has headers. The code above removes the headers so that this line will not be needed. However, one must be careful to match the order of columns between the commas to the table creation statement. Otherwise, Hadoop will create data integrity errors such as placing employee ID where the employee name should be. Even if you leave the headers in, Hadoop will not automatically sort and match the data to the table design. Again, both the data order in the CSV file and the table creation statement must match precisely.

create table myTableNameHere
(myTableID int,
 name string,
 autoIDNum bigint,
 location string,
 phone string,
 email string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
TBLPROPERTIES("skip.header.line.count"="1");

To load the data, simply type LOAD DATA INPATH followed by the path to the CSV file you exported from your RDBMS between single quotes. Don't forget that the data must be pre-constrained by running it through a relational data model, preferably a multi-dimensional 2NF star schema. Although it appears that Hadoop supports data types and IDs, it does not. The data is simply in a flat file. Therefore, the data must be literally perfect before it is loaded. Also note that the data at the space in the file patern must exactly match the specification in the create table statement. For example, the first value must be an integer because myTableID is defined as int. You cannot import it, then change it later. Again, it must be literally perfect on the way in. If you do change it later, you can only overwrite the entire table or append to the table. Again, Hadoop was meant for archival data which never changes. Transactional updates are not supported.

LOAD DATA INPATH '/tmp/data/myTableNameHere.csv' 
OVERWRITE INTO TABLE myTableNameHere;

If you cannot find a Hadoop resource, look for a UNIX System Administrator. He or she can learn the minor differences between Hadoop and the original UNIX file system in about one to three hours. Usually, only an experienced UNIX System Administrator or Enterprise Data Modeler whose skill sets contain UNIX would recognize NFS and the old file system database system on sight, which is why most people believe Hadoop is new and "better" than RDBMS. However, file system databases do not support primary keys for uniquely identifying data, nor foreign keys for constraining data according to business requirements. In 1971, no one had petabytes of data or even gigabytes of data, so small file system databases were adequate. Most entire disk drives were about 40 megabytes and the largest available disk was only 100 megabytes. The first one gigabyte disk was produced in 1980, so file system databases were still adequate until 1987, the year RAID and large disks composed of multiple smaller disks were invented. At that point, the data was too big and too complex for NoSQL and file system based databases, but for some reason, almost everyone has forgotten the wisdom of the past. Most importantly, the skill to tell when to use structured data, with foreign keys, primary keys, and relational constraints, or unstructured file system data has been largely lost.

Over 40 years ago, in developed countries, the file system database was abandoned widely in favor of relational database management systems for CRUD transaction support, security, scalability, interoperability, performance, and most of all modern Enterprise Data Modeling which enables one set of data to be constrained by another set of data automatically while implementing security and encoded business requirements. CRUD support contains create, read, update, and delete instead of just overwrite and append as supported by Hadoop. Again, none of the RDBMS features are supported on file system databases. Additionally, because Hadoop is file system based, there is no way to control read and write functions that may corrupt the data if thousands of end-users start to read and write to the system simultaneously. That is why OLAP and file system databases were limited to usually one or very few users who operate concurrently. For example, a process or ad-hoc query could try to read from a file before it is completely written and create a corrupted copy of the data either in the interface or HDFS distributed copy of the data. This was a known issue in the 1970's and is still listed by Apache as a "known issue" because there is no way to resolve it using file system databases.

Over 40 years ago, file system read/write data corruption error issue was resolved by implementing Relational Database Management Systems which have optimistic and pessimistic locking logic features. Optimistic locking allows one to read the previous copy of the data noting the version number while it is being updated, but does not allow one to read nor write a partially changed record. Pessimistic locking blocks all reads and writes for a record until it has been completely updated. That is an advanced transaction management system which is required for OLTP. OLTP capabilities is a business requirement for over 99 percent of all IT projects which also requires end-user support for hundreds to over a billion users.

While many claim Hadoop can support billions of users, the Hadoop HDFS does not and cannot have transactional locking feature because it is a file system type database simply based on NFS and not a RDBMS (Relational Database Management System). Despite distributing its file systems over thousands of nodes, the file systems cannot control read and writes directly on any file system for more than one user per node and definitely not millions to over a billion users. While it is true that one can connect a billion people to tens of thousands of distributed Hadoop nodes, one cannot do so with data integrity, high transactional performance, and business logic intact.

Additionally such a large distributed system would cost approximately five billion dollars / quarter or about twenty billion dollars / year to maintain due to the size, power costs, administrator labor, land, building maintenance, cooling costs, and other data center costs. Distributed computing use for scalability is a forty year old strategy that was abandoned for Enterprise Class Super Computer Systems and Object Oriented Code that can use cluster capable system and large amounts of memory on 64-bit systems.

The distributed computer file system database strategy was only valid before 64-bit systems were widely available at low costs, which was before 2006. In 2003, Apple adopted the 64-bit chip. By 2006, Intel had produced it's "clone copy" of the AMD 64-bit chip. Previously, only UNIX systems were 64-bit capable. By the time Hadoop was created in 2011, it was already obsolete in countries with economies that could afford new 64-bit Enterprise Computer Systems. Countries such as India, which are less developed, are still using 32-bit systems which have a 4GB memory limit compared to the 16 Exabyte memory limit of a 64-bit system,which is 16 billion gigabytes. For example, the price of a new Oracle T7-2 or T7-4 Enterprise Blade Server is greater than the price of an apartment in an Indian city center or a family home outside a city center. Given the choice of buying a house or a computer, most Indians would buy a house. US currency is also over 73 times the value of Indian and other foreign currencies. Indians also drive down wages in the US because $50,000 USD will still buy a very nice home in India. So, Indians are willing to work for much lower wages, often taking jobs that traditionally pay $200,000 to over $1,000,000 / year for less than $50,000 to $100,000 even if they must live in poor conditions.

Unfortunately, because of national financial limits, most people in countries such as India could only use old strategies designed for older, cheaper systems. Therefore, the only way to scale using old computers is to use distributed computing. Unfortunately, Indians have thrown many US companies backwards 40 years in development because they use and recommend only what they know when they are hired in the US. That combined with the H1B strategy has created mass chaos in the IT data industry with 70 to 85 percent failure for Big Data Projects, over 99 percent failure and cancellation for ERP projects, and an overall 93 percent failure and cancellation for all IT projects combined. Many of the issues causing the failure can be traced back to issues that were resolved 40 years ago, such as optimistic and pessimistic locking.

However, to implement fully optimistic locking, the data model structure must have timestamps for each table and a time dimension for each fact table in multi-dimensional designs. Object oriented code will then be able to use the locking features of the database and the data within the structure. The timestamp data type is often misunderstood and misused by most non-data modelers who attempt a "do-it-yourself" database design and the time dimension is usually completely missing from their designs if they know enough to transform the logical design into a multi-dimensional design. Misunderstanding time integration into data model designs is one of the primary causes of system failure, data corruption, and performance issues. A data governance policy would mandate time integration for all data models and professional data modeling services.

Since Hadoop does not have the locking mechanisms, it can also corrupt data, then replicate corrupted blocks of data to all nodes causing system wide data loss. One way to mitigate this issue is to figure out the number of concurrent end-users and create one distributed node / user so that each node has one single thread running. For example, if a company has 10 million global users spread over 24 hours, the company may need approximately 115 nodes. However, despite mitigation, this and other issues directly prevents OLTP because file system databases were originally compatible with and made for OLAP which are slower, usually limited to one user, and less likely to cause a file locking or data corruption issue. Neither Hadoop nor NoSQL supports high volume OLTP transactions and are therefore useless in over 99 percent of modern IT projects. The only practical use for NoSQL is small single-user spreadsheets. The only practical use for Hadoop is archiving data and OLAP style analysis. Even in those rare cases, the data must be constrained in a properly designed database before it is exported to Hadoop. Because data modeling is almost never done in big data projects, most years, up to 85 percent of the projects fail.

Additionally, once one dumps data to a file system or NoSQL system, there is usually ZERO security. The file system is the first thing hackers can access if they break into a computer. Dumping one's entire data set to a flat file in a file system is like committing suicide. One does not even have to be a hacker to steal private data from a text file in a file system. If one attempts to encrypt the files in Hadoop, one usually also disables compatibility with most tools that use Hadoop for data analytics. That makes Hadoop useless or nearly useless for enterprise level projects. Hadoop is meant for very small projects that have a very small or zero budget, no high security requirements, and no OLTP or archiving non-personal data.

If you have hired someone who is telling you that Hadoop or any NoSQL system should be your primary transactional database, fire him or her immediately! If they try to tell you, "Google was founded on Hadoop," look up when Google was founded, September 4, 1998, Menlo Park, CA, and when Hadoop was initially released, 2011. Even if it were true, the data architecture designs of a search engine would only apply to a search engine, not health care, manufacturing, e-commerce, etc. Most people who try to go back to the old technologies while claiming they are "new" have no real-world experience and are likely spouting information from marketing materials meant solely to increase IT consulting resource sales without regard for ethics. They are hoping that the management of their target company will not know the difference between out-dated, inadequate, insecure legacy technology and technology actually designed to meet their business requirements. For example, health care and financial data requires structured data, encryption, and maximum security at the row level. Hadoop simply dumps the data into the file system unstructured, without row level security, and without encryption. Even if one tries to encrypt the data, the system usually corrupts the data as soon as it is used, decrypted, and then re-encrypted.

Don't believe it so far? Review the known issues page for Hadoop. Pay special attention to the Hive SQL injection attack vulnerability. The vulnerability allows external users an hackers to copy your data if specific precautions are not taken. It's like reliving IT in the 1970's and 1980's and making old classic mistakes modern technology was made to prevent. For example, instead of replicating file system blocks, including corrupted blocks, over multiple distributed nodes using Hadoop HDFS, which is a poor derivative of NFS, one should be using a relational database management system with replication between multi-master or single-master nodes.

Also, beware if the HDFS distribution is not connecting with SSL or other types of encryption. More than 30 major organizations, including banks have been hacked by not encrypting connections between hadoop nodes or while migrating data to a cloud hosting system such as AWS. Without network encryption the data will stream in clear text between file systems. Clear text is visible to any network card on the network in promiscuous mode. The data can be covertly copied with minimal hacking skills.

Additionally, modern file systems such as ZFS are transactional and prevent the creation, use, and replication of bad blocks on disk. With a transactional file system, data is managed using copy on write semantics. Data is never overwritten, and any sequence of operations is either entirely committed or entirely ignored. The file system can never be corrupted through accidental loss of power, poor applications, system errors, or a system crash. ZFS Acts as both the file system and the volume manager and has a full record of the complete state of the hardware and the data. In case you're wondering, you need a real UNIX 64-bit system to implement ZFS. A real enterprise level system with enterprise level security would combine an OLTP capable relational database management system, ZFS, UNIX, and an object oriented programming language capable of encryption, scalability, high availability, and multi-platform compatibility.

Why are most people wrong about Hadoop? Because they did not do a complete SDLC analysis, document their business requirements, hire real IT professionals at living wages for their area, and most of all, they copied what someone else did or claimed they did without validating it. Most companies that implement Hadoop, do not check to see if it matches their business requirements and consequently, never complete the prerequisite Enterprise Data Modeling which constrains the data before it is exported to Hadoop. Additionally, for security, the data must be encrypted and it can only be implemented on a lockable UNIX file system. Most of all, whoever implements Hadoop, must also be a Master Physical Enterprise Data Modeler, usually with more than 15 years of experience.

I once worked with a company that specialized in health care systems and found that their resource for data modeling had done nothing for five years and simply acted arrogant and ignored meeting requests while pretending to be too busy. Upon deep research, I discovered that all the work he claimed to have done, did not exist and he had no idea how to do it. Additionally, most of the other staff members were H1B workers from India who had convinced the managers to implement MS SQL server, which is not actually OLTP capable, despite what the documentation says, and has no query result cache nor in-memory query capability. They did not complete the analysis before recommending the solution. The analysis would have revealed the performance goals, such as the number of concurrent end-users. The number of expected end-users was over ten million / day. The system only supported one to five users simultaneously with a response time of up to two minutes on web interfaces to the database.

I also discovered that none of their developers knew how to write object oriented code even while using an object capable language. They simply wrote the code to execute top to bottom like a giant external procedure, then called stored procedures in the database. Out of the blue, one of the H1B workers form India said, "Just go to Hadoop!" after the system failed miserably. The managers did not even know why it was important to avoid stored procedures in OLTP systems EVEN WITH PROCEDURE in the name of what they were using. Anyone with basic computer science knowledge knows the difference between procedural code and object code or RDBMS and a legacy file system databases and, most importantly, when to use each one. For weeks, I heard, "Just go to Hadoop!" while I covertly created a real data model and documented a real enterprise system design.

Again, because Hadoop does not support data modeling, it cannot support the encoding of business requirements at the core data level which would prevent anyone or any application from violating the data security, data governance policies, business rules, business requirements, and approved business processes. It relies on standard data modeling in an external RDBMS to pre-constrain the data. Yet, I have been sent over 970 job query emails indicating that Hadoop is "mandatory," or that Hadoop is "critical." Immediately, I knew that no one in management at those companies knew that their data was simply being dumped into a file system and distributed to multiple nodes across the network with no security and no data integrity. Some of those companies are large health care organizations, large banks, publishers, and insurance agencies with hundreds of millions of sets of private customer data! Meanwhile, major corporations are losing their customer data to hackers without understanding why.

Additionally, one must note that the code above to export data from Oracle and import it to Hadoop is not complex and simple to write and maintain. In fact, the entire Hadoop system is extremely easy and simple because it's based on the legacy UNIX file system and NFS. What causes Hadoop projects to fail is the lack of traditional conceptual, logical, and physical data modeling BEFORE exporting data to Hadoop which is a very common failure of business analytics. The failure of business analytics and encoding requirements at the core data level is the primary cause of IT project failure and data breaches. Take my advice, if you implement Hadoop, do the SDLC analysis, design, and data modeling first. Alternatively, one can implement UNIX, Java, ZFS, and replication on OLTP capable databases such as Oracle, PostgreSQL, and IBM DB2/UDB, which also requires that one complete the SDLC analysis, design, and data modeling first. If you don't want to "spend the money" , simply use Erwin to reverse engineer your data models to PostgreSQL, a free open source database that rivals Oracle and IBM DB2, then archive your data to single-master replicated PostgreSQL instances for Big Data processing.

Lastly, I suggest that one should hire the most experienced professionals one can find at a living wage in the area in which one operates to avoid complete loss of project investments. Additionally, most managers are hired and deployed as if working for a manufacturing plant with an assembly line. The difference between IT management and manufacturing management is that IT management must be as technical or more technical than the staff they manage. If they are not, the entire organization could be lead to implement the wrong technology, lose billions of dollars, and possibly go out of business. Implementing the correct Information Technology is no longer optional, but critical to business success, and in some cases, survival.

要查看或添加评论,请登录

Hanabal Khaing的更多文章

  • Complex and Correct vs Simple and Wrong

    Complex and Correct vs Simple and Wrong

    Eight years ago, a company hired me to fix a telecommunications system by updating the data model. The fix cost…

  • How People Steal Millions from Coworker 401K

    How People Steal Millions from Coworker 401K

    Have you ever taken clothes out of the dryer, matched up all the socks, but had one sock left over? How did that…

  • How People Steal a Million Dollars from the Data Modeling IT Budget

    How People Steal a Million Dollars from the Data Modeling IT Budget

    How Do Data Models Either Prevent or Enable IT Budget Theft Real, theft-deterrent Data models can only be created…

    1 条评论
  • How to Spot a Fake Data Model

    How to Spot a Fake Data Model

    Why is the Data Modeler and your Data Model More Important than the CEO, all C-Level Staff, and the Board of Directors?…

  • The 30 Plus Skillsets of a Data Modeler

    The 30 Plus Skillsets of a Data Modeler

    The Major Skillsets of a Data Modeler The total skillset count is at minimum 36 and may exceed 60 total skillsets…

  • Data Governance BIM & MDM

    Data Governance BIM & MDM

    Data Governance is the methodical macro and micro management of data storage and flow between countries and companies…

  • Why are over 800,000 Data Jobs Always Open?

    Why are over 800,000 Data Jobs Always Open?

    I could answer the question, "Why are 800,00 Data Jobs Always Open," with one sentence. MOST, not all, of the resources…

  • UNDERSTANDING SAP HANA

    UNDERSTANDING SAP HANA

    First I would like to note that SAP HANA, the platform, versus SAP HANA S/4, the replacement for the SAP ERP / SAP ECC…

  • Canonical Data Model Data Dictionary Mapping

    Canonical Data Model Data Dictionary Mapping

    The purpose of a canonical model is to provide inter-operability between multiple systems at the data level. Once the…

  • Asset Valuation Alert System for Real Estate & Securities Investments

    Asset Valuation Alert System for Real Estate & Securities Investments

    One of the most frequent requests I get as a data modeler is to integrate external unstructured "Big Data" with…

社区洞察

其他会员也浏览了