Apache Cassandra Database

Apache Cassandra Database

Apache Cassandra is one of the new breeds of database systems. Aiming squarely at the Big Data market, Cassandra is a fully NoSQL style database engine. Differing significantly from traditional relational databases due to the fact it is capable of storing and accessing largely unstructured data.

Cassandra has been designed from the ground up to be massively scalable. The development of Cassandra took a big leap forward in 2006 when Facebook made this proprietary technology available as an open source project. Since then, industry giants such as Google and Amazon have contributed to the development of the platform. This means that for large scale commercial use, Cassandra has been proven to work. It powers some of the highest traffic websites in the world.

Apache Cassandra Architecture

Alongside its radical NoSQL data storage engine, Cassandra also excels at being extremely scalable. This has been facilitated by implementing a fully peer-to-peer style of distributed architecture. This means that each database node that makes up part of the service platform, participates equally. There is no master and slave relationship with one node being in overall control.

This delivers a platform that is incredibly fault tolerant. If one node goes down, the remaining nodes are still available, with exactly the same dataset. Furthermore, there is no need for developers to produce code to exploit this peer-to-peer architecture, as it is all done transparently in the background.

A final benefit of this peer-to-peer style of architecture is that the entire database service can be split across multiple physical sites, or indeed, even across one or more cloud services. The ability to host the same database, in several physical data centers, adds a very high level of physical data protection.

A Note on Data Consistency

As individual Cassandra nodes serve, store and modify data, there is a need to propagate changes to data across all of the other nodes. System administrators are able to set the strength of data consistency and replication across the entire node cluster.

This can be performed at a very granular level, even down to the type of database function that was performed. For example, inserting new data may be given a higher replication priority than changing existing data.

How Does Cassandra Perform?

As we might expect from a NoSQL platform used by Facebook, Amazon, and Netflix, Cassandra tends to outperform the competition for large scale deployments. As a comparison, when testing Cassandra alongside HBase, the closest comparable technology, Cassandra outperforms HBase by a factor of 8 to 10 for every type of database operation.

In Conclusion

Apache Cassandra is entirely suited to large-scale applications that need to access huge volumes of unstructured data. That being said, Cassandra is still a good choice for smaller applications, as it delivers a high level of data protection out of the box.

Developing for Cassandra is very simple, as most of the truly clever aspects of this technology are handled transparently, so developers have no need to develop platform specific code. This makes Cassandra easy to implement, as developers do not have to be brought up to speed to start creating applications.


MySQL vs Cassandra DB

What is Cassandra

Cassandra is a database from Apache that is open source. It is NoSQL, and so also lightweight. Cassandra is also a distributed database. A distributed database runs on multiple machines, but to the users, it would look like only one because they act as a unified whole. This happens through multiple nodes, which each represent an instance of Cassandra. From there, the nodes communicate with each other to distribute the workload for improved functionality. If this node logic sounds familiar, it’s because Cassandra’s designed to be easily organized into a cluster. From there, you can have multiple data centers if you so choose.


Cassandra is also flexible in its scalability. Because Cassandra is so dynamic, you can grow or shrink the database as you need. But this isn’t like MySQL, where there is heavy downtime ultimately to hit a ceiling again. Instead, Cassandra allows more on-the-fly expansion which just means all you need to do is add more nodes to increase the size, capacity, or even CPU power or RAM associated. This means very little-to-no downtime is required, and if you go overboard you can scale back just as easily.

Open-Source Database

As we’ve talked about in the past, MySQL and Cassandra are both open source. With MySQL, a few articles ago we talked about the proprietary software that MySQL offers. That of course is a paid service with additional support and capabilities. With Cassandra, I found information on their open-source documentation, but I couldn’t find anything about it having paid features or proprietary code. If this is incorrect please let me know in the comments, but from what I’m seeing, Cassandra is true open source.


Database Capabilities

So, first, let’s talk about the structure of the databases. MySQL, as you know, is an RDBMS (Relational Database Management System). Cassandra, however, is a NoSQL database. This means that MySQL will follow more of a master/worker architecture, while Cassandra follows peer-to-peer architecture.


We already know that MySQL supports ACID (Atomicity, Consistency, Reliability, and Durability) transactions. Cassandra, however, does not automatically follow ACID transactions. This does not mean it isn’t possible. Although not initially provided, you can tune Cassandra’s features to support ACID properties. For example, tuning Cassandra’s replication and fault tolerance ensures reliability. Another example is in tuning the consistency. Cassandra is an AP (Available Partition-tolerant) database, but you can configure the consistency to be on a per-query basis.

When we’re looking at scalability, MySQL more commonly supports vertical scaling. Horizontal scaling is also possible through replication or sharding. Cassandra, on the other hand, supports both horizontal and vertical scalability. Although this one is a little more specific, let’s also look at the performance of Read transactions. But first, we need to look at JOINs to understand the logic. As you know well by now, MySQL, or any RDBMS for that matter, supports JOINs between multiple tables in a query. Cassandra, on the other hand, discourages JOINs. Instead, it prefers to SELECT from only one table per query. So, because multiple tables can be joined in a MySQL Read, the performance would be O(log(n)). With only one table being read at a time, Cassandra’s performance is O(1). When looking at Write statements, MySQL’s performance can be slowed because a search is being performed before the write. Instead of a search, Cassandra uses the append model, which providers higher performance when writing.

Administrative

Maybe a given, but MySQL, because it is an RDBMS, supports Referential Integrity, and has Foreign keys. Because Cassandra is a NoSQL database, it does not enforce Referential Integrity and therefore does not have Foreign Keys.


To ensure consistency within a distributed system, MySQL provides the Immediate Consistency method, but it is the only type provided. Cassandra allows both Immediate Consistency methods and Eventual Consistency methods.

As far as operating systems go, MySQL is used on FreeBSD, Linux, OS X, Solaris, and Windows. Cassandra, however, is only supported on BSD, Linux, OS X, and Windows. MySQL, as we have learned, was also written in C and C++ languages. Cassandra, on the other hand, was written only in Java. MySQL was also developed by Oracle, which Cassandra was developed by Apache Software.

Advantages of Cassandra

As we talked about while describing Cassandra, its scalability is a great advantage. This is because it can be done quickly with no downtime, as you do not have to shut the database down to scale. Both horizontal and vertical scalability is an option, as Cassandra uses a linear model for faster responses.


Along with scalability, the data storage is flexible. Because it is a NoSQL database, it can deal with structured, unstructured, or semi-structured data. In the same way, the data distribution is flexible. Several different data centers can be used, which makes it easier to distribute the data.

Performance was another factor we discussed. The benefit we’ll talk about here is how it handles simultaneous read and write statements. Even multiple write requests are handled quickly and do not affect the read requests.

Another benefit to using Cassandra is the easy language, CQL (Cassandra Query Language), which is offered as an alternative SQL. Cassandra also has the benefit of decentralization. This means that because of the structure of the nodes, there would be no single point of failure. If a node was to fail, another node could retrieve the same data, and therefore the data would still be available.

Disadvantages of Cassandra

One disadvantage of Cassandra is that because it is NoSQL, there is no structured SQL syntax, so there would be a list of features Cassandra doesn’t have. For example, there is no enforcement of Referential integrity, subqueries (GROUP BY, ORDER BY, etc.), or even JOINs. Cassandra has limited querying capabilities, and aggregates are also not supported. In addition, the Read requests can run slowly. We mentioned that Write requests can run quickly, but multiple Read requests can delay results and run more slowly.


The data is modeled in Cassandra using predicted queries. This means that there is the potential for duplicate data. Especially with Cassandra being NoSQL, you may have to deal with duplicate data, as it will not be automatically rejected like MySQL or other SQL languages would.

Advantages of MySQL when compared to Cassandra

The largest advantage of MySQL, when compared to Cassandra, is the fact that it is an RDBMS. Primarily, we are talking about JOINs, aggregates, and other functionality such as enforcing Referential Integrity. There is also more flexible querying, and you can create any combination to yield different results. However, MySQL is more flexible than some other SQL systems, so there is limited compliance with SQL standards.


MySQL also tries to prevent duplicate records, which Cassandra does not. Not only this, but MySQL is also ACID compliant, which may be that extra structure you need for your database.

Disadvantages of MySQL

The first major disadvantage, when compared to Cassandra, is the flexibility when scaling. Although MySQL can scale, there is downtime associated, and even then, you can still hit ceilings. However, the query speed can also be slowed because of all potential combinations in the queries. For example, if you’re joining multiple tables, that could slow your results, whether it is a Read or a Write request.


MySQL is also only partially open source. There is proprietary code involved as well as additional support when using the paid version.

Conclusion

In today’s MySQL versus series, we looked at Cassandra, a NoSQL database. Cassandra is another lightweight, open-source, and highly scalable database that has been gaining popularity. It is an Apache software designed to run as a distributed database amongst a series of nodes. This means there can be multiple data centers, and if you want to scale you simply add or remove nodes. Although Cassandra is flexible and useful, it also does not follow standard SQL practices, such as enforcing Referential Integrity, and it encourages the user to write separate queries instead of JOINing, which is not supported.



Cassandra is a real good solution for many projects and a nice hand for Big Data scientists, i think there is one more limitation, for exemple other users keyspaces can be displayed, so that can make a real security problem, also we cannot limitate a client performance, we have only one way option, we let him in or not, i think we will see an update in the few next months trying to solve this security issue, STILL, i find Cassandra real powerful as a Database, one of bests ever, thank you for your great posts Omar ??

赞
回复

要查看或添加评论,请登录

Omar Ismail的更多文章

社区洞察

其他会员也浏览了