The NoSQL Ecosystem
Introduction:
Unlike traditional SQL databases, NoSQL is an ecosystem of diverse, complementary, and competing tools offering alternatives for data storage. Understanding NoSQL requires exploring this landscape and the design choices each tool presents. Choosing a NoSQL system necessitates a deeper understanding of systems architecture.
1.1 What's in a Name?
"NoSQL" literally means "Not Only SQL," reflecting a broader meaning than simply "not using SQL." NoSQL systems provide alternatives to relational databases, allowing developers to use SQL alongside other approaches. They might replace a relational database entirely or integrate it with other systems.
1.1.1 SQL and the Relational Model
SQL is a declarative query language. The programmer specifies?what?to do, not?how?to do it. Examples include:
SQL's abstraction hides details like data layout, indexing, and algorithms. Query optimizers determine the most efficient execution plan. However, this abstraction can lead to unpredictability in query costs.
1.1.2 NoSQL Inspirations
Two influential research papers inspired the NoSQL movement:
Many NoSQL systems blend features from BigTable and Dynamo. For example:
1.1.3 Characteristics and Considerations
NoSQL systems simplify database operations for better performance prediction. Complex query logic often shifts to the application layer. They also relax traditional guarantees like ACID properties (Atomicity, Consistency, Isolation, Durability) in exchange for performance. This makes partitioning across multiple machines easier. However, NoSQL systems are still evolving, and specific features change rapidly.
Key considerations when choosing a NoSQL system include:
1.2 NoSQL Data and Query Models
NoSQL systems often use simpler data and query models than SQL. Common models include key-oriented storage and graph models, with query languages like key lookups and MapReduce.
1.2.1 Key-based NoSQL Data Models
NoSQL databases frequently restrict data retrieval to a single key field. Complex operations like joins are handled in application logic. This improves performance predictability but reduces abstraction.
Different types of key-based stores exist:
1.2.2 Graph Storage
Graph databases (e.g., HyperGraphDB, Neo4J) are best suited for graph-structured data, differing significantly from other NoSQL types in data model, query patterns, and physical storage.
1.2.3 Complex Queries
Some NoSQL systems offer more advanced query capabilities:
1.2.4 Transactions
Most NoSQL systems prioritize performance over full ACID transaction support. They often provide key-level serialization but leave more responsibility for transactional integrity to the application. Redis is a notable exception, offering atomic operations within a single server.
1.2.5 Schema-free Storage
Many NoSQL systems lack schema enforcement, providing flexibility but requiring more defensive programming in the application layer to handle schema variations.
1.3 Data Durability
Data durability involves persisting data to stable storage and replicating it across multiple locations. Different NoSQL systems offer varying durability guarantees to balance performance and data safety.
1.3.1 Single-server Durability
Techniques for improving single-server durability include:
1.3.2 Multi-server Durability
Techniques for multi-server durability involve replication across multiple machines:
1.4 Scaling for Performance
Scaling involves distributing the workload across multiple machines.
1.4.1 Do Not Shard Until You Have To
Before sharding, consider:
1.4.2 Sharding Through Coordinators
Coordinators (e.g., Lounge, BigCouch for CouchDB, Twitter's Gizzard) manage sharding across multiple independent instances of a database.
1.4.3 Consistent Hash Rings
Consistent hashing distributes keys across servers using a hash function. A "ring" represents the keyspace. Servers are placed on the ring, and keys are mapped to the nearest server. Replication is achieved by assigning keys to multiple servers on the ring. (Diagram below)
Diagram 1.1: Consistent Hash Ring
+-----------------+ +-----------------+ +-----------------+
| Server A |---->| Server B |---->| Server C |
+-----------------+ +-----------------+ +-----------------+
^ |
| V
+-----------------+ +-----------------+ +-----------------+
| Server D |---->| Server E |---->| Server A |
+-----------------+ +-----------------+ +-----------------+
To improve distribution, virtual nodes can be used.
1.4.4 Range Partitioning
Range partitioning divides the keyspace into ranges, each managed by a server. Metadata tracks the key range-to-server mapping. This allows for more granular load balancing. (Diagram below)
Diagram 1.2: BigTable-based Range Partitioning
Client -> Metadata Server A (Level 0) -> Metadata Server B (Level 1) -> Data Server C (Tablet)
1.4.5 Which Partitioning Scheme to Use
The choice between consistent hashing and range partitioning depends on workload characteristics. Range partitioning is suitable for frequent range scans, while consistent hashing is simpler for random key lookups.
1.5 Consistency
Maintaining consistency across replicated data is challenging. Two main approaches exist: strong consistency and eventual consistency.
1.5.1 A Little Bit About CAP
The CAP theorem states that a distributed system can only satisfy two of the following three properties: Consistency, Availability, and Partition tolerance. NoSQL systems typically prioritize availability over consistency in the face of network partitions.
1.5.2 Strong Consistency
Strong consistency ensures that all replicas agree on the data's value. This is achieved by requiring a quorum of replicas to acknowledge writes and reads (R+W > N).
1.5.3 Eventual Consistency
Eventual consistency allows replicas to temporarily diverge, eventually synchronizing. Techniques include:
1.6 A Final Word
NoSQL systems are constantly evolving. The key takeaway is understanding the design trade-offs that shape these systems, and the implications for application design.