Here's what you missed on MongoDB
Confession time: I'm a Glee fan. Glee, like many series, starts with a recap of what has happened recently to remind you / catch you up. If you are coming to MongoDB World? in 2022 or just haven't looked at MongoDB for a while I thought I'd put together my personal edit of the highlights to bring you up to speed. It's OK if you haven't ever seen Glee or looked at MongoDB before, this starts with season 1.
What is MongoDB in 2022
MongoDB is a database designed to store records and allow multiple processes to retrieve, aggregate and update them simultaneously without blocking each other or overwriting each other's changes. This is pretty much the textbook definition of a database.
MongoDB does almost everything you expect a database to do, and in very much the same way, it does offer a little more design flexibility than a traditional RDBMS if required. Much of the code that makes up MongoDB is very similar to the code and algorithms that make up Oracle, SQL Server or Postgres.
What are Documents
Rather than store data in predefined tables of scalar values, MongoDB's primary data store is of Binary Serialised Objects (BSON). When you give MongoDB an Object to store it takes the field names and data types from the Object you give it - you do not need to specify this in advance and it will not enforce any constraints by default. If you wish it can enforce all the same type, range and value co-dependency constraints as an RDBMS but most developers do not do this in the early stages of design. The one constraint not supported is foreign keys.
Document databases differ from a tabular model in that any field can be not just a typed scalar but also either an array of values of any type or a nested object with its own named fields. We often represent this visually using JSON as it's hard to show in CSV. JSON has the same relationship to MongoDB that CSV has to an RDBMS.
These objects are grouped in Collections (analogous to tables) and Collections are logically grouped in Databases (analogous to Schemas or Databases in an RDBMS)
Time Series
MongoDB has an option to declare a collection as being for Time-Series data, when this happens the underlying storage changes from simply storing the documents to bucketing, clustering and converting to a columnar format seamlessly to increase compression and speed retrieval for time-series use cases.
Replication and Sharding
A running MongoDB system is referred to as a cluster. A cluster will normally have a minimum of three identical nodes each with the same data to provide high availability - this is called a replica set. Any write which has been written to a majority(2) of the three nodes is durable in the event of any single node failing and MongoDB will continue to work uninterrupted while you repair that node. Replica sets are for availability, not performance scaling - kinda like RAID 1.
Multiple Replica Sets can be formed into a Sharded cluster where collections are partitioned across them to provide more hardware resources - normally RAM but sometimes CPU or Disk. This is like disk striping RAID 0 and is always done on top of replica sets (Like RAID10) . Sharding can also be used to enforce geo-locality of data for performance or compliance.
This high availability and horizontal scaling are key to MongoDB's ability to support very large systems.
The Query API and Drivers
The optimal way to interact with Mongo is with a library for your language called a driver - these exist for many languages and implement a common API adapted to the programming style, data types and idioms of the programming language. This makes MongoDB feel very natural for programmers regardless of language, you can basically persist and retrieve language objects.
The Query API itself is the secret sauce of MongoDB - it was designed to allow smart, safe, efficient interaction with in-database objects and to provide transaction-like semantics. An update consists of a match portion - including all required pre-conditions and a mutation portion which can not only set values in a record but change them relative to their current value - for example pushing to an array or incrementing a value. These operations allow atomic updates to the same document by multiple processes with them being serialised at the last moment on the server to ensure safety with minimal contention. This is a far cry from a typical key/value database and very much analogous to an SQL update statement.
The Query API also supports traditional Begin/Commit style transaction for modifying multiple records simultaneously with full ACID behaviour but this , like in an RDBMS, means records inside uncommitted transactions are locked for writing (and some forms or reading) which increases contention and is discouraged where a better approach exists.
Aggregation?
The Query API supports a full set of CRUD operations but it also supports Turing-complete data aggregation and processing model called an aggregation pipeline. This can perform tasks like Grouping, Calculation, Data Transformation, Clustering, Joins and Sliding time window functionality inside the database. It is possible to translate almost any SQL statement into a MongoDB aggregation pipeline to process data in-database.
In short for simple and not so simple data analysis it is often better to perform it in-database than to remove it and use an external tool. It's faster and has no cloud egress costs.
Indexing
MongoDB indexes data much like an RDBMS with B(+)Tree indexes on fields as defined by the developer. MongoDB supports simple indexes, compound indexes, geospatial indexes as well as correctly handling the indexing of all values in an array, varying data types and indexing all fields in a given path if desired.
Indexes are used to optimise retrieval and sorting in exactly the same way as in an RDBMS.
Atlas
You can run MongoDB yourself on-premises or in the cloud but a lot of MongoDB's customers choose to have MongoDB Inc host and manage it for them on AWS, GCP and Azure. A customer selects the cloud provider (or can span multiple), the locations and the architecture - Hardware and Topology and MongoDB instantiate and manage the server instances, charging by the minute. This starts with a free tier and goes up. MongoDB manages hundreds of thousands of servers if not millions.
In 2021 MongoDB launched a beta of Atlas Serverless - which is the same hosting but you as a customer do not define the topology or infrastructure, that's automatic and you pay by the data read and written - this is better for very variable workloads or handling sudden growth.
Atlas Search
If your data is hosted by MongoDB in Atlas you can create Lucene based indexes as well as the traditional BTree ones allowing you to do a full range of full-text and relevance searching - MongoDB maintains the Lucene servers and services for you so no requirement to run Elastic or similar.
Atlas Services/Realm Services
MongoDB has a serverless middleware platform forming part of what is called the Atlas Data Platform designed to make it easier to expose and manage your data. This includes Application User Management ( Email/Facebook/Google signup as distinct from database service users), configurable REST and GraphQL APIs, Fine Grained security rules, Javascript hosted functions, Static CDN Hosting, and HTTPS endpoints. This platform means you can create and host entire applications with no need to run any of your own servers.
Charts
Charts is a Cloud-hosted BI Visualisation tool designed to create dynamic charts and graphs from your data whether standalone or as dashboards. These can then be embedded in your own application.
Realm and Realm Sync
MongoDB purchased Realm an on-mobile object persistence database optimised for speed on the mobile device, offline storage and retrieval. This includes a sync mechanism to keep the on-device database in sync with Atlas allowing on and offline query and update with the latest wall-clock time wins deconfliction protocol for offline modifications to a field.
Data-Lake and Online Archive
MongoDB offers the same query API and aggregation not only over data in the database but also over files in Parquet, CSV, JSON and BSON stored in Amazon S3, albeit reading only and indexed differently. This allows the same Query language and Aggregation to be used over the much cheaper storage that AWS S3 provides.
You can also migrate data from the live database to files in S3 either on-demand or automatically and access both live data and S3 data simultaneously from a virtual view that queries both/multiple sources.
Connectors
As well as the drivers and REST & GraphQL APIs MongoDB has plugins for Spark, Hadoop, Kafka (Source and Sync) and the BI Connector which is a facade that makes MongoDB look like and talk MySQL's binary protocol and SQL dialect allowing a Read-only MySQL tool to interact with the database.
And.. That's what you missed on Glee.
Vice President and Field CTO @ MongoDB, Technical Leadership, Generative AI, Cloud
2 年Check out the timeline at the evolved page to get an idea how fast MongoDB is moving new features and functions !!! https://www.mongodb.com/evolved