People in the database world have long since become accustomed to tradeoffs – performance vs. scalability, availability vs. consistency, and so on. As the NoSQL (‘Not Only SQL’) movement took hold, additional tradeoffs joined in, in particular, relational vs. document-centric.
Relational databases delivered the benefits of ACID compliance, SQL queries, and in particular, the power of JOINs. Document-centric databases, in contrast, provided native JavaScript and JSON support and cloud-friendly horizontal scalability.
To bridge these approaches, Clusterpoint has been working to build a database that delivers the best of both worlds. Native JavaScript/JSON support as well as ACID compliance. Horizontal scalability coupled with SQL support. On-premise as well as cloud-based database-as-a-service (DBaaS) options. And with the recent announcement of version 4.1, Clusterpoint now offers JOINs as well.
High Performance JOINs across Collections of Documents
JOINs across JSON documents have always been a challenge for NoSQL database vendors to implement because of JSON’s arbitrary, schemaless structure. Clusterpoint has resolved these challenges with its innovative approach to sharding and hyper-replicating collections of documents. Here’s how it works.
Within a typical NoSQL data model, Clusterpoint manages self-contained documents by identifying them with a primary key. (Documents are typically in JSON format, but Clusterpoint also supports XML.)
Clusterpoint then assigns each document to a database shard based upon this key, and then groups documents together into collections. Finally, Clusterpoint schedules collections to reside on particular computational nodes within the Clusterpoint infrastructure.
Clusterpoint provides two separate distribution mechanisms for each collection of documents. First, for collections that contain particularly large documents, Clusterpoint shards the collection, splitting its documents among many computational nodes. This sharding enables parallel processing of each collection, and Clusterpoint also replicates each shard to ensure high availability of the data.
The second distribution mechanism is hyper-replication, which is best suited for small collections. In this situation, Clusterpoint places a copy of each collection on every computational node it has scheduled to handle the collections.
Because of the large quantities of replicated data, hyper-replication comes at a cost of storage space and additional processing of updates, but in return, Clusterpoint is able to execute JOINs between sharded and hyper-replicated collections or among multiple hyper-replicated collections without sacrificing performance.
Executing JOINs in Clusterpoint
In version 4.1, Clusterpoint supports the LEFT JOIN and INNER JOIN keywords, and also adds a custom EMBED function.
Both JOINs operate on different collections within the FROM clause of the SELECT operator in Clusterpoint’s JavaScript-savvy extension of SQL it calls JS/SQL.
The LEFT JOIN keyword returns documents from the main collection with the matching fields from the joined collection, while INNER JOIN returns documents from both collections as long as there is a matching document ID field shared between both collections.
In order to maintain performance, the JOINed collection should be hyper-replicated.
For example, Clusterpoint executes the following LEFT JOIN query on the sales collection, joining it with the people collection, according to the common values in the fields people._id and sales.person_id:
SELECT people.name as salesperson, sales.amount, sales.price FROM sales LEFT JOIN people ON people._id == sales.person_id
Furthermore, to achieve grouped results, the query should contain the EMBED function, as the following example illustrates.
SELECT people.name, EMBED({amount: sales.amount, price: sales.price}) as sales FROM sales LEFT JOIN people ON people._id == sales.person_id WHERE people.name == "Daniel Spring" GROUP BY sales.person_id LIMIT 0, 20
In the example above, the EMBED function creates a grouped result set for the selected fields, returning a JSON array of grouped values.
It’s also possible to join nested JSON documents with Clusterpoint. For example, let’s use the following INSERT statements to populate a product document with several inventory sub-documents:
INSERT INTO product["34A40855"] JSON VALUE { name: "Schwinn S29 Full Suspension Mountain Bike", price: 259.16 }; INSERT INTO product["34A40855"].inventory JSON VALUE { location: "Warehouse-East", items: 17 }; INSERT INTO product["34A40855"].inventory JSON VALUE { location: "Warehouse-West", items: 17 };
It would then be possible to access the array of nested objects with the following SELECT statement:
SELECT price, inventory FROM product
Similarly, use the SUPER () function to access the super-document:
SELECT location, items, SUPER().name FROM inventory WHERE SUPER().price > 30
Subqueries as well as joining collections through references will be available in future versions of Clusterpoint, but the current version 4.1 provides powerful JOIN capabilities that will help organizations transition from on-premise relational databases to Clusterpoint.
Resolving the Denormalization Tradeoff
Relational databases have enormous power and flexibility when dealing with normalized data. However, JOINs are necessary for implementing useful queries across such data, and overly complex JOINs can slow a relational database to a crawl.
Data denormalization is a common solution to this problem, but relational databases aren’t well-suited to such data – which is one of the reasons to move to a document-oriented database.
In fact, when organizations transition from relational databases to document-oriented NoSQL databases, they typically denormalize their data, thus exporting data from multiple tables into a single JSON object.
As a result, they obtain the power and scalability of the NoSQL environment – but not without tradeoffs. In particular, to change a single element in a JSON object in a document database, it’s typically necessary to change the entire object – a task that impacts performance, especially for databases with many large documents.
However, unlike traditional relational databases, which lose efficiency when they take normalization to the extreme, Clusterpoint manages the normalization/denormalization tradeoff. Its secret: Clusterpoint keeps some data denormalized in rich hierarchical JSON objects, while using normalization where it makes sense in order to efficiently JOIN data.
However, Clusterpoint allows JOINs to normalized data similar to relational databases. Even highly normalized data architectures like the classical star schema for data warehouses, for example, becomes efficient in distributed environment when running on Clusterpoint.
Additionally, Clusterpoint parallelizes the processing of JOINs via sharding, while executing JOINs between documents on the same computational node. As a result, Clusterpoint maintains unlimited horizontal scalability without sacrificing performance – essential requirements in today’s big data-centric world.
The Intellyx Take
It’s necessary to appropriately shard and hyper-replicate the data in order to support JOINs, while maintaining the performance and horizontal scalability that are Clusterpoint’s core value proposition.
Sharding, however, is an increasingly standard tool in the data expert’s tool belt, as data challenges increasingly fall under the big data umbrella. And while hyper-replication consumes storage capacity, such capacity is becoming less expensive by the day.
In the final analysis, Clusterpoint requires a certain level of expertise in order to squeeze the most value from the platform – but with the addition of JOINs to its repertoire, more customers will be able to take advantage of Clusterpoint to leave their relational databases far behind.
Intellyx advises companies on their digital transformation initiatives and helps vendors communicate their agility stories. Clusterpoint is an Intellyx client. Intellyx retains full editorial control over the content of this article. Sample code source: Clusterpoint. Image credit: Clusterpoint.