Demystifying the Multi-Model Capabilities in Azure Cosmos DB

When someone asks me, “Hey, what is Cosmos DB?” I casually respond, “Well, that’s Microsoft’s globally distributed, massively scalable, horizontally partitioned, low latency, fully indexed, multi-model NoSQL database, of course.” One of two things happen then: I’ll get a long weird look, after which the other person will politely excuse themselves and move along, or, I’ll hear “wow, that sounds awesome, tell me more!”. If you’re still reading this, then I’m guessing you’re in the latter category.

When you start elaborating on each of the bullet points in my soundbite response, there’s a lot to discuss before you get to “multi-model NoSQL” at the tail end. Starting with “globally distributed,” Cosmos DB is – first and foremost – a database designed for modern web and mobile applications, which are (typically) global applications in nature. Simply by clicking the mouse on a map in the portal, your Cosmos DB database is instantly replicated anywhere and everywhere Microsoft hosts a data center (there are nearly 50 of them worldwide, to date). This delivers high availability and low latency to users wherever they’re located.

Cosmos DB also delivers virtually unlimited scale, both in terms of storage via server-side horizontal partitioning, and throughput; by provisioning a prescribed number of request units (RUs) per second. This ensures low latency and is backed by comprehensive SLAs to yield predictable database performance for your applications. In addition, it’s unique “inverted indexing” scheme enables automatic indexing of every property that you store, with minimal overhead.

Whew. That’s quite a lot to digest before we even start pondering Cosmos DB’s multi-model support, mentioned at the end of my lengthy description. In fact, it’s very deliberately placed at the end, because regardless of which data model you choose, all the capabilities around global distribution, horizontal partitioning, provisioned throughput, and automatic indexing remain the same. These are durable concepts that transcend whatever data model you choose, which actually makes no difference to Cosmos DB. So, you get to pick and choose among any of the supported data models, without compromising any of the core features of the Cosmos DB engine.

Which segues right into the topic of this article; what exactly is “multi-model”, and specifically, what does it mean for a database platform like Cosmos DB to support multiple data models?

It all boils down to how you’d like to treat your data – and this is where the developer comes in, because while massive scale is clearly important (if not critical), developers don’t really care about such details as long as it all “just works”, which is the job of Cosmos DB to ensure. When it comes to actually building applications – well, that’s the developer’s job, and this is where the decision of which data model to choose comes into play.

Depending on the type of application being built, it could be more appropriate to use one data model over another. For example, if the application focuses more on relationships between entities than the entities themselves, then a graph data model may work better than a document model. In other cases, a developer may want to migrate an existing NoSQL application to Cosmos DB; for example, an existing Mongo DB or Cassandra application. In these scenarios, the Cosmos DB data model will be pre-determined and subject to the back-end database dependency of the application being ported; the developer would choose either the Mongo DB-compatible or Cassandra-compatible data model. Such a migration would require minimal (to no) changes to the existing application code. And yet, in other “green field” situations, developers that are very opinionated about how data should be modeled are free to choose whichever data model they prefer.

Each data model has an API for developers to work with in Cosmos DB. Put another way, the developer chooses an API for their database, and that determines the data model that is used. So, let’s break it down:

Document Data Model (SQL & MongoDB APIs)

The first thing to point out is that the SQL API is, essentially, the original DocumentDB programming model from the days when Cosmos DB was called DocumentDB. This is, arguably, the most robust and capable of all the APIs because it is the only one that exposes a server-side programming model that lets you build fully transactional stored procedures, triggers, and user-defined functions.

Both the SQL and MongoDB APIs give you a document data model, but the two APIs themselves are radically different. Yes, they are similar from a data modeling perspective; you store complete denormalized entities as a hierarchical key-value document model; pure JSON in the case of the SQL API, or BSON in the case of the MongoDB API (BSON is MongoDB’s special binary-encoded version of JSON that extends JSON with additional data types and multi-language support).

The critical difference between the two APIs is the programming interface itself. The SQL API uses Microsoft’s innovative variant of Structured Query Language (SQL) that is tailored for searching across hierarchical JSON documents. It also supports the server-side programming model (for example, stored procedures), which none of the other APIs do.

In contrast, the MongoDB API actually provides wire-level support, which is a compatibility layer that understands the protocol used by the MongoDB driver for sending packets over the network. MongoDB has a built-in find method used for querying documents (unlike the SQL support found in the SQL API). So, the MongoDB API appeals to existing MongoDB developers, because they now enjoy the scale-out, throughput, and global distribution capabilities of Cosmos DB without abandoning the MongoDB ecosystem. Therefore, the MongoDB API in Cosmos DB gives you full compatibility with existing MongoDB application code and lets you continue working with familiar MongoDB tools.

Key-Value Data Model (Table API)

You can also model your data as a key-value store using the Table API. This API is actually the evolution of Azure Table Storage – one of the very first NoSQL databases available on Azure. In fact, all existing Azure Table Storage customers will eventually be migrated over to Cosmos DB and the Table API.

With this data model, each entity consists of a key and a value pair, but the value itself is a set of key-value pairs. This is nothing like a table in a relational database, where each row has the same columns; with the Table API in Cosmo DB, each entity’s value can have a different set of key-value pairs.

The Table API appeals primarily to existing Azure Table Storage customers because it emulates the Azure Table Storage API. Using this API, existing Azure Table Storage apps can be migrated quickly and easily to Cosmos DB. However, for a new project, there would be little reason to ever consider using the Table API in light of the fact that the SQL API is far more capable than the Table API.

So when would you actually choose to use the Table API? Well, again, the primary use case is to migrate an existing Azure Table Storage account over to Cosmos DB, without having to change any code in your applications. Remember that Microsoft is planning to do this for every customer over a long-term migration, but there’s no reason to wait for them to do that, if you don’t want to wait. You can migrate the data yourself now, and then immediately start enjoying the benefits of Cosmos DB as a back-end without making any changes, whatsoever, to your existing Azure Table Storage applications. You just change the connection string to point to Cosmos DB and the application continues to work seamlessly against the Cosmos DB Table API.

Graph Data Model (Gremlin API)

You can also choose the Gremlin API, which gives you a graph database derived from the Apache Tinkerpop open source project. Graph databases are becoming increasingly popular in the NoSQL world, and it’s easy to see why. Since graph databases implement a collection of interconnected entities and relationships, they can be used to model many scenarios in the real, interconnected world.

So, what do you put in a graph? One of two things; either a vertex or an edge. Don’t let these terms intimidate you. They’re just fancy words for entities and relationships, respectively. A vertex is an entity and an edge is a one-way relationship between any two vertices, and that’s it – nothing more and nothing less. These are the building blocks of any graph database - whether you’re storing a vertex or an edge, you can attach any number of arbitrary properties to it, much like the arbitrary key-value pairs you can define for a row using the Table API or a flat JSON document using the SQL API.

The Gremlin API provides a succinct graph traversal language that enables you to efficiently query across the many relationships that exist in a graph database. For example, in a social networking application, you could easily find a user, then look for all of that user’s posts where the location is NY, and of those, find all the relationships where some other user has commented on or liked those posts.

Columnar (Cassandra API)

There’s a fourth option for choosing a data model, and that’s columnar using the Cassandra API (currently in preview at the time of this writing). Columnar is yet another way of modeling your data where – in a departure from the typical way of dealing with schema-free data in the NoSQL world – you can define the schema of your data up-front. However, data is still stored physically in a column-oriented fashion, so it’s still OK to have sparse columns, and it has good support for aggregations. Columnar is somewhat similar to the Key-Value data model with the Table API, except that items in the container adhere to the defined schema. In that sense, columnar is really most similar to columnstore in SQL Server, except of course, that it is implemented using a NoSQL architecture so it’s distributed and partitioned to massively scale out big data.

Atom Record Sequence (ARS)

The fact of the matter is, these APIs merely project your data as different data models; whereas internally, your data is always stored as ARS – or Atom Record Sequence – a Microsoft creation that defines the persistence layer for key-value pairs. Now you don’t need to know anything about ARS, you don’t even need to know that it’s there, but it is there - under the covers, storing all your data as key-value pairs in a manner that’s agnostic to the data-model you’ve chosen to work with.

At the end of the day, it’s all just keys and values – not just the key-value data model, but all these data models. They’re all some form of keys and values. A JSON or BSON document is a collection of keys and values, where values can either be simple values, or they can contain nested key-value pairs. The key-value model is clearly based on keys and values, but so are graph and columnar. The vertices and edges you define in a graph database are, themselves, key-value pairs, and certainly, all the columns defined for a columnar data model can be viewed as key-value pairs as well.

So these API’s are here to broaden your choices in terms of how you get to treat your data; they bear no consequence on the ability to scale your database. For example, if you want to be able to write SQL queries, you would choose the SQL API and not the Table API, but if you want MongoDB or Azure Table Storage compatibility, then you’d go with the MongoDB or Table API respectively. Your decision is irrelevant as far as Cosmos DB is concerned and your database will scale just the same regardless.

Switching Between Data Models

As I’ve explained, when you choose an API, you are also choosing a data model. As of today (since the release of Cosmos DB in May 2017), you choose an API when you create a Cosmos DB account. This means that today, a Cosmos DB account is tied to one API, which ties it to one data model:

Yet again, each data model is merely a projection of the same underlying ARS format. Eventually you will be able to create a single account in which you can switch freely between different APIs within the account, allowing you to access one database as graph, key-value, document, or columnar, all at once if you wish:

Although such interoperability is not yet supported, Cosmos DB, today, does offer a glimpse of how this works across the SQL and Gremlin APIs when working with a graph data model. With the Gremlin API, you store vertices and edges in a graph, which are actually persisted as GraphSON, a JSON document under the covers. So while you would normally use the Gremlin API to populate the graph using the Gremlin DSL (Domain Specific Language), you can also interchangeably use the SQL API to manipulate that GraphSON as native JSON documents.

For example, you can use stored procedures (supported only with the SQL API, and not available for the Gremlin API) to populate the graph with properly formatted GraphSON documents representing vertices and edges. You can then switch to using the Gremlin API for querying the very same graph that was populated by the stored procedures using the SQL API. Furthermore, you can use the SQL API to store entities other than vertices and edges to supplement the graph with additional documents that are available for your application - essentially creating a hybrid container that’s both a graph and document collection at the same time.

You can also expect to see additional APIs in the future, as Cosmos DB broadens its compatibility support for other database systems. This will enable an even wider range of developers to stick with their database of choice, while leveraging Cosmos DB as a back end for horizontal partitioning, provisioned throughput, global distribution, and automatic indexing.


Azure Cosmos DB has multiple APIs and supports multiple data models. In this article, we explored the multi-API, multi-model capabilities of Cosmos DB, including the document data model with either the SQL or MongoDB APIs, key-value with the Table API, graph with the Gremlin API, and columnar with the Cassandra API.

Regardless of which data model you choose, Cosmos DB stores everything in ARS and merely projects different data models based on the different APIs. This provides developers with a wide range of choices for how they’d like to model their data without making any compromises in scale, partitioning, throughput, indexing, or global distribution.