How is data stored in IBM Cloudant?
Every database in IBM® Cloudant® for IBM Cloud® is formed of one or more distinct shards, where the number of shards is referred to as Q
. A shard contains a distinct subset of documents from the database.
Sharding
As noted above, every IBM® Cloudant® for IBM Cloud® database is formed of Q
shards, and all Q
shards together contain the data within a database. Further, each shard is replicated three times. Each replica is stored
on a different server, and replicas are spread across availability zones within a single region. Shard replicas are used to provide high availability within a region.
A document is assigned to a particular shard by using the hash of its ID. This assignment means that a document is always stored on a known shard and a known set of servers.
IBM® Cloudant® for IBM Cloud® selects a standard default value for Q
based on observed workloads, and may update this default over time. The number of replicas for each shard is always three, chosen to balance performance and data
safety.
How does sharding affect performance?
The number of shards for a database can affect the database's performance in a number of ways:
-
Each document in the database is stored on a single shard. In a partitioned database, all documents in a partition are stored on a single shard. Each shard hosts multiple partitions.
When reading data, workloads that read documents by ID and restrict their querying to partitioned queries, having more shards enables greater parallelism within the database. This is because the database is able to retrieve data from a single shard rather than needing to access every database shard.
When writing documents, greater shard counts also tend to improve performance. Data is written directly to the shards involved. As database indexes are local to each shard, index building can benefit from extra parallelism.
Bulk document reads and writes also benefit from larger shard counts because the database is able to determine the exact shards needed for the request.
-
To respond to a global query request, a database must process the results from all the shards.
Therefore, having more shards introduces a greater processing demand for global queries. The reason is that the database retrieves data from all database shards, then combines the results before it returns the response to the client. This currently scales greater than linearly with the number of shards, so is important to avoid with more than 16 shards in latency-sensitive request paths.
Advanced sharding topics
As changing the number of shards for a database is an advanced topic, most customers should keep the default values. The defaults have been tuned over time to balance database performance across the different considerations discussed below.
Selecting a shard count for a database
The tension when choosing a shard count is between data volume and the types of requests used by an application:
- Larger data volumes and document counts benefit from greater numbers of shards.
- Performance of global queries decreases with the number of shards.
We need to take account of both data size and the application's required data access patterns to decide on a good sharding strategy. To start calculating an appropriate shard count, first evaluate your expected number of documents and the data volume. Then refer to these guidelines:
- Use power-of-two steps to increase shard count --- 4, 8, 16 and so on.
- Select a shard count that will keep each shard beneath both 20GB and 50 million documents for your long term estimated data needs.
Now you have an estimate for the number of shards your data requires, the next step is to understand whether your data model allows your data to be in one database or needs to be spread over several databases:
- If your data requires 16 or fewer shards, most data models will be successful. Single document read/write, partitioned queries and global queries will perform well.
- If your data requires more than 16 shards, evaluate whether your application needs to execute global queries in latency-sensitive paths:
- If the data model naturally supports partitioned queries, then using a larger shard count (eg, 32 or 64) in a single database is recommended.
- If the data model requires the ability to execute global queries, then the application will need to use several databases with 16 or fewer shards. The application will need to handle merging query responses from several databases.
The numbers in these guidelines are derived from observation and experience rather than precise calculation.
Regardless of shard count, both partitioned and global queries must ensure they use appropriate indexes and selective queries to avoid significant latency increases as your data set grows. Good indexes allow scanning consistent amounts of data to respond to queries, regardless of how large your data grows.
Setting shard count when creating a database
The number of shards,
Q
, for a database is set when the database is created. The Q
value cannot be changed later.
To specify the Q
when you create a database, use the q
query string parameter.