In one of the previous posts we saw how to set up a Python microservice. This post is a continuation and shows how to scale it. Specifically, we’ll look at handling larger volumes of data by sharding it across databases. We’ll shard based on the customer (or tenant, or user, etc.) making the request, and route it to the right database. While load tests on my local machine show promise, the pattern outlined is far from production-ready.
Before We Begin
We’ll assume that we’d like to create a service that can handle growing volume of data. To accommodate this we’d like to shard the data across databases based on the user making the request. For the sake of simplicity we’ll assume that the customer is a user of a SaaS platform. The customers are heterogenous in their volume of data — some small, some large, some so large that they require their own dedicated database.
We’ll develop a library, and a sharding microservice. Everything outlined in this post is very specific to the Python ecosystem and the libraries chosen but I am hopeful that the ideas can be translated to a language of your choice. We’ll use Flask and Peewee to do the routing and create a pattern that allows transitioning from a single database to multiple databases.
The setup is fully Dockerized and consists of three Postgres databases, a sharding service, and an api service.
Getting Started
In a nutshell, we’d like to look at a header in the request and decide which database to connect to. This needs to happen as soon as the request is received. Flask makes this easy by allowing us to execute functions before and after receiving a request and we’ll leverage them to do the routing.
Library
We’ll begin by creating a library that allows connecting to the right database.
1 |
|
An instance of Shard
is responsible for connecting to the right database depending on how it is configured. In “standalone” mode, it connects to a single database for all customers. This is helpful when creating the microservice for the first time. In “api” mode it makes a request to the sharding microservice. This is helpful when we’d like to scale the service. The API returns the credentials for the appropriate database depending on the identifier passed to it. The identifier_field
is a column which must be present in all tables. For example, every table must have a “customer_id” column.
We’ll add a helper function to create an instance of the Shard
. This makes it easy to transition from standalone to api mode by simply setting a few environment variables.
1 | def from_env_variables() -> Shard: |
API service
We’ll add request hooks to the API service which will use an instance of the Shard
to connect to the right database.
1 | _shard = sharding.from_env_variables() |
We’re creating an instance of Shard
and using it to retrieve the appropriate database. This is then stored in the per-request global g
. This lets us use the same database throughout the context of the request. Finally, we register the before and after hooks.
We’ll now add functions to the library which allow saving and retrieving the data using the database stored in g
.
1 | def save_with_db( |
What allows us to switch the database at runtime is the bind_ctx
method of the Peewee Database
instance. This temporarily binds the model to the database that was retrieved using the Shard
. In essence, we’re storing and retrieving the data from the right database.
Next we’ll add a simple Peewee model that represents a person.
1 | class Person(Model): |
We’ll add an endpoint which will let us save a row with some randomly generated data.
1 |
|
Sharding service
We’ll add an endpoint to the sharding service which will return the database to connect to over an API.
1 |
|
Here we’re selecting the right database depending on the customer making the request. For the sake of this demo the information is mostly static but in a production scenario this would come from a meta database that the sharding service connects to.
Databases
We’ll now create tables in each of the three databases.
1 | CREATE TABLE person ( |
Testing
We’ll add a small script which uses Apache Bench to send requests for three different customers.
1 |
|
We’ll run the script and wait for it to complete.
1 | ./bench.sh |
We’ll now connect to one of the databases and check the data. I’m connecting to the third instance of Postgres which should have data for the customer with ID “3”. We’ll first check the count of the rows to see that all 1000 rows are present, and then check the customer ID to ensure that requests are properly routed in a multithreaded environment.
1 | SELECT COUNT(*) FROM person; |
This returns the following:
1 | count |
We’ll now check for the customer ID stored in the database.
1 | SELECT DISTINCT customer_id FROM person; |
This returns the following:
1 | customer_id |
Conclusion
That’s it. That’s how we can connect a Flask service to multiple databases dynamically.