Setting up a data catalog with DataHub

In a previous post we’d seen how to create a realtime data platform with Pinot, Trino, Airflow, and Debezium. In this post we’ll see how to setup a data catalog using DataHub. A data catalog, as the name suggests, is an inventory of the data within the organization. Data catalogs make it easy to find the data within the organisation like tables, data sets, reports, etc.

Before we begin

My setup consists of Docker containers required to run DataHub. While DataHub provides features like data lineage, column assertions, and much more, we will look at three of the simpler featuers. One, we’ll look at creating a glossary of the terms that will be used frequently in the organization. Two, we’ll catalog the datasets and views that we saw in the previous post. Three, we’ll create an inventory of dashboards and reports created for various departments within the organisation.

The rationale for this as follows. Imagine a day in the life of a business analyst. Their responsibilities include creating reports and dashboards for various departments. For example, the marketing team may want to see an “orders by day” dashboard so that they can correlate the effects of advertising campaigns with an uptick in the volume of orders. Similarly, the product team may want a report of which features are being used by the users. The requests of both of these teams will be served by the business analyts using the data that’s been brought into the data platform. While they create these reports and dashboards, it’s common for them to receive queries asking where a team member can find a certain report or how to interpret a data point within a report. They may also have to search for tables and data sets to create new reports, acquaint themselves with the vocabulary of the various departments, and so on.

A data catalog makes all of this a more efficient process. In the following sections we’ll see how we can use DataHub to do it. For example, we’ll create the definition of the term “order”, create a list of reports created for the marketing department, and bring in the views and data sets so that they become searchable.

The work of data scientists is similar, too, because they create data sets that can be reused across various models. For example, data sets representing features for various customers can be stored in the platform, made searchable, and used with various models. They, too, benefit from having a data catalog.

Finally, it helps bring people up to speed with the data that is consumed by their department or team. For example, when someone joins the marketing team, pointing them to the data catalog helps them get productive quickly by finding the relevant reports, terminology, etc.

Ingesting data sets and views

To ingest the tables and views, we’ll create a data pipeline which ingets metadata from AWS Glue and writes it to the metadata service. This is done by creating a YAML configuration in DataHub that specifies where to ingest the metadata from, and where to write it. Once this is created, we can schedule it to run periodically so that it stays updated with Glue.

The image above shows how we define a “source” and how we ingest it into a “sink”. Here we’ve specified that we’d like to read from Glue and write it to DataHub’s metadata service.

Once the source and destination are defined, we can set a schedule to run the ingestion. This will bring in the metadata about the data sets and views we’ve created in Glue.

The image above shows that a successful run of the ingestion pipeline brings in the views and data sets. These are then browsable in the UI. Similarly, they are also searchable as shown in the following image.

This makes it possible for the analysts and the data scientists to quickly locate data sets.

Defining the vocabulary

Next, we’ll create the definition of the word “order”. This can be done from the UI as shown below. The definition can be added by editing the documentation.

Once created, this is available under “Glossary” and in search results.

Data products

Finally, we’ll create a data product. This is the catalog of reports and dashboards created for various departments. For example, the image below shows a dashboard created for the marketing team.

Expanding the dashboard allows us to look at the documentation for the report. This could contain the definition of the terms used in the report, as shown on the bottom right, a link to the dashboard in Superset, definitions of data points, report owners, and so on.

That’s it. That’s how a data catalog helps streamline working with data.

Programming Puzzles 2

As I continue working my way through the book on programming puzzles, I came across those involving permutations. In this post I’ll collect puzzles with the same theme, both from the book and from the internet.

All permutations

The first puzzle is to compute all the permutations of a given array. By extension, it can be used to compute all the permutations of a string, too, if we view it as an array of characters. To do this we’ll implement Heap’s algorithm. The following is its recursive version.

1
2
3
4
5
6
7
8
9
10
11
12
13
def heap(permutations: list[list], A: list, n: int):
if n == 1:
permutations.append(list(A))
else:
heap(permutations, A, n - 1)

for i in range(n - 1):
if n % 2 == 0:
A[i], A[n - 1] = A[n - 1], A[i]
else:
A[0], A[n - 1] = A[n - 1], A[0]

heap(permutations, A, n - 1)

The array permutations is the accumulator which will store all the permutations of the array. The initial arguments to the function would be an empty acuumulator, the list to permute, and the length of the list.

Next permutation

The next puzzle we’ll look at is computing the next permutation of the array in lexicographical order. The following implementation has been taken from the book.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def next_permutation(perm: list[int]) -> list[int]:
inversion_point = len(perm) - 2

while (inversion_point >= 0) and (perm[inversion_point] >= perm[inversion_point + 1]):
inversion_point = inversion_point - 1

if inversion_point == -1:
return []

for i in reversed(range(inversion_point + 1, len(perm))):
if perm[i] > perm[inversion_point]:
perm[inversion_point], perm[i] = perm[i], perm[inversion_point]
break

perm[inversion_point + 1:] = reversed(perm[inversion_point + 1:])

return perm

Previous permutation

A variation of the puzzle is to compute the previous permutation of the array in lexicographical order. The idea is to “reverse” the logic for computing the next permutation. If we look closely, we’ll find that all we’re changing are the comparison operators.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def previous_permutation(perm: list[int]) -> list[int]:
inversion_point = len(perm) - 2

while (inversion_point >= 0) and (perm[inversion_point] <= perm[inversion_point + 1]):
inversion_point = inversion_point - 1

if inversion_point == -1:
return []

for i in reversed(range(inversion_point + 1, len(perm))):
if perm[i] < perm[inversion_point]:
perm[inversion_point], perm[i] = perm[i], perm[inversion_point]
break

perm[inversion_point + 1:] = reversed(perm[inversion_point + 1:])

return perm

kth smallest permutation

The final puzzle we’ll look at is the one where we need to compute the k’th smallest permutation. The solution to this uses the previous_permutation function that we saw above. The idea is to call this function k times on the lexicographically-largest array. Sorting the array in decreasing order results is the largest. This becomes the input to the previous_permutation function.

1
2
3
4
5
6
7
8
9
def kth_smallest_permutation(perm: list[int], k: int) -> list[int]:
# -- Arrange the numbers in decreasing order
# -- thereby creating the lexicographically largest permutation
perm = sorted(perm, reverse=True)

for _ in range(k):
perm = previous_permutation(perm)

return perm

That’s it. These are puzzles involving permutations.

Programming Puzzles 1

I am working my way through a classic book of programming puzzles. As I work through more of these puzzles, I’ll share what I discover to solidify my understanding and help others who are doing the same. If you’ve ever completed puzzles on a site like leetcode, you’ll notice that the sheer volume of puzzles is overwhelming. However, there are patterns to these puzzles, and becoming familiar with them makes it easier to solve them. In this post we’ll take a look at one such pattern - two pointers - and see how it can be used to solve puzzles involving arrays.

Two Pointers

The idea behind two pointers is that there are, as the name suggests, two pointers that traverse the array, with one pointer leading the other. Using these two pointers we update the array and solve the puzzle at hand. As an illustrative example, let us consider the puzzle where we’re given an array of even and odd numbers and we’d like to move all the even numbers to the front of the array.

Even and Odd

1
2
3
4
5
6
7
8
9
10
11
def even_odd(A: list[int]) -> None:
"""Move even numbers to the front of the array."""
write_idx = 0
idx = 0

while idx < len(A):
if A[idx] % 2 == 0:
A[write_idx], A[idx] = A[idx], A[write_idx]
write_idx = write_idx + 1

idx = idx + 1

The two pointers here are idx and write_idx. While idx traverses the array and indicates the current element, write_idx indicates the position where the next even number should be written. Whenever idx points to an even number, it is written at the position indicated by write_idx. With this logic, if all the numbers in the array are even, idx and write_idx point to the same element i.e. the number is swapped with itself and the pointers are moved forward.

We’ll build upon this technique to remove duplicates from the array.

Remove Duplicates

Consider a sorted array containing duplicate numbers. We’d like to keep only one occurrence of each number and overwrite the rest. This can be solved using two pointers as follows.

1
2
3
4
5
6
7
8
9
10
11
12
def remove_duplicates(A: list[int]) -> int:
"""Remove all duplicates in the array."""
write_idx, idx = 1, 1

while idx < len(A):
if A[write_idx - 1] != A[idx]:
A[write_idx] = A[idx]
write_idx = write_idx + 1

idx = idx + 1

return write_idx

In this solution, idx and write_idx start at index 1 instead of 0. The reason is that we’d like to look at the number to the left of write_idx, and starting at index 1 allows us to do that. Notice also how we’re writing the if condition to check for duplicity in the vicinity of write_idx; the number to the left of write_idx should be different from the one that idx is presently pointing to.

As a varitation, move the duplicates to the end of the array instead of overwriting them.

As another variation, remove a given number from the array by moving it to the end.

With this same pattern, we can now change the puzzle to state that we want at most two instances of the number in the sorted array.

Remove Duplicates Variation

1
2
3
4
5
6
7
8
9
10
def remove_duplicates(A: list[int]) -> int:
"""Keep at most two instances of the number."""
write_idx = 2

for idx in range(2, len(A)):
if A[write_idx - 2] != A[idx]:
A[write_idx] = A[idx]
write_idx = write_idx + 1

return write_idx

Akin to the previous puzzle, we look for duplicates in the vicinity of write_idx. While in the previous puzzle the if condition checked for one number to the left, in this variation we look at two positions to the left of write_idx to keep at most two instances. The remainder of the logic is the same. As a variation, try keeping at most three instances of the number in the sorted array.

Finally, we’ll use the same pattern to solve the Dutch national flag problem.

Dutch National Flag

In this problem, we sort the array by dividing it into three distinct regions. The first region contains elements less than the pivot, the second region contains elements equal to the pivot, and the third region contains elements greater than the pivot.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def dutch_national_flag(A: list[int], pivot_idx: int) -> None:
"""Divide the array into three distinct regions."""
pivot = A[pivot_idx]
write_idx = 0
idx = 0

# --- Move all elements less than pivot to the front
while idx < len(A):
if A[idx] < pivot:
A[write_idx], A[idx] = A[idx], A[write_idx]
write_idx = write_idx + 1
idx = idx + 1

idx = write_idx

# -- Move all elements equal to the pivot to the middle
while idx < len(A):
if A[idx] == pivot:
A[write_idx], A[idx] = A[idx], A[write_idx]
write_idx = write_idx + 1
idx = idx + 1

# -- All elements greater than pivot have been moved to the end.

This problem combines everything we’ve seen so far about two pointers and divides the array into three distinct regions. As we compute the first two regions, the third region is computed as a side-effect.

We can now solve a variation of the Dutch national flag partitioning problem by accepting a list of pivot elements. In this variation all the numbers within the list of pivots appear together i.e. all the elements equal to the first pivot element appear first, equal to second pivot element appear second, and so on.

1
2
3
4
5
6
7
8
9
10
11
12
def dutch_national_flag(A: list[int], pivots: list[int]) -> None:
"""This is a variation in which all elements with same key appear together."""
write_idx = 0

for pivot in pivots:
idx = write_idx

while idx < len(A):
if A[idx] == pivot:
A[write_idx], A[idx] = A[idx], A[write_idx]
write_idx = write_idx + 1
idx = idx + 1

That’s it, that’s how we can solve puzzles involving two pointers and arrays.

Creating a realtime data platform with Pinot, Airflow, Trino, and Debezium

I’d previously written about creating a realtime data warehouse with Apache Doris and Debezium. In this post we’ll see how to create a realtime data platform with Pinot, Trino, Airflow, Debezium, and Superset. In a nutshell, the idea is to bring together data from various sources into Pinot using Debezium, transform it using Airflow, use Trino for query federation, and use Superset to create reports.

Before We Begin

My setup consists of Docker containers for running Pinot, Airflow, Debezium, and Trino. Like in the post on creating a warehouse with Doris, we’ll create a person table in Postgres and replicate it into Kafka. We’ll then ingest it into Pinot using its integrated Kafka consumer. Once that’s done, we’ll use Airflow to transform the data to create a view that makes it easier to work with it. Finally, we can use Superset to create reports. The intent of this post is to create a complete data platform that makes it possible to derive insights from data with minimal latency. The overall architecture looks like the following.

Getting Started

We’ll begin by creating a schema for the person table in Pinot. This will then be used to create a realtime table. Since we want to use Pinot’s upsert capability to maintain the latest record of each row, we’ll ensure that we define the primary key correctly in the schema. In the case of the person table, it is the combination of the id and the customer_id field. The schema looks as follows.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
{
"schemaName": "person",
"dimensionFieldSpecs": [
{
"name": "id",
"dataType": "LONG"

},
{
"name": "customer_id",
"dataType": "LONG"
},
{
"name": "source",
"dataType": "JSON"
},
{
"name": "op",
"dataType": "STRING"
}
] ,
"dateTimeFieldSpecs": [{
"name": "ts_ms",
"dataType": "LONG",
"format" : "1:MILLISECONDS:EPOCH",
"granularity": "1:MILLISECONDS"
}],
"primaryKeyColumns": [
"id",
"customer_id"
],
"metricFieldSpecs": []
}

We’ll use the schema to create the realtime table in Pinot. Using ingestionConfig we’ll extract fields out of the Debezium payload and into the columns defined above. This is defined below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
"ingestionConfig":{
"transformConfigs":[
{
"columnName": "id",
"transformFunction": "jsonPath(payload, '$.after.id')"
},
{
"columnName": "customer_id",
"transformFunction": "jsonPath(payload, '$.after.customer_id')"
},
{
"columnName": "source",
"transformFunction": "jsonPath(payload, '$.after')"
},
{
"columnName": "op",
"transformFunction": "jsonPath(payload, '$.op')"
}
]
}

Next we’ll create a table in Postgres to store the entries. The SQL query is given below.

1
2
3
4
5
6
CREATE TABLE person (
id BIGSERIAL NOT NULL,
customer_id BIGINT NOT NULL,
name TEXT NOT NULL,
PRIMARY KEY (id, customer_id)
);

Next we’ll create a Debezium source connector to stream change data into Kafka.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
{
"name": "person",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "db",
"database.user": "postgres",
"database.password": "my-secret-pw",
"database.dbname": "postgres",
"database.server.name": "postgres",
"table.include.list": ".*\\.person",
"plugin.name": "pgoutput",
"publication.autocreate.mode": "filtered",
"time.precision.mode": "connect",
"tombstones.on.delete": "false",
"snapshot.mode": "initial",
"heartbeat.interval.ms": "1000",
"transforms": "route",
"transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.route.regex": "([^.]+)\\.([^.]+)\\.([^.]+)",
"transforms.route.replacement": "$3",
"event.processing.failure.handling.mode": "skip",
"producer.override.compression.type": "snappy",
"topic.prefix": "user_service"
}
}

Finally, we’ll use curl to send these configs to their appropriate endpoints, beginning with Debezium.

1
curl -H "Content-Type: application/json" -XPOST -d @debezium/person.json localhost:8083/connectors | jq .

To create a table in Pinot we’ll first create the schema followed by the table. The curl command is given below.

1
curl -F schemaName=@tables/001-person/person_schema.json localhost:9000/schemas | jq .

The command to create the table is given below.

1
curl -XPOST -H 'Content-Type: application/json' -d @tables/001-person/person_table.json localhost:9000/tables | jq .

With these steps done, the change data from Debezium will be ingested into Pinot. We can view this using Pinot’s query console.

This is where we begin to integrate Airflow and Trino. While the data has been ingested into Pinot, we’ll use Trino for querying. There are two main reasons for this. One, this allows usto federate queries across multiple sources. Two, Pinot’s SQL capabilities are limited. For example, there is no support, as of writing, for creating views. To circumvent these we’ll create a Hive connector in Trino and use it to query Pinot.

The first step is to connect Trino and Pinot. We’ll do this using the Pinot connector.

1
2
3
4
CREATE CATALOG pinot USING pinot 
WITH (
"pinot.controller-urls" = 'pinot-controller:9000'
);

Next we’ll create the Hive connector. This will allow us to create views, and more importantly materialized views which act as intermediate datasets or final reports, which can be queried by Superset. I’m using AWS Glue instead of Hive so you’ll have to change the configuration accordingly.

1
2
3
4
5
6
7
8
9
10
11
12
CREATE CATALOG hive USING hive
WITH (
"hive.metastore" = 'glue',
"hive.recursive-directories" = 'true',
"hive.storage-format" = 'PARQUET',
"hive.insert-existing-partitions-behavior" = 'APPEND',
"fs.native-s3.enabled" = 'true',
"s3.endpoint" = 'https://s3.us-east-1.amazonaws.com',
"s3.region" = 'us-east-1',
"s3.aws-access-key" = '...',
"s3.aws-secret-key" = '...'
);

We’ll create a schema to store the views and point it to an S3 bucket.

1
2
3
4
CREATE SCHEMA hive.views
WITH (
"location" = 's3://your-bucket-name-here/views/'
);

We can then create a view on top of the Pinot table using Hive.

1
2
3
4
5
6
CREATE OR REPLACE VIEW hive.views.person AS
SELECT id,
customer_id,
JSON_EXTRACT_SCALAR(source, '$.name') AS name,
op
FROM pinot.default.person;

Finally, we’ll query the view.

1
2
3
4
5
6
7
trino> SELECT * FROM hive.views.person;
id | customer_id | name | op
----+-------------+-------+----
1 | 1 | Fasih | r
2 | 2 | Alice | r
3 | 3 | Bob | r
(3 rows)

While this helps us ingest and query the data, we’ll take this a step further and use Airflow to create the views instead. This allows us to create views which are time-constrained. For example, if we have an order table which contains all the orders placed by the customers, using Airflow allows to create views which are limited to, say, the last one year by adding a WHERE clause.

We’ll use the TrinoOperator that ships with Airflow and use it to create the view. To do this, we’ll create an sql folder under the dags folder and place our query there. We’ll then create the DAG and operator as follows.

1
2
3
4
5
6
7
8
9
10
11
12
13
dag = DAG(
dag_id="create_views",
catchup=False,
schedule="@daily",
start_date=pendulum.now("GMT")
)

person = TrinoOperator(
task_id="person",
trino_conn_id="trino",
sql="sql/views/person.sql",
dag=dag
)

Workflow

The kind of workflow this setup enables is the one where the data engineering team is responsible for ingesting the data into Pinot and creating the base views on top of it. The business intelligence / analytics engineering, and data science teams can then use Airflow to create datasets that they need. These can be created as materialized views to speed up reporting or training of machine learning models. Another advantage of this setup is that bringing in older data, say, of the last two years instead of one, is a matter of changing the query of the base view. This avoids complicated backfills and speeds things up significantly.

As an aside, it is possible to use DBT instead of TrinoOperator. It can be used in conjunction with TrinoOperator, too. However, I preferred using the in-built operator to keep the stack simpler.

Cost

Before we conclude, we’ll quickly go over how to keep the cost of the Pinot cluster low while using this setup. In the official documentation it says that data can be seperated by age; older data can be stored in HDDs while the newer data can be stored in SSDs. This allows lowering the cost of the cluster.

An alternative approach is to keep all the data in HDDs and load subsets into Hive for querying. This also allows changing the date range of the views by simply updating the queries. In essence, Pinot becomes the permanent storage for data while Trino and Hive become the intermediate query and storage layer.

That’s it. That’s how we can create a realtime data platform using Pinot, Trino, Debezium, and Airflow.

A note on exponents

In the chapter on exponents the authors mention that if a base is raised to both a power and a root, we should calculate the root first and then the power. This works perfectly well. However, reversing the order produces correct results, too. In this post we’ll see why that works using the properties of exponents.

Let’s say we have a base that is raised to power and root . We could write this as . From the properties of exponents, we could rewrite this as . Alternatively, it can be written as . Since multiplication is commutative, we can switch the order of operations and rewrite it as . This means we can calculate the power first and then take the root.

Let’s take a look at a numerical example. Consider . We know that the answer should be equal to . If we were to calculate the root first, we get . If we were to calculate the power first and then take the root, we’d get . As we can see, we get the same result.

Therefore, we can apply the operations in any order.

Factoring Quadratics

While reading through the chapter on solving quadratics of a math textbook, I came across a paragraph where the authors mention that factoring quadratics takes a bit of ingenuity, experience, and dumb luck. I spent some time creating an alternative method from the one mentioned in the book which makes factoring quadratics a matter of following a simple set of steps, and removes the element of luck from it. In this post I will review some of the concepts mentioned in the book, and solve through one of the more difficult problems to illustrate my method.

Concepts

Let’s begin by looking at the quadratic . This can be factored as . Multiplying the terms gives us . Comparing this to the coefficients of the the original quadratic gives us , and . For a simple quadratic, we can guess that and . For quadratics where it is not so obvious, we need hints to guide us along the way.

We can get insights into the signs of and by looking at the product and the sum of coefficients of the quadratic. If the product is positive, then they have the same signs. If the product is negative, they have different signs. This makes intuitive sense. In case the product is positive, the sum tells us whether they are both positive or both negative.

We will use this again when we look at the alternative method to factor a quadratic. First, however, we will look at a different type of quadratic where the coefficient of is not 1.

Consider the quadratic . It can be factored as . Multiplying the terms gives us . As in the previously mentioned quadratic, we can get the product and the sum terms by comparing the quadratic with the general form we just derived. Here , and . A small nuance to keep in mind is that if the coefficient of were negative, we’d factor out a to make it positive.

Now we move on to the problem and the method. You’ll find that although the method is tedious, it will remove the element of luck from the process.

Method

Consider the quadratic . Here, , , and . From the guiding hints mentioned in the previous section, we notice that the signs of and are the same; they are either both positive or both negative. We begin the method be defining a function which returns a set of pairs of all the factors of . Therefore, we can write and as follows.

We will now get to the tedious part. We will create combinations of . We pick the first pair of factors of and match it with the first pair of factors of . For the sake of brevity, we will only some of the examples to illustrate the process.

7 7 2 66 476
7 7 -2 -66 -476
7 7 66 2 476
7 7 -66 -2 -476

Notice how we swap the values of and in the columns above. This is because we’re trying to find the product and its value will change depending on what the value of and are. Similarly, we’ll have to swap the values of and ; this is not apparent in the table above because both the values are . We will have to continue on with the table since we we are yet to find a combination of numbers which equals . The table is quite large so we’ll skip the rest of the entries for the sake of brevity and look at the one which gives us the value we’re looking for.

49 1 -22 -6 -316

We can now write our factors as . This gives us or .

That’s it. That’s how we can factor quadratics by creating combinations of the factors of their coefficients.

Scaling Python Microservices Part 3

In a previous blog post we’d seen how to use OpenTelemetry and Parca to instrument two microservcies. Imagine that our architecture has grown and we now have many microservices. For example, there is a microservice that is used to send emails, another which keeps track of user profile information, etc. We’d previously hardcoded the HTTP address of the second microservice to make a call from the first microservice. We could have, in a production environment, added a DNS entry which points to the load balancer and used that instead. However, when there are many microservcies, it becomes cumbersome to add these entries. Furthermore, any new microservice which depends on another microservice now needs to have the address as a part of its configuration. Maintaining a single repository, which can be used to find the microservices and their addresses, becomes difficult to maintain. An easier way to find services is to use service discovery - a dynamic way for one service to find another. In this post we’ll modify the previous architecture and add service discovery using Zookeeper.

Before We Begin

In a nutshell, we’ll use Zookeeper for service registration and discovery. We’ll modify our create_app factory function to make one more function call which registers the service as it starts. The function call creates an ephemeral node in Zookeeper at a specified path and stores the host and port. Any service which needs to find another service looks for it on the same path. We’ll create a small library for all of this which will make things easier.

Getting Started

We’ll begin by creating a simple attr class to represent a service.

1
2
3
4
@define(frozen=True)
class Service:
host: str
port: int

We’ll be using the kazoo Python library to interact with Zookeeper. The next step is to create a private variable which will be used to store the Zookeeper client.

1
_zk: KazooClient | None = None

Next we’ll add the function which will be called as a part of the create_app factory function.

1
2
3
def init_zookeeper_and_register():
_init_zookeeper()
_register_service()

In this function we’ll first initialize the connection to Zookeeper and then register our service. We’ll see both of these functions next, starting with the one to init Zookeeper.

1
2
3
4
5
6
7
def _init_zookeeper():
global _zk

if not _zk:
_zk = KazooClient(hosts='127.0.0.1:2181')
_zk.start(timeout=5)
_zk.add_listener(_state_change_listener)

The _zk variable is meant to be a singleton. I was unable to find whether this is thread-safe or not but for the sake of this post we’ll assume it is. The start method creates a connection to Zookeeper synchronously and raises an error if no connection could be made in timeout seconds. We also add a listener to respond to changes in the state of a Zookeeper connection. This enables us to respond to scenarios where the connection drops momentarily. Since we’ll be creating ephemeral nodes in Zookeeper, they’ll be removed in the case of a session loss and would need to be recreated. In the listener we recreate the node upon a successful reconnection.

Next we’ll look at the function to register the service.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def _register_service():
global _zk

service = os.environ.get("SERVICE")
host = socket.getfqdn()
port = os.environ.get("PORT")

assert service
assert host
assert port

identifier = str(uuid4())
path = f"/providers/{service}/{identifier}"
data = {"host": host, "port": int(port)}
data_bytes = json.dumps(data).encode("utf-8")

_zk.create(path, data_bytes, ephemeral=True, makepath=True)

We get the name of the service and the port it is running on as environment variables. The host is retrieved as the fully-qualified domain name of the machine we’re on. Although the example is running on my local machine, it may work on a cloud provider like AWS. We then create a UUID identifier for the service, and a path on which the service will be registered. The information stored on the path is a JSON containing the host and the port of the service. Finally, we create an ephemeral node on the path and store the host and port.

Next we’ll look at the function which handles changes in connection state.

1
2
3
def _state_change_listener(state):
if state == KazooState.CONNECTED:
_register_service()

We simply re-register the service upon reconnection.

Finally, we’ll look at the function which retrieves an instance of the service from Zookeeper.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def get_service_from_zookeeper(service: str) -> Service | None:
global _zk
assert _zk

path = f"/providers/{service}"
children = _zk.get_children(path)

if not children:
return None

idx = random.randint(0, len(children) - 1)

child = children[idx]
config, _ = _zk.get(f"/providers/{service}/{child}")
config = config.decode("utf-8")
config = json.loads(config)

return Service(**config)

To get a service we get all the children at the path /providers/SERVICE_NAME. This returns a list of UUIDs, each of which represents an instance of the service. We randomly pick one of these instances and fetch the information associated with it. What we get back is a two-tuple containing the host and port, and an instance of ZStat which we can ignore. We decode and parse the returned information to get an instance of dict. This is then used to return an instance of Service. We can then use this information to make a call to the returned service.

This is all we need to create a small set of utility functions to register and discover services. All we need to do to register the service as a part of the Flask application process is to add a function call in the create_app factory function.

1
2
3
4
5
6
7
8
def create_app() -> Flask:
...

# -- Initialize Zookeeper and register the service.
init_zookeeper_and_register()

return app

Finally, we’ll retrieve the information about the second service in the first service right before we make the call.

1
2
3
4
def make_request() -> dict:
service = get_service_from_zookeeper(service="second")
url = f"http://{service.host}:{service.port}"
...

As an aside, the service will automatically be deregistered once it stops running. This is because we created an ephemeral node in Zookeeper and it only persists as long as the session that created it is alive. When the service stops, the session is disconnected, and the node is removed from Zookeeper.

Rationale

The rationale behind registering and discovering services from Zookeeper is to make it easy to create new services and to find the ones that we need. It’s even more convenient when there is a shared library which contains the code that we saw above. For the sake of simplicity, we’ll assume all our services are written in Python and there is a single library that we need. The library could also contain an enum that represents all the services that are registered with Zookeeper. For example, to make a call to an email microservice, we could have code that looks like the following.

1
2
email_service = Service.EMAIL_SERVICE.value
service = get_service_from_zookeeper(service=email_service)

This, in my opinion, is much simpler than adding a DNS entry. The benefits add up over time and result in code that is both readable and maintainable.

That’s it. That’s how we can register and retrieve services from Zookeeper. Code is available on Github.

Scaling Python Microservices Part 2

In one of the previous posts we saw how we can scale a Python microservice and allow it to connect, theoretically, to an infinite number of databases. The way we did this is by fetching the database connection information at runtime from another microservice using the unique identifier of the customer. This allowed us to scale horizontally to some extent. However, there is still the limitation that the data of the customer may be so large that it would exceed the limit of the database when it is scaled vertically. In this post we’ll look at how to extend the architecture we saw previously and shard the data across servers. We’ll continue to use relational databases and see how we can shard the data using Postgres PARTITION BY.

Before We Begin

The gist of scaling by sharding is to split the table into multiple partitions and let each of these be hosted on a separate host. For the purpose of this post we’ll use a simple setup that consists of four partitions that are spread over two hosts. We’ll use Postgres’ Foreign Data Wrapper (FDW) to connect one instance of Postgres to another instance of Postgres. We’ll store partitions in both these hosts, and create a table which uses these partitions. Querying this table would allow us to query data from all the partitions.

Getting Started

My setup has two instances of Postgres, both of which will host partitions. One of them will also contain the base table which will use these partitions. We’ll begin by logging into the first instance and creating the FDW extension which ships natively with Postgres.

1
CREATE EXTENSION postgres_fdw;

Next, we’ll tell the first instance that there is a second instance of Postgres that we can connect to. Since both of these instances are running as Docker containers, I will use the hostname in the SQL query.

1
CREATE SERVER postgres_5 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'postgres_5', dbname 'postgres');

Next, we’ll create a user mapping. This allows the user of the first instance to log into the second instance as one of its users. We’re simply mapping the postgres user of the first instance to the postgres user of the second instance.

1
CREATE USER MAPPING FOR postgres SERVER postgres_5 OPTIONS (user 'postgres', password 'my-secret-pw');

Next, we’ll create the base table. There are a couple of things to notice. First, we use the PARTITION BY clause to specify that the table is partitioned. Second, there is no primary key on this table. Specifying a primary key prevents us from using foreign tables so we’ll omit them.

1
2
3
4
5
6
7
CREATE TABLE person (
id BIGSERIAL NOT NULL,
quarter BIGINT NOT NULL,
name TEXT NOT NULL,
address TEXT NOT NULL,
customer_id TEXT NOT NULL
) PARTITION BY HASH (quarter);

Next, we’ll create two partitions that reside on the first instance. We could, if the data were large enough, host each of these on separate instances. For the purpose of this post, we’ll host them on the same instance.

1
2
CREATE TABLE person_0 PARTITION OF person FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE person_1 PARTITION OF person FOR VALUES WITH (MODULUS 4, REMAINDER 1);

We’ll now switch to the second instance and create two tables which will host the remaining two partitions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
CREATE TABLE person_2 (
id BIGSERIAL NOT NULL,
quarter BIGINT NOT NULL,
name TEXT NOT NULL,
address TEXT NOT NULL,
customer_id TEXT NOT NULL
);


CREATE TABLE person_3 (
id BIGSERIAL NOT NULL,
quarter BIGINT NOT NULL,
name TEXT NOT NULL,
address TEXT NOT NULL,
customer_id TEXT NOT NULL
);

Once this is done, we’ll go back to the first instance and designate these tables as partitions of the base table.

1
2
CREATE FOREIGN TABLE person_2 PARTITION OF person FOR VALUES WITH (MODULUS 4, REMAINDER 2) SERVER postgres_5;
CREATE FOREIGN TABLE person_3 PARTITION OF person FOR VALUES WITH (MODULUS 4, REMAINDER 3) SERVER postgres_5;

That’s it. This is all we need to partition data across multiple Postgres hosts. We’ll now run a benchmark to insert data into the table and its partitions.

1
ab -p /dev/null -T "Content-Type: application/json" -n 5000 -c 100 -H "X-Customer-ID: 4" http://localhost:5000/person

Once the benchmark is complete, we can query the base table to see that we have 5000 rows.

1
2
3
4
SELECT COUNT(*) FROM person;

count
5000

What I like about this approach is that it is built using functionality that is native to Postgres - FDW, partitions, and external tables. Additionally, the sharding is transparent to the application; it sees a single Postgres instance.

Finito.

Implementing TSum with Dask

In one of the previous blog posts I’d written about implementing TSum, a table-summarization algorithm from Google Research. The implementation was written using Javascript and was meant for small datasets that can be summarized within the browser itself. I recently ported the implementation to Dask so that it can be used for larger datasets that consist of many rows. In a nutshell, it lets us summarize a Dask DataFrame and find representative patterns within it. In this post we’ll see how to use the algorithm to summarize a Dask DataFrame, and run benchmarks to see its performance.

Before We Begin

Although the library is designed to be used in production on data stored in a warehouse, it can also be used to summarize CSV or Parquet files. In essence, anything that can be read into a Dask DataFrame can be summarized.

Getting Started

Summarizing data

Imagine that we have customer data stored in a datawarehouse that we’d like to summarize. For example, how would we best describe the customer’s behavior given the data? In essence, we’d like to find patterns within this dataset. In scenarios like these, TSum works well. As an example of data summarization, we’ll use the patient data given in the research paper and pass it to the summarization algorithm.

We’ll begin by adding a function to generate some test data.

1
2
3
4
5
6
7
8
9
10
11
12
13
def data(n=1):
return [
{"gender": "M", "age": "adult", "blood_pressure": "normal"},
{"gender": "M", "age": "adult", "blood_pressure": "low"},
{"gender": "M", "age": "adult", "blood_pressure": "normal"},
{"gender": "M", "age": "adult", "blood_pressure": "high"},
{"gender": "M", "age": "adult", "blood_pressure": "low"},
{"gender": "F", "age": "child", "blood_pressure": "low"},
{"gender": "M", "age": "child", "blood_pressure": "low"},
{"gender": "F", "age": "child", "blood_pressure": "low"},
{"gender": "M", "age": "teen", "blood_pressure": "high"},
{"gender": "F", "age": "child", "blood_pressure": "normal"},
] * int(n)

We’ll then add code to summarize this data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import json
import time

import cattrs
import dask.dataframe as dd
import pandas as pd
import tabulate

from tsum import summarize

if __name__ == "__main__":
from dask.distributed import LocalCluster

cluster = LocalCluster(n_workers=1, nthreads=8, diagnostics_port=8787)
client = cluster.get_client()

df = pd.DataFrame.from_records(data=data(n=1))
ddf = dd.from_pandas(df, npartitions=4)
t0 = time.perf_counter()
patterns = summarize(ddf=ddf)
t1 = time.perf_counter()

dicts = [cattrs.unstructure(_) for _ in patterns]
print(json.dumps(dicts, indent=4))

Upon running the script we get the following patterns.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[
{
"pattern": {
"gender": "M",
"age": "adult"
},
"saving": 3313,
"coverage": 50.0
},
{
"pattern": {
"age": "child",
"blood_pressure": "low"
},
"saving": 1684,
"coverage": 30.0
}
]

This indicates that the patterns that best describe our data are “adult males”, which comprise 50% of the data, followed by “children with low blood pressure”, which comprise 30% of the data. We can verify this by looking at the data returned from the data function, and from the patterns mentioned in the paper.

Running benchmarks

To run the benchmarks, we’ll modify the script and create DataFrames with increasing number of rows. The benchmarks are being run on my local machine which has an Intel i7-8750H, and 16GB of RAM. The script which runs the benchmark is given below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
if __name__ == "__main__":
from dask.distributed import LocalCluster

cluster = LocalCluster(n_workers=1, nthreads=8, diagnostics_port=8787)
client = cluster.get_client()
table = []

for n in [1, 1e1, 1e2, 1e3, 1e4, 1e5, 1e6]:
df = pd.DataFrame.from_records(data=data(n=n))
ddf = dd.from_pandas(df, npartitions=4)
t0 = time.perf_counter()
summarize(ddf=ddf)
t1 = time.perf_counter()
table.append(
{
"Rows": len(ddf),
"Time Taken (seconds)": (t1 - t0),
}
)

print(tabulate.tabulate(table))

This is the output generated. As we can see, it takes 17 minutes for 1e6 rows.

1
2
3
4
5
6
7
8
9
--------  ---------
10 14.5076
100 24.1455
1000 23.4862
10000 23.4842
100000 32.8378
1000000 121.013
10000000 1050.46
-------- ---------

Conclusion

That’s it. That’s how we can summarize a Dask DataFrame using TSum. The library is available on PyPI and can be installed with the following command.

1
pip install tsum

The code is available on GitHub. Contributions welcome.

Running Database Migrations

Let’s start with a question: how do you run database migrations? Depending on the technology you are using, you may choose something like Flyway, Alembic, or some other tool that fits well with your process. My preference is to write and run the migrations as SQL files. I recently released a small library, yoyo-cloud, that allows storing the migrations as files in S3 and then applying them to the database. In this post we will look at the rationale behind the library, and the kind of workflow it enables.

Before We Begin

We’ll start with an example that shows how to run migrations on a Postgres instance. There are a couple of SQL files that I have stored in an S3 bucket — one to create a table, and another to insert a row after the table is created. The bucket is public so you should be able to run the snippet below.

1
2
3
4
5
6
7
8
9
10
from yoyo import get_backend
from yoyo_cloud import read_s3_migrations

if __name__ == "__main__":
migrations = read_s3_migrations(paths=["s3://yoyo-migrations/"])
backend = get_backend(f"postgresql://postgres:my-secret-pw@localhost:5432/postgres")

with backend.lock():
# -- Apply any outstanding migrations
backend.apply_migrations(backend.to_apply(migrations))

As you can see from the imports, we use a combination of the original library, yoyo, which provides a helper function to connect to the database, and the new library, yoyo_cloud, which provides a helper function to read the migrations that are stored in S3.

The read_s3_migrations function reads the files that are stored in S3. It takes as input a list of S3 paths that point to directories where the files are stored. This function is similar in interface to the read_migrations function in yoyo which reads migrations from a directory on the file system except that it reads from S3; the value returned is the same — a list of migrations.

Finally, we apply the migrations to the Postgres instance. The purpose of this example was to demonstrate the simplicity with which migrations stored in S3 can be applied to a database using yoyo_cloud. We will now look at the rationale behind the library in the next section.

Rationale

The rationale behind the library, as mentioned at the start of this post, is to store SQL migrations as files and apply them one-by-one. Additionally, it allows migrating multiple similar databases easily. In the previous post on scaling Python microservices we’d seen how a Flask microservice can dynamically connect to multiple databases. If we want to apply a migration, say adding a new column to a table, we’d want it to be applied to all the databases that the service connects to. Using yoyo_cloud we can read the migration from S3, and apply to every table. Since the migrations are idempotent, we can safely reapply the previous migrations.

Workflow

Let’s assume we’d like to create an automation which applies database migrations. We’ll assume we have two environments — stage, and prod. Whenever a migration is to be released to production, it is first tested in the dev environment, and then committed to version control to be applied to the staging environment. These could be stored as a part of the same repository or in a separate repository that contains only migrations. Let’s assume it is the latter. We could have a directory structure as follows.

1
2
3
4
5
6
7
migrations
- order_service
- prod
- 001-create-table.sql
- stage
- 001-create-table.sql
- 002-add-column.sql

The automation could then apply these migrations to the relevant environment of the service. An additional benefit of this approach is that it makes setting up new environments easier because the migrations can applied to a new database. Additionally, committing migrations to version control, as opposed to running them in an adhoc manner, allows keeping track of when the migration was introduced, why, and by whom.

Conclusion

That’s it. That’s how we can apply migrations stored in S3 using yoyo_cloud. If you’re using it, or considering using it, please leave a comment.

Scaling Python Microservices

In one of the previous posts we saw how to set up a Python microservice. This post is a continuation and shows how to scale it. Specifically, we’ll look at handling larger volumes of data by sharding it across databases. We’ll shard based on the customer (or tenant, or user, etc.) making the request, and route it to the right database. While load tests on my local machine show promise, the pattern outlined is far from production-ready.

Before We Begin

We’ll assume that we’d like to create a service that can handle growing volume of data. To accommodate this we’d like to shard the data across databases based on the user making the request. For the sake of simplicity we’ll assume that the customer is a user of a SaaS platform. The customers are heterogenous in their volume of data — some small, some large, some so large that they require their own dedicated database.

We’ll develop a library, and a sharding microservice. Everything outlined in this post is very specific to the Python ecosystem and the libraries chosen but I am hopeful that the ideas can be translated to a language of your choice. We’ll use Flask and Peewee to do the routing and create a pattern that allows transitioning from a single database to multiple databases.

The setup is fully Dockerized and consists of three Postgres databases, a sharding service, and an api service.

Getting Started

In a nutshell, we’d like to look at a header in the request and decide which database to connect to. This needs to happen as soon as the request is received. Flask makes this easy by allowing us to execute functions before and after receiving a request and we’ll leverage them to do the routing.

Library

We’ll begin by creating a library that allows connecting to the right database.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
@attr.s(auto_attribs=True, frozen=True)
class Shard:
config: APIConfig | StandaloneConfig
identifier_field: str

def db(self, identifier: Identifier) -> Database:
credentials = self._db_credentials(identifier=identifier)

credentials_dict = attr.asdict(credentials) # noqa
credentials_dict.pop("flavor")

if credentials.flavor == DatabaseFlavor.POSTGRES:
return PostgresqlDatabase(**credentials_dict)

if credentials.flavor == DatabaseFlavor.MYSQL:
return MySQLDatabase(**credentials_dict)

def _db_credentials(self, identifier: Identifier) -> DBCredentials:
if isinstance(self.config, StandaloneConfig):
credentials = attr.asdict(self.config) # noqa
return DBCredentials(**credentials)
return self._fetch_credentials_from_api(identifier=identifier)

def _fetch_credentials_from_api(self, identifier: Identifier) -> DBCredentials:
url = f"{self.config.endpoint}/write/{str(identifier)}"
response = requests.get(url)
response.raise_for_status()
json = response.json()
return cattrs.structure(json, DBCredentials)

An instance of Shard is responsible for connecting to the right database depending on how it is configured. In “standalone” mode, it connects to a single database for all customers. This is helpful when creating the microservice for the first time. In “api” mode it makes a request to the sharding microservice. This is helpful when we’d like to scale the service. The API returns the credentials for the appropriate database depending on the identifier passed to it. The identifier_field is a column which must be present in all tables. For example, every table must have a “customer_id” column.

We’ll add a helper function to create an instance of the Shard. This makes it easy to transition from standalone to api mode by simply setting a few environment variables.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def from_env_variables() -> Shard:
mode = os.environ.get(EnvironmentVariables.SHARDING_MODE)
identifier_field = os.environ.get(EnvironmentVariables.SHARDING_IDENTIFIER_FIELD)

if mode == ShardingMode.API:
endpoint = os.environ.get(EnvironmentVariables.SHARDING_API_ENDPOINT)
config = APIConfig(endpoint=endpoint)
return Shard(config=config, identifier_field=identifier_field)

if mode == ShardingMode.STANDALONE:
host = os.environ.get(EnvironmentVariables.SHARDING_HOST)
port = os.environ.get(EnvironmentVariables.SHARDING_PORT)
user = os.environ.get(EnvironmentVariables.SHARDING_USER)
password = os.environ.get(EnvironmentVariables.SHARDING_PASSWORD)
database = os.environ.get(EnvironmentVariables.SHARDING_DATABASE)
flavor = os.environ.get(EnvironmentVariables.SHARDING_FLAVOR)

config = StandaloneConfig(
host=host,
port=int(port),
user=user,
password=password,
database=database,
flavor=flavor,
)

return Shard(config=config, identifier_field=identifier_field)

API service

We’ll add request hooks to the API service which will use an instance of the Shard to connect to the right database.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
_shard = sharding.from_env_variables()


def _before_request():
identifier = request.headers.get("X-Customer-ID")
g.db = _shard.db(identifier=identifier)


def _after_request(response):
g.db.close()
return response


api = Blueprint("api", __name__)
api.before_app_request(_before_request)
api.after_app_request(_after_request)

We’re creating an instance of Shard and using it to retrieve the appropriate database. This is then stored in the per-request global g. This lets us use the same database throughout the context of the request. Finally, we register the before and after hooks.

We’ll now add functions to the library which allow saving and retrieving the data using the database stored in g.

1
2
3
4
5
6
7
8
9
10
11
def save_with_db(
instance: Model,
db: Database,
force_insert: bool = False,
):
identifier_field = os.environ.get(EnvironmentVariables.SHARDING_IDENTIFIER_FIELD)
identifier = getattr(instance, identifier_field)
assert identifier, "identifier field is not set on the instance"

with db.bind_ctx(models=[instance.__class__]):
instance.save(force_insert=force_insert)

What allows us to switch the database at runtime is the bind_ctx method of the Peewee Database instance. This temporarily binds the model to the database that was retrieved using the Shard. In essence, we’re storing and retrieving the data from the right database.

Next we’ll add a simple Peewee model that represents a person.

1
2
3
4
5
6
7
8
class Person(Model):
class Meta:
model_metadata_class = ThreadSafeDatabaseMetadata

id = BigAutoField(primary_key=True)
name = TextField()
address = TextField()
customer_id = TextField()

We’ll add an endpoint which will let us save a row with some randomly generated data.

1
2
3
4
5
6
7
8
9
10
@api.post("/person/<string:customer_id>")
def post_person(customer_id: str) -> dict:
fake = Faker()
person = Person(
name=fake.name(),
address=fake.address(),
customer_id=customer_id,
)
save_with_db(instance=person, db=g.db, force_insert=True)
return {"success": True}

Sharding service

We’ll add an endpoint to the sharding service which will return the database to connect to over an API.

1
2
3
4
5
6
7
8
9
10
@api.get("/write/<string:identifier>")
def get_write_db(identifier: str):
return {
"host": f"postgres_{identifier}",
"port": 5432,
"flavor": "postgres",
"user": "postgres",
"password": "my-secret-pw",
"database": "postgres",
}

Here we’re selecting the right database depending on the customer making the request. For the sake of this demo the information is mostly static but in a production scenario this would come from a meta database that the sharding service connects to.

Databases

We’ll now create tables in each of the three databases.

1
2
3
4
5
6
CREATE TABLE person (
id BIGSERIAL PRIMARY KEY,
name TEXT NOT NULL,
address TEXT NOT NULL,
customer_id TEXT NOT NULL
);

Testing

We’ll add a small script which uses Apache Bench to send requests for three different customers.

1
2
3
4
5
#!/bin/sh

ab -p /dev/null -T "Content-Type: application/json" -n 1000 -c 100 -H "X-Customer-ID: 1" http://localhost:5000/person/1 &
ab -p /dev/null -T "Content-Type: application/json" -n 1000 -c 100 -H "X-Customer-ID: 2" http://localhost:5000/person/2 &
ab -p /dev/null -T "Content-Type: application/json" -n 1000 -c 100 -H "X-Customer-ID: 3" http://localhost:5000/person/3 &

We’ll run the script and wait for it to complete.

1
./bench.sh

We’ll now connect to one of the databases and check the data. I’m connecting to the third instance of Postgres which should have data for the customer with ID “3”. We’ll first check the count of the rows to see that all 1000 rows are present, and then check the customer ID to ensure that requests are properly routed in a multithreaded environment.

1
SELECT COUNT(*) FROM person;

This returns the following:

1
2
count
1000

We’ll now check for the customer ID stored in the database.

1
SELECT DISTINCT customer_id FROM person;

This returns the following:

1
2
customer_id
3

Conclusion

That’s it. That’s how we can connect a Flask service to multiple databases dynamically.

Hosting Python Packages

In the previous post we saw how to create Python microservices. It’s likely that these microservices will share code. For example libraries for logging, accessing the database, etc. In this post we’ll see how to create and host Python packages.

Before We Begin

The setup is intentionally chosen for simplicity. We’ll separate the common code into a repository of its own, and use Poetry to package it. We’ll deploy it to S3, and use dumb-pypi to generate a static site which can be used to list the hosted packages.

The setup can be expanded to accommodate larger organizations with more teams, but it is primarily intended for smaller organizations with fewer teams that need to ramp up quickly. The main idea behind package distribution for Python is to arrange the files in a particular directory structure that can be accessed by a web server. While I am utilizing S3, you are free to use any comparable technology. You may even host everything on virtual machines (VMs) that are placed behind a load balancer.

Getting Started

Let’s start with a package called “common” which contains the OTel code we saw previously. In essence, we’ve simply copied the contents of package over into a new repository. Next we’ll intialize a Poetry project and enter the relevant information interactively.

1
poetry init --name="common" --description="A collection of utilities" --python="^3.12"

Next we’ll add the relevant dependencies with Poetry. The first three are for the common code, and the fourth is for generating static pages.

1
2
3
4
poetry add opentelemetry-sdk@latest
poetry add opentelemetry-api@latest
poetry add setuptools@latest
poetry add --group dev dumb-pypi@latest

Next we’ll write a small bash script which will create the package, generate a static site to display hosted packages, and upload them to S3.

deploy.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#!/bin/sh
PROJECT="common"
BUCKET="... bucket name ..."
TEAM="devtools"

# -- Generate the package file for dumb-pypi
PACKAGE=$(poetry build | awk '{print $3}' | sed -n '3p')
echo "$PACKAGE" >> packages
sort -u packages -o packages

# -- Upload package
aws s3 cp "dist/$PACKAGE" "s3://$BUCKET/packages/$TEAM/$PROJECT/$PACKAGE"

# -- Generate static files for PyPi
dumb-pypi \
--package-list packages \
--packages-url "https://$BUCKET.s3.us-west-2.amazonaws.com/packages/$TEAM/$PROJECT" \
--output-dir index

# -- Upload the files in S3
aws s3 cp --recursive index/ "s3://$BUCKET/dumb-pypi/$TEAM/index/"

# -- Open static pages in the browser
open "https://$BUCKET.s3.us-west-2.amazonaws.com/dumb-pypi/$TEAM/index/index.html"

# -- Serve static pages locally
# -- python -m http.server -b 0.0.0.0 -d index 8080

There’s a lot going on in the script. First we build the package using Poetry on line #7. The output contains the name of the zip file and we extract that into a variable. This is then stored into a file on line #8 and will be used by dumb-pypi to generate the static files. We sort and deduplicate the packages on line #9. On line #12 we store copy the package to S3. On line #15 we generate the static files into a folder named index which we then copy to S3 on line #21. On line #24 we open the documentation in the browser.

The S3 bucket contains a directory structure as mentioned in the Python packaging docs.[1]

Before we run the script, however, we will have to configure the S3 bucket to be publicly accessible. We do so by enabling it to serve static content, and adding a policy which enables access to the content. The policy is given below.

1
2
3
4
5
6
7
8
9
10
11
12
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ReadIndex",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::yourbucketnamehere/*"
}
]
}

Now we’ll run the script.

lang;bash
1
./deploy.sh

The browser displays the following page.

We can now test the installation of the package with both Poetry and Pip. We will do this in a different Poetry project so that the install does not conflict with the existing “common” package.

1
2
3
4
5
6
7
8
9
10
11
12
13
pip install --extra-index-url https://selfhostedpackages.s3.us-west-2.amazonaws.com/packages/devtools/ common

Looking in indexes: https://pypi.org/simple, https://selfhostedpackages.s3.us-west-2.amazonaws.com/packages/devtools/
Collecting common
Using cached common-0.1.2.tar.gz (3.5 kB)
Preparing metadata (setup.py) ... done
Building wheels for collected packages: common
Building wheel for common (setup.py) ... done
Created wheel for common: filename=common-0.1.2-py3-none-any.whl size=3708 sha256=51d2f21c15829e49375762f5ca246d7f0e4d0bc82c425b25b5e77fcc83e97eae
Stored in directory: /home/fasih/.cache/pip/wheels/12/f4/3f/8982873f5bfad3134251f605011de0c35f93d64b78cb07e3b8
Successfully built common
Installing collected packages: common
Successfully installed common-0.1.2

Notice how we added the location an extra index using --extra-index-url. The URL points to the devtools directory in the bucket which is the root directory for the packages created by the “devtools” team. The subdirectories follow the layout mentioned previously.

Next we’ll try the same with Poetry after uninstalling the package using pip. First, we’ll check how Poetry allows us to do it.

1
2
3
poetry add --help
...
A url (https://example.com/packages/my-package-0.1.0.tar.gz)

Let’s go ahead and add the package in that format.

1
2
3
4
5
6
7
poetry add --dry-run https://selfhostedpackages.s3.us-west-2.amazonaws.com/packages/devtools/common/common-0.1.0.tar.gz

Updating dependencies
Resolving dependencies... (0.1s)

Package operations: 3 installs, 0 updates, 0 removals, 29 skipped
...

We can verify that the package was installed successfully by importing it in a Python shell.

1
2
3
4
Python 3.12.0 | packaged by Anaconda, Inc. | (main, Oct  2 2023, 17:29:18) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from common import create_histogram
>>>

Conclusion

That’s it. That’s how we can host our own Python packages using S3. Note that this is a setup to get started quickly. If you’re looking for a more matured setup, please take a look at the devpi project.[2] The code is available on Github.

Footnotes and References

[1] https://packaging.python.org/en/latest/guides/hosting-your-own-index/#manual-repository
[2] https://devpi.net/docs/devpi/devpi/latest/+doc/index.html