Ingesting third-party data into Apache Doris with Meltano

A common requirement for a data warehouse is to store data from third-party sources like SaaS tools. For example, results of surveys conducted by Product or Marketing teams, or status of JIRA tickets. While most of these tools provide an option to export data, it could be very limited. For example, they’d only allow downloading it as a CSV file from their dashboard as a one-off download. Some of them provide APIs to export data which is a lot more convenient. Of those that do provide a direct integration with a warehouse, it’d be with the commercial ones like Snowflake. If we’re using a different warehouse, it is going to take some engineering effort to periodically get this data.

In this post we’ll look at how we can streamline this ingestion. As always, we’ll use open-source technologies to do this. Like in my previous post, we’ll use Doris as our data warehouse. We’ll use the open-source tool Meltano to fetch data from a Github repository. While this is not going to fully replace the need to write custom code, it will help us benefit from the collection of open-source connectors that come with Meltano.

Setting things up

Like in my previous post, my setup consists of Docker containers for running Doris locally. There’s a Postgres container that runs as a part of this setup. We’ll install Meltano, and ingest data into a Postgres database. We’ll then create a catalog in Doris that points to the data ingested in Postgres. While it may be possible to ingest the data directly into Doris by letting Meltano write it to S3, I find it much easier to write it to a database.

Getting started

For the purpose of this post we’ll will be modifying the example on Meltano’s website to ingest data from Github to Postgres. Meltano is available as a pip package and I already have a Conda environment in which it is installed. I’ll gloss over the steps that show how to create a Meltano project for the sake of keeping the post brief, and jump right into the ones where we add the Github extractor, the Postgres loader, and configure them.

Let’s start by adding the loader, and the extractor plugins.

1
2
meltano add extractor tap-github
meltano add loader target-postgres

We’ll then configure them one-by-one, starting with the Github plugin. We’ll add the repository we’d like to fetch data from, and the auth token that will be used to authenticate with Github.

1
2
meltano config tap-github set repositories '["thescalaguy/blog"]'
meltano config tap-github set auth_token '...'

Next we’ll configure the Postgres plugin. We’ll ingest the data into a schema called “github” in the “ingest” database.

1
2
3
4
5
6
meltano config target-postgres set user postgres
meltano config target-postgres set password my-secret-pw
meltano config target-postgres set database ingest
meltano config target-postgres set port 6432
meltano config target-postgres set host 127.0.0.1
meltano config target-postgres set default_target_schema github

Finally, we run the pipeline.

1
meltano run tap-github target-postgres

A successful run of the pipeline will create a whole bunch of tables in Postgres. One such table is “repositories”, and we’ll query it next.

1
SELECT id, repo, org, name FROM repositories;

The query above gives us the following row.

1
2
3
| id       | repo | org         | name |
|----------|------|-------------|------|
| 95359149 | blog | thescalaguy | blog |

We can now set the pipeline to run on a schedule by slightly modifying the example given in Meltano’s “Getting Started” guide.

1
meltano schedule add github-to-postgres --extractor tap-github --loader target-postgres --interval @daily

We’ll now create a catalog in Doris that points to the database in Postgres. To do this we need to execute the CREATE CATALOG command. Notice that we’ve specified the driver_url that points to Maven. This is a lot easier than manually downloading the JAR, and making it available in the Docker container.

1
2
3
4
5
6
7
8
CREATE CATALOG meltano PROPERTIES (
"type"="jdbc",
"user"="postgres",
"password"="my-secret-pw",
"jdbc_url" = "jdbc:postgresql://192.168.0.107:6432/ingest",
"driver_url" = "https://repo1.maven.org/maven2/org/postgresql/postgresql/42.7.1/postgresql-42.7.1.jar",
"driver_class" = "org.postgresql.Driver"
);

We have one last step before we can query the data, and that is to fix a few symlinks in the containers. It looks like the code is looking for CA certificates at a location different than where they actually are. From what I could glean from the documentation, perhaps the base image has changed and the paths need to be upadted. In any case, here’s how to fix it both in the FE and BE containers by docker exec-ing into them.

1
2
root@be:/# mkdir -p /etc/pki/tls/certs/
root@be:/# ln -s /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt

Now we’ll run a SELECT query. Notice how we’re writing the FROM clause; it is catalog.schema.table.

1
2
SELECT id, repo, org, name 
FROM meltano.github.repositories;

There are, however, limitations to what we can read from the Postgres table. Some of the column types are not supported. Changing the query to SELECT * will raise an error that “topics” column is not supported. We can describe the table in the catalog to check which of the columns cannot be read.

1
DESC meltano.github.repositories;

The result shows “UNSUPPORTED_TYPE” next to “topics” column. This is perhaps because Doris uses MySQL protocol and there’s no equivalent mapping bewteen Postgres and MySQL for that type. One way to mitigate this is to create a new column using Meltano’s stream_maps, and parse it in Doris.

That’s it. That’s how we can ingest data from third-party systems into Doris.

Setting up a SQL IDE with Apache Superset

I have a few requirements when choosing a SQL IDE. I’d like to use something that is relatively modern, has good support for a wide variety of databases, and allows making charts and graphs. I looked at Metabase, TablePlus, DBeaver, DataGrip, and Apache Superset. All of these tools have a feature or two that is missing in the other. For example, DataGrip allows creating entity relationship diagrams that make it easy to understand how the tables are related to each other; many of the tools do not have this feature. The free version of TablePlus allows creating only one set of charts; Metabase and Superset allow creating many of these. For my day-to-day requirements, as you may have guessed from the title of this post, Apache Superset is the best fit. This post shows how to quickly set up Superset using Docker Compose.

Why Superset

One of the things I often need to do is to monitor the performance of a system. For example, building upon the previous post where I talked about a hypothetical notification delivery system, counting how many notifications are being sent out per day, both as an aggregate metric across channels, and per-channel. This can be tracked by generating StatsD metrics, and charting them in Grafana. However, chances are that we’d need to display them to the customer, and also to other non-technical stakeholders. To do this we’d need our systems to generate records in the database, and have these be stored in a data warehouse that we can eventually query. This is where Superset is useful; it has built-in IDE that allows exploratory data analysis, and the ability to create charts once we’ve finalised the SQL query. The query can then be made part of the application serving analytics, or the BI tool that is used to by the stakeholders.

Another reason is that it is an open-source, actively-developed project. This means bug fixes and improvements will be shipped at a decent cadence; we have a tool that will stay modern as long as we keep updating the local setup.

Finally, because it is an open-source project, it has good documentation, and a helpful community. Both of these make using a technology a pleasant experience.

Setting up Superset

Superset supports a wide vartiety of databases. However, we need to install drivers within the Docker containers to support additional databases. For the sake of this example, we’ll install the Snowflake driver. To set up Superset locally we need to clone the git repo. Let’s start by doing that.

1
git clone https://github.com/apache/superset.git

Next we navigate to the repo and add the Snowflake driver.

1
2
3
cd superset
echo "snowflake-sqlalchemy" >> ./docker/requirements-local.txt
echo "cryptography==39.0.1" >> ./docker/requirements-local.txt

I had to add the cryptography package manually because the first time I set up Superset, logs showed that database migrations did not run becasuse it was missing.

Next we checkout the latest stable version of the repo. As of writing, it is 3.0.1.

1
git checkout 3.0.1

Now we bring up the containers.

1
TAG=3.0.1 docker-compose -f docker-compose-non-dev.yml up -d

This will start all the containers we need for version 3.0.1 of Superset. Note that if we add newer drivers, we’ll need to rebuild the images. Navigate to the Superset login page and use admin as both the username and password. We can now follow the documentation on how to set up a data source, and use the SQL IDE to write queries.

That’s it. We’ve set up Superset.

Creating facades in Python

One of the GoF design patterns is the facade. It lets us create a simple interface that hides the underlying complexity. For example, we can have a facade which lets the client book meetings on various calendars like Google, Outlook, Calendly, etc. The client specifies details about the meeting such as the title, description, etc. along with which calendar to use. The facade then executes appropriate logic to book the meeting, without the client having to deal with the low-level details.

This post talks about how we can create a facade in Python. We’ll first take a look at singledispatch to see how we can call different functions depending on the type of the argument. We’ll then build upon this to create a function which dispatches based on the value instead of the type. We’ll use the example given above to create a function which dispatches to the right function based on what calendar the client would like to use.

Single Dispatch

The official documentation defines single dispatch to be a function where the implementation is chosen based on the type of a single argument. This means we can have one function which handles integers, another which handles strings, and so on. Such functions are created using the singledispatch decorator from the functools package. Here’s a trivial example which prints the type of the argument handled by the function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import functools


@functools.singledispatch
def echo(x):
...


@echo.register
def echo_int(x: int):
print(f"{x} is an int")


@echo.register
def echo_str(x: str):
print(f"{x} is a str")


if __name__ == "__main__":
echo(5)
echo("5")

We start by decorating the echo function with singledispatch. This is the function we will pass our arguments to. We then create echo_int, and echo_str which are different implementation that will handle the various types of arguments. These are registered using the echo.register decorator.

When we run the example, we get the following output. As expected, the function to execute is chosen based on the type of the argument. Calling the function with a type which is not handled results in a noop as we’ve set the body of echo to ellipses.

1
2
5 is an int
5 is a str

When looking at the source code of singledispatch, we find that it maintains a dictionary which maps the type of the argument to its corresponding function. In the following sections, we’ll look at how we can dispatch based on the value of the argument.

Example

Let’s say we’re writing a library that lets the users book meetings on a calendar of their choosing. We expose a book_meeting function. The argument to this function is an instance of the Meeting data class which contains information about the meeting, and the calendar on which it should be booked.

Code

Model

We’ll start by adding an enum which represents the calendars that we support.

1
2
3
4
5
6
import enum


class Calendar(str, enum.Enum):
GOOGLE = "google"
OUTLOOK = "outlook"

Next we’ll add the data class which represents the meeting as a dataclass.

1
2
3
4
5
6
7
8
9
10
import dataclasses as dc 
import datetime


@dc.dataclass(frozen=True)
class Meeting:
title: str
description: str
time: datetime.datetime
calendar: Calendar

Finally, we’ll start creating the facade by adding functions which will dispatch based on the value of calendar contained within the instance of Meeting.

Dispatch

We’ll create a registry which maps the enum to its corresponding function. The function takes as input a Meeting object and returns a boolean indicating whether the meeting was successfully booked or not.

1
2
3
from typing import Callable, TypeVar 
MeetingT = TypeVar("MeetingT", bound=Callable[[Meeting], bool])
registry: dict[Calendar, MeetingT] = {}

Next we’ll add the book_meeting function. This is where we dispatch to the appropriate function depending on the meeting object that is received as the argument.

1
2
3
4
5
6
7
def book_meeting(meeting: Meeting) -> bool:
func = registry.get(meeting.calendar)

if not func:
raise Exception(f"No function registered for calendar {meeting.calendar}")

return func(meeting)

To be able to register functions which contains the logic for a particular calendar, we’ll create a decorator called register.

1
2
3
4
5
6
7
def register(calendar: Calendar):
def _(func: MeetingT):
if registry.get(calendar):
raise Exception(f"A function has already been registered for {calendar}")
registry[calendar] = func
return func
return _

register accepts as argument the calendar for which we’re registering a function. It returns another higher-order function which puts the actual function in the registry. Since the actual logic of the execution is in the decorated function, we simply return the original function func.

Finally, we register functions for different calendars.

1
2
3
4
5
6
7
8
9
10
@register(Calendar.GOOGLE)
def book_google(meeting: Meeting) -> bool:
print(f"Booked Google meeting")
return True


@register(Calendar.OUTLOOK)
def book_outlook(meeting: Meeting) -> bool:
print(f"Booked Outlook meeting")
return True

We’ll put all of this code in action by trying to book a meeting on Google calendar.

1
2
3
4
5
6
7
8
9
if __name__ == "__main__":
meeting = Meeting(
title="Hello",
description="World",
time=datetime.datetime.now(),
calendar=Calendar.GOOGLE
)

book_meeting(meeting=meeting)

This prints “Booked Google meeting”, like we’d expect. We can now continue to add more functions which contain logic for specific calendars. Our library can evolve without any change to the exposed interface. It’s also possible to organise functions into their own modules, import the register decorator, and decorate them to add them to the registry. This has two main benefits. One, we keep the code well structured. Two, the code for different versions of the same calendar can stay separated; we avoid having to write if checks to see the calendar version since that can be made part of the enum itself, like GOOGLE_V1.

That’s it. That’s how you can create a facade in Python.

Software Architecture as Code

Let’s start with an exercise: ask two of your colleagues to create an architecture for a moderately-complex CRUD app. Chances are that they’d look very different from each other. They’d look different because the notations used to depict different parts of the system, like a database, for example, are different. Additionally, everyone has a favorite tool like DrawIO, LucidChart, paper towel, etc. for drawing these diagrams. This approach only works well for rapid prototyping. The drag-and-drop nature of most drawing tools, and paper towel’s susceptibility to getting lost make documenting large software systems a less-than-pleasant experience.

This post describes how to use code to create software architecture diagrams and then commit them to version control. We’ll look at the process and tools for producing consistent diagrams at various abstraction levels, ranging from high-level to low-level. To achieve this, we will create a fictitious notification delivery system. After reading this you should be able to draw diagrams for your own systems at various levels of detail, and be able to explain them to different types of audiences.

In the first part we will talk about why it is necessary to represent the system at various levels of abstraction. In the second part we will talk about notations we can use to draw our system. In the last part we will work through the notification delivery system and create different types of diagrams for it. If you’re looking for an introduction-by-example, feel free to skip to the third part.

Abstraction

Consider a typical software engineering team. You can categorize the people in the team as “engineers” or “not engineers”. Product managers and other stakeholders would make up the “not engineers” portion of the team. They require a broad overview of the system. Within the engineering team, too, people look at the system at different levels of detail. A junior engineer, for example, would focus on the subsystem they are working on while passing by the other subsystems. A more senior engineer, on the other hand, would be expected to be intimately familiar with many systems. This requires presenting the system at different levels of abstraction, depending on the intended audience.

The C4 model allows us to create such views of the system - from the least to the most detailed. The four C’s of the model stand for context, containers, components, and code. A system is made up of many containers (applications, databases, etc), each of which contains multiple components (various API controllers, etc), which are made up of code elements (individual classes, functions, interfaces, etc.). The diagram below, taken from C4 model’s website, shows how they are arranged in a hierarchy.

Given the hierarchy of the four C’s, we can create diagrams at different levels of abstraction. For example, a high-level diagram which only shows different software systems. The highest-level diagram is called the system landscape diagram. This shows how our system fits into the larger scheme of things. The system context diagram narrows the focus by showing our system and the systems that it interacts with. A container diagram adds more detail by showing the responsibilities of different subsystems that together make up our system. Finally, a code diagram shows the most detailed view consisting of class hierarchies, ER diagrams, etc.

Creating such diagrams is helpful because it lets us communicate effectively. For example, we’d use the system landscape or system context diagram when discussing the scope of changes with the product team. A junior engineer, on the other hand, could work their way through the different types of diagrams to get a sense of how the system works.

Writing code to generate these multiple perspectives of the system is a really convenient method, as you’ll see later in the post. All of the views can be generated automatically by the same code. In comparison, you could use a drag-and-drop tool. However, keeping all the diagrams in sync with each other gets tiresome and error-prone.

Notation

Languages like UML are very heavy on notation. For example, the notation to draw a component diagram is very different from a sequence diagram. In contrast, the C4 model is very light on notation. As a starting point, you could represent different diagrams using boxes and arrows, and then augment them with shapes and colors to convey more information. This makes it easy for different people to understand the diagrams without getting bogged down in notation.

As you’ll see in the example that follows, we’ll start with boxes and arrows. For example, the database would be represented as a grey box. We’ll then change it’s shape and color to make it a blue cylinder.

Example

Now let’s look at our notification delivery system. It consists of a REST API which allows callers to send notifications across various channels like email, SMS, or push notifications. Every incoming request is added to a Kafka topic, depending on the channel of delivery, and is eventually picked up by background consumers. The status of each request is updated in the database. A CDC pipeline reads from this database and stores it into a warehouse. The diagram below gives an overview of what the system would look like.

Getting Started

My tool of choice for creating architecture diagrams as code is Structurizr. It provides a DSL which lets us create C4 model of our systems. When you visit its homepage, you’d find an example of how the same code can generate different views of the system at different levels of abstraction. We will use the open-source version of the tool, called Structurizr Lite, which allows a single person to create, edit, and view diagrams; in other words, there is no collaborative editing. The workflow we are aiming for lets individuals submit architecture diagrams as a part of their pull requests. In the next section, we will create a script which will let us run Structurizr Lite as a Docker container. We will then create the diagram that we saw above using the Structurizr DSL.

start.sh

We’ll start by organizing our Git repository. At the top level is the design folder where we will keep all the Structurizr code. The start.sh script, also at the top level, will launch the Docker container and reference the files in the design folder.

1
2
3
4
.
├── app.py
├── design
└── start.sh

The content of the start script is the following:

1
2
3
4
5
6
7
8
9
10
#! /bin/bash

if [ -z $1 ]; then
echo "You should provide a file name"
exit 1
fi

docker run --rm -it -p 8080:8080 \
-e STRUCTURIZR_WORKSPACE_FILENAME=$1 \
-v ~/Personal/notifications/design:/usr/local/structurizr structurizr/lite

We pass the name of the file as the first argument when we run the script. It is stored in the /usr/local/structurizr directory of the container which we have mapped to the design folder of our Git repository. When we run the script and the file does not exist, Structurizr creates it for us. This is helpful when we are creating a new diagram. If the file exists, Structurizr will read from it and render the diagrams. Let’s start by creating our first file.

1
./start.sh 001-overview

This will create two files under the design folder. The one we’d like to edit ends with dsl extension and will be committed to the repositiory. It is okay to add the json file to gitignore as that will be generated again when the container runs.

1
2
3
4
5
6
.
├── app.py
├── design
│__ ├── 001-overview.dsl
│__ └── 001-overview.json
└── start.sh

If you navigate to localhost:8080 you’ll see Structurizr rendering a default diagram. We’ll update this to reflect our system.

Creating our first diagram

Open the dsl file with your favorite editor and replace the contents of the file with the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
workspace {
model {
user = softwareSystem "Client"
notificationSystem = softwareSystem "Notification System"

sendgrid = softwareSystem "Sendgrid"
apns = softwareSystem "Apple Push Notification Service"
fcm = softwareSystem "Firebase Cloud Messaging"
cdc = softwareSystem "Change Data Capture"
warehouse = softwareSystem "Data Warehouse"

user -> notificationSystem "Uses"
notificationSystem -> sendgrid "Sends emails using"
notificationSystem -> apns "Sends push notifications using"
notificationSystem -> fcm "Sends push notifications using"
cdc -> notificationSystem "Reads databse changes from"
cdc -> warehouse "Writes database changes to"
}

views {
systemLandscape notificationSystem "Overview" {
include *
autoLayout lr
}

systemContext notificationSystem "Context" {
include *
autoLayout lr
}
}
}

Let’s unpack what is going on here. At the top-level is the workspace which is a container for the model and views. A model contains the different pieces of our architecture. This includes the software system we are describing, the people and external systems it interacts with, and the relationships between them. views contains the different levels of abstractions at which we’d like to see the system. Here we’ve defined the system landscape view, and the system context view.

In the model we’ve defined various software systems and the relationships between them. The system that we’d like to diagram is the notificationSystem. At this stage we’ve just added a one-line definition of the system, as softwareSystem, which results in it being rendered as a box in the diagram with its name on it.

We’ll continue to refine this diagram by defining other parts which together comprise the notification system. We will do so by adding them as children of the notification system which we’ve defined in the DSL. In other words, we will add one more level of detail to the system.

Relationships between the different parts of the system are defined using an arrow ->. These can be added as children of a softwareSystem or defined outside of them, as we’ve done above.

Adding more detail

We will now zoom in and add the containers which make up the notification system. We will add the background consumers which consume the notifications enqueued in Kafka and send them to the relevant channel. Update the contents of the dsl file to the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
workspace {

model {
user = softwareSystem "Client"
notificationSystem = softwareSystem "Notification System" {

api = container "API" "Receives requests to send notifications"
db = container "Database" "Stores the status of each request to send a notification"
email = container "Email Consumer" "Sends emails"
ios = container "iOS Consumer" "Sends iOS push notifications"
android = container "Android Consumer" "Sends Android push notifications"

}

sendgrid = softwareSystem "Sendgrid"
apns = softwareSystem "Apple Push Notification Service"
fcm = softwareSystem "Firebase Cloud Messaging"
cdc = softwareSystem "Change Data Capture"
warehouse = softwareSystem "Data Warehouse"

# -- Relationships between systems
user -> api "Uses"
api -> email "Enqueues request to send emails, through Kafka, to"
email -> sendgrid "Sends emails using"

api -> ios "Enqueues request to send push notifications, through Kafka, using"
ios -> apns "Sends push notifications using"

api -> android "Enqueues request to send push notifications, through Kafka, using"
android -> fcm "Sends push notifications using"

cdc -> db "Reads databse changes from"
cdc -> warehouse "Writes database changes to"

# -- Relationships between components
api -> db "Stores incoming request in"
}

views {
systemLandscape notificationSystem "Overview" {
include *
autoLayout lr
}

systemContext notificationSystem "Context" {
include *
autoLayout lr
}

container notificationSystem "Container" {
include *
autoLayout tb
}
}

}

Notice that we’ve added components which are children of the notification system. We’ve also changed the relationships to point to these components instead of the parent software system. This keeps the system landscape and the system context diagrams the same but allows us to add a new container diagram to the system. This diagram is shown below.

You can add more detail by creating a component diagram. This will let you show individual parts of your software that perform specific tasks. For example, a REST API controller which handles incoming requests to send notifications. Although possible, it is recommended to add a component diagram only if it adds value. Usually it’d be sufficient to only create a container diagram.

Styling our diagram

The diagram we’ve created so far conveys the architecture. However, the only indication of what is internal to the notification system, i.e. within the scope of the system, is a faint dashed line which groups various components together. Let’s go ahead and style our diagram by making components that are a part of the notification system render with a blue background. We’ll start by creating a few directories under design directory.

1
2
3
mkdir -p design/commons/styles
mkdir -p design/commons/icons
mkdir -p design/commons/themes

We’ve created three directories. styles will contain files which store styling information. icons is where we will keep PNG icons. themes is where we will store predefined themes. This is what the directory structure will look like.

1
2
3
4
5
6
7
8
9
design/
├── 001-overview.dsl
└── commons
├── icons
│__ ├── debezium.png
│__ └── postgres.png
├── styles
│__ └── default.dsl
└── themes

Let’s create default.dsl under the styles directory. The contents of the file are given below.

1
2
3
4
5
6
7
8
9
10
11
12
styles {

element "internal" {
background #4287f5
color #ffffff
}

element "database" {
shape Cylinder
}

}

Here we are creating a couple of custom elements. We are creating an internal element which has a blue background color, and a database element which is a cylinder. We will then use these elements to style our diagram. Let’s update the dsl file to include these elements. I’m only showing the parts that have changed for the sake of brevity.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
workspace {

model {
user = softwareSystem "Client"
notificationSystem = softwareSystem "Notification System" {

tags "internal"

api = container "API" "Receives requests to send notifications" "Python" "internal"

db = container "Database" "Stores the status of each request to send a notification" "Postgres" {
tags "internal, database"
}

email = container "Email Consumer" "Sends emails" "Python" "internal"
ios = container "iOS Consumer" "Sends iOS push notifications" "Python" "internal"
android = container "Android Consumer" "Sends Android push notifications" "Python" "internal"

}
...
}

views {
...
!include ./commons/styles/default.dsl
}
}

We begin by including the styling information using the !include directive under views. Styling information is associated with individual systems and components by using tags. You can combine tags to apply multiple styles. Tags can be specified in-line or as a child of a software system or a component. We’ve specified the internal tag in-line for the api container, and as a child of the db container. Notice how we’ve applied both the internal and database tags to the db container, causing it to be rendered as a blue cylinder. With styling, the container diagram now visually communicates what is within and outside the scope of the notification system.

Dynamic diagrams

The diagrams we’ve created so far are static and only show the structure of the system. Dynamic diagrams allow us to show the flow of information between systems as a sequence of steps. They are like UML sequence diagrams but allow a more free-form arrangement of the various parts of the system. The steps are numbered to show the order in which they happen.

As an example, let’s model how an email notification would be triggered. The API receives a request to send an email. It stores this incoming request in the database, and then puts it into a Kafka topic. The consumer then picks up this request, and sends the email using a third-party system.

Dynamic diagrams are created using the dynamic element under views. To model the scenario mentioned above, we’d add the following to our dsl file.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
views {
...

dynamic notificationSystem {
title "Send an email notification"
user -> api "Sends request to trigger an email notification to"
api -> db "Stores the incoming request in"
api -> email "Enqueues the request in Kafka for"
email -> sendgrid "Sends email using"
autoLayout lr
}

...
}

The steps are written in the order in which they appear. Numbers indicating their sequence are added automatically by Structurizr. The dynamic diagram for the code we just added is given below.

Deployment diagrams

Deployment diagrams allow us to show how the software will actually run. For example, the various components of the software can be deployed as containers running on a Kubernetes cluster. We’ll start by creating a simple deployment diagram which shows the local development environment. All of the services, except the ones that are third-party, will run on the developer’s laptop.

Deployment diagrams are created by adding a model under model using the deploymentEnvironemnt keyword. Once that is defined, we include it in the views. Let’s model the development environment.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
model {
...
development = deploymentEnvironment "Development" {
deploymentNode "Laptop" {
containerInstance api
containerInstance email
containerInstance ios
containerInstance android
containerInstance db
}

deploymentNode "Third-Party Services" "Development" {
softwareSystemInstance sendgrid
softwareSystemInstance fcm
softwareSystemInstance apns
}
}
...
}

We’re creating a node called “Laptop” which will host all the components running on the developer’s local machine. To this we add the various containers. We add another node to contain all the third-party services. Relationships between these are automatically inferred from the container diagram created previously. Finally, we add it to views

1
2
3
4
5
6
7
8
views {
...
deployment * development {
include *
autoLayout lr
}
...
}

This gives us the following diagram.

We can also depict how the software would be deployed to production. Let’s say that our system is deployed to AWS. Our databse is an RDS cluster, and API servers and email consumers are deployed to EC2 instances. We use Route52 for DNS, and Elastic Load Balancer to forward requests to the API servers. Like we did for the development environment above, we’ll create a deployment diagram. This diagram, however, will be styled some more to indicate the various AWS services.

Structurizr allows importing themes. Themes provide predefined tags which you can use to style elements of your diagram. Themes are added to views. Let’s start by adding the AWS theme.

1
2
3
4
5
views {
...
theme https://static.structurizr.com/themes/amazon-web-services-2023.01.31/theme.json
...
}

Next we’ll model the production environment. In the model below, we’re nesting deployment nodes. The outermost node represents the AWS environment as a whole. The next node represent the AWS region. Being able to nest deployment nodes allows us to reuse elements of a diagram. For example, if we had multiple database clusters that were all hosted on Amazon RDS, we could nest them all under a single deployment node representing Amazon RDS.

We’ve added Route52 and ELB as infrastructure nodes, and defined relationships between them. They’re styled using tags that we imported from the AWS theme. These are defined in the JSON that’s returned from the theme’s URL.

You don’t need to add relationships between deployment nodes that host your containers. These relationships are implied from the container diagram that was created above.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
model {
...
production = deploymentEnvironment "Production" {
deploymentNode "Amazon Web Services" {
tags "Amazon Web Services - Cloud"

deploymentNode "US-East-1" {
tags "Amazon Web Services - Region"

route53 = infrastructureNode "Route 53" {
tags "Amazon Web Services - Route 53"
}

elb = infrastructureNode "Elastic Load Balancer" {
tags "Amazon Web Services - Elastic Load Balancing"
}

deploymentNode "Amazon RDS" {
tags "Amazon Web Services - RDS"

deploymentNode "Postgres" {
tags "Amazon Web Services - RDS Postgres instance"
containerInstance db
}
}

deploymentNode "API Servers" {
tags "Amazon Web Services - EC2"

deploymentNode "Ubuntu Server" {
apiInstance = containerInstance api
}
}

deploymentNode "Email Consumers" {
tags "Amazon Web Services - EC2"

deploymentNode "Ubuntu Server" {
containerInstance email
}
}
}

route53 -> elb "Forwards requests to" "HTTPS"
elb -> apiInstance "Forwards requests to" "HTTPS"
}

deploymentNode "Third-Party Services" "Production" {
softwareSystemInstance sendgrid
}
}
...
}

Finally, we’ll add the production environment to views.

1
2
3
4
5
6
7
8
views {
...
deployment * production {
include *
autoLayout lr
}
...
}

This gives us the following diagram. Notice how adding tags has caused AWS icons to be added and styling to be applied, wherever applicable. This is a very small subset of the overall system. Some of the background consumers, CDC, and warehouse have been omitted.

That’s it. That’s how you can create software architecture using code. I’ve created a Github repository for this blog post which contains the complete code for creating the notification system we discussed above.

Creating a data warehouse with Apache Doris, Kafka, and Debezium

Over the last few weeks I’ve been looking at ways to create a data warehouse. The aim is to bring together disparate sources of data so that they can be analysed to create reports and dashboards. There are many open-source and closed-source options available like Snowflake, Clickhouse, Apache Pinot, etc. However, I ended up settling for Apache Doris. In this post I will go over how you can replicate data into Doris using Kafka. I will use Debezium to replicate a Postgres table into the warehouse.

Setting things up

My setup consists of Docker containers for Kafka, Debezium, Postgres, and Doris. In a nutshell, I’ll create a table in Postgres, have Debezium stream it to Kafka, and then use the Kafka consumer in Doris to ingest this into a table. After the JSON generated by Debezium has been written to a table, I’ll create a view on top of this data to make it look relational again.

Before we begin

Doris ships with “external table” ingestion for Postgres. However, the caveat is that you’d have to import the data in small batches manually. Additionally, the external connector is configured for a single database table. If you have a setup where every tenant gets their own table then the data can be split among multiple databases (for MySQL) or schemas (for Postgres) on the same cluster. Setting up external tables becomes difficult. This is because new databases or schemas can be added dynamically and would require an external table to be setup for them. In contrast, you could have Debezium fetch tables based on a regex pattern. Any table in any database or schema that matches the pattern would be a valid candidate for Debezium to stream from.

Another reason for using Debezium is to avoid reingesting the data when new columns are added to the table. Since we will store the raw JSON payload that Debezium sends, we can simply update the view and extract the new column from the JSON.

The final reason for using Debezium is to be able to support databases that Doris doesn’t have external table support for. For example, MongoDB. Using Debezium allows us to stream these databases into Kafka and then into Doris.

Getting started

We’ll begin by bringing up the containers. For the purpose of this demo, we will create a table called people and add some rows to it. Here is what a row from this table looks like:

1
2
3
| customer_id | id     | name         | created_at                 | updated_at | deleted_at |
|-------------|--------|--------------|----------------------------|------------|------------|
| 1 | 108923 | Cynthia Ford | 2023-11-02 15:50:42.158417 | NULL | NULL |

After the table is populated, we’ll create a Debezium source connector to stream the table to Kafka. The configuration for this connector looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
{
"name": "postgres_source",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres",
"database.user": "postgres",
"database.password": "my-secret-pw",
"database.dbname": "postgres",
"database.server.name": "postgres",
"table.include.list": ".*\\.people",
"plugin.name": "pgoutput",
"publication.autocreate.mode": "filtered",
"time.precision.mode": "connect",
"tombstones.on.delete": "false",
"snapshot.mode": "initial",
"heartbeat.interval.ms": "1000",
"transforms": "route",
"transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.route.regex": "([^.]+)\\.([^.]+)\\.([^.]+)",
"transforms.route.replacement": "$3",
"event.processing.failure.handling.mode": "skip",
"producer.override.compression.type": "snappy"
}
}

Notice how we’re specifying a regex pattern for table.include.list in the configuration. This would match people table in any schema within the postgres database. We are also using the route transformer to merge the tables from various schemas into a single Kafka topic.

Now that the data is in Kafka, we can begin ingesting it into Doris. We will do this as a two-step process. One, we will ingest this into a landing area where we store the raw JSON payload along with the primary key. Two, we will create a view on top of this data and extract the keys that we need.

We’ll create a database called ingest where we will store the raw data.

1
CREATE DATABASE ingest;

Within this database, we will create a table called people where the data from Kafka will be stored.

1
2
3
4
5
6
7
8
9
10
11
12
13
CREATE TABLE IF NOT EXISTS ingest.people (
id BIGINT,
customer_id BIGINT,
source JSONB,
op VARCHAR,
ts_ms BIGINT
)
UNIQUE KEY (id, customer_id)
DISTRIBUTED BY HASH (customer_id) BUCKETS 1
PROPERTIES (
"replication_num" = "1",
"enable_unique_key_merge_on_write" = "true"
);

The source column will store the JSON generated by Debezium. We will extract column values from this when we create the view. We are specifying the combination of customer_id and id as the UNIQUE KEY and enabling merge-on-write in properties. This allows us to keep the latest version of the row as we read updates from Debezium. Since all of this running on my local machine, I’ve set the replication to 1. Without this setting Doris would not allow creating tables as there aren’t at least 3 instances of Doris backend running.

To begin loading data from Kafka, we will create a ROUTINE LOAD. This will read data from Kafka in batches and write them to the people table.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
CREATE ROUTINE LOAD ingest.load_people ON people
PROPERTIES
(
"max_batch_interval" = "20",
"max_batch_rows" = "300000",
"max_batch_size" = "209715200",
"format" = "json",
"jsonpaths" = "[
\"$.payload.after.id\",
\"$.payload.after.customer_id\",
\"$.payload.after\",
\"$.payload.op\",
\"$.payload.ts_ms\"
]"
)
FROM KAFKA
(
"kafka_broker_list" = "192.168.0.108:9092",
"kafka_topic" = "people",
"property.group.id" = "people",
"property.client.id" = "1",
"property.kafka_default_offsets" = "OFFSET_BEGINNING"
);

In the definiton of the routine load we are specifying the table to write into using the ON clause. In the properties we specify that we are ingesting JSON objects from Kafka. We also specify jsonpaths which allows us to select keys from within the JSON payload. We are extracting keys that would become columns in the raw people table. Since jsonpaths is a stringified JSON list, we need to escape the quotes. Finally, we specify the Kafka broker and the topic to read from.

You can see the running task using

1
SHOW ALL ROUTINE LOAD FOR ingest.load_people;

After waiting for some time to let the task finish ingesting the rows we can create the view. We’ll start by creating the database.

1
CREATE DATABASE warehouse;

Finally, we will create the view

1
2
3
4
5
6
7
8
9
CREATE VIEW warehouse.people AS
SELECT
id,
customer_id,
REPLACE(json_extract(source, '$.name'), "\"", '') AS name,
FROM_UNIXTIME(CAST(replace(json_extract(source, '$.created_at'), "\"", '') AS BIGINT) / 1000) created_at,
FROM_UNIXTIME(CAST(replace(json_extract(source, '$.updated_at'), "\"", '') AS BIGINT) / 1000) updated_at,
FROM_UNIXTIME(CAST(replace(json_extract(source, '$.deleted_at'), "\"", '') AS BIGINT) / 1000) deleted_at
FROM ingest.people;

In the SQL above, we are extacting the fields from the JSON that we stored. We are processing them to remove quotations, and converting the epoch sent by Debezium to a DATETIME object. Since we have enabled merge-on-write, subsequent updates to the source table in Postgres would be reflected automatically in the view.

We can now see the data stored in the view.

1
SELECT * FROM warehouse.people LIMIT 1;

and this gives us

1
2
3
| customer_id | id    | name           | created_at          | updated_at | deleted_at |
|-------------|-------|----------------|---------------------|------------|------------|
| 1 | 28910 | Douglas Becker | 2023-11-03 11:42:09 | NULL | NULL |

That’s it. That’s how you can set up a real-time data warehouse using Doris, Debezium, and Kafka.

Learn you a SQL for Great Good

Lately I have been spending some time on Reddit’s /r/SQL and a question that comes up frequently is where to find good resources to learn SQL. This post traces a learning path. I will list down courses, books, and websites which you can use to learn SQL.

Getting Started

The first step is to get comfortable with the relational model. Depending on whether you prefer books or courses, you can pick one of the following.

Course - Databases: Relational Databases and SQL

The course, offered by Stanford on edX, provides an introduction to relational databases and SQL. It assumes no prior knowledge of either relational databases or SQL. It is the first in a series of 5 courses with each course covering a more advanced aspect of SQL and relational model. As of writing, the course costs USD 50 for the certificate track.

Book - Learning SQL, 3rd Edition

If you are a reader, Learning SQL by Alan Beaulieu is the book for you. It begins with a quick introduction of the relational model and then dives head-first into writing SQL queries. It explains everything by examples and also provides exercises at the end of each chapter to solidify learning. The book is intended to be read cover to cover as each chapter builds upon the previous one. You will find a comprehensive coverage of SQL from the basics of creating tables, to inserting, retrieving, and deleting data, to the more advanced analytical functions.

Website - SQLBolt

If you’re looking for a free alternative to learn SQL then you can use SQLBolt. It begins with a quick one-page introduction of relational databases and then takes you straight into writing SQL queries. The good part about SQLBolt is that you do not need to set up any database on your local machine. You can do all the exercises right in the browser.

SQLBolt is a good website to learn the basics of SQL. However, there are a few topics which are yet to be covered. These are topics of intermediate difficulty and you will have to find alternative resources to learn them.

Practice, Practice, Practice

Once you have a firm grasp of the basics, the next step is to get some practice.

Website - PgExercises

PgExercises is a good place to start practicing your new SQL skills. You are presented with the database of a fictitious country club on which the problems are based. The problems begin with simple retrieval of data and gradually become more involved by introducing conditional statements, and so on. The nice thing about PgExercises is that you can solve all the problems on the website itself and get immediate feedback on whether your answer is correct or not. There is an option to get a hint if you are stuck and to view the answer if you are really stuck.

You can also use PgExercises to prepare for an interview. Set a timer when solving the exercises and don’t look at the hints. This will give you an indication of how prepared you are.

Website - Window Functions

If you’re asked to find out the maximum or minimum value among a group of rows, you’d probably come up with a combination of GROUP BY clause along with MIN or MAX functions. What if, instead, you were asked to rank your results in decreasing order? SQL allows you to do that using the RANK function.

MIN, MAX, and AVG are aggregate functions whereas RANK is a window function. If you’d like to learn how to use window functions, click on the link in the title. The Window Functions website begins with a refresher on GROUP BY and HAVING clauses and then proceeds to introduce you to window functions. Like PgExercieses, you’ll be solving problems in an interactive editor using Postgres. All this on a cute dataset of cats.

Use Window Functions and PgExercises to prepare for your upcoming interviews.

Website - AdvancedSQLPuzzles

If you are looking for something more mind-bending then head over to AdvancedSQLPuzzles. It is a collection of 40 puzzles that will help you determine your level of expertise with SQL. The only caveat to using the website is that you will have to put in some effort to set up a local database before you can start solving the puzzles. Additionally, the solutions are written in T-SQL so you might need to translate the solutions to your favorite dialect of SQL.

Finito.

Frequentism vs Bayesianism

When it comes to statistics, there’s two schools of thought - frequentism and Bayesianism. In the coming posts we’ll be looking at hypothesis testing and interval estimation and knowing the difference between the two schools is important. In this post I’ll go over what frequentism and Bayesianism are and how they differ.

What is ‘probability’?

The reason for these two schools of thoughts is the difference in their interpretation of probability. For frequentists, probabilities are about frequency of occurence of events. For Bayesians, probabilities are about degree of certainty of events. This fundamental divide in the definition of probability leads to vastly different methods of statistical analysis.

The two schools

The aim of both the frequentists and Bayesians is the same - to estimate some parameters of a population that are unknown.

The assumption of the frequentist approach is that the parameters of a population are fixed but unknown constants. Since these are constants, no statements of probability can be made about them. The frequentist procedures work by drawing a large number of random samples, calculating a statistic using each of these samples, and then finding the probability distribution of the statistic. This is called the sampling distribution. Statements of probability can be made about the statistic.

The assumption of the Bayesian approach is that the parameters of a population are random variables. This allows making probability statements about them. There is a notion of some true value that the parameters can take with certain probability. The Bayesian approach thus allows adding in some prior information. The cornerstone of Bayesian approach is Bayes’ theorem:

Here, is the hypothesis and is the data. What the theorem lets us calculate is the posterior which is the probability of the hypothesis being true given the data. The prior is the probability that is true before the data is considered. The prior lets us encode our beliefs about the parameters of the population into the equation. is the likelihood and is the evidence about hypothesis given data . Finally, is the probability of getting the data regardless of the hypothesis.

What’s important here is the prior . This is our degree of certainty about the hypothesis being true. This probability can itself be calculated using frequentist methods but what matters is the fact that Bayesian approach lets us factor it in.

Frequentist Bayesian
Parameters are fixed, unknown constants. No statements of probability can be made about them. Parameters are random variables. Since random variables have an underlying probability distribution, statements of probability can be made about them.
Probability is about long run frequencies. Probability is about specifying the degree of (un)certainty.
No statements of probability are made about the data or the hypothesis. Statements of probability are made about both data and hypothesis.
Makes use only of the likelihood. Makes use of both the prior and the likelihood.

The procedures

In the frequentist approach, the parameters are an unknown constant and it is the data that changes (by repeated sampling). In the Bayesian approach, the parameters are a random variable and it is the data that stays constant (the data that has been observed). In this section we will contrast the frequentist confidence interval with Bayesian credible interval.

Both confidence intervals and credible intervals are interval estimators. Interval estimators provide a range of values that the true parameters can take.

The frequentist confidence interval

Let’s assume that there’s a true parameter , representing the mean of a population, that we are trying to estimate. From a sample of values we can then construct two estimators and and say that the true value lies between and with a certain level of confidence.

To be confident that , we need to know the sampling distribution of the estimator. If the random variable is normally distributed, then the sample mean is also normally distributed with mean (true mean) and variance i.e. . In a normal distribution, 95% of the area under the curve is covered by two standard deviations. Therefore, if we want 95% of the intervals to contain the true value of , we can construct an interval .

What this means is that if we were to keep drawing samples and constructing these intervals, 95% of these random intervals will contain the true value of the parameter . This can be stated more generally as . This means that with probability (with ), the interval between and contains the true value of the parameter . So, if we set , we’re constructing a 95% confidence interval. is called the level of significance and is the probability of committing type I error.

Let’s suppose we’re trying to find the average height of men. It is normally distributed with mean and standard deviation inches. We take a sample of and find that the sample mean inches. The 95% confidence interval would be . This gives us the confidence interval . This is one such interval. If we were to construct 100 such intervals, 95 of them would contain the true parameter .

The caveat here is that for simplicity I’ve assumed the critical value to be 2 instead of 1.96 for constructing the interval.

The Bayesian credible interval

A Bayesian credible interval is an interval that has a high posterior probability, , of containing the true value of the parameter. Compared to the frequentist confidence intervals which say that of all the intervals calculated will contain the true parameter , the Bayesian credible interval says that the calculated interval has probability of containing the true parameter.

Let’s suppose we’re interested in the proportion of population [1] that gets 8 hours of sleep every night. The parameter here now represents proportion. A sample of 27 is drawn of which 11 get 8 hours of sleep every night. Therefore, the random variable .

To calculate a Bayesian credible interval, we need to assume a subjective prior. Suppose the prior was . The posterior would thus be . This can be calculated using scipy as follows:

1
2
3
from scipy import stats
stats.beta.interval(0.9, 14.3, 23.2)
(0.256110060437748, 0.5138511051170076)

The 90% credible interval is (0.256, 0.514).

In summary, here are some of the frequentist approaches and their Bayesian counterparts.

Frequentist Bayesian
Max likelihood estimation (MLE) Max a posteriori (MAP)
Confidence interval Credible interval
Significance test Bayes factor

A thing that I have glossed over is handling of nuisance parameters. Bayesian procedures provide a general way of dealing with nuisance parameters. These are the parameters we do not want to make inference about, and we do not want them to interfere with the inference we are making about the main parameter. For example, if we’re trying to infer the mean of a normal distribution, the variance is a nuisance parameter.

The critique

The prior is subjective and can change from person to person. This is the frequentist critique of the Bayesian approach; it introduces subjectivity into the equation. The frequentist approach make use of only the likelihood. On the other hand, the Bayesian criticism of the frequentist approach is that it uses an implicit prior. Bayes theorem can be restated as . For the frequentist approach, which makes use of likelihood, the prior would need to be set to 1 i.e. a flat prior. The Bayesian approach makes the use of prior explicit even if it is subjective.

The Bayesian criticism of frequentist procedures is that they do not answer the question that was asked but rather skirt around it. Suppose the question posed was “in what range will the true values of the parameter lie?”. The Bayesian credible interval will give one, albeit subjective, interval. The frequentist confidence interval, however, will give many different intervals. In that sense, frequentism isn’t answering the question posed.

Another Bayesian criticism of the frequentist procedures that they rely on the possible samples that could occur but did not instead of relying on the one sample that did occur. Bayesian procedures treat this sample as the fixed data and vary the parameters around it.

The end.

Max Likelihood

So far what we’ve been doing is point estimation using OLS. Another method of estimation with more mathematical rigor is max likelihood estimation. In this post we will look at likelihood, likelihood function, max likelihood, and how all this ties together in estimating the parameters of our bivariate linear regression model.

Probability

When we think of probability, we are thinking about the chance of something happening in the future. For example, if a coin is tossed, what is the chance of getting a head? It’s .

But how did we arrive at the conclusion that the probability of getting a head is ? This leads us to two varying schools of thoughts on probability - frequentism and bayesianism.

If the coin were tossed a 1000 times, about half the times the outcome would be a head and half the times it would be a tail. The more frequently an event happens, the more probable / certain it is. This is the frequentist definition of probability. Under this definition, probability is the ratio of the number of times we make the desired observation to the total number of observations.

In the Bayesian approach, probability is a measure of our certainty about something. In a coin toss, there’s a 50% chance of getting a head and a 50% chance of getting a tail because there’s only two possible outcomes.

This subtle difference in the definition of probability leads to vastly different statistical methods.

Under frequentism, it is all about the frequency of observations. The event that occurs more frequently is the more probable event. There is no concept of an underlying probability distribution for the event we are trying to observe.

Under Bayesianism, in contrast, each event that occurs follows an observed probability distribution which is based on some underlying true distribution. Calculating the probabilities involves estimating the parameters of the true distribution based on the observed distribution.

Likelihood

Likelihood is the hypothetical probability that an event that has already occurred would yield a specific outcome[1].

For example, if a coin were tossed 10 times and out of those 10 times a person guessed the outcome correctly 8 times then what’s the probability that the person would guess equally correctly the next time a coin is tossed 10 times? This is the likelihood. Unlike probability, likelihood is about past events.

The likelihood of observing the data is represented by the likelihood function. The likelihood function is the probability density function of observing the given data. Suppose that we have observations - - of data which are realizations of a random variable . Let the function represent the probability density function of the random variable with parameters . The the likelihood of observing is given by the likelihood function as:

Max Likelihood Estimation

If we were to maximize the likelihood function, we’d end up with the highest probability of observing the data that we have observed i.e. the max likelihood. The data that we have observed follows some distribution we’d like to estimate. This would require us to estimate the parameters such that they maximize the likelihood function i.e. . The estimator is therefore the max likelihood estimator of the parameters such that it maximizes the likelihood function.

The procedure followed in maximizing the likelihood function to compute is called max likelihood estimation (MLE). MLE answers the following question: given that the we’ve observed data which is drawn from some distribution with parameters , to what value should we set these parameters in the sampling distribution so that the probability of observing this data is the maximum?

The mechanics of finding is an optimizaition problem involving a little bit of differential calculus. It involves calculating the partial derivative of the likelihood function with respect to each of the parameters in and setting them to zero. This would yield a system of equations which can then be solved to obtain estimates of the parameters which would maximize the function.

Calculating the partial derivatives like this becomes unwieldy very quickly since the likelihood function involves multiplication of lot of terms. We can instead calculate the maximum of log-likelihood since it will convert the product term to a summation term. It is defined as .

This provides us with enough background to apply MLE to find our regression parameters.

MLE for Regression

In regression, we have our data which we assume is normally distributed with mean and variance . The likelihood function for this can be written as:

In the equations above, because we assume to be normally distributed. As mentioned previously, computing the log-likelihood is easier. The log-likelihood is given as:

To find the ML estimators, we need to differentiate the log-likelihood function partially with respect to , , and and set the resulting equations to zero. This gives us:

Setting the above equations to zero and letting , , and represent the ML estimators, we get:

Simplifying the first two equation above, we get:

The equations above are the normal equations for OLS estimators. What this means is that the OLS estimators and are the same as the ML estimators and when we assume they’re normally distributed. The estimator for homoscedastic variance under ML is given by . This estimator underestimates the true i.e. it is biased downwards. However, in large samples, converges to the true value. This means that it is asymptotically unbiased.

Summary

In summary, MLE is an alternative to OLS. The method, however, requires that we make an explicit assumption about the probability distribution of the data. Under the normality assumption, the estimators for and are the same in both MLE and OLS. The ML estimator for is biased downwards but converges to the true value when the sample size increases.

Classic Normal Linear Regression Model

The posts so far have covered estimators. An estimator looks at the sample data and comes up with an estimate for some true population parameter. However, how do we know how close, say , is to the true ? This is a question of statistical inference. To be able to make an inference, would need to follow some distribution. CLRM makes no assumptions about the distribution of data. CNLRM (Classic Normal Linear Regression Model), however, adds the assumption of normality i.e. the data and parameters are normally distributed.

In this post we will cover the normal distribution, CNLRM, and how the assumption of normality helps us.

Normal Distribution

A probability distribution is a function which provides the probabilities of the outcomes of a random phenomenon. For example, a random variable could describe the outcomes of rolling a fair dice. The probability distribution would then be a function that would return for each of the outcomes of .

The normal distribution is the most commonly encoutered probability distribution. It states that the averages (or any other statistic like sum or count) calculated by drawing samples of observations from a random variable tend to converge to the normal. The distribution is defined by two parameters - the mean (), and the variance (). If a random variable is normally distributed, it is written notationally as . The probability density function of normal distribution is given by:

Distribution of , , and

For us to be able to draw any inference about the estimators and , we will have to make some assumption about the probability distribution of . In one of the previous posts we saw that . This can alternatively be written as where . So we have

The , betas, and are fixed and therefore is a linear function of . Under CNLRM, we assume to be to be normally distributed. A property of normally distributed random variables is that their linear functions are also normally distributed. What this means is that since and are linear functions of , they are also normally distributed.
As mentioned previously, a normal distribution is described by its mean and variance. This means can also be described by its mean and variance as . What this shows us is that we assume, on average, for to be zero and there to be constant variance (homoscedasticity).

A further assumption we make is that where . This means that the and are uncorrelated and independently distributed because zero correlation between two normally distributed variables implies statstical independence.

Also note that since is normally distributed, so is . We can state this as

Why do we assume normality?

We assume normality for the following reasons:

First, the subsume the effect of a large number of factors that affect the dependent variable but are not included in the regression model. By using the central limit theorem we can state that the distribution of the sum of these variables will be normal.

Second, because we assume to be normal, both and are normal since they are linear functions of .

Third, the normal distribution is fairly easy to understand as it involves just two parameters, and is extensively studied.

Properties of estimators under CNLRM

  1. They are unbiased.
  2. They have minimum variance.

Combinining 1. and 2. means that the estimators are efficient - they are minimum-variance unbiased.

  1. They are consistent. This means that as the size of the sample increases, the values of the estimators tend to converge to the true population value.

  2. The estimators have the least variance in the class of all estimators - linear or otherwise. This makes the estimators best unbiased estimators (BUE) as opposed to best linear unbiased estimators (BLUE).

Misc

Earlier in the post I mentioned that . This section gives the derivation of this expression. We will look at the numerator for ease of calculation.

Finito.

Gauss-Markov Theorem

In one of the previous posts we looked at the assumptions underlying the classic linear regression model. In this post we’ll take a look at the Gauss-Markov theorem which uses these assumptions to state that the OLS estimators are the best linear unbiased estimators, in the class of linear unbiased estimators.

Gauss-Markov Theorem

To recap briefly, some of the assumptions we made were that:

  1. On average, the stochastic error term is zero i.e. .
  2. The data is homoscedastic i.e. the variance / standard deviation of the stochastic error term is a finite value regardless of .
  3. The model is linear in parameters i.e. and have power 1.
  4. There is no autocorrelation between the disturbance terms.

The Gauss-Markov theorem states the following:

Given the assumptions underlying the CLRM, the OLS estimators, in the class of linear unbiased estimators, have minimum variance. That is, they are BLUE (Best Linear Unbiased Estimators).

We’ll go over what it means for an estimator to be best, linear, and unbiased.

Linear

The dependent variable is a linear function of the variables in the model. The model must be linear in parameters. Under this assumption and are both valid models since the estimators are linear in both of them.

Unbiased

This means that on average, the value of the estimator is the same as the parameter it is estimating i.e. .

Best

This means that among the class of linear estimators, the OLS estimators will have the least variance. Consider the two normal distributions shown in the graph above for two linear, unbiased estimators (represented by orange) and (represented by blue). Also let’s assume that the true value of is zero. This means that both of these are unbiased since, on average, their value is the same as the true value of the parameter. However, the OLS estimator is better since it has lesser variance compared to the other estimator.

Finito.

Precision of OLS Estimates

The calculation of the estimators and is based on sample data. As the sample drawn changes, the value of these estimators also changes. This leaves us with the question of how reliable these estimates are i.e. we’d like to determine the precision of these estimators. This can be determined by calculating the standard error or the coefficient of determination.

Standard Error

The standard error of a statistic[1] is the standard deviation of the sampling distribution of that statistic.

Suppose we have a dataset which contains incomes of people. From this dataset we start drawing samples of size n and calculating the mean. Now if we plot the distribution (the sampling distribution) of the means we calculated, we’ll get a normal distribution centered on the population mean. The standard deviation of this sampling distribution is the standard error of the statistic which in our case is mean.

Standard error is given by the formula:

Where
is the standard deviation of the population.
is the sample size.

Calculating the standard error shows the sampling fluctuation. Sampling fluctuation shows the extent to which a statistic takes on different values.

Notice that the standard error has an inverse relation with the sample size n. This means that the larger the sample we draw from the population, the lesser the standard error will be. This will also result in a tighter normal distribution since the standard deviation will be less.

Standard Error of OLS Estimates

The standard error of the OLS estimators and is given by:

where is the square root of true but unknown constant of homoscedastic variance .

All of the terms in the equations above except can be calculated from the sample drawn. Therefore, we will need an unbiased estimator . The denominator represents the degrees of freedom. is the residual sum of squares.

Although , its summation can be computed with an alternative formula as .

All of the terms in the above formula would have already been computed as a part of computing the estimators.

How it all ties together

As you calculate the estimators and draw the sample regression curve that passes through your data, you need some numeric measure of how good the curve fits the data i.e. a measure of “goodness of fit”. This can be given as:

This is the positive square root of the estimator of homoscedastic variance. This is the standard deviation of the values about the regerssion curve.

The standard errors of the estimators and will show you how much they fluctuate as you draw the samples. The lesser their standard error, the better.

The square of the standard error is called the mean squared error.

Let’s see some code

To start off, let’s load the Boston housing dataset.

1
2
3
4
5
6
7
8
import pandas as pd
from sklearn import datasets
boston = datasets.load_boston()
data, target = boston.data, boston.target
df = pd.DataFrame(data=data, columns=boston.feature_names)
df = df[['RM']]
df['RM'] = df['RM'].apply(round)
df['price'] = target

I’m choosing to use the average number of rooms as the variable to see how it affects price. I’ve rounded it off to make the calculations easier. The scatter plot of the data looks the following:

This, ofcourse, is the entire population data so let’s draw a sample.

1
sample = df.sample(n=100)

Now let’s bring back our function that calculates the OLS estimates:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def ols(sample):
Xbar = sample['X'].mean()
Ybar = sample['Y'].mean()

sample['x'] = sample['X'] - Xbar
sample['x_sq'] = sample['x'] ** 2
sample['y'] = sample['Y'] - Ybar
sample['xy'] = sample['x'] * sample['y']

beta2cap = sample['xy'].sum() / sample['x_sq'].sum()
beta1cap = Ybar - (beta2cap * Xbar)

sample['Ycap'] = beta1cap + beta2cap * sample['X']
sample['ucap'] = sample['Y'] - sample['Ycap']

return beta1cap, beta2cap

and now let’s see coefficients

1
2
sample.rename(columns={'RM': 'X', 'price': 'Y'}, inplace=True)
intercept, slope = ols(sample)

The plot of the regression curve on the sample data looks the following:

The sample DataFrame has our intermediate calculation and we can use that to calculate the standard error. Let’s write a function which does that.

1
2
3
from math import sqrt
def standard_error(sample):
return sqrt((sample['ucap'] ** 2).sum() / (len(sample) - 2))

Finally, let’s see how much standard deviation we have around the regression line.

1
2
standard_error(sample)
7.401174774558201

Coefficient of Determination

The coefficient of determination denoted by (for bivariate regression) or (for multivariate regression) is a ratio of the explained variance to the total variance of the data. The higher the coefficient of determination, the more accurately the regressors (average number of rooms in the house) explain the regressand (the price of the house) i.e. the better your regression model is.

For bivariate linear regression, value is given by:

can take on a value between 0 and 1.

Turning this into code, we have:

1
2
3
def coeff_of_determination(sample):
Ybar = sample['Y'].mean()
return ((sample['Ycap'] - Ybar) ** 2).sum() / ((sample['Y'] - Ybar) ** 2).sum()

For the sample we’ve drawn, the value comes out to be 0.327. This means that only 32.7% of the variance in the total data is explained by the regression model.

Finito.


[1] A “statistic” is a numerical quantity such as mean, or median.

Assumptions Underlying Ordinary Least Squares

In the previous post we calculated the values of and from a given sample and in closing I mentioned that we’ll take a look at finding out whether the sample regression line we obtained is a good fit. In other words, we need to assess how close and are to and or how close is to . Recall that the population regression function (PRF) is given by . This shows that the value of depends on and . To be able to make any statistical inference about (and and ) we need to make assumptions about how the values of and were generated. We’ll discuss these assumptions in this post as they’re important for the valid interpretation of the regression estimates.

Classic Linear Regression Model (CLRM)

There are 7 assumptions that Gaussian or Classic linear regression model (CLRM) makes. These are:

1. The regression model is linear in parameters

Recall that the population regression function is given by . This is linear in parameters since and have power 1. The explanatory variables, however, do not have to be linear.

2. Values of X are fixed in repeated sampling

In this series of posts, we’ll be looking at fixed regressor models[1] where the values of X are considered to be fixed when drawing a random sample. For example, in the previous post the values of X were fixed between 80 and 260. If another sample were to be drawn, the sample values of X would be used.

3. Zero mean value of the error term

This assumption states that for every given , the error term is zero, on average. This should be intuitive. The regression line passes through the conditional mean for every . The error term are distances of the individual points from the conditional mean. Some of them lie above the regression line and some of them lie below. However, they cancel each other out; the positive cancel out the negative and therefore is zero, on average. Notationally, .

What this assumption means that all those variables that were not considered to be a part of the regression model do not systematically affect i.e. their effect on average is zero on .

This assumption also means is that there is no specification error or specification bias. This happens when we chose the wrong functional form for the regression model, exclude necessary explanatory variables or include unnecessary ones.

4. Constant variance of (homoscedasticity)

This assumption states that the variation of is the same regardless of .

With this assumption, we assume that the variance of for a given (i.e. conditional variance) is a positive constant value given by . This means that the variance for the various populations is the same. What this also means is that the variance around the regression line is the same and it neither increases nor decreases with change in values of .

5. No autocorrelation between disturbances

Given any two values, and (), the correlation between any two and is zero. In other words, the observations are independently sampled. Notationally, . This assumption states that there is no serial correlation or autocorrelation i.e. and are uncorrelated.

To build an intuition, let’s consider the population regression function

Now suppose that and are positively correlated. This means that depends not only on but also on because that affects . Autocorrelation occurs in timeseries data like stock market trends where the the observation of one day depends on the observation of the previous day. What we assume is that each observation is independent as it is in case of the income-expense example we saw.

6. The sample size must be greater than the number of parameters to estimate

This is fairly simple to understand. To be able to estimate and , we need atleast 2 pairs of observations.

7. Nature of X variables

There are a few things we assume about the value of variables. One, for a given sample they are not all the same i.e. is a positive number. Second, there are no outliers among the values.

If we have all the values of the same in a given sample, the regression line would be a horizontal line and therefore it will be impossible to estimate and thus . This will happen because in the equation , the denominator will become zero since and will be the same.

Furthermore, we assume that there are no outliers in the values. Suppose there are a few values which are far apart from the mean value, the regression line which we get will be vastly different depending on whether the sample drawn contains these outliers or not.

These are the 7 assumptions underlying OLS. In the next section, I’d like to elucidate on assumption 3 and 5 a little more.

More on Assumption 3. and 5.

In assumption 3. we state that i.e. on average the disturbance term is zero. For the expected value of a variable given another variable be zero means the two variables are uncorrelated. We assume these two variables to be uncorrelated so that we can assess the impact each has on the predicted value of . If the disturbance and the explanatory variable were correlated positively or negatively then it means that contains some factor that affects the predicted value of i.e. we’ve skipped an important explanatory variable that should’ve been included in the regression model. This means we have functionally misspecified our regression model and introduced specification bias.

Functionally misspecifying our regression model will also introduce autocorrelation in our model. Consider the two plots shown below.

In the first plot the line fits the sample points properly. In the second one, it doesn’t fit properly. The sample points seem to form a curve of some sort while we try to fit a straight line through it. What we get is runs of negative residuals followed by positive residuals which indicates autocorrelation.

Autocorrelation can also be introduced not just by functionally misspecifying the regression model but also by the nature of data itself. For example, consider timeseries data representing stock market movement. The value of depends on i.e. the two are correlated.

When there is autocorrelation, the OLS estimators are no longer the best estimators for predicting the value of .

Conclusion

In this post we looked at the assumptions underlying OLS. In the coming post we’ll look at the Gauss-Markov theorem and the BLUE property of OLS estimators.


[1] There are also stochastic regression models where the values of X are not fixed.