In the previous post we looked at computing the Fibonacci series both with and without dynamic programming. In this post we’ll look at another example where dynamic programming is applicable. The example is borrowed from ‘Introduction to Algorithms’ by CLRS and implemented in Python. By the end of this post we’ll try to develop an intuition for when dynamic programming applies.
Rod Cutting
The problem we are presented with is the following: given a steel rod, we’d like to find the optimal way to cut it into smaller rods. More formally, we’re presented with a rod of size n inches and a table of prices pi. We’d like to determine the maximum revenue rn that we can obtain by cutting the rod and selling it. If the price pn of the rod of length n is large enough, we may sell the rod without making any cuts.
The table of prices that we’ll work with is given below.
length i
1
2
3
4
5
6
7
8
9
10
price pi
1
5
8
9
10
17
17
20
24
30
Consider a rod of length 4 inches. The maxium revenue we can obtain is 10 by cutting the rod into two parts of length 2 inches each.
Given a rod of n inches, we may sell it uncut or we may sell it by cutting it into smaller pieces. Since we do not know the size of the cuts to make, we will have to consider all possible sizes. Once we make a cut of size from the left end of the rod, we can view the remaining length of the rod of size as an independent instance of the rod cutting problem. In other words, we are solving a smaller instance of the same problem. The equation below shows how we can mathematically formulate the problem.
It states that the revenue rn is the maximum revenue obtained by considering all cuts of size plus the revenue obtained by cutting the remaining rod of size . We can write a recursive function to obtain this value as follows:
1 2 3 4 5 6 7 8 9 10
defrod_cut(p: list[int], n: int) -> int: if n == 0: return0
q = float("-inf")
for i inrange(1, n + 1): q = max(q, p[i] + rod_cut(p, n - i))
return q
We can verify the results by calling the function for a rod of size 4 and passing the table of prices.
The recursive version, however, does redundant work. Consider the rod of size n = 4. We will have to consider cuts of size . When considering the remaining rod of size 3, we’d consider cuts of size . In both of these cases we recompute, for example, the revenue obtained when the remainder of the rod is of size 2.
We can use dynamic programming to solve this problem by modifying the rod_cut function as follows.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
defrod_cut(p: list[int], t: list[int | None], n: int) -> int: if n == 0: return0
if t[n] isnotNone: return t[n]
q = float("-inf")
for i inrange(1, n + 1): q = max(q, p[i] + rod_cut(p, t, n - i))
t[n] = q
return q
Notice how we’ve introduced a table t which stores the maximum revenue obtained by cutting a rod of size n. This allows us to reuse previous computations. We can run this function and verify the result.
The question that we’re left with is the following: how did we decide that the problem could be solved with dynamic programming? There are two key factors that help us in deciding if dynamic programming can be used. The first is overlapping subproblems and the second is optimal substructure.
For a dynamic programming algorithm to work, the number of subproblems must be small. This means that the recursive algorithm which solves the problem encounters the same subproblems over and over again. When this happens, we say that the problem we’re trying to solve has overlapping subproblems. In the rod cutting problem, when we try to cut a rod of size 4, we consider cuts of size . When we, then, consider the remaining rod of size 3, we consider cuts of size . The smaller problem of optimally cutting the rod of size 2 is encountered again. In other words, it is an overlapping subproblem. Dynamic programming algorithms solve an overlapping subproblem once and store its result in a table so that it can be reused again.
The second factor is optimal substructure. When a problem exhibits optimal substructure, it means that the solution to the problem contains within it the optimal solutions to the subproblems; we build an optimal solution to the problem from optimal solutions to the subproblems. The rod-cutting problem exhibits optimal substructure because the optimal solution to cutting a rod of length involves finding the optimal solution to cutting the remaining rod, if a cut has been made.
Moving on. So far our algorithm has returned the optimal value of the solution. In other words, it returned the maximum revenue that can be obtained by optimaly cutting the rod. We often need to store the choice that led to the optimal solution. In the context of the rod cutting problem, this would be the lengths of the cuts made. We can do this by keeping additional information in a separate table. The following code listing is a modification of the above function with an additional table s to store the value of the optimal cut.
defrod_cut(p: list[int], s: list[int | None], t: list[int | None], n: int) -> int: if n == 0: return0
if t[n] isnotNone: return t[n]
q = float("-inf") j = None
for i inrange(1, n + 1): r = p[i] + rod_cut(p, s, t, n - i)
if r > q: q = r j = i
s[n] = j t[n] = q
return q
In this version of the code, we store the size of the cut being made for a rod of length n in the table s. Once we have this information, we can reconstruct the optimal solution. The function that follows shows how to do that.
Finally, we call the function to see the optimal solution. Since we know that the optimal solution for a rod of size 4 is to cut it into two equal halves, we’ll use .
1 2 3 4 5 6 7
n = 4 t = [None] * 11 s = [None] * 11 p = [0, 1, 5, 8, 9, 10, 17, 17, 20, 24, 30] q = rod_cut(p, s, t, n)
assert [2, 2] == optimal_cuts(s, n)
That’s it. That’s how we can use dynamic programming to optimally cut a rod of size inches.
In this post, and hopefully in the next few posts, I’d like to devle into the topic of dynamic programming. The aim is to develop an intuition for when it is applicable by solving a few puzzles. I’ll be referring to the chapter on dynamic programming in ‘Introduction to Algorithms’ by CLRS, and elucidating it in my own words.
Dynamic Programming
The chapter opens with the definition of dynamic programming: it is a technique for solving problems by combining solutions to subproblems. The subproblems may have subsubproblems that are common between them. A dynamic programming algorithm solves these subsubproblems only once and saves the result, thereby avoiding unnecessary work. The term “programming” refers to a tabular method in which the results of subsubproblems are saved in a table and reused when the same subsubproblem is encountered again.
All of this is abstract so let’s look at a concrete example of computing the Fibonacci series.
1 2
deffibonacci(n: int) -> int: return n if n < 2else fibonacci(n - 1) + fibonacci(n - 2)
The call graph for fibonacci(4) is given below.
As we can see, we’re computing fibonacci(2) twice. In other words, the subsubproblem of computing fibonacci(2) is shared between fibonacci(4) and fibonacci(3). A dynamic programming algorithm would solve this subsubproblem only once and save the result in a table and reuse it. Let’s see what that looks like.
In the code above, we create a table T which stores the Fibonacci numbers. If an entry exists in the table, we return it immediately. Otherwise, we compute and store it. This recursive approach, with results of subsubproblems stored in a table, is called “top-down with memoziation”; the table is called the “memo”. We begin with the original problem and then proceed to solve it by finding solutions to smaller subproblems. The procedure which computes the solution is said to be “memoized” as it remembers the previous computations.
Another approach is called “bottom-up” in which the solutions to the smaller subproblems are computed first. This depends on the notion that subproblems have “size”. In this approach, a solution to the subproblem is found only when the solutions to its smaller subsubproblems have been found. We can apply this approach when computing the Fibonacci series.
As we can see, the larger numbers in the Fibonacci series are computed only when the smaller numbers have been computed.
This was a small example of how dynamic programming algorithms work. They are applied to problems where subproblems share subsubproblems. The solutions to these subsubproblems are stored in a table and reused when they are encountered again. This enables the algorithm to work more efficiently as it avoids the rework of solving the subsubproblems.
In the next post we’ll look at another example of dynamic programming that’s presented in the book and implement it in Python to further our understanding of the subject.
I recently came across the “machine coding” round, also called “low level design” round, of the tech interview process. In this round of the interview, you’re presented with a real-life problem statement, and are expected to create a functioning system in under 90 minutes. The intent is to see how you structure your code, follow best practices, apply design patterns, and so on. In the spirit of practising for this round, this blog post looks at how to create a simple todo list application in Python.
Problem Statement
Let’s start with the problem statement: create a todo list. The very first requirement that comes to mind when creating a todo list is the ability to add tasks to the list. For example, getting groceries. It’s also possible to add subtasks to the list. For example, getting apples and oranges are subtasks when getting groceries. The todo list may be shared with family and friends. They receive notifications when items are added or removed from the list, and have the ability to add or remove items. Only the one who created the list and the ones with whom the list has been shared may modify the list. The todo list may be displayed in the console or in the webpage. Finally, the todo list may be persisted in the database.
With these requirements in mind, we’ll begin by looking at the classes and methods that comprise our system, the design patterns we can apply, and finally go ahead and create the system. Please note that since the implementation is going to be in Python, it may seem like I’m trying to do Java in Python by creating too many classes. For example, Python has the concept of “factory function” rather than “factory method” since functions are first-class citizens of the language. You’ve been warned.
Design Patterns
Let’s begin with a quick refresher on design patterns. In a nutshell, design patterns allow us to write reusable, maintainable, and testable code by providing patterns for solving common software engineering problems. These patterns can be categorized into three types: behavioral, creational, and structural. Behavioral patterns describe how objects interact with each other. Creational patterns describe how objects can be created. Finally, structural patterns describe how objects can be composed to create larger systems.
In the context of the todo list, we’ll see the following patterns being used. We’ll take a quick look at these patterns and then map them onto the classes we’ll create.
Composite pattern allows us to represent hierarchical, tree-like, data structures. The intention of this pattern is to ensure that composite objects (the non-leaf nodes), and individual objects (the leaf nodes) are treated the same. The UML diagram is given below.
As we can see, the non-leaf Composite class is itself a Component and has a collection of Components. These may be Composites or Leafs. In our todo list example, we can model the the tasks and subtasks using this pattern. We begin by creating a top-level abstract class TodoItem which corresponds to top-level Component. The subtask can be represented by a SubTask class and is the Leaf. Finally, we’ll create a Task class which is the Composite.
The adapter pattern allows disparate components to work together seamlessly by reconciling differences between interfaces. It does so by providing a familiar interface that the client can talk to, and in turn using the interface of the underlying component. In other words, adapter pattern translates from one interface to another; the adapter calls specific methods on the adaptee when it is invoked. The UML diagram is given below.
In our todo list example, we can model connection to database using a DatabaseAdapter. This allows us to expose a familiar set of methods to perform operations like saving the todo list to the database while abstracting the nitty-gritties of the actual database. This class can then be subclassed to provide functionality for specific databases. For example, a MySQLDatabaseAdapter for using MySQL as the database.
The advantage of using this approach is that it allows seamless migration between databases; we may start small with SQLiteDatabaseAdapter and then switch to MySQLDatabaseAdapter by simply changing which class we use. Since the interface exposed is the same, there will be minimal refactoring in the client code to make this transition.
The strategy pattern allows defining a family of algorithms, encapsulating each one, and making them interchangeable. This means we can define and pass algorithms around. There is a base class Strategy that defines the interface for the algorithm, and one or more concretions that provide the actual implementation. The UML diagram is given below.
In our todo list example, we can design the rendering of the list as a strategy. For example, we can have a markdown strategy, an HTML strategy, etc. These different strategies produce different representations of the todo list.
The proxy pattern encapsulates an object and allows itself to be used in place of the original object. Since the proxy is used in place of the original object, it enables use cases such as controlling access. There can be methods on the proxy which require authentication before the operation is performed on the original object.
In our todolist example, we can create a proxy which ensures that only the owner of the list and those with whom it is shared can make changes to it. To do this, we’ll create an interface called TodoListProtocol which defines methods which the todo list must implement. This will be used by the TodoList class, which represents the actual todo list, and by the TodoListProxy which provides access control for the todo list. The proxy will require that that when a method call is made, the user making the call also be passed as an argument. Only if the user is the owner of the list or is one of the collaborators will the operation be performed.
The observer pattern allows objects, the observers, to be notified when the state of another object, the observable, changes.
In our todolist example, we’ll create observers which get notified when the state of the todo list changes. We’ll create an abstract class called Observable which will be inherited by the TodoList class. This makes the class “observable”. We’ll create another class TodoListObserver which inherits the Observer class. This makes it the “observer” and it’ll be notified when changes happen to the todo list.
Now that we’ve looked at the design patterns we’ll use, let’s look at the code.
Code
User
We’ll begin with the simplest class first, the User class which represents a user of the system. The owners and collaborators of the list will be represented by this class.
1 2 3
@dc.dataclass(frozen=True) classUser: email: str
Observer Pattern
Next, let’s add classes to create the observer pattern. We’ll begin by creating the abstract observable class. It’ll store the list of observers it has to notify, and require the subclasses to provide an implementation to return their state. Since the observers hold a reference to the observable they are observing, the method which returns the state will be used to find what’s changed.
The implementation for the Observable class is shown below.
Similarly, we’ll implement the Observer class. It requires its subclasses to provide an implementation for the update method which will be called by the Observable it’s observing.
Finally, we’ll create the concrete observer TodoListObserver. It notifies the user by sending them an email when the list is updated. For simplicity, however, we’ll just log a message to the console.
1 2 3 4 5 6 7 8
classTodoListObserver(Observer):
def__init__(self, user: "User"): self.user = user
defupdate(self): state = self.observable.state print(f"Notify {self.user.email} about changed state")
Composite Pattern
Next we’ll create the composite pattern. We’ll create a base class TodoItem which will be inherited by both the Task and SubTask class. It has basic fields that are required to define a task like an ID, a title, etc. and a couple of base methods to mark the item as complete or check if it’s complete.
The first of the subclasses is the Task class which we’ll implement next. A task is considered complete when it’s been marked completed and are all of its subtasks. Similarly, marking a task complete marks all the subtasks as complete.
If we look at the TodoItem, Task, SubTask classes, we can see the hierarchical structure where each Task, a non-leaf component, may have zero or more SubTasks, leaf components. Since both the Task and SubTask are instances of TodoItem, they have the same interface and can be used interchangeably.
Adapter Pattern
Next we’ll create a simple database adapter with a single method to save the todo list. There is only one method for the sake of simplicity but it’s easy to see how there can be more of these methods. We’ll start with a protocol called DatabaseAdapter. Think of a Python protocol to be similar to a Java interface.
1 2 3
classDatabaseAdapter(Protocol):
defsave_list(self, todolist: "TodoList"): ...
Next we’ll create two concrete classes which implement this protocol. The first class creates an adapter for MySQL and another for SQLite. Both of these classes take an instance of their specific database and use it to persist the todo list. Since each database instance may have its own set of methods to save data, an adapter provides a familiar interface that can be used elsewhere in the code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
classMySQLAdapter(DatabaseAdapter):
def__init__(self, db): self.db = db
defsave_list(self, todolist: "TodoList"): ...
classSQLiteAdapter(DatabaseAdapter):
def__init__(self, db): self.db = db
defsave_list(self, todolist: "TodoList"): ...
Strategy Pattern
Next, we’ll create strategies to render the todo list. We’ll start by creating a protocol called RenderingStratgy with a single method called render which returns a string representation of the todo list.
1 2 3
classRenderingStrategy(Protocol):
defrender(self, todolist: "TodoList") -> str: ...
We’ll add a concrete strategy called TableRenderingStrategy which displays the tasks and subtasks in tabular format.
To create the proxy pattern, we’ll create the protocol, TodoListProtocol, for both the todo list and the proxy. Let’s start with the protocol which defines the methods that are common to both the todo list and the proxy. As we can see, there’s methods to mark the list as complete, to search for tasks, to render the list, and so on.
Notice how it implements both TodoListProtocol and Observable. In the add method, we call notify which updates all the observers for this list.
Finally, we’ll add the proxy for TodoList. The proxy authenticates each call to the underlying TodoList by checking whether the user trying to access the list is the owner or a collaborator.
defraise_if_not_authenticated(self, user: User): ifnot self.is_authenticated_user(user=user): raise Exception(f"User {user.email} is not authenticated.")
defis_authenticated_user(self, user: User): return (self.todolist.owner == user) or (user in self.todolist.collaborators)
Running the Code
Let’s wire the pieces together and run them. We’ll create an adapter, a rendering strategy, and an observer. Since we’ll use dependency injection, we’ll pass these obhects to the appropriate methods.
Notice we’ve added tasks and subtasks to the list. We’ve also added a collaborator and an observer to the list. We’ll finish adding the observer by updating its observable property. This will allow it to fetch the state from the list when it gets updated.
1
observer.observable = todolist
Next, we’ll create a proxy for the list so that we can authenticate the calls.
1
proxy = TodoListProxy(todolist=todolist)
We can now make method calls to see the code in action. Let’s begin by adding a task.
1 2 3 4
proxy.add( task=Task(title="Walk the dog"), user=User("john.doe@gmail.com"), )
This produces the following output in the console. Since John Doe added an item to the list, Jane Doe will be notified of the change.
I’d recently written about an experimental library to detect PII. When discussing it with an acquaintance of mine, I was told that PII can also be disguised. For example, a corpus of text like a review or a comment can contain email address in the form “johndoeatgmaildotcom”. This led me to update the library so that emails like these can also be flagged. In a nutshell, I had to update the regex which was used to find the email.
Example
This is best explained with a few examples. In all of the examples, we begin with a proper email and disguise it one step at a time.
I recently created an experimental library, detectpii, to detect PII data in relational databases. In this post we’ll take a look at the rationale behind the library and it’s architecture.
Rationale
A common requirement in many software systems is to store PII information. For example, a food ordering system may store the user’s name, address, phone number, and email. This information may also be replicated into the data warehouse. As a matter of good data governance, you may want to restrict access to such information. Many data warehouses allow applying a masking policy that makes it easier to redact such values. However, you’d have to specify which columns to apply this policy to. detectpii makes it easier to identify such tables and columns.
My journey into creating this library began by looking for open-source projects that would help identify such information. After some research, I did find a few projects that do this. The first project is piicatcher. It allows comparing the column names of tables against regular expressions that represent common PII column names like user_name. The second project is CommonRegex which allows comparing column values against regular expression patterns like emails, IP addresses, etc.
detectpii combines these two libraries to allow finding column names and values that may potentially contain PII. Out of the box, the library allows scanning column names and a subset of its values for potentially PII information.
Architecure
At the heart of the library is the PiiDetectionPipeline. A pipeline consists of a Catalog, which represents the database we’d like to scan, and a number of Scanners, which perform the actual scan. The library ships with two scanners - the MetadataScanner and the DataScanner. The first compares column names of the tables in the catalog against known patterns for the ones which store PII information. The second compares the value of each column of the table by retrieving a subset of the rows and comparing them against patterns for PII. The result of the scan is a list of column names that may potentially be PII.
The design of the library is extensible and more scanners can be added. For example, a scanner to use a proprietary machine learning algorithm instead of regular expression match.
Usage
To perform a scan, we create a pipeline and pass it a catalog and a list of scanners. To inititate the scan, we call the scan method on the pipeline to get back a list of PII columns.
from detectpii.catalog import PostgresCatalog from detectpii.pipeline import PiiDetectionPipeline from detectpii.scanner import DataScanner, MetadataScanner
# -- Create a catalog to connect to a database / warehouse pg_catalog = PostgresCatalog( host="localhost", user="postgres", password="my-secret-pw", database="postgres", port=5432, schema="public" )
# -- Create a pipeline to detect PII in the tables pipeline = PiiDetectionPipeline( catalog=pg_catalog, scanners=[ MetadataScanner(), DataScanner(percentage=20, times=2,), ] )
# -- Scan for PII columns. pii_columns = pipeline.scan()
That’s it. That’s how to use the library to detect PII columns in tables.
In the exercises that follow the chapter on algebraic manipulations, there is a question that pertains to expressing an integer as the sum of two other integers squared. We are then asked to find expressions for the multiples of , namely and . In this post, we’ll take a look at the solution provided by the authors for the first half of the question, , and then come up with a method of our own to solve the second half, .
Question
If is an integer that can be expressed as the sum of two integer squares, show that both and can also be expressed as the sum of two integer squares.
Solution
From the question, since it is the sum of two integer squares. This means . We need to find two integers such that when their squares are summed, we end with . From the solution, these are the numbers (a + b) and (a - b) because when they are squared and summed, we get . This is the result of . Go ahead and expand them to verify the result.
What do we deduce from this? We find that both the expressions contributed an and a . These were added together to get the final result. How do we use this to get ? Notice that we’re squaring the integers. This means that, for example, one of them would have to contribute an and the other would have to contribute a ; similar logic applies for .
This leaves us with two pairs of numbers — and . Let’s square and sum both of these numbers one-by-one.
In a previous post we’d seen how to create a realtime data platform with Pinot, Trino, Airflow, and Debezium. In this post we’ll see how to setup a data catalog using DataHub. A data catalog, as the name suggests, is an inventory of the data within the organization. Data catalogs make it easy to find the data within the organisation like tables, data sets, reports, etc.
Before we begin
My setup consists of Docker containers required to run DataHub. While DataHub provides features like data lineage, column assertions, and much more, we will look at three of the simpler featuers. One, we’ll look at creating a glossary of the terms that will be used frequently in the organization. Two, we’ll catalog the datasets and views that we saw in the previous post. Three, we’ll create an inventory of dashboards and reports created for various departments within the organisation.
The rationale for this as follows. Imagine a day in the life of a business analyst. Their responsibilities include creating reports and dashboards for various departments. For example, the marketing team may want to see an “orders by day” dashboard so that they can correlate the effects of advertising campaigns with an uptick in the volume of orders. Similarly, the product team may want a report of which features are being used by the users. The requests of both of these teams will be served by the business analyts using the data that’s been brought into the data platform. While they create these reports and dashboards, it’s common for them to receive queries asking where a team member can find a certain report or how to interpret a data point within a report. They may also have to search for tables and data sets to create new reports, acquaint themselves with the vocabulary of the various departments, and so on.
A data catalog makes all of this a more efficient process. In the following sections we’ll see how we can use DataHub to do it. For example, we’ll create the definition of the term “order”, create a list of reports created for the marketing department, and bring in the views and data sets so that they become searchable.
The work of data scientists is similar, too, because they create data sets that can be reused across various models. For example, data sets representing features for various customers can be stored in the platform, made searchable, and used with various models. They, too, benefit from having a data catalog.
Finally, it helps bring people up to speed with the data that is consumed by their department or team. For example, when someone joins the marketing team, pointing them to the data catalog helps them get productive quickly by finding the relevant reports, terminology, etc.
Ingesting data sets and views
To ingest the tables and views, we’ll create a data pipeline which ingets metadata from AWS Glue and writes it to the metadata service. This is done by creating a YAML configuration in DataHub that specifies where to ingest the metadata from, and where to write it. Once this is created, we can schedule it to run periodically so that it stays updated with Glue.
The image above shows how we define a “source” and how we ingest it into a “sink”. Here we’ve specified that we’d like to read from Glue and write it to DataHub’s metadata service.
Once the source and destination are defined, we can set a schedule to run the ingestion. This will bring in the metadata about the data sets and views we’ve created in Glue.
The image above shows that a successful run of the ingestion pipeline brings in the views and data sets. These are then browsable in the UI. Similarly, they are also searchable as shown in the following image.
This makes it possible for the analysts and the data scientists to quickly locate data sets.
Defining the vocabulary
Next, we’ll create the definition of the word “order”. This can be done from the UI as shown below. The definition can be added by editing the documentation.
Once created, this is available under “Glossary” and in search results.
Data products
Finally, we’ll create a data product. This is the catalog of reports and dashboards created for various departments. For example, the image below shows a dashboard created for the marketing team.
Expanding the dashboard allows us to look at the documentation for the report. This could contain the definition of the terms used in the report, as shown on the bottom right, a link to the dashboard in Superset, definitions of data points, report owners, and so on.
That’s it. That’s how a data catalog helps streamline working with data.
As I continue working my way through the book on programming puzzles, I came across those involving permutations. In this post I’ll collect puzzles with the same theme, both from the book and from the internet.
All permutations
The first puzzle is to compute all the permutations of a given array. By extension, it can be used to compute all the permutations of a string, too, if we view it as an array of characters. To do this we’ll implement Heap’s algorithm. The following is its recursive version.
1 2 3 4 5 6 7 8 9 10 11 12 13
defheap(permutations: list[list], A: list, n: int): if n == 1: permutations.append(list(A)) else: heap(permutations, A, n - 1)
for i inrange(n - 1): if n % 2 == 0: A[i], A[n - 1] = A[n - 1], A[i] else: A[0], A[n - 1] = A[n - 1], A[0]
heap(permutations, A, n - 1)
The array permutations is the accumulator which will store all the permutations of the array. The initial arguments to the function would be an empty acuumulator, the list to permute, and the length of the list.
Next permutation
The next puzzle we’ll look at is computing the next permutation of the array in lexicographical order. The following implementation has been taken from the book.
A variation of the puzzle is to compute the previous permutation of the array in lexicographical order. The idea is to “reverse” the logic for computing the next permutation. If we look closely, we’ll find that all we’re changing are the comparison operators.
The final puzzle we’ll look at is the one where we need to compute the k’th smallest permutation. The solution to this uses the previous_permutation function that we saw above. The idea is to call this function k times on the lexicographically-largest array. Sorting the array in decreasing order results is the largest. This becomes the input to the previous_permutation function.
1 2 3 4 5 6 7 8 9
defkth_smallest_permutation(perm: list[int], k: int) -> list[int]: # -- Arrange the numbers in decreasing order # -- thereby creating the lexicographically largest permutation perm = sorted(perm, reverse=True)
for _ inrange(k): perm = previous_permutation(perm)
return perm
That’s it. These are puzzles involving permutations.
I am working my way through a classic book of programming puzzles. As I work through more of these puzzles, I’ll share what I discover to solidify my understanding and help others who are doing the same. If you’ve ever completed puzzles on a site like leetcode, you’ll notice that the sheer volume of puzzles is overwhelming. However, there are patterns to these puzzles, and becoming familiar with them makes it easier to solve them. In this post we’ll take a look at one such pattern - two pointers - and see how it can be used to solve puzzles involving arrays.
Two Pointers
The idea behind two pointers is that there are, as the name suggests, two pointers that traverse the array, with one pointer leading the other. Using these two pointers we update the array and solve the puzzle at hand. As an illustrative example, let us consider the puzzle where we’re given an array of even and odd numbers and we’d like to move all the even numbers to the front of the array.
Even and Odd
1 2 3 4 5 6 7 8 9 10 11
defeven_odd(A: list[int]) -> None: """Move even numbers to the front of the array.""" write_idx = 0 idx = 0
while idx < len(A): if A[idx] % 2 == 0: A[write_idx], A[idx] = A[idx], A[write_idx] write_idx = write_idx + 1
idx = idx + 1
The two pointers here are idx and write_idx. While idx traverses the array and indicates the current element, write_idx indicates the position where the next even number should be written. Whenever idx points to an even number, it is written at the position indicated by write_idx. With this logic, if all the numbers in the array are even, idx and write_idx point to the same element i.e. the number is swapped with itself and the pointers are moved forward.
We’ll build upon this technique to remove duplicates from the array.
Remove Duplicates
Consider a sorted array containing duplicate numbers. We’d like to keep only one occurrence of each number and overwrite the rest. This can be solved using two pointers as follows.
1 2 3 4 5 6 7 8 9 10 11 12
defremove_duplicates(A: list[int]) -> int: """Remove all duplicates in the array.""" write_idx, idx = 1, 1
while idx < len(A): if A[write_idx - 1] != A[idx]: A[write_idx] = A[idx] write_idx = write_idx + 1
idx = idx + 1
return write_idx
In this solution, idx and write_idx start at index 1 instead of 0. The reason is that we’d like to look at the number to the left of write_idx, and starting at index 1 allows us to do that. Notice also how we’re writing the if condition to check for duplicity in the vicinity of write_idx; the number to the left of write_idx should be different from the one that idx is presently pointing to.
As a varitation, move the duplicates to the end of the array instead of overwriting them.
Move Duplicates to the End
1 2 3 4 5 6 7 8 9 10 11 12
defremove_duplicates(A: list[int]) -> None: """Remove all duplicates in the array and move them to the end""" write_idx, idx = 1, 1
while idx < len(A): if A[write_idx - 1] != A[idx]: A[write_idx], A[idx] = A[idx], A[write_idx] write_idx = write_idx + 1
idx = idx + 1
return write_idx
As another variation, remove a given number from the array by moving it to the end.
while idx < len(A): if A[idx] != k: A[write_idx], A[idx] = A[idx], A[write_idx] write_idx = write_idx + 1
idx = idx + 1
With this same pattern, we can now change the puzzle to state that we want at most two instances of the number in the sorted array.
Remove Duplicates Variation
1 2 3 4 5 6 7 8 9 10
defremove_duplicates(A: list[int]) -> int: """Keep at most two instances of the number.""" write_idx = 2
for idx inrange(2, len(A)): if A[write_idx - 2] != A[idx]: A[write_idx] = A[idx] write_idx = write_idx + 1
return write_idx
Akin to the previous puzzle, we look for duplicates in the vicinity of write_idx. While in the previous puzzle the if condition checked for one number to the left, in this variation we look at two positions to the left of write_idx to keep at most two instances. The remainder of the logic is the same. As a variation, try keeping at most three instances of the number in the sorted array.
Finally, we’ll use the same pattern to solve the Dutch national flag problem.
Dutch National Flag
In this problem, we sort the array by dividing it into three distinct regions. The first region contains elements less than the pivot, the second region contains elements equal to the pivot, and the third region contains elements greater than the pivot.
defdutch_national_flag(A: list[int], pivot_idx: int) -> None: """Divide the array into three distinct regions.""" pivot = A[pivot_idx] write_idx = 0 idx = 0
# --- Move all elements less than pivot to the front while idx < len(A): if A[idx] < pivot: A[write_idx], A[idx] = A[idx], A[write_idx] write_idx = write_idx + 1 idx = idx + 1
idx = write_idx
# -- Move all elements equal to the pivot to the middle while idx < len(A): if A[idx] == pivot: A[write_idx], A[idx] = A[idx], A[write_idx] write_idx = write_idx + 1 idx = idx + 1
# -- All elements greater than pivot have been moved to the end.
This problem combines everything we’ve seen so far about two pointers and divides the array into three distinct regions. As we compute the first two regions, the third region is computed as a side-effect.
We can now solve a variation of the Dutch national flag partitioning problem by accepting a list of pivot elements. In this variation all the numbers within the list of pivots appear together i.e. all the elements equal to the first pivot element appear first, equal to second pivot element appear second, and so on.
1 2 3 4 5 6 7 8 9 10 11 12
defdutch_national_flag(A: list[int], pivots: list[int]) -> None: """This is a variation in which all elements with same key appear together.""" write_idx = 0
I’d previously written about creating a realtime data warehouse with Apache Doris and Debezium. In this post we’ll see how to create a realtime data platform with Pinot, Trino, Airflow, Debezium, and Superset. In a nutshell, the idea is to bring together data from various sources into Pinot using Debezium, transform it using Airflow, use Trino for query federation, and use Superset to create reports.
Before We Begin
My setup consists of Docker containers for running Pinot, Airflow, Debezium, and Trino. Like in the post on creating a warehouse with Doris, we’ll create a person table in Postgres and replicate it into Kafka. We’ll then ingest it into Pinot using its integrated Kafka consumer. Once that’s done, we’ll use Airflow to transform the data to create a view that makes it easier to work with it. Finally, we can use Superset to create reports. The intent of this post is to create a complete data platform that makes it possible to derive insights from data with minimal latency. The overall architecture looks like the following.
Getting Started
We’ll begin by creating a schema for the person table in Pinot. This will then be used to create a realtime table. Since we want to use Pinot’s upsert capability to maintain the latest record of each row, we’ll ensure that we define the primary key correctly in the schema. In the case of the person table, it is the combination of the id and the customer_id field. The schema looks as follows.
We’ll use the schema to create the realtime table in Pinot. Using ingestionConfig we’ll extract fields out of the Debezium payload and into the columns defined above. This is defined below.
With these steps done, the change data from Debezium will be ingested into Pinot. We can view this using Pinot’s query console.
This is where we begin to integrate Airflow and Trino. While the data has been ingested into Pinot, we’ll use Trino for querying. There are two main reasons for this. One, this allows usto federate queries across multiple sources. Two, Pinot’s SQL capabilities are limited. For example, there is no support, as of writing, for creating views. To circumvent these we’ll create a Hive connector in Trino and use it to query Pinot.
The first step is to connect Trino and Pinot. We’ll do this using the Pinot connector.
1 2 3 4
CREATE CATALOG pinot USING pinot WITH ( "pinot.controller-urls" = 'pinot-controller:9000' );
Next we’ll create the Hive connector. This will allow us to create views, and more importantly materialized views which act as intermediate datasets or final reports, which can be queried by Superset. I’m using AWS Glue instead of Hive so you’ll have to change the configuration accordingly.
We’ll create a schema to store the views and point it to an S3 bucket.
1 2 3 4
CREATE SCHEMA hive.views WITH ( "location" = 's3://your-bucket-name-here/views/' );
We can then create a view on top of the Pinot table using Hive.
1 2 3 4 5 6
CREATE OR REPLACE VIEW hive.views.person AS SELECT id, customer_id, JSON_EXTRACT_SCALAR(source, '$.name') AS name, op FROM pinot.default.person;
Finally, we’ll query the view.
1 2 3 4 5 6 7
trino> SELECT * FROM hive.views.person; id | customer_id | name | op ----+-------------+-------+---- 1 | 1 | Fasih | r 2 | 2 | Alice | r 3 | 3 | Bob | r (3 rows)
While this helps us ingest and query the data, we’ll take this a step further and use Airflow to create the views instead. This allows us to create views which are time-constrained. For example, if we have an order table which contains all the orders placed by the customers, using Airflow allows to create views which are limited to, say, the last one year by adding a WHERE clause.
We’ll use the TrinoOperator that ships with Airflow and use it to create the view. To do this, we’ll create an sql folder under the dags folder and place our query there. We’ll then create the DAG and operator as follows.
1 2 3 4 5 6 7 8 9 10 11 12 13
dag = DAG( dag_id="create_views", catchup=False, schedule="@daily", start_date=pendulum.now("GMT") )
person = TrinoOperator( task_id="person", trino_conn_id="trino", sql="sql/views/person.sql", dag=dag )
Workflow
The kind of workflow this setup enables is the one where the data engineering team is responsible for ingesting the data into Pinot and creating the base views on top of it. The business intelligence / analytics engineering, and data science teams can then use Airflow to create datasets that they need. These can be created as materialized views to speed up reporting or training of machine learning models. Another advantage of this setup is that bringing in older data, say, of the last two years instead of one, is a matter of changing the query of the base view. This avoids complicated backfills and speeds things up significantly.
As an aside, it is possible to use DBT instead of TrinoOperator. It can be used in conjunction with TrinoOperator, too. However, I preferred using the in-built operator to keep the stack simpler.
Cost
Before we conclude, we’ll quickly go over how to keep the cost of the Pinot cluster low while using this setup. In the official documentation it says that data can be seperated by age; older data can be stored in HDDs while the newer data can be stored in SSDs. This allows lowering the cost of the cluster.
An alternative approach is to keep all the data in HDDs and load subsets into Hive for querying. This also allows changing the date range of the views by simply updating the queries. In essence, Pinot becomes the permanent storage for data while Trino and Hive become the intermediate query and storage layer.
That’s it. That’s how we can create a realtime data platform using Pinot, Trino, Debezium, and Airflow.
In the chapter on exponents the authors mention that if a base is raised to both a power and a root, we should calculate the root first and then the power. This works perfectly well. However, reversing the order produces correct results, too. In this post we’ll see why that works using the properties of exponents.
Let’s say we have a base that is raised to power and root . We could write this as . From the properties of exponents, we could rewrite this as . Alternatively, it can be written as . Since multiplication is commutative, we can switch the order of operations and rewrite it as . This means we can calculate the power first and then take the root.
Let’s take a look at a numerical example. Consider . We know that the answer should be equal to . If we were to calculate the root first, we get . If we were to calculate the power first and then take the root, we’d get . As we can see, we get the same result.
Therefore, we can apply the operations in any order.
While reading through the chapter on solving quadratics of a math textbook, I came across a paragraph where the authors mention that factoring quadratics takes a bit of ingenuity, experience, and dumb luck. I spent some time creating an alternative method from the one mentioned in the book which makes factoring quadratics a matter of following a simple set of steps, and removes the element of luck from it. In this post I will review some of the concepts mentioned in the book, and solve through one of the more difficult problems to illustrate my method.
Concepts
Let’s begin by looking at the quadratic . This can be factored as . Multiplying the terms gives us . Comparing this to the coefficients of the the original quadratic gives us , and . For a simple quadratic, we can guess that and . For quadratics where it is not so obvious, we need hints to guide us along the way.
We can get insights into the signs of and by looking at the product and the sum of coefficients of the quadratic. If the product is positive, then they have the same signs. If the product is negative, they have different signs. This makes intuitive sense. In case the product is positive, the sum tells us whether they are both positive or both negative.
We will use this again when we look at the alternative method to factor a quadratic. First, however, we will look at a different type of quadratic where the coefficient of is not 1.
Consider the quadratic . It can be factored as . Multiplying the terms gives us . As in the previously mentioned quadratic, we can get the product and the sum terms by comparing the quadratic with the general form we just derived. Here , and . A small nuance to keep in mind is that if the coefficient of were negative, we’d factor out a to make it positive.
Now we move on to the problem and the method. You’ll find that although the method is tedious, it will remove the element of luck from the process.
Method
Consider the quadratic . Here, , , and . From the guiding hints mentioned in the previous section, we notice that the signs of and are the same; they are either both positive or both negative. We begin the method be defining a function which returns a set of pairs of all the factors of . Therefore, we can write and as follows.
We will now get to the tedious part. We will create combinations of . We pick the first pair of factors of and match it with the first pair of factors of . For the sake of brevity, we will only some of the examples to illustrate the process.
7
7
2
66
476
7
7
-2
-66
-476
7
7
66
2
476
7
7
-66
-2
-476
Notice how we swap the values of and in the columns above. This is because we’re trying to find the product and its value will change depending on what the value of and are. Similarly, we’ll have to swap the values of and ; this is not apparent in the table above because both the values are . We will have to continue on with the table since we we are yet to find a combination of numbers which equals . The table is quite large so we’ll skip the rest of the entries for the sake of brevity and look at the one which gives us the value we’re looking for.
49
1
-22
-6
-316
We can now write our factors as . This gives us or .
That’s it. That’s how we can factor quadratics by creating combinations of the factors of their coefficients.