# Implementing TSum: An Algorithm for Table Summarization

So, I am taking a break from my JVM JIT series to write this post about a table summarization algorithm that I had implemented way back in my college days. Simply put, the algorithm finds the most descriptive patterns in the table that succinctly convey the meaning of the rows contained i.e. summarize it. This post will focus on what the algorithm is and how it’s implemented in JavaScript. Take this post with a grain of salt because the implementation was a part of my college assignment and hasn’t been thoroughly checked for correctness.

### What is TSum?

TSum is the algorithm published by Google Research [link to the pdf]. To quote the abstract:

Given a table where rows correspond to records and columns correspond to attributes, we want to find a small number of patterns that succinctly summarize the dataset. … TSum, a method that provides a sequence of patterns ordered by their “representativeness.” It can decide both which these patterns are, as well as how many are necessary to properly summarize the data.

An algorithm like this is useful in situations where the volume of data is so large that it’d be impossible for a human to deduce anything from it simply by looking at the data. For example, given patient records, a description like “most patients are middle-aged men with cholestrol, followed by child patients with chickenpox” provides a very human-readable summary of the data.

The algorithm views finding the best pattern as a compression problem — the best pattern will give the best compression.

### Definitions

#### Pattern and Pattern Size

A pattern $P$ is a tuple $(A_1 = u_1, …, A_D = u_D)$ where $u_i$ can can be either a specific value or a “don’t care” value represented by $\ast$. The number of matching attributes in a pattern is called the size of the pattern and is denoted by $size(P)$. An attribute is considered matching if and only if its value doesn’t equal “don’t care” value.

#### Pattern List and Pattern Set

A pattern list is an ordered sequence of patterns $\mathcal{P} = [P_1, P_2 …]$ while a pattern set is a set of patterns $\mathcal{P} = \{P_1, P_2 … \}$ with no order.

#### Compression Saving

Compression saving of a pattern $P$ on a table $\mathcal{T}$, denoted as $Saving(P,\mathcal{T})$ is the amount of compression it can achieve. Specifically, it is the difference of bits between he uncompressed and compressed representations of the data records covered by the pattern $P$. Let $N$ be the number of records covered by pattern $P$ and $D$ be the number of attributes in $\mathcal{T}$. Then,

where

and

$W_i$ is the average number of bits in the ith attribute.
$D$ is the number of attributes.
$log^*(N) = 2 \ast \lceil log_2(N+2) \rceil$

#### Residue Data Table

Given a table $\mathcal{T}$ and a pattern collection $\mathcal{P}$, the residue data table contains the records that are not covered by any of the patterns in $\mathcal{P}$.

#### Pattern Marhsalling

Given a set of patterns $\mathcal{S}$, the pattern marshalling algorithm picks a pattern from $\mathcal{S}$ which has the highest compression saving. After every iteration, the records in the table which have been covered by the patterns chosen so far will be removed from consideration.

### Generating Patterns - Local Expansion

Local expansion is a pattern generation strategy that tries to “grow” the patterns to increase the compression savings on the data record. In this approach, the algorithm will start with single attributes first and and find a single-condition pattern $P = \{ (A=a) \}$ that has the best compression saving, and then expand the pattern by adding other conditions until the compression cost cannot be improved. To find the next pattern, the same procedure is repeated, but only on the residue data table - the part of the table not covered by any of the patterns found so far.

### Code

We’ll start with simpler code first. Let’s start by writing a function to calculate the benefit.

The actual translation of formula to code happens from line #25. We start by finding N which is the number of records in the table that match the pattern. Im using lodash‘s filter function to avoid the boilerplate of having to find the matching records myself. W is the accumulator in which I will sum the number of bits each attribute in the pattern take. attributes are the attributes in the pattern. Then for each attribute, we use bitsFor and add it up with the value of W. Finally, we return the value according to the formula.

Simply put, benefit is $N - 1$ times the total number of bits each attribute in the pattern would take.

Next, let’s write code to find $log^\ast(N)$

This is a fairly straightforward translation of the formula to code and needs no explanation.

We now have all the code we need to calculate overhead so let’s do that.

Notice that I start by require-ing the log.js module we just saw. I am using filter to find the number of rows in the table that match the pattern. On the next line I find the number of attributes in a given record of the table. Since the algorithm assumes no null / empty values for any attribute, we can safely pick up the 0th record and see how many attributes it contains.

Now that we have benefit and overhead in place, let’s calculate saving

Now let’s write code for pattern marshalling.

Given a set of patterns $\mathcal{S}$ and a table $\mathcal{T}$, we start by making a copy of the patterns passed to us. The aim is to select a pattern which gives us non-negative compression. If we find such a pattern, we add it to our patterns list and remove the chosen pattern from the current list of patterns. We continue the next itreation on the residue table. We repeat this until we have no more patterns to consider.

The next piece of the puzzle is to write code for local expansion.

As mentioned in the definitions, local expansion grows a pattern, starting with an empty pattern. The expansion is done in expand function.

The final piece of code we need to look at is to calculate the coverage i.e. how much of the data is described by a given pattern.

Now that we have all the machinery in place, let’s write a simple test. We’ll take the patients example given in the paper and turn it into JSON objects. Then we’ll write a simple script to run our code using this data and check whether the results make sense.

Here’s the data:

Here’s the test:

Now let’s run this:

The output says that the most descriptive pattern is “adult male” which makes up 50% of the rows followed by “children with low blood pressure” which make up 30% of the rows. if we look at the sample data, out of the 10 rows, 5 are “adult male” and 3 are “children with low blood pressure”. So the output of the algorithm checks out.

Finito.