So, I am taking a break from my JVM JIT series to write this post about a table summarization algorithm that I had implemented way back in my college days. Simply put, the algorithm finds the most descriptive patterns in the table that succinctly convey the meaning of the rows contained i.e. summarize it. This post will focus on what the algorithm is and how it’s implemented in JavaScript. Take this post with a grain of salt because the implementation was a part of my college assignment and hasn’t been thoroughly checked for correctness.

### What is TSum?

TSum is the algorithm published by Google Research ^{[link to the pdf]}. To quote the abstract:

Given a table where rows correspond to records and columns correspond to attributes, we want to find a small number of patterns that succinctly summarize the dataset. … TSum, a method that provides a sequence of patterns ordered by their “representativeness.” It can decide both which these patterns are, as well as how many are necessary to properly summarize the data.

An algorithm like this is useful in situations where the volume of data is so large that it’d be impossible for a human to deduce anything from it simply by looking at the data. For example, given patient records, a description like “most patients are middle-aged men with cholestrol, followed by child patients with chickenpox” provides a very human-readable summary of the data.

The algorithm views finding the best pattern as a compression problem — the best pattern will give the best compression.

### Definitions

#### Pattern and Pattern Size

A pattern

#### Pattern List and Pattern Set

A pattern list is an ordered sequence of patterns

#### Compression Saving

Compression saving of a pattern

where

and

^{th} attribute.

#### Residue Data Table

Given a table

#### Pattern Marhsalling

Given a set of patterns

### Generating Patterns - Local Expansion

Local expansion is a pattern generation strategy that tries to “grow” the patterns to increase the compression savings on the data record. In this approach, the algorithm will start with single attributes first and and find a single-condition pattern

### Code

We’ll start with simpler code first. Let’s start by writing a function to calculate the benefit.

1 | ; |

The actual translation of formula to code happens from line #25. We start by finding `N`

which is the number of records in the table that match the pattern. Im using `lodash`

‘s `filter`

function to avoid the boilerplate of having to find the matching records myself. `W`

is the accumulator in which I will sum the number of bits each attribute in the pattern take. `attributes`

are the attributes in the pattern. Then for each attribute, we use `bitsFor`

and add it up with the value of `W`

. Finally, we return the value according to the formula.

Simply put, benefit is

Next, let’s write code to find

1 | ; |

This is a fairly straightforward translation of the formula to code and needs no explanation.

We now have all the code we need to calculate overhead so let’s do that.

1 | ; |

Notice that I start by `require`

-ing the `log.js`

module we just saw. I am using `filter`

to find the number of rows in the table that match the pattern. On the next line I find the number of attributes in a given record of the table. Since the algorithm assumes no null / empty values for any attribute, we can safely pick up the 0^{th} record and see how many attributes it contains.

Now that we have `benefit`

and `overhead`

in place, let’s calculate `saving`

1 | ; |

Now let’s write code for pattern marshalling.

1 | ; |

Given a set of patterns `patterns`

list and remove the chosen pattern from the current list of patterns. We continue the next itreation on the residue table. We repeat this until we have no more patterns to consider.

The next piece of the puzzle is to write code for local expansion.

1 | ; |

As mentioned in the definitions, local expansion grows a pattern, starting with an empty pattern. The expansion is done in `expand`

function.

1 | const _ = require("lodash"); |

The final piece of code we need to look at is to calculate the coverage i.e. how much of the data is described by a given pattern.

1 | ; |

Now that we have all the machinery in place, let’s write a simple test. We’ll take the patients example given in the paper and turn it into JSON objects. Then we’ll write a simple script to run our code using this data and check whether the results make sense.

Here’s the data:

1 | [ |

Here’s the test:

1 | ; |

Now let’s run this:

1 | $ node test/test1.js |

The output says that the most descriptive pattern is “adult male” which makes up 50% of the rows followed by “children with low blood pressure” which make up 30% of the rows. if we look at the sample data, out of the 10 rows, 5 are “adult male” and 3 are “children with low blood pressure”. So the output of the algorithm checks out.

Finito.