Implementing TSum: An Algorithm for Table Summarization
So, I am taking a break from my JVM JIT series to write this post about a table summarization algorithm that I had implemented way back in my college days. Simply put, the algorithm finds the most descriptive patterns in the table that succinctly convey the meaning of the rows contained i.e. summarize it. This post will focus on what the algorithm is and how it’s implemented in JavaScript. Take this post with a grain of salt because the implementation was a part of my college assignment and hasn’t been thoroughly checked for correctness.
Given a table where rows correspond to records and columns correspond to attributes, we want to find a small number of patterns that succinctly summarize the dataset. … TSum, a method that provides a sequence of patterns ordered by their “representativeness.” It can decide both which these patterns are, as well as how many are necessary to properly summarize the data.
An algorithm like this is useful in situations where the volume of data is so large that it’d be impossible for a human to deduce anything from it simply by looking at the data. For example, given patient records, a description like “most patients are middle-aged men with cholestrol, followed by child patients with chickenpox” provides a very human-readable summary of the data.
The algorithm views finding the best pattern as a compression problem — the best pattern will give the best compression.
Definitions
Pattern and Pattern Size
A pattern is a tuple where can can be either a specific value or a “don’t care” value represented by . The number of matching attributes in a pattern is called the size of the pattern and is denoted by . An attribute is considered matching if and only if its value doesn’t equal “don’t care” value.
Pattern List and Pattern Set
A pattern list is an ordered sequence of patterns while a pattern set is a set of patterns with no order.
Compression Saving
Compression saving of a pattern on a table , denoted as is the amount of compression it can achieve. Specifically, it is the difference of bits between he uncompressed and compressed representations of the data records covered by the pattern . Let be the number of records covered by pattern and be the number of attributes in . Then,
where
and
is the average number of bits in the ith attribute. is the number of attributes.
Residue Data Table
Given a table and a pattern collection , the residue data table contains the records that are not covered by any of the patterns in .
Pattern Marhsalling
Given a set of patterns , the pattern marshalling algorithm picks a pattern from which has the highest compression saving. After every iteration, the records in the table which have been covered by the patterns chosen so far will be removed from consideration.
Generating Patterns - Local Expansion
Local expansion is a pattern generation strategy that tries to “grow” the patterns to increase the compression savings on the data record. In this approach, the algorithm will start with single attributes first and and find a single-condition pattern that has the best compression saving, and then expand the pattern by adding other conditions until the compression cost cannot be improved. To find the next pattern, the same procedure is repeated, but only on the residue data table - the part of the table not covered by any of the patterns found so far.
Code
We’ll start with simpler code first. Let’s start by writing a function to calculate the benefit.
"use strict"; const _ = require("lodash"); /** * The benefit of using this pattern to represent the table * @param T the table containing rows * @param pattern the pattern to be applied */ functionbenefit(pattern, T){ if(pattern == null) return0;
// a function for internal use. // find out the bits needed to encode the value functionbitsFor(value) { if( typeof(value) === "string" || value instanceofString){ // JS uses UTF-16 return value.length * 16; } if( typeof(value) === "number" || value instanceofNumber){ // 64-bit for each number return64; } return0; }
let N = _.filter(T, pattern).length; let W = 0; let attributes = Object.keys(pattern);
attributes.forEach(attribute => { W += bitsFor(pattern[attribute]); });
return (N-1) * W; } module.exports = benefit;
The actual translation of formula to code happens from line #25. We start by finding N which is the number of records in the table that match the pattern. Im using lodash‘s filter function to avoid the boilerplate of having to find the matching records myself. W is the accumulator in which I will sum the number of bits each attribute in the pattern take. attributes are the attributes in the pattern. Then for each attribute, we use bitsFor and add it up with the value of W. Finally, we return the value according to the formula.
Simply put, benefit is times the total number of bits each attribute in the pattern would take.
Next, let’s write code to find
log.js
1 2 3 4 5 6 7 8 9 10
"use strict"; /** * Calculates log*(N). * @param N number of rows in the table * @returns {number} */ functionlog(N){ return2 * Math.log2( N+2 ); } module.exports = log;
This is a fairly straightforward translation of the formula to code and needs no explanation.
We now have all the code we need to calculate overhead so let’s do that.
overhead.js
1 2 3 4 5 6 7 8 9 10 11 12 13 14
"use strict"; const _ = require("lodash"); const log = require("./log"); /** * The overhead of using this pattern to represent this table * @param T the table containing rows * @param pattern the pattern to be applied */ functionoverhead(pattern, T){ let N = _.filter(T, pattern).length; let D = Object.keys(T[0]).length; // number of attributes return D + log(N); } module.exports = overhead;
Notice that I start by require-ing the log.js module we just saw. I am using filter to find the number of rows in the table that match the pattern. On the next line I find the number of attributes in a given record of the table. Since the algorithm assumes no null / empty values for any attribute, we can safely pick up the 0th record and see how many attributes it contains.
Now that we have benefit and overhead in place, let’s calculate saving
saving.js
1 2 3 4 5 6 7 8 9 10 11 12 13
"use strict"; const benefit = require("./benefit"); const overhead = require("./overhead"); /** * The compression saving of a pattern P on a data table T, * denoted by saving(P,T) is the amount of compression it can achieve. * @param T the table containing rows * @param pattern the pattern to be applied */ functionsaving(pattern, T){ returnbenefit(pattern, T) - overhead(pattern, T); } module.exports = saving;
functionpatternMarshalling(S, T){ // chosen patterns let patterns = []; // empty = don't care for every attribute let remainingPatterns = _.cloneDeep(S);
while( remainingPatterns.length > 0 ){ // select the pattern with the top incremental compression saving let bTop = Number.NEGATIVE_INFINITY; let pTop; let residualTable = residue(patterns, T);
Given a set of patterns and a table , we start by making a copy of the patterns passed to us. The aim is to select a pattern which gives us non-negative compression. If we find such a pattern, we add it to our patterns list and remove the chosen pattern from the current list of patterns. We continue the next itreation on the residue table. We repeat this until we have no more patterns to consider.
The next piece of the puzzle is to write code for local expansion.
/** * Local expansion pattern generation strategy that directly looks * for patterns that could minimize the compression cost * @param T an array of JSON objects * @returns {Array} of patterns that best summarize the data */ functionlocalExpansion(T){ let patterns = []; // final list of patterns to return
// while we still have rows while( T.length > 0 ){ // expand from an empty pattern (Algorithm 4) let pattern = expand(T, {}); // stop if we cannot achieve more compression saving if( _.isEqual(pattern, {}) ){ break; } // found a new pattern patterns.push( pattern ); // remove all the rows that match the pattern i.e.residual table for the pattern T = _.difference(T, _.filter(T, pattern)); }
return patterns; }
module.exports = localExpansion;
As mentioned in the definitions, local expansion grows a pattern, starting with an empty pattern. The expansion is done in expand function.
/** * Expands a pattern to improve compression saving. See algorithm 4. * @paramT * @parampattern * @returns {*} */ functionexpand(T, pattern){ // find the attributes not included in the pattern let allAttributes = Object.keys(T[0]); let matchedAttributes = Object.keys(pattern); let attributes = _.difference(allAttributes, matchedAttributes);
let bestPattern; let highestCompressionSaving = 0;
attributes.forEach( attribute => { // find all the unique values for the current attribute let values = _.map(T, attribute); values = _.uniq(values);
values.forEach( value => { // an expanded pattern created by appending the current // attribute and its value to the existing pattern let newPattern = _.cloneDeep(pattern); newPattern[attribute] = value;
The final piece of code we need to look at is to calculate the coverage i.e. how much of the data is described by a given pattern.
coverage.js
1 2 3 4 5 6 7 8 9 10
"use strict";
const _ = require("lodash"); functioncoverage(pattern,T){ let matchingRows = _.filter(T,pattern); let coverage = matchingRows.length / T.length; return coverage * 100; // % of coverage }
module.exports = coverage;
Now that we have all the machinery in place, let’s write a simple test. We’ll take the patients example given in the paper and turn it into JSON objects. Then we’ll write a simple script to run our code using this data and check whether the results make sense.
The output says that the most descriptive pattern is “adult male” which makes up 50% of the rows followed by “children with low blood pressure” which make up 30% of the rows. if we look at the sample data, out of the 10 rows, 5 are “adult male” and 3 are “children with low blood pressure”. So the output of the algorithm checks out.