In one of the previous blog posts I’d written about implementing TSum, a table-summarization algorithm from Google Research. The implementation was written using Javascript and was meant for small datasets that can be summarized within the browser itself. I recently ported the implementation to Dask so that it can be used for larger datasets that consist of many rows. In a nutshell, it lets us summarize a Dask DataFrame and find representative patterns within it. In this post we’ll see how to use the algorithm to summarize a Dask DataFrame, and run benchmarks to see its performance.
Before We Begin
Although the library is designed to be used in production on data stored in a warehouse, it can also be used to summarize CSV or Parquet files. In essence, anything that can be read into a Dask DataFrame can be summarized.
Getting Started
Summarizing data
Imagine that we have customer data stored in a datawarehouse that we’d like to summarize. For example, how would we best describe the customer’s behavior given the data? In essence, we’d like to find patterns within this dataset. In scenarios like these, TSum works well. As an example of data summarization, we’ll use the patient data given in the research paper and pass it to the summarization algorithm.
We’ll begin by adding a function to generate some test data.
1 | def data(n=1): |
We’ll then add code to summarize this data.
1 | import json |
Upon running the script we get the following patterns.
1 | [ |
This indicates that the patterns that best describe our data are “adult males”, which comprise 50% of the data, followed by “children with low blood pressure”, which comprise 30% of the data. We can verify this by looking at the data returned from the data
function, and from the patterns mentioned in the paper.
Running benchmarks
To run the benchmarks, we’ll modify the script and create DataFrames with increasing number of rows. The benchmarks are being run on my local machine which has an Intel i7-8750H, and 16GB of RAM. The script which runs the benchmark is given below.
1 | if __name__ == "__main__": |
This is the output generated. As we can see, it takes 17 minutes for 1e6 rows.
1 | -------- --------- |
Conclusion
That’s it. That’s how we can summarize a Dask DataFrame using TSum. The library is available on PyPI and can be installed with the following command.
1 | pip install tsum |
The code is available on GitHub. Contributions welcome.