2018-11-01

Bivariate Linear Regression

In the previous post we built the intuition behind linear regression. In this post we’ll dig deeper into the simplest form of linear regression which involves one dependant variable and one explanatory variable. Since we only have two variables involved, it’s called bivariate linear regression.

Ordinary Least Squares

We defined sample regression function as:

${Y_i} = \hat{\beta}_1 + \hat{\beta}_2X_i + \hat{u}_i \\\\ {Y_i} = \hat{Y}_i + \hat{u}_i$

The term is called the residual. If we re-write the sample regression function, we get . The residual then represents how far off our estimated value is from the actual value .^[1] A reasonable assumption to make is that we should pick and such that the sum of is the smallest. This would mean that our estimates are close to the actual values.

This assumption, however, does not work. Let’s assume that our values for are -10, 3, -3, 10. Here the sum is zero but we can see that the first and last predictions are far apart from the actual values. To mitigate this, we can add the squares of the residuals. We need to pick and such that the sum of the square of the residuals is the minimum i.e. is the least.

Again, assuming the values of to be -10, 3, -3, 10, the squares are 100, 9, 9, 100. This sums to 218. This shows us that the predicted values are far from the actual values. is called the residual sum of squares (RSS) or the squared error term.

The method of OLS provides us with estimators and such that, for a given sample set, is the least. These estimators are given by the formula:

$\begin{align} \hat{\beta}_2 &= \frac{\sum{x_iy_i}}{\sum{x_i^2}} \\ \hat{\beta}_1 &= \bar{Y} - \hat{\beta}_2\bar{X} \end{align}$

where = , . These are the differences of individual values from their corresponding sample mean or . The estimators thus obtained are called least-squares estimators.

Now that we’ve covered significant ground, let’s solve a sum by hand.

Example

Let’s start by drawing a random sample from the original dataset. The sample I’ve drawn looks like this:

income (X)  expenditure (Y)
        80               65
       100               70
       120               84
       140               80
       160              118
       180              130
       200              144
       220              137
       240              191
       260              150

To see how this contrasts with the original dataset, here’s a plot of the original dataset and the random sample. The faint green points represent the original dataset whereas the red ones are the sample we’ve drawn. As always, our task is to use the sample dataset to come up with a regression line that’s as close as possible to the population dataset i.e. use the red dots to come up with a line that’d be similar to the line we’d get if we had all the faint green dots.

I’ll write some Python + Pandas code to come up with the intermediate calculations and the final and values. I highly recommend that you solve this by hand to get the feel for it.

regression.py

def ols(sample):
    Xbar = sample['X'].mean()
    Ybar = sample['Y'].mean()

    sample['x'] = sample['X'].apply(lambda Xi: Xi - Xbar)
    sample['x_sq'] = sample['x'].apply(lambda x: x ** 2)
    sample['y'] = sample['Y'].apply(lambda Yi: Yi - Ybar)
    sample['xy'] = sample[['x', 'y']].apply(lambda row: row['x'] * row['y'], axis='columns')

    beta2cap = sample['xy'].sum() / sample['x_sq'].sum()
    beta1cap = Ybar - (beta2cap * Xbar)

    sample['Ycap'] = sample['X'].apply(lambda X: beta1cap + (beta2cap * X))

    sample['ucap'] = sample[['Y', 'Ycap']].apply(lambda row: row['Y'] - row['Ycap'], axis='columns')

    return beta1cap, beta2cap

On calling the function, these are the results I get:

1 2	>>> ols(sample) (9.696969696969703, 0.6306060606060606)

and the intermediate calculations are the following:

     X    Y     x    x_sq     y      xy        Ycap       ucap
0   80   65 -90.0  8100.0 -51.9  4671.0   60.145455   4.854545
1  100   70 -70.0  4900.0 -46.9  3283.0   72.757576  -2.757576
2  120   84 -50.0  2500.0 -32.9  1645.0   85.369697  -1.369697
3  140   80 -30.0   900.0 -36.9  1107.0   97.981818 -17.981818
4  160  118 -10.0   100.0   1.1   -11.0  110.593939   7.406061
5  180  130  10.0   100.0  13.1   131.0  123.206061   6.793939
6  200  144  30.0   900.0  27.1   813.0  135.818182   8.181818
7  220  137  50.0  2500.0  20.1  1005.0  148.430303 -11.430303
8  240  191  70.0  4900.0  74.1  5187.0  161.042424  29.957576
9  260  150  90.0  8100.0  33.1  2979.0  173.654545 -23.654545

Let’s plot the line we obtained as a result of this.

Interpreting the Results

Our calculations gave us the results that and . Having a slope () of means that for every unit increase in income, there’s an increase of in expenditure. The intercept is where the regression line meets the Y axis. This means that even without any income, a persion would have an expenditure of . In a lot of cases, however, the intercept term doesn’t really matter as much as the slope and how to interpret the intercept depends upon what the Y axis represents.

Conclusion

That’s it for this post on bivariate linear regression. We saw how to calculate and and worked on a sample problem. We also saw how to interpret the result of the calculations. In the coming post we’ll look at how to assess whether the regression line is a good fit.

[1] It’s important to not confuse the stochastic error term with the residual . The stochastic error term represents the inherent variability of data whereas the residual represents the difference between the predicted and actual values of .

2018-10-28

Introduction to Linear Regression

The first topic we’ll look at is linear regression. As mentioned previously, regression is about understanding dependence of one variable on another. We’ll start by building an intuitive understanding of what it is before we dive into the math-y details of it. We’ll cover the theory, learn to solve some problems by hand, look at the assumptions supporting it, then look at what happens if these assumptions aren’t held. Once this is done we’ll look at how we can do the same by using gradient descent.

What is Regression Analysis?

Regression analysis is the study of the dependence of one variable (dependent variable, regressand) over one or more other variables (explanatory variables, regressors). The aim is to see how the dependent variable changes, on average, if there is a change in the explanatory variables. This is better explained with an example.

The plot shown here is of daily income and daily expenditure. The dependent variable here is the expenditure which depends on the explanatory variable income. As we can see from the plot, for every income level we have a range of expenditures. However, as the regression line shows, on average as income goes up, the expenditure goes up, too.

Why do we say that the income goes up on average? Let’s look at the raw data when daily income is 80 or 100.

income (X)  expenditure (Y)  average
        80               55     65.0
        80               60
        80               65
        80               70
        80               75
       100               65     77.0
       100               70
       100               74
       100               80
       100               85
       100               88

The highest expenditure when income is 80 (75) is greater than the lowest expenditure when income level is 100 (65). If we look at the averages, the expenditure has gone up. There are a number of factors which may have caused the person with income 80 to expend 75 but as a whole, people with his daily income tend to expend less than those who earn 100 a day. The regression line passes through these averages.

That’s the basic idea behind regression analysis.

The Math behind Regression Analysis

The regression line passes through the average expenditure for every given income level. In other words, it passes through the conditional expected value — . This is read as “the expected value of expenditure (Y) given an income level ()”. Therefore, is a function of i.e. . The question now is: what is the function ?

We can start by assuming to be a straight line function which means that we’re assuming expenditure to be linearly related to income. So, . This is the slope-intercept form of a line where is the intercept and is the slope of the line. Both and are called the regression coefficients (or parameters). The function is called the population regression function.

Now given that we know how to calculate the average expenditure for a given income level, how do we calculate the expenditure of each individual at that income level? We can say that the individual’s expenditure is “off the average expenditure by a certain margin”. This can be written as . The term denotes how far off an individual’s expenditure is from the average and is called the stochastic error term.

No matter how good our regression model is, there is always going to be some inherent variability. There are factors other than income which define a person’s expenditure. Factors like gender, age, etc. affect the expenditure. All of these factors are subsumed into the stochastic error term.

When it comes to applying regression analysis, we won’t have access to the population data. Rather, we have access to sample data. Our task is to come up with sample regression coefficients such that they’ll be as close as possible to the population regression coefficients. The sample regression function is defined as:

${Y_i} = \hat{\beta}_1 + \hat{\beta}_2X_i + \hat{u}_i \\\\ {Y_i} = \hat{Y}_i + \hat{u}_i$

Here , , , and are estimators of , , , and respectively. An estimator is simply a formula which tells us how to estimate the population parameter from the information provided by the sample data available to us.

What it means to be ‘Linear’

Linearity can be defined as being either linear in parameters or linear in explanatory variables. To be linear in parameters means that all your parameters in the regression function will have power 1. By that definition, is linear. In contrast, is linear in explanatory variables since has power 1. We’ll be looking at models that are linear in parameters. Henceforth, “linear” would mean “linear in parameters”.

That’s it. This is what sets the foundation for further studying linear regression.

2018-10-24

The Machine Learning Notebook: Introduction

A while ago I’d started writing a series on machine learning titled ML for Newbies which I never got around to finishing. This series of posts is a redoing of the older series with greater mathematical, and theoretical rigor. Some of the content will be ported over from that series but most of it will be new content.

The aim of the series is to summarize everything that I know about machine learning in an approachable, intuituve manner, and also be notes for myself.

What is Machine Learning?

ML is all about automated learning. It is about creating algorithms that can “learn” from the input given to them. These algorithms can then do a number of tasks such as filter out spam emails, predict the weather, etc. More formally, it’s about converting experience into expertise. The input to the algorithm become its “experience” and the task that it performs becomes its “expertise”.

Although ML is a sub-field of AI (Artificial Intelligence), they’re different. AI focuses on replicating human intelligence whereas ML focuses on complementing human intelligence by doing tasks that fall beyond human capabilities like looking for patterns in massive data sets.

How does a Machine Learn?

Like I mentioned earlier, a machine learns from the input data given to it. The way the machine learns can also be classified as supervised, unsupervised, active, or passive.

Supervised Learning

Let’s say your task is to build a spam filter. You have a large data set of emails, each of which is marked as either “spam” or “not spam”. You’ll feed this data set to your algorithm and it’ll learn from it. Later, you’ll input an email and it’ll tell you if it is spam or not. This type of learning is called supervised learning. It is called “supervised” because the algorithm was first taught what is spam and what isn’t by a data set generated manually. You had to supervise its learning process.

Unsupervised Learning

Let’s stay with the spam filter example. Let’s now suppose that the emails in the data set are not marked as “spam” or “not spam”. It’s the task of the algorithm to figure it out somehow. Such learning is called unsupervised learning.

Passive Learning

In passive learning, the algorithm only observes the data provided to it and learns from it, without influencing it in anyway. Suppose now that the spam filter starts out blank with no notion of what is and what isn’t spam. It’s up to the user to mark the incoming emails so that the algorithm can learn from it. This is passive learning.

Active Learning

In contrast, let’s say that the spam filter now looked at the email first, and if it found something suspicious, asked the user to mark it as spam or not. This is active learning.

What can ML do?

There are a number of ML algorithms, each focusing on a different task. These algorithms can do a number of things like:

Clustering

In simple terms, clustering is grouping. It’s an unsupervised form of learning in which the algorithm looks at the data and groups the similar pieces of together based on some similarity measure. In the spam filter examples above, when the algorithm, on its own, figured out which of the emails are spam, it clustered them together.

Classification

Classification is about finding out which group a piece of data belongs to. It’s a form of supervised learning because the algorithm needs to know beforehand what each of the group looks like before it can be asked to classify a piece of data. When the spam filter was given a data set with marked emails, all it had to do was to answer the question “is this email spam?”. Basically, it was asked to classify that email into a group.

Regression

Regression is about understanding relationships between variables. If one variable changes, how would it affect the other? For example, if the amount of power needed for a phone to stay on increased, how would it affect battery life?

Association Rules

Finding out association rules in data sets is also about finding out relationships between variables. Suppose you have a large data set from an online store, association rules is about finding out what items you are likely to buy based on what you’ve previously bought. For example, if you buy milk and sugar, you’re likely to buy eggs. Your rule then becomes {milk, sugar} => eggs.

For now, don’t worry about the theoretical rigor of these. We’ll get to that in subsequent posts. That’s all for now. I hope this post was motivation enough for you to start learning ML. :)

2018-10-21

Implementing TSum: An Algorithm for Table Summarization

So, I am taking a break from my JVM JIT series to write this post about a table summarization algorithm that I had implemented way back in my college days. Simply put, the algorithm finds the most descriptive patterns in the table that succinctly convey the meaning of the rows contained i.e. summarize it. This post will focus on what the algorithm is and how it’s implemented in JavaScript. Take this post with a grain of salt because the implementation was a part of my college assignment and hasn’t been thoroughly checked for correctness.

What is TSum?

TSum is the algorithm published by Google Research ^{[link to the pdf]}. To quote the abstract:

Given a table where rows correspond to records and columns correspond to attributes, we want to find a small number of patterns that succinctly summarize the dataset. … TSum, a method that provides a sequence of patterns ordered by their “representativeness.” It can decide both which these patterns are, as well as how many are necessary to properly summarize the data.

An algorithm like this is useful in situations where the volume of data is so large that it’d be impossible for a human to deduce anything from it simply by looking at the data. For example, given patient records, a description like “most patients are middle-aged men with cholestrol, followed by child patients with chickenpox” provides a very human-readable summary of the data.

The algorithm views finding the best pattern as a compression problem — the best pattern will give the best compression.

Definitions

Pattern and Pattern Size

A pattern is a tuple where can can be either a specific value or a “don’t care” value represented by . The number of matching attributes in a pattern is called the size of the pattern and is denoted by . An attribute is considered matching if and only if its value doesn’t equal “don’t care” value.

Pattern List and Pattern Set

A pattern list is an ordered sequence of patterns while a pattern set is a set of patterns with no order.

Compression Saving

Compression saving of a pattern on a table , denoted as is the amount of compression it can achieve. Specifically, it is the difference of bits between he uncompressed and compressed representations of the data records covered by the pattern . Let be the number of records covered by pattern and be the number of attributes in . Then,

$Saving(P, \mathcal{T}) = Benefit(P, \mathcal{T}) - Overhead(P, \mathcal{T})$

where

$Benefit(P, \mathcal{T}) = (N - 1) \ast \sum_{i, A_i \in matched(P)} W_i$

and

$Overhead(P, \mathcal{T}) = D + log^*(N)$

is the average number of bits in the i^th attribute.
is the number of attributes.

Residue Data Table

Given a table and a pattern collection , the residue data table contains the records that are not covered by any of the patterns in .

Pattern Marhsalling

Given a set of patterns , the pattern marshalling algorithm picks a pattern from which has the highest compression saving. After every iteration, the records in the table which have been covered by the patterns chosen so far will be removed from consideration.

Generating Patterns - Local Expansion

Local expansion is a pattern generation strategy that tries to “grow” the patterns to increase the compression savings on the data record. In this approach, the algorithm will start with single attributes first and and find a single-condition pattern that has the best compression saving, and then expand the pattern by adding other conditions until the compression cost cannot be improved. To find the next pattern, the same procedure is repeated, but only on the residue data table - the part of the table not covered by any of the patterns found so far.

Code

We’ll start with simpler code first. Let’s start by writing a function to calculate the benefit.

benefit.js

"use strict";
const _ = require("lodash");
/**
 * The benefit of using this pattern to represent the table
 * @param T the table containing rows
 * @param pattern the pattern to be applied
 */
function benefit(pattern, T){
    if(pattern == null) return 0;

    // a function for internal use.
    // find out the bits needed to encode the value
    function bitsFor(value) {
        if( typeof(value) === "string" || value instanceof String){
            // JS uses UTF-16
            return value.length * 16;
        }
        if( typeof(value) === "number" || value instanceof Number){
            // 64-bit for each number
            return 64;
        }
        return 0;
    }

    let N = _.filter(T, pattern).length;
    let W = 0;
    let attributes = Object.keys(pattern);

    attributes.forEach(attribute => {
        W += bitsFor(pattern[attribute]);
    });

    return (N-1) * W;
}
module.exports = benefit;

The actual translation of formula to code happens from line #25. We start by finding N which is the number of records in the table that match the pattern. Im using lodash‘s filter function to avoid the boilerplate of having to find the matching records myself. W is the accumulator in which I will sum the number of bits each attribute in the pattern take. attributes are the attributes in the pattern. Then for each attribute, we use bitsFor and add it up with the value of W. Finally, we return the value according to the formula.

Simply put, benefit is times the total number of bits each attribute in the pattern would take.

Next, let’s write code to find

log.js

"use strict";
/**
 * Calculates log*(N).
 * @param N number of rows in the table
 * @returns {number}
 */
function log(N){
    return 2 * Math.log2( N+2 );
}
module.exports = log;

This is a fairly straightforward translation of the formula to code and needs no explanation.

We now have all the code we need to calculate overhead so let’s do that.

overhead.js

"use strict";
const _ = require("lodash");
const log = require("./log");
/**
 * The overhead of using this pattern to represent this table
 * @param T the table containing rows
 * @param pattern the pattern to be applied
 */
function overhead(pattern, T){
    let N = _.filter(T, pattern).length;
    let D = Object.keys(T[0]).length; // number of attributes
    return D + log(N);
}
module.exports = overhead;

Notice that I start by require-ing the log.js module we just saw. I am using filter to find the number of rows in the table that match the pattern. On the next line I find the number of attributes in a given record of the table. Since the algorithm assumes no null / empty values for any attribute, we can safely pick up the 0^th record and see how many attributes it contains.

Now that we have benefit and overhead in place, let’s calculate saving

saving.js

"use strict";
const benefit = require("./benefit");
const overhead = require("./overhead");
/**
 * The compression saving of a pattern P on a data table T,
 * denoted by saving(P,T) is the amount of compression it can achieve.
 * @param T the table containing rows
 * @param pattern the pattern to be applied
 */
function saving(pattern, T){
    return benefit(pattern, T) - overhead(pattern, T);
}
module.exports = saving;

Now let’s write code for pattern marshalling.

pattern_marshalling.js

"use strict";
const _ = require("lodash");
const saving = require("./saving");
const coverage = require("./coverage");

function patternMarshalling(S, T){
    // chosen patterns
    let patterns = []; // empty = don't care for every attribute
    let remainingPatterns = _.cloneDeep(S);

    while( remainingPatterns.length > 0 ){
        // select the pattern with the top incremental compression saving
        let bTop = Number.NEGATIVE_INFINITY;
        let pTop;
        let residualTable = residue(patterns, T);

        remainingPatterns.forEach(pattern => {
            let compression = saving(pattern, residualTable);
            if( compression > bTop ) {
                bTop = compression;
                pTop = pattern;
            }
        });

        if( bTop > 0 ){
            patterns.push({
                "pattern" : pTop,
                "saving" : bTop,
                "coverage" : coverage(pTop,T)
            });
        }

        remainingPatterns = _.difference(remainingPatterns, _.filter(remainingPatterns, pTop));
    }

    return patterns;
}

function residue(patterns, T){
    patterns.forEach(pattern => {
        T = _.difference(T, _.filter(T, pattern.pattern));
    });
    return T;
}

module.exports = patternMarshalling;

Given a set of patterns and a table , we start by making a copy of the patterns passed to us. The aim is to select a pattern which gives us non-negative compression. If we find such a pattern, we add it to our patterns list and remove the chosen pattern from the current list of patterns. We continue the next itreation on the residue table. We repeat this until we have no more patterns to consider.

The next piece of the puzzle is to write code for local expansion.

local_expansion.js

"use strict";
const _ = require("lodash");
const expand = require("./expand");

/**
 * Local expansion pattern generation strategy that directly looks
 * for patterns that could minimize the compression cost
 * @param T an array of JSON objects
 * @returns {Array} of patterns that best summarize the data
 */
function localExpansion(T){
    let patterns = []; // final list of patterns to return

    // while we still have rows
    while( T.length > 0 ){
        // expand from an empty pattern (Algorithm 4)
        let pattern = expand(T, {});
        // stop if we cannot achieve more compression saving
        if( _.isEqual(pattern, {}) ){
            break;
        }
        // found a new pattern
        patterns.push( pattern );
        // remove all the rows that match the pattern i.e.residual table for the pattern
        T = _.difference(T, _.filter(T, pattern));
    }

    return patterns;
}

module.exports = localExpansion;

As mentioned in the definitions, local expansion grows a pattern, starting with an empty pattern. The expansion is done in expand function.

expand.js

const _ = require("lodash");
const saving = require("./saving");

/**
 * Expands a pattern to improve compression saving. See algorithm 4.
 * @param T
 * @param pattern
 * @returns {*}
 */
function expand(T, pattern){
    // find the attributes not included in the pattern
    let allAttributes = Object.keys(T[0]);
    let matchedAttributes = Object.keys(pattern);
    let attributes = _.difference(allAttributes, matchedAttributes);

    let bestPattern;
    let highestCompressionSaving = 0;

    attributes.forEach( attribute => {
        // find all the unique values for the current attribute
        let values = _.map(T, attribute);
        values = _.uniq(values);

        values.forEach( value => {
            // an expanded pattern created by appending the current
            // attribute and its value to the existing pattern
            let newPattern = _.cloneDeep(pattern);
            newPattern[attribute] = value;

            let compressionSaving = saving(newPattern, T);

            if(compressionSaving > highestCompressionSaving){
                highestCompressionSaving = compressionSaving;
                bestPattern = newPattern;
            }
        });
    });

    if( saving(bestPattern, T) > saving(pattern, T) ){
        return expand(T, bestPattern);
    }else{
        return pattern;
    }
}

module.exports = expand;

The final piece of code we need to look at is to calculate the coverage i.e. how much of the data is described by a given pattern.

coverage.js

"use strict";

const _ = require("lodash");
function coverage(pattern,T){
    let matchingRows = _.filter(T,pattern);
    let coverage = matchingRows.length / T.length;
    return coverage * 100; // % of coverage
}

module.exports = coverage;

Now that we have all the machinery in place, let’s write a simple test. We’ll take the patients example given in the paper and turn it into JSON objects. Then we’ll write a simple script to run our code using this data and check whether the results make sense.

Here’s the data:

patients.json

[
  {"gender":"M","age":"adult","blood_pressure":"normal"},
  {"gender":"M","age":"adult","blood_pressure":"low"},
  {"gender":"M","age":"adult","blood_pressure":"normal"},
  {"gender":"M","age":"adult","blood_pressure":"high"},
  {"gender":"M","age":"adult","blood_pressure":"low"},

  {"gender":"F","age":"child","blood_pressure":"low"},
  {"gender":"M","age":"child","blood_pressure":"low"},
  {"gender":"F","age":"child","blood_pressure":"low"},

  {"gender":"M","age":"teen","blood_pressure":"high"},
  {"gender":"F","age":"child","blood_pressure":"normal"}
]

Here’s the test:

test1.js

"use strict";

const tsum = require("../index");
const table = require("./data/patients.json");
const _ = require("lodash");

let patterns = tsum.localExpansion( table );
let sorted = tsum.patternMarshalling(patterns,table);

patterns = _.shuffle(patterns);
console.log( sorted );

Now let’s run this:

$ node test/test1.js
[ { pattern: { age: 'adult', gender: 'M' },
    saving: 375.3852901558848,
    coverage: 50 },
  { pattern: { age: 'child', blood_pressure: 'low' },
    saving: 248.35614381022526,
    coverage: 30 } ]

The output says that the most descriptive pattern is “adult male” which makes up 50% of the rows followed by “children with low blood pressure” which make up 30% of the rows. if we look at the sample data, out of the 10 rows, 5 are “adult male” and 3 are “children with low blood pressure”. So the output of the algorithm checks out.

Finito.

2018-05-20

JVM JIT - Loop Unrolling

In the previous post we looked at how JVM generates assembly code and how we can take a look at it. In this post we will look at yet another optimization technique called loop unrolling.

What is loop unrolling?

To quote The Java HotSpot Performance Engine Architecture,

the Server VM features loop unrolling, a standard compiler optimization that enables faster loop execution. Loop unrolling increases the loop body size while simultaneously decreasing the number of iterations.

Okay so it is an optimization that makes our loops faster by increasing the body size, and reducing the iterations. This is much better explained by example.

A loop like this:

1
2
3

for(int i = 0; i < N; i++) {
    S(i);
}

can be unrolled into this:

for(int i = 0; i < N; i += 4) {
    S(i);
    S(i+1);
    S(i+2);
    S(i+3);
}

So the size of the loop has increased because instead of calling method S just once per iteration, we call it 4x. The iterations have reduced because we now have a stride of 4 (i += 4). This is a space-time tradeoff because you gain speed by increasing the size of the program. If this for loop were to be a part of a hot method that got compiled to assembly, more code cache would be consumed because of unrolling.

Loop unrolling, however, wins at a different front — speed. By increasing the stride of the loop, we’re reducing the number of jumps that the CPU has to make. Jumps are costly and a reduction in jumps is a huge performance boost.

Loop unrolling in action

import org.openjdk.jmh.annotations.*;
import java.util.concurrent.TimeUnit;

@Warmup(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(value = 1, jvmArgsPrepend = {"-XX:+UnlockDiagnosticVMOptions", "-XX:+PrintAssembly"})
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class MyBenchmark {
    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public void testMethod() {
        int[] nums = new int[10];

        for (int i = 0; i < nums.length; i++) {
            nums[i] = 0x42;
        }
    }
}

The code above is a JMH benchmark which will fork 1 JVM (@Fork), warm it for 10 iterations (@Warmup), then run our benchmark for 5 iterations (@Measurement). The generated assembly looks like this:

0x00000001296ac845: cmp    %r10d,%ebp
0x00000001296ac848: jae    0x00000001296ac8b4
0x00000001296ac84a: movl   $0x42,0x10(%rbx,%rbp,4)
0x00000001296ac852: inc    %ebp               
0x00000001296ac854: cmp    %r11d,%ebp
0x00000001296ac857: jl     0x00000001296ac845

...

0x00000001296ac880: vmovdqu %xmm0,0x10(%rbx,%rbp,4)  
0x00000001296ac886: add    $0x4,%ebp          
0x00000001296ac889: cmp    %r8d,%ebp
0x00000001296ac88c: jl     0x00000001296ac880

So the first block of code is moving 0x42 to some memory location (line #3), incrementing the counter %ebp (line #4), and jumping back to the start of the loop for the next iteration (line #6). Then there is the second block which increments the counter by 4 (line #11) and uses vmovdqu (line #10).

In the second block, vmovdqu moves %xmm0 (a 128-bit register) to some memory location (line #10) i.e. we’re writing 4 32-bit numbers in one go. This is equivalent of writing nums[0 .. 3] in one iteration.

This means that our entire loop which writes 10 array elements will not run for 10 iterations at all. It will run for 4 — 2 iterations to write 8 elements in groups of 4 each (block 2), and 2 more to write the remaining 2 elements (block 1).

What we saw above was loop unrolling to enable vectorization. Since this happened without any effort on our part, it’s called auto-vectorization.

What is vectorization?

There are certain CPU instructions which are capable of operating on multiple data elements simultaneously. Such instructions are called SIMD instructions — Single Instruction Multiple Data or vectorized instructions. In the example above, all 4 elements were packed into a single register XMM0 and then moved to their memory location with vmovdqu. This results in faster processing by saving jumps and clock cycles.

+--------|--------|--------|--------+
|  0x42  |  0x42  |  0x42  |  0x42  | xmm0
+--------|--------|--------|--------+
     ↓       ↓         ↓        ↓

Auto-vectorization is when the compiler converts the scalar instructions (which operate on a single data element at a time) to vector instructions (which operate on multiple data elements at a time) without any effort on the part of the programmer.

Vectorizing the code enables superword level parallelism (SLP). SLP is a type of SIMD parallelism in which source and result operands are packed in a storage location.

An example of SLP

The example shows statement packing in which the operands have been packed into registers and the scalar addition and multiplication have been replaced by their vectorized counterparts.

How does loop unrolling enable SLP?

Consider this loop:

1
2
3

for (i=0; i<16; i++) {
    localdiff = ref[i] - curr[i];
}

This is a loop containing isomorphic statements — statements that contain same operations in the same order. Such statements are easy to unroll. So the loop above can be transformed as follows:

for (i=0; i<16; i+=4) {
    localdiff = ref[i+0] - curr[i+0];
    diff += abs(localdiff);
    
    localdiff = ref[i+1] - curr[i+1];
    diff += abs(localdiff);

    localdiff = ref[i+2] - curr[i+2];
    diff += abs(localdiff);

    localdiff = ref[i+3] - curr[i+3];
    diff += abs(localdiff);
}

This can be further transformed as:

for (i=0; i<16; i+=4) {
    localdiff0 = ref[i+0] - curr[i+0];
    localdiff1 = ref[i+1] - curr[i+1];
    localdiff2 = ref[i+2] - curr[i+2];
    localdiff3 = ref[i+3] - curr[i+3];

    diff += abs(localdiff0);
    diff += abs(localdiff1);
    diff += abs(localdiff2);
    diff += abs(localdiff3);
}

Now it becomes easy to see how SLP can be used to speed up the execution of the loop. Loop unrolling sets the stage for SLP.

Which flags control loop unrolling and SLP?

There are a couple of flags - -XX:LoopUnrollLimit and -XX:+UseSuperWord. LoopUnrollLimit controls how many times your loop will be unrolled and UseSuperWord controls the transformation of scalar operations into vectorized operations.

Conclusion

Loop unrolling makes your code faster by doing more per iteration and increasing the stride. It’s a space-time trade off where you make your code larger so that the execution takes less time. Loop unrolling can be taken a step further to make scalar instructions into vector instructions. These instructions operate on packed data and achieve more per instruction than their scalar counterparts.

The conversion from scalar instructions to vector instructions is called vectorization. Since this happens transparently to the programmer, it is called auto-vectorization.

JVM JIT - Compiling to Assembly

In the previous post we saw how JIT inlining works. We also saw how the JVM performs OSR to replace the interpreted version of the method to the compiled version on the fly. In this post we’ll dig even deeper and see the assembly code that is generated when the method gets compiled.

Prerequisites

The flag which enables us to see assembly code is -XX:+PrintAssembly. However, viewing assembly code does not work out of the box. You’ll need to have the disassembler on your path. You’ll need to get hsdis (HotSpot Disassembler) and build it for your system. There’s a prebuilt version available for Mac and that’s the one I am going to use.

1	git clone https://github.com/liuzhengyang/hsdis.git

Once we have that, we’ll add it to LD_LIBRARY_PATH.

1	export LD_LIBRARY_PATH=./hsdis/build/macosx-amd64

Now we’re all set to see how JVM generates assembly code.

Printing assembly code

We’ll reuse the same inlining code from last time:

public class Inline {
    public static void main(String[] args) {
        long upto = Long.parseLong(args[0]);

        for(int i = 0; i < upto; i++) {
            int x = inline1();
        }
    }

    public static int inline1() {
        return inline2();
    }

    public static int inline2() {
        return inline3();
    }

    public static int inline3() {
        return 4;
    }
}

-XX:+PrintAssembly is a diagnostic flag so we’ll need to unlock JVM’s disgnostic options first. Here’s how:

1	java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly Inline 100000

This will generate a lot of assembly code. We will, however, look at the assembly code generated for inline1.

Decoding compiled method 0x000000010d6095d0:
Code:
[Disassembling for mach='i386:x86-64']
[Entry Point]
[Verified Entry Point]
[Constants]
  # {method} 'inline1' '()I' in 'Inline'
  #           [sp+0x20]  (sp of caller)
  0x000000010d609700: sub    $0x18,%rsp
  0x000000010d609707: mov    %rbp,0x10(%rsp)    ;*synchronization entry
                                                ; - Inline::inline1@-1 (line 11)
  0x000000010d60970c: mov    $0x4,%eax
  0x000000010d609711: add    $0x10,%rsp
  0x000000010d609715: pop    %rbp
  0x000000010d609716: test   %eax,-0x496b71c(%rip)        # 0x0000000108c9e000
                                                ;   {poll_return}
  0x000000010d60971c: retq
  0x000000010d60971d: hlt
  0x000000010d60971e: hlt
  0x000000010d60971f: hlt
[Exception Handler]
[Stub Code]
  0x000000010d609720: jmpq   0x000000010d6050a0  ;   {no_reloc}
[Deopt Handler Code]
  0x000000010d609725: callq  0x000000010d60972a
  0x000000010d60972a: subq   $0x5,(%rsp)
  0x000000010d60972f: jmpq   0x000000010d5deb00  ;   {runtime_call}
  0x000000010d609734: hlt
  0x000000010d609735: hlt
  0x000000010d609736: hlt
  0x000000010d609737: hlt    Decoding compiled method 0x000000010d6064d0:

So this is the assembly code that we get when we run the program. It’s a lot to grok in one go so let’s break it down.

Line #7 and #8 are self explanatory; they show which method we’re looking at. Line #9 and #10 (and #13 to #17) are for thread synchronization. The JVM can get rid of thread synchronization if it sees that there is no need for it (lock eliding) but since we are using static methods here, it needs to add code for synchronization. It doesn’t know that we only have only one thread running.

Our actual program is on line #11 where we are moving the value 4 to %eax register. This is the register which holds, by convention, the return value for our methods. This shows that the JVM has optimized our code. Our call chain was inline1 ⟶ inline2 ⟶ inline3 and it was inline3 which returned 4. However, JVM is smart enough to see that these method calls are superfluous and decided to get rid of them. Very nifty!

Line #21 to #23 has code to handle exceptions. We know there won’t be any exceptions but the JVM doesn’t so it has to be prepared to deal with that.

And finally, there’s code to deoptimize. In addition to static optimizations, there are some optimizations that the JVM makes which are speculative. This means that the JVM generates assembly code expecting things to go a certain way after it has profiled the interpreted code. However, if the speculation is wrong, the JVM can go back to running the interpreted version.

Which flags control compilation?

-XX:CompileThreshold is the flag which controls the number of call / branch invocations after which the JVM compiles bytecodes to assembly. You can use -XX:+PrintFlagsFinal to see the value. By default it is 10000.

Compiling a method to assembly depends on two factors: the number of times that method has been invoked (method entry counter) and the number of times a loop has been executed (back-edge counter). Once the sum of the two counters is above CompileThreshold, the method will be compiled to assembly.

Maintaining the two counters separately is very useful. If the back-edge counter alone exceeds the threshold, the JVM can compile just the loop (and not the entire method) to assembly. It will perform an OSR and start using the compiled version of the loop while the loop is executing instead of waiting for the next method invocation. When the method is invoked the next time around, it’ll use the compiled version of the code.

So since compiled code is better than interpreted code, and CompileThreshold controls when a method will be compiled to assembly, reducing the CompileThreshold would mean we have a lot more assembly code.

There is one advantage to reducing the CompileThreshold - it will reduce the time taken for the branches / methods to be deemed hot i.e. reduce the JVM warmup time.

In older JDKs, there was another reason to reduce CompileThreshold. The method entry and back-edge counters would decay at every safepoint. This would mean that some methods would not compile to assembly since the counters kept decaying. These are the “lukewarm” methods that never became hot. With JDK 8+, the counters no longer decay at safepoints so there won’t be any lukewarm methods.

In addition, JDK 8+ come with tiered compilation enabled and the CompileThreshold is ignored. The idea of there being a “compile threshold”, though, does not change. I’m defering the topic of tiered compilation for the sake of simplicity.

Where is the compiled code stored?

The compiled code is stored in JVM’s code cache. As more methods become hot, the cache starts to get filled. Once the cache is filled, the JVM can no longer compile anything to assembly and will resort to purely interpreteting the bytecodes.

The size of code cache is platform dependent.

Also, JVM ensures that the access to cache is optimized. The hlt instructions in the assembly code exist for aligning the addresses. It is much more efficient for the CPU to read from even addresses than it is to read from odd addresses in memory. The hlt instructions ensure that the code is at an even address in memory.

Which flags control code cache size?

There are two flags which are important in setting the code cache size - InitialCodeCacheSize and ReservedCodeCacheSize. The first flag indicates the code cache size the JVM will start with and the latter indicates the size to which the code cache can grow. With JDK 8+, ReservedCodeCacheSize is large enough so you don’t need to set it explicitly. On my machine it is 240 MB (5x what it is for Java 7, 48 MB).

Conclusion

The JVM compiles hot code to assembly and stores it at even addresses in it’s code cache for faster access. Executing assembly code is much more efficient than interpreting the bytecodes. You don’t really need to look at the assembly code generated everyday but knowing what is generated as your code executes gives you an insight into what the JVM does to make your code run faster.

2018-04-29

JVM JIT - Inlining

In the previous post we looked at how interpreted and compiled languages work. To recap, an interpreter works by generating assembly code for every bytecode it encounters. This is a very simple way to execute a program and also a very slow one. It ends up redoing a lot of translation from bytecode to assembly. Also, this simplistic approach means that the interpreter cannot do optimizations as it executes the bytecodes. Then there are compilers which produce assembly ahead-of-time. This overcomes having to generate assembly again and again but once the assembly is generated it cannot be changed on the fly.

JVM comes with both an interpreter and a compiler. When the execution of the code begins, the bytecodes are interpreted. For the sake of this series, I’ll be looking at Oracle HotSpot JVM which looks for “hot spots” in the code as the bytecodes get interpreted. These are the parts of the code which are most frequently executed and the performance of the application depends on these. Once the code is identified as “hot”, JVM can go from interpreting the code to compiling it to assembly i.e. the code is compiled “just-in-time”. In addition, since the code is being profiled as it is run, the compiled code is optimized.

In this post we’ll look at one such optimization: inlining.

Inlining

Inlining is an optimization where the call to a method is replaced by the body of the called method i.e. at the call site, the caller and the callee are melded together. When a method is called, the JVM has to push a stack frame so that it can resume from where it left off after the called method has finished executing. Inlining improves performance since JVM will not have to push a stack frame.

I’ll start with a simple example to demonstrate how inlining works.

Inline.java

public class Inline {
    public static void main(String[] args) {
        long upto = Long.parseLong(args[0]);

        for(int i = 0; i < upto; i++) {
            int x = inline1();
        }
    }

    public static int inline1() {
        return inline2();
    }

    public static int inline2() {
        return inline3();
    }

    public static int inline3() {
        return 4;
    }
}

Next, let’s compile and run the code.

1	java -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining Inline 100000

Output:

63    1             Inline::inline1 (4 bytes)
63    2             Inline::inline2 (4 bytes)
                       @ 0   Inline::inline3 (2 bytes)   inline (hot)
                       @ 0   Inline::inline2 (4 bytes)   inline (hot)
                         @ 0   Inline::inline3 (2 bytes)   inline (hot)
66    3             Inline::inline3 (2 bytes)
66    4 %           Inline::main @ 9 (28 bytes)
                       @ 16   Inline::inline1 (4 bytes)   inline (hot)
                         @ 0   Inline::inline2 (4 bytes)   inline (hot)
                           @ 0   Inline::inline3 (2 bytes)   inline (hot)
66    4 %           Inline::main @ -2 (28 bytes)   made not entrant

Line #1 shows that inline1 was compiled to assembly. Line #2 and Line #6 show that inline2 and inline3 were also compiled to assembly. Line #3 to line #5 show inlining. We can see that inline3 was merged into inline2. Similarly, line #8 and #9 show that inline2 was merged into inline1. So basically, all the methods were inlined into inline1. This means that once a certain threshold is crossed, we’ll no longer be making methods calls at all. This gives a significant performance boost.

Which flags control inlining?

When you run a Java program, you can view the flags with which it ran using -XX:+PrintFlagsFinal. Let’s do that and look at a few flags of interest.

1	java -XX:+PrintFlagsFinal Inline 10000

You’ll see a bunch of flags and their default values. The ones we are interested in are CompileThreshold, MaxInlineLevel, MaxInlineSize, and FreqInlineSize.

CompileThreshold is the number of invocations before compiling a method to native.
MaxInlineLevel is a limit on how deep you’d go before you stop inlining. The default value is 9. This means if we had method calls like inline1 ⟶ inline2 … ⟶ inline20, we’d only inline upto inline10. There after, we’d invoke inline11.
MaxInlineSize decides the maximum size of a method, in bytecodes, to be inlined. The default value is 35. This means that if the method to be inlined has mre than 35 bytecodes, it will not be inlined.
FreqInlineSize, in contrast, decides the maximum size of a hot method, in bytecodes, to be inlined. This is a platform-dependent value and on my machine it is 325.

You can tweak these flags to change how inlining behaves for your program.

What is On Stack Replacement (OSR)?

When we make a method call, JVM pushes a stack frame. When a method is deemed hot, the JVM replaces the intrepreted version with the compiled version by replacing the old stack frame with a new one. This is done while the method is running. We saw OSR being indicated in our example. The % indicates that an OSR was made.

1	66 4 % Inline::main @ -2 (28 bytes) made not entrant

Let’s write some code to see OSR in action once again.

import java.lang.ref.WeakReference;

public class OSR {
    public static void main(String[] args) {
        Object unused = new Object();
        WeakReference<Object> ref = new WeakReference<>(unused);

        int x = 0;

        while( ref.get() != null ) {
            x += 1;
            System.out.println(x);
        }

        System.out.println("Finished!");
    }
}

So this is a loop that will never terminate, right? Let’s run the program and see.

...
434062
  16828   59 %           OSR::main @ -2 (48 bytes)   made not entrant
Finished!

What just happened? When the JVM decided to perform an OSR, it saw that there was no use for the unused object and decided to set it to null, causing the WeakReference to return null and thus breaking the loop. When an OSR is performed, the method that is invoked doesn’t restart execution from the start. Rather, It continues from the “back-edge”. In our case, it would be the loop. Since the JVM saw that there was no use for the unused object after this back-edge, it was removed and the loop could terminate.

Being able to resume execution from the back-edge is very efficient. This means that once a method has been compiled to native code it can be used rightaway rather than at the next invocation of the method.

Conclusion

To recap, we saw how JVM inlines code. Fusing the caller and the callee provides for improved performance since the overhead of method dispatch is avoided. We saw the flags which control inlining and we saw how JVM performs OSR.

Inlining is a very useful optimization because it forms the basis for other optimizations like escape analysis and dead code elimination.

2018-04-17

JVM JIT - Introduction

Motivation

My day job requires me to write code in Clojure. This means the code is eventually compiled to bytecode and run on the JVM. Intrigued by how the JVM does what it does, I decided to dig a little deeper and look at how it optimizes the code on the fly. In this series of posts I will be looking at JVM JIT (Just-In-Time) compiler.

Myriad ways to run a program

Before I go into how JVM JIT works, I want to take a quick look at how interpreted and compiled languages work. For this post, I’ll take a look at the working of Python (an interpreted language) and C (a compiled language).

Python

Python, by default, ships with CPython - the original Python interpreter that runs C code for every bytecode. There’s other implementations like IronPython or PyPy. IronPython turns Python into a fully compiled language running on top of Microsoft’s .NET Common Language Runtime (CLR) whereas PyPy turns Python into a JIT compiled language. For the sake of this post, however, I will look at CPython and how it works.

I’ll start with some code which will print the bytecodes for another Python file that is passed to it.

bytecode.py

from sys import argv
from dis import dis

script, path = argv
source_file = open(path)
source_code = source_file.read()
compiled = compile(source_code, "<string>", "exec")
bytecodes = dis(compiled)

print(bytecodes)
source_file.close()

Next, here’s some code that’ll print numbers.

print_numbers.py

1 2	for n in [101, 102, 103]: print(n)

Now, let’s run the code and see the bytecodes we get.

1	python3 bytecode.py print_numbers.py

Output:

  1           0 SETUP_LOOP              20 (to 22)
              2 LOAD_CONST               4 ((101, 102, 103))
              4 GET_ITER
        >>    6 FOR_ITER                12 (to 20)
              8 STORE_NAME               0 (n)

  2          10 LOAD_NAME                1 (print)
             12 LOAD_NAME                0 (n)
             14 CALL_FUNCTION            1
             16 POP_TOP
             18 JUMP_ABSOLUTE            6
        >>   20 POP_BLOCK
        >>   22 LOAD_CONST               3 (None)
             24 RETURN_VALUE
None

The loop starts on line #4. For every element in the list, we’re pushing print and n onto the stack, calling the function, popping the stack, and repeating the loop. For each of the bytecodes, there’s associated C code i.e. FOR_ITER, STORE_NAME, etc. have associated C code.

This is a very simple way to run a program and also a very inefficient one. We’re repeating the stack operations and jumps over and over again. There’s no scope for optimizations like loop unrolling.

C

In contrast to Python is C. All the C code is compiled to assembly ahead-of-time. Here’s a simple C program which will print “EVEN” if a number is even.

numbers.c

#include<stdio.h>

int main() {
  for(int i = 1; i < 10000; i += 2) {
    if( i % 2 == 0 ) {
      printf("EVEN!");
    } else {
      printf("");
    }
  }
  return 0;
}

Next, let’s compile this code.

1	gcc -S numbers.c

This will generate numbers.s. The assembly is fairly long so I’ll just cover the relevant parts.

numbers.s

LBB0_1:
        cmpl    $10000, -8(%rbp) 
        jge     LBB0_7
        ...
        idivl   %ecx
        cmpl    $0, %edx
        jne     LBB0_4
        leaq    L_.str(%rip), %rdi
        ...
        callq   _printf
        movl    %eax, -16(%rbp)
        jmp     LBB0_5
LBB0_4:
        leaq    L_.str.1(%rip), %rdi
        movb    $0, %al
        callq   _printf
        ...
LBB0_5:
        jmp     LBB0_6
LBB0_6:
        ...
        addl    $2, %eax
        ...
        jmp     LBB0_1
        ...
LBB0_7:
        ...
        retq
        ...
L_.str:
        .asciz  "EVEN!"
L_.str.1:
        .space  1

Lines #2 - #3 show that if we’ve reached the limit of 10k, we’ll jump to LBB0_7 and the program ends.
If not, on line #5 we perform a signed division (idivl) and check if it is not zero. If it is not zero, we jump to LBB0_4 and print L_.str.1 which is just a whitespace.

We will always end up making this jump because we’ll never reach the condition where we have an even number. This is the problem with ahead-of-time compilation where you cannot speculate what the data is going to be and therefore you have to be ready to handle all the possibilities.

JVM JIT

JVM JIT combines the best of both the worlds. When you execute your program the first time, the bytecodes are interpreted. As the code continues to execute, JVM collects statistics about it and the more frequently used code (“hot” code) is compiled to assembly. In addition, there are optimizations like loop unrolling. Loop unrolling looks like this:

// Plain ol' loop
for(i = 0; i < 3; i++) {
    System.out.println(arr[i]);
}

// Unrolled
System.out.println(arr[0]);
System.out.println(arr[1]);
System.out.println(arr[2]);

Unrolling a loop helps avoid jumps and thus makes execution faster.

Also, since JVM collects statistics about code, it can make optimizations on the fly. For example, in the case where an even number is never reached, JVM can generate assembly code that’ll only have the else part of the branch.

Conclusion

JVM does some fairly intersting optimizations under the hood. The aim of this series of posts is to cover as much of this as possible. We’ll start simple and build upon this as we go.

2018-01-06

Transducers

One of the nice things that you’ll come across in Clojure is transducer. In this post I’ll go over what transducers are, how you can use them, how you can make one, and what transducible contexts are.

What are transducers?

In simple terms, transducers are composable transformation pipelines. A transducer does not care about where the input comes from or where the output will go; it simply cares about the transformation of the data that that flows through the pipeline. Let’s look at an example:

(let [xf (comp 
           (map inc)
           (filter even?))]
  (sequence xf (range 10)))

Here xf (external function) is our transducer which will increment every number and then will keep only the even numbers. Calling sequence functions like map, filter, etc. with single arity returns a transducer which you can then compose. The transducer doesn’t know where it will be used - will it be used with a collection or with a channel? So, a transducer captures the essence of your transformation. sequence is responsible for providing the input to the transducer. This is the context in which the transducer will run.

Here’s how the same thing can be done using threading macro:

1
2
3

(->> (range 10)
     (map inc)
     (filter even?))

The difference here is that the 2-arity version of map and filter will create intermediate collections while the 1-artity versions won’t. Transducers are much more efficient than threading together sequence functions.

Chaining
Transducing

Source for images

Inside a transducer

Let’s look at the 1-arity version of map and see what makes a transducer.

(defn map
  ([f]
    (fn [rf]
      (fn
        ([] (rf))
        ([result] (rf result))
        ([result input]
           (rf result (f input)))
        ([result input & inputs]
           (rf result (apply f input inputs))))))
   ...)

When you call 1-arity version of map you get back a transducer which, as shown above, is a function. Functions like map, filter, etc. take a collection and return a collection. Transducers, on the otherhand, take one reducing function and return another. The function returned is expected to have three arities:

0-arity (init): This kickstarts the transformation pipeline. The only thing you do here is call the reducing function `rf`.

2-arity (step): This is where you'll perform the transformation. You get the result so far and the next input. In case of `map`, you call the reducung function `rf` by applying the function `f` to the input. How the value is going to be added to the result is the job of `rf`. If you don't want to add anything to the result, just return the result as-is. You may call `rf` once, multiple times, or not at all.

1-arity (end): This is called when the transducer is terminating. Here you must call `rf` exactly once and call the 1-arity version. This results in the production of the final value.

So, the general form of a transducer is this:

(fn [rf]
  (fn 
    ([] (rf))
    ([result] (rf result))
    ([result input] ... )))

Using transducers

You can use a transducer in a context. There’s four contexts which come out of the box — into, transduce, sequence, and educe.

into

The simplest way to use a transducer is to pass it to into. This will add your transformed elements to an already-existing collection after applying the transducer. In this example, we’re simply adding a range into a vector.

1
2
3

(let [xf (comp (map inc)
               (map dec))]
  (into [] xf (range 10)))

Internally, into calls transduce.

transduce

transduce is similar to the standard reduce function but it also takes an additional xform as an argument.

1
2
3

(let [xf (comp (map inc)
               (map dec))]
  (transduce xf + (range 10)))

sequence

sequence lets you create a lazy sequence after applying a transducer. In contrast, into and transduce are eager.

1
2
3

(let [xf (comp (map inc)
               (map dec))]
  (sequence xf (range 10)))

eduction

eduction lets you capture applying a transducer to a collection. The value returned is an iterable application of the transducer on the collection items which can then be passed to, say, reduce.

(let [xf (comp (map inc)
               (map dec))
      coll (eduction xf (range 10))]
  (reduce + 0 coll))

Inside a transducible context

As mentioned before, transducers run in transducible contexts. The ones that come as a part of clojure.core would suffice most real-world needs and you’ll rarely see yourself writing new ones. Let’s look at transduce.

(defn transduce
  ([xform f coll] (transduce xform f (f) coll))
  ([xform f init coll]
     (let [f (xform f)
           ret (if (instance? clojure.lang.IReduceInit coll)
                 (.reduce ^clojure.lang.IReduceInit coll f init)
                 (clojure.core.protocols/coll-reduce coll f init))]
       (f ret))))

transduce is just like reduce. The 3-arity version expects an initial value to be supplied by calling the 0-arity version of the supplied function. The 4-arity version is slightly more involved. IReduceInit is an interface implemented by collections to let them provide an initial value. It lets a collection reduce itself. If not, the call goes to coll-reduce which is a faster way to reduce a collection than using first/next recursion.

Stateful transducers

It’s possible for transducers to maintain reduction state.

(defn multiply-xf
  []
  (fn [rf]
    (let [product (volatile! 1)]
      (fn
        ([] (rf))
        ([result] (rf result))
        ([result input]
         (let [new-product (vswap! product * input)]
           (rf result new-product)))))))

Here’s a transducer which will multiply all the incoming numbers. We maintain state by using a Volatile. Whenever we get a new input we multiply it with the product and update the state of Volatile using vswap!. Let’s see this in action:

1 2	(into [] (multiply-xf) [1 2 3]) => [1 2 6]

Early Termination

The way the above transducer is written, it’ll process all the inputs even if one of the inputs is zero. We know that once we encounter a zero, we can safely end the reduction process and return a zero. reduced lets you return a reduced value and end the reduction. Let’s make a minor change to the above transducer and add in early termination.

(defn multiply-xf
  []
  (fn [rf]
    (let [product (volatile! 1)]
      (fn
        ([] (rf))
        ([result] (rf result))
        ([result input]
         (let [new-product (vswap! product * input)]
           (if (zero? new-product)
             (reduced result)
             (rf result new-product))))))))

In the 2-arity function, we check if the new-product is zero. If it is, we know we have a reduced value. We end the reduction by returning the result we have so far. Let’s see this in action:

(into [] (multiply-xf) [1 2 3])
=> [1 2 6]
(into [] (multiply-xf) [1 2 0 3])
=> [1 2]

Conclusion

Transducers can be a very useful tool in your Clojure toolkit that let you process large collections, channels, etc. effectively by letting you make composable transformation pipelines that process one element at a time. They require a little getting used-to but once you’re past the learning curve, performance galore!

2017-12-06

One Year as a Software Engineer - a Retrospective

December 2017 marks my completing one year as a software engineer. This post is a restrospective where I list down the lessons I’ve learned, in no specific order of priority.

Personal Life

Learn to invest

We’re having too good a time today. We ain’t thinking about tomorrow.
— John Dillinger, Public Enemy

It’s easy to get carried away because of the fat pay cheque your fancy developer job brings you at the end of the month. Think of all the bling you can buy. This hedonistic attitude, however, does not help you hedge against the volatility of the job market. In his book Out of our Minds: Learning to be Creative, Ken Robinson states that secure life-long employment in a single job is a thing of the past. This applies even more if you work in a startup environment where things change in the blink of an eye.

Making proper investments and establishing a second flow of cash can help you get a grip on the situation when it gets rough. The simplest way to invest is to put your money in the stock market and let the dividends add to your monthly flow of cash. You do not have to be an active day trader managing your positions. You can very easily invest in mutual funds, index funds, etc. and they are not that difficult to begin with.

One of the best books I’ve been recommended by a friend of mine is Trading and Exchanges: Market Microstructure for Practitioners by Larry Harris. This will give you a very good overview of the market and all that you need to become a confident investor.

Be ready to interview

Love your job but don’t love your company, because you may not know when your company stops loving you.
— A. P. J. Abdul Kalam, 11^th President of India

The key takeaway of working in a startup environment is this: things change very rapidly and when the push comes to shove, you will be thrown overboard. Having seen this happen to people close to me, I’ve learned that you need to be ready to interview with other startups and/or companies as soon as you are fired or have resigned. This includes staying in touch with the fundamentals you’ve learned in your CS class like data structures and algorithms, and also knowing the technologies you’ve used in reasonable depth. Also, having a good network of developers goes a long way in easing your search for a new job. And don’t forget to sharpen the saw — do things that are not programming that make you a better programmer.

Learn to negotiate

Dude, it’s five minutes. Let’s un-suck your negotiation.
— Patrick McKenzie

Learning how to negotiate is one of the most important skills that is often overlooked. It is overlooked because it is perceived to be cheap to negotiate for salary, or just plain avoiding difficult conversation. Whatever the case, get over it and learn the skill. As Patrick McKenzie mentions in his brilliant blog post Salary Negotiation: Make More Money, Be More Valued, all it takes is five minutes to finalize your salary. These five minutes have a lasting impact for alteast an year to come.

Read a lot

Read, read, read.
— William Faulkner

I’ll admit that it is hard to take time out to read daily but keeping a dedicated slot of 30 minutes just for reading goes a long way. I make sure it’s a distraction-free slot with no dinging notifications and I try not to multitask. Another exercise that I do in conjunction with reading is trying to improve my retention of the material I’ve read by improving my recall memory. The best way to do this is to use Feynman technique where you eludicate what you’ve learned and pretend to teach it to a student.

I prefer keeping one programming and one non-programming book as a part of my daily reading. In addition, there’s a list of blogs, and papers that I read from time to time. I’ll probably post a list as a separate post.

Engineering

Understand Peter principle

The key to management is to get rid of the managers.
— Ricardo Semler

Laurence Peter came up with the concept that a person in a role keeps getting promoted based on their performance in their current role and not on the abilities that the role demands. It is quite possible to have a manager who doesn’t know how to do his job well i.e. he’s risen to his level of incompetence. This principle is important to understand as an engineer and is something that should make one reflect on one’s current skillset - do you possess the abilities to be in the role you are currently in or do you need to skill up? Again, this goes back to my point on sharpening the saw and doing non-programming things that make you a better developer.

Tech debt is evil

Simplicity is hard work. But, there’s a huge payoff. The person who has a genuinely simpler system - a system made out of genuinely simple parts, is going to be able to affect the greatest change with the least work. He’s going to kick your ass. He’s gonna spend more time simplifying things up front and in the long haul he’s gonna wipe the plate with you because he’ll have that ability to change things when you’re struggling to push elephants around.
— Rich Hickey

When Ward Cunningham came up with the tech debt metaphor, he was referring to writing code despite having a poor understanding of the requirements. As time passes, whatever little understanding there was of the requirement fades away and the code is taken for granted. The definition of tech debt has since then come to represent poorly written code that later nobody understands and it’s taken for granted - something Ward Cunningham disagrees with.

The lethal combination is poorly written code for badly understood requirement(s) and you’ll come across this very often in startup environments. This comes with some pretty nasty ramifications like team in-fighting, and politics. In worst cases, it can bring the development of new features to a grinding halt.

Avoiding tech debt has to be a top priority for any startup that wants to grow. A lot of it revolves around establishing some processes to convey requirements among teams and ensuring that the resulting system design is simple. Like the quote by Rich Hickey shows, it is hard work but it will pay off in the longer run.

Centralize your Logs

However, logging doesn’t seem to receive the same level of attention; consequently, developers find it hard to know the ‘what, when, and how’ of logging.
— Colin Eberhardt

Please stop asking developers to SSH into machines to read the logs. Centralize your logs by using ELK or if you want to avoid the hassle of setting up ELK, use a third party service like Fluentd or similar. A good centralized logging strategy will not only save you the pain of SSH-ing into multiple servers and grep-ing, it will also let you search through them easily. In addition, aggregating logs from various servers helps you identify patterns that may emerge by checking what’s happening on multiple servers in a specific time range.

2017-11-16

Unit Testing with Specs

clojure.spec is a standard, expressive, powerful, and integrated system for specification and testing. It lets you define the shape of your data, and place contraints on it. Once the shape, and constraints are defined, clojure.spec can then generate sample data which you can use to test your functions. In this post I’ll walk you through how you can use clojure.spec in conjunction with other libraries to write unit tests.

Motivation

As developers, we are accustomed to writing example-based tests - we provide a known input, look at the resulting output, and assert that it matches our expectations. Although there is nothing wrong with this approach, there are a few drawbacks:

It is expensive as it takes longer to complete.
It is easier to miss out on the corner cases.
It is more prone to pesticide paradox.^[1]

In contrast, clojure.spec allows you to do generative, property-based testing. Generative testing allows you to specify what kind of data you are looking for. This is done by using generators. A generator is a declarative description of possible inputs to a function.^[2] Property-based testing allows you to specify how your program is supposed to behave, given an input. A property is a high-level specification of behavior that should hold for a range of inputs.^[3]

Setup

Creating an App

We’ll begin by creating an app using lein and defining the dependencies. So go ahead and execute the following to create your project:

1	lein new app clj-spec

Adding Dependencies

Next we’ll add a few dependencies. cd into clj-spec and open project.clj. Add the following to your :dependencies

[org.clojure/clojure "1.8.0"]
[clojure-future-spec "1.9.0-beta4"]  
[org.clojure/test.check "0.9.0"]  
[circleci/bond "0.3.0"] 
[cloverage "1.0.9"]

clojure.spec comes as a part of Clojure 1.9 which, as of writing, isn’t out yet. If you’re on Clojure 1.8, as I am, you can use clojure-future-spec which will give you the same APIs. circleci/bond is a stubbing library which we’ll use to stub IO, network calls, database calls, etc. cloverage is the tool we’ll use to see the coverage of our tests.

Using `clojure.spec`

Simple Specs

Fire up a REPL by executing lein repl and require the required namespaces ;)

clj-spec.core=> (require '[clojure.spec.alpha :as s])
nil
clj-spec.core=> (require '[clojure.spec.gen.alpha :as gen])
nil
clj-spec.core=> (require '[clojure.future :refer :all])
nil

spec will let us define the shape of our data, and constraints on it. gen will let us generate the sample data.

Let’s write a simple spec which we can use to generate integers.

1 2	clj-spec.core=> (s/def ::n int?) :clj-spec.core/n

We’ve defined a spec ::n which will constrain the sample data to only be integers. Notice the use of double colons to create a namespace-qualified symbol; this is a requirement of the spec library. Now let’s generate some sample data.

1 2	clj-spec.core=> (gen/generate (s/gen ::n)) -29454

s/gen takes a spec as an input and returns a generator which will produce conforming data. gen/generate exercises this generator to return a single sample value. You can produce multiple values by using gen/sample:

1 2	clj-spec.core=> (gen/sample (s/gen ::n) 5) (0 0 1 -1 0)

We could have done the same thing more succinctly by using the in-built functions as follows:

1 2	clj-spec.core=> (gen/generate (gen/vector (gen/int) 5)) [25 -29 -13 26 -8]

Spec-ing Maps

Let’s say we have a map which represents a person and looks like this:

1 2	{:name "John Doe" :age 32}

Let’s spec this.

1
2
3

(s/def ::name string?)
(s/def ::age int?)
(s/def ::person (s/keys :req-un [::name ::age]))

We’ve defined ::name to be a string, and ::age to be an integer (positive or negative). You can make your specs as strict or as lenient as you choose. Finally, we define ::person to be a map which requires the keys ::name and ::age, albiet without namespace-qualification. Let’s see this in action:

1 2	clj-spec.core=> (gen/generate (s/gen ::person)) {:name "KXYXcbk", :age 107706}

By now you must have a fair idea of how you can spec your data and have sample values generated that match those specs. Next we’ll look at how we can do property-based testing with specs.

Using `test.check`

test.check allows us to do property-based testing. Property-based tests make statements about the output of your code based on the input, and these statements are verified for many different possible inputs.^[4]

A Simple Function

(defn even-or-odd
  [coll]
  (map #(cond
          (even? %) :even
          :else :odd) coll))

We’ll begin by testing the simple function even-or-odd. We know that for all even numbers we should get :even and for all odd numbers we should get :odd. Let’s express this as a property of the function. Begin by require-ing a couple more namespaces.

clj-spec.core=> (require '[clojure.test.check :as tc])
nil
clj-spec.core=> (require '[clojure.test.check.properties :as prop])
nil

Now for the actual property.

(def property
  (prop/for-all [v (gen/vector (gen/choose 0 1))]
    (let [result (even-or-odd v)
          zeroes-and-ones (group-by zero? v)
          zeroes (get zeroes-and-ones true)
          ones (get zeroes-and-ones false)
          evens-and-odds (group-by #(= :even %) result)
          evens (get zeroes-and-ones true)
          odds (get zeroes-and-ones false)]
      (and (= (count zeroes) (count evens))
           (= (count ones) (count odds))))))

We have a generator which will create a vector of 0s and 1s only. We pass that vector as an input to our function. Additionally, we know that the number of 0s should equal the number of :evens returned and that the number of 1s should equal the number of :odds returned.

Next, let’s test this property.

1 2	clj-spec.core=> (tc/quick-check 100 property) {:result true, :num-tests 100, :seed 1510754879429}

Awesome! We ran the test a 100 times and passed. The added benefit is that the input generated will be different every time you run the test.

Using `bond`

bond is a library which will let you stub side-effecting functions like database calls. We’ll require the namespce and modify our code to save even numbers to database.

First, the namespace.

1 2	clj-spec.core=> (require '[bond.james :as bond]) nil

Next, the code.

(defn save
  [n]
  ;; let's assume there's a database call here
  nil)

(defn even-or-odd
  [coll]
  (map #(cond
          (even? %) (do
                      (save %) ;; save to database
                      :even)
          :else :odd) coll))

Now let’s update the property and stub save.

(def property
  (prop/for-all [v (gen/vector (gen/choose 0 1))]
    (bond/with-stub [save]
      (let [result (even-or-odd v)
            zeroes-and-ones (group-by zero? v)
            zeroes (get zeroes-and-ones true)
            ones (get zeroes-and-ones false)
            evens-and-odds (group-by #(= :even %) result)
            evens (get zeroes-and-ones true)
            odds (get zeroes-and-ones false)]
        (and (= (count zeroes) (count evens))
             (= (count ones) (count odds))
             (= (count zeroes) (-> save bond/calls count)))))))

Notice how we’re using bond/with-stub and telling it to stub save function which calls the database. Later, we assert that the number of times that the databse was called is equal to the number of evens in the vector. Let’s verify the property.

1 2	clj-spec.core=> (tc/quick-check 10000 property) {:result true, :num-tests 10000, :seed 1510834022725}

Voilà! It works!

The last part of this post is about finding out test coverage using cloverage. For that, we’ll be moving our code to core.clj and writing test under the test directory.

Using `cloverage`

To see cloverage in action, we’ll need to add our functions to core.clj. Here’s what it’ll look like:

(ns clj-spec.core
  (:gen-class))

(defn save
  [n]
  ;; let's assume there's a database call here
  nil)

(defn even-or-odd
  [coll]
  (map #(cond
          (even? %) (do
                      (save %) ;; save to database
                      :even)
          :else :odd) coll))

(defn -main
  "I don't do a whole lot ... yet."
  [& args]
  (println "Hello, World!"))

Update your clj-spec.core-test to the following:

(ns clj-spec.core-test
  (:require [bond.james :as bond]
            [clojure.test.check.properties :as prop]
            [clojure.spec.gen.alpha :as gen]
            [clojure.test.check.clojure-test :refer :all]
            [clj-spec.core :refer :all]))

(defspec even-odd-test
 100
 (prop/for-all [v (gen/vector (gen/choose 0 1))]
   (bond/with-stub [save]
     (let [result (even-or-odd v)
           zeroes-and-ones (group-by zero? v)
           zeroes (get zeroes-and-ones true)
           ones (get zeroes-and-ones false)
           evens-and-odds (group-by #(= :even %) result)
           evens (get zeroes-and-ones true)
           odds (get zeroes-and-ones false)]
       (and (= (count zeroes) (count evens))
            (= (count ones) (count odds))
            (= (count zeroes) (-> save bond/calls count)))))))

Here we are using the defspec macro to run the same property-based test a 100 times only this time we’ll run the test via command-line using lein. Execute the following command to run the test and see the coverage.

1	lein run -m cloverage.coverage -t 'clj-spec.core-test' -n 'clj-spec.core'

This will make use of cloverage to run the tests. -t denotes our test namespace and -n denotes the namespace for whom the tests are written. You’ll get an output like this:

...
Produced output in /Users/fasih/Personal/clj-spec/target/coverage .
HTML: file:///Users/fasih/Personal/clj-spec/target/coverage/index.html

|---------------+---------+---------|
|     Namespace | % Forms | % Lines |
|---------------+---------+---------|
| clj-spec.core |   82.61 |   88.89 |
|---------------+---------+---------|
|     ALL FILES |   82.61 |   88.89 |
|---------------+---------+---------|

Perfect!! Now we know how much coverage we have. The HTML file has a nice graphical representation of which lines we’ve covered with our tests.

Conclusion

This brings us to the end of the post on using clojure.spec to write generative, property-based tests in both the REPL and source files. Generative testing automates the task of having to come up with examples for your tests. Where to go from here? Each of these libraries is pretty powerful in itself and will provide you with the necessary tools to write powerful, robust, and expressive tests that require minimal effort. So, head over to the offical docs to learn more.

2017-10-13

Monad

So far we’ve looked at monoids and functors. The next algebraic data structure we’ll cover is a monad. If you’ve wondered what a monad is but never really understood it, this is the post for you. I am sure that you’ve used it without realizing it. So let’s get to it.

Definition

A monad has more structure than a functor. This means that you can call map on it and that it obeys all the functor laws. In addition, a monad has a flatMap function which you can use to chain monads together. In essence, monads represent units of computation that you can chain together and the result of this chaining is also a monad.

Let’s look at a few examples.

Example

val first = List(1, 2)
val next = List(8, 9)

for {
  i <- first
  j <- next
}
yield(i * j)

The above code^[1] uses a for comprehension to muliply elements of the list together. Under the hood, this gets translated to:

first flatMap {
  f => next map {
    n => f * n
  }
}

The compiler is making use of the List monad to chain operations together. Let’s break this down.

1
2
3

next map {
  n => f * n
}

This part of the code will return a List since that is what calling map on a List does. Since we have two elements in first list, the result of mapping will generate two lists of two elements each. This isn’t what we want. We want a single list that combines the results together.

1
2
3

first flatMap {
  ...
}

The flattening of results is what flatMap does - it takes the two lists and squishes them into one.

Monad Laws

For something to be a monad, it has to obey the monadic laws. There’s three monad laws:

Left identity
Right identity
Associativity

Left Identity

This law means that if we take a value, put it into a monad, and then flatMap it with a function f, that’s the same as simply applying the function f to the original value. Let’s see this in code:

scala> def f(x: Int): List[Int] = { List(x * 2) }
f: (x: Int)List[Int]

// left identity
List(2).flatMap(f) == f(2)
res5: Boolean = true

Right Identity

This law means that if we take a monad, flatMap it, and within that flatMap we try to create a monad out of it, then that’s the same as original monad. Let’s see this in code:

1
2
3

// right identity
scala> List(1, 2, 3).flatMap({ x => List(x) }) == List(1, 2, 3)
res6: Boolean = true

Let’s walkthrough this. The function to flatMap gets the elements of the original list, List(1, 2, 3), one-by-one. The result is List(List(1), List(2), List(3)). This is then flattened to create List(1, 2, 3), which is the original list.

Associativity

This law states that if we apply a chain of functions to our monad, that’s the same as the composition of all the functions. Let’s see this in code:

scala> def f(x: Int): List[Int] = { List(x + 1) }
f: (x: Int)List[Int]

scala> def g(x: Int): List[Int] = { List(x + 1) }
g: (x: Int)List[Int]

scala> List(1, 2, 3).flatMap(f).flatMap(g) == List(1, 2, 3).flatMap(x => f(x).flatMap(g))
res8: Boolean = true

Conclusion

This brings us to the end of the post on monads and their laws. List isn’t the only monad in your arsenal. Options and Futures are monads, too. I suggest going ahead and constructing examples for monadic laws for them.

Ordinary Least Squares

Example

Interpreting the Results

Conclusion

What is Regression Analysis?

The Math behind Regression Analysis

What it means to be ‘Linear’

What is Machine Learning?

How does a Machine Learn?

Supervised Learning

Unsupervised Learning

Passive Learning

Active Learning

What can ML do?

Clustering

Classification

Regression

Association Rules

What is TSum?

Definitions

Pattern and Pattern Size

Pattern List and Pattern Set

Compression Saving

Residue Data Table

Pattern Marhsalling

Generating Patterns - Local Expansion

Code

What is loop unrolling?

Loop unrolling in action

What is vectorization?

How does loop unrolling enable SLP?

Which flags control loop unrolling and SLP?

Conclusion

Recommended reading

Prerequisites

Printing assembly code

Which flags control compilation?

Where is the compiled code stored?

Which flags control code cache size?

Conclusion

Inlining

Which flags control inlining?

What is On Stack Replacement (OSR)?

Conclusion

Motivation

Myriad ways to run a program

Python

C

JVM JIT

Conclusion

What are transducers?

Inside a transducer

Using transducers

into

transduce

sequence

eduction

Inside a transducible context

Stateful transducers

Early Termination

Conclusion

Personal Life

Learn to invest

Be ready to interview

Learn to negotiate

Read a lot

Engineering

Understand Peter principle

Tech debt is evil

Centralize your Logs

Motivation

Setup

Creating an App

Adding Dependencies

Using clojure.spec

Simple Specs

Spec-ing Maps

Using test.check

A Simple Function

Using bond

Using `clojure.spec`

Using `test.check`

Using `bond`

Using `cloverage`