Updating Experiments

LensKit 3.0 overhauls the way that we configure and run experiments. This page describes how to update your LensKit 2.x experiments to work in LensKit 3.

You can find an example using the new structure in the lk3 branch of eval-quickstart.

Note: This page pertains to an upcoming version of LensKit. If you want to use the features described in this page, use LenKit 3.0-SNAPSHOT.

Overview and Motivation

In LensKit 2, evaluations were configured and run via eval.groovy scripts written in a custom domain-specific language built on top of Groovy. These scripts took care of data preparation, running recommenders, and had some rudimentary build system capabilities to handle different tasks.

There were several problems with this:

We were implementing a build tool, and not doing nearly so good a job of it as Gradle or Ant.
The ins and outs of the script language were not documented and were somewhat prone to breakage.
It was difficult to extend, and even harder to document how to extend it.

For LensKit 3, we have rebuilt the evaluator’s various capabilities into command-line tools and written a more complete Gradle plugin to allow them to be configured and run. Now, rather than having a single LenskitEval task in your build.gradle, you define separate Crossfold and TrainTest tasks, and can mix them up with whatever other Gradle tasks you can imagine (including doing other things with the output of the crossfolder).

The Gradle tasks, and the Spec objects that they use to communicate with the evaluator, are documented in the Gradle plugin API docs.

Setting Up

LensKit 3 evaluations are driven by build.gradle rather than eval.groovy. If you were using Gradle to run evaluations before, you can edit your existing Gradle file; if you were using the old Maven plugin or some other tool, you’ll need to convert to Gradle.

You first need to tell Gradle where to get the LensKit plugin. We don’t yet publish it to the Gradle plugin repository (although we plan to at release time), so add the following to the top of your build.gradle:

buildscript {
    repositories {
        maven {
            url 'https://oss.sonatype.org/content/repositories/snapshots/'
        }
        mavenCentral()
    }
    dependencies {
        classpath 'org.grouplens.lenskit:lenskit-gradle:3.0-SNAPSHOT'
    }
}

Next, we need to activate the plugin and import its tasks:

apply plugin: 'java' // if you use Groovy or Scala, add those plugins
apply plugin: 'lenskit'

import org.lenskit.gradle.*

And then set up the project dependencies to pull in LensKit:

repositories {
    maven {
        url 'https://oss.sonatype.org/content/repositories/snapshots/'
    }
    mavenCentral()
}

dependencies {
    compile "org.grouplens.lenskit:lenskit-all:3.0-SNAPSHOT"
    runtime "org.grouplens.lenskit:lenskit-cli:3.0-SNAPSHOT"
}

There’s a little redundancy here between the buildscript block and repositories; currently, Gradle doesn’t let us remove that.

Converting Crossfolding

Now that we have the main infrastructure ready, we can set up the data. We do this via a Crossfold task; this like the crossfold task in the old LensKit evaluator.

task crossfold(type: Crossfold, group: 'evaluate') {
    input textFile {
        file "data/ml-100k/u.data"
        delimiter "\t"
        // ratings are on a 1-5 scale
        domain {
            minimum 1.0
            maximum 5.0
            precision 1.0
        }
    }
    // test on random 1/5 of each user's ratings
    userPartitionMethod holdoutFraction(0.2, 'random')
    // use 5-fold cross-validation
    partitionCount 5
    // pack data for efficiency
    outputFormat 'PACK'
}

The Gradle tasks configure spec objects that describe the evaluation to run. The tasks provide some helper methods, such as input, textFile, and holdoutFraction, to make it easier to build many kinds of specs; other methods, such as partitionCount, delegate directly to the spec’s JavaBean property methods.

Anything in one of the spec classes that cannot be successfully configured in a Gradle task is a bug.

Converting Train-Test Evaluation

Once you have created a crossfold task, you can use it to run a train-test experiment:

task evaluate(type: TrainTest, group: 'evaluate') {
    // we add our crossfold task as evaluation input
    dataSet crossfold

    // send the output to appropriate files
    outputFile "$buildDir/eval-results.csv"
    userOutputFile "$buildDir/eval-users.csv"

    // configure our algorithms
    algorithm 'PersMean', 'algorithms/pers-mean.groovy'
    algorithm 'ItemItem', 'algorithms/item-item.groovy'
    algorithm 'Custom', 'algorithms/custom.groovy'

    // and some evaluation tasks and metrics
    predict {
        metric 'rmse'
        metric 'ndcg'
    }
    recommend {
        metric 'mrr'
    }
}

Some things are very much like the old code, such as adding the crossfold task as a dataSet (except that the task cannot be nested in Gradle). The output file is also there.

There are two important changes to be aware of.

Algorithm Configuration

Algorithms are now configured in independent Groovy files, using the configuration syntax. In addition, you can have algorithm blocks to define multiple algorithms in a single Groovy file, just like the old algorithm blocks:

algorithm('FunkSVD') {
    attributes['FeatureCount'] = 100
    // configure your FunkSVD algorithm here
}

If an algorithm configuration file has no algorithm blocks, then the entire configuration is treated as a single algorithm. If there are one or more algorithm blocks, then the algorithms they define are used and a separate top-level algorithm is not created.

Metrics

Metrics have also changed. In LensKit 2, you specified metrics by class, or by builder-based blocks, and just specified metrics for the evaluation.

LensKit 3 introduces the notion of evaluation tasks, each of which is a thing to do with a recommender. For prediction accuracy (RMSE, etc.), this doesn’t really make a practical difference. For top-N evaluations, however, it is a major improvement. Previously, each top-N metric needed to know the list size and candidate/exclude sets and request recommendations with them; aggressive caching prevented this from being very slow. Now, computing the list of recommendations is the job of the task, and the metric just measures the recommendation list that it is given. The upshot is that top-N metrics are much easier to write.

Enough blabbing. What does it look like? Well, for predictions, you write a predict block describing a predict task with its metrics and (optionally) output file:

predict {
    outputFile "$buildDir/predictions.csv.gz"
    metric 'rmse'
    metric 'ndcg'
}

This will write all the test predictions to a compressed CSV file, and compute the RMSE and Predict nDCG of each prediction.

Recommendations operate similarly, but have some additional configuration options:

recommend {
    listSize 25
    candidateItems "allItems"
    excludeItems "user.trainItems"
    outputFile "$buildDir/recommendations.csv.gz"
    metric 'ndcg'
    metric('mrr') {
        goodItems "user.testItems"
    }
}

This does a few things:

Recommend 25 items per user
Consider all items except those in the user’s training set (their past history) to be candidates
Compute top-N nDCG
Write all recommendations to a compressed CSV file
Compute mean reciprocal rank, considering all items in the user’s test set to be relevant

The item selectors (candidates, exclude, and goodItems) are actually Groovy expressions, evaluated in the context of an ItemSelectScript so they have access to the set of all items (allItems) and the user being tested (user), as well as a few helpful utility functions. For example, if you want the candidate set to consist of the user’s test items plus 100 random decoys, you can use the following:

user.testItems + pickRandom(allItems - user.trainItems, 100)

This is a little complicated, because we want to remove the training items from the universe before picking decoys, so we have a full set of 100 decoys after applying the exclude set.

These changes do mean that any metrics you wrote for LensKit 2 will need to be modified to work with the new metric interfaces for LensKit 3. There are two base classes, PredictMetric and TopNMetric. Consult the source code for LensKit’s metric implementations, such as TopNLengthMetric, for an example of what a new metric should look like.

Finishing Up

If you have custom Java code, just put it in the usual src/main/java directory, and it will be compiled before the evaluation is run. It will also be treated as an input file to the evaluation, so the evaluation will rerun if your custom code changes.

A few things, such as subsampling, have gone away. The new, flexible evaluation model based on smaller pieces that you can recombine at will means that such custom data processing can be implemented in Python or R scripts that get run by the Gradle build file (using an Exec) task. LensKit development will focus on fundamental and commonly-used recommendation tasks, but if you have a task you’d like to see us directly support, please raise it on the mailing list or our issue tracker.