Evaluator Metrics
Prediction Accuracy Metrics
LensKit provides several metrics for measuring prediction accuracy. They are implemented by the classes in the o.g.l.eval.metrics.predict
package, and include:
- Coverage (
CoveragePredictMetric
) - RMSE (
RMSEPredictMetric
) - MAE (
MAEPredictMetric
) - nDCG (
NDCGPredictMetric
) — this deploys nDCG as a weighted rank accuracy measure of prediction accuracy - Half-life utility (
HLUtilityPredictMetric
) — like nDCG, but using Breese’s half-life discounting - Entropy (
EntropyPredctMetric
) — measures mutual information between ratings and predictions
To use one of these metrics, just mention its class by name in your trainTest
block:
metric RMSEPredictMetric
Top-N metrics
The metrics discussed above are all prediction accuracy metrics, evaluating the accuracy of the rating predictor either for ranking items or for predicting the user’s rating for individual items. LensKit also supports metrics over recommendation lists; these are called Top-N metrics, though the recommendation list may be generated by some other means.
Configuring a top-N metric is a bit more involved than a prediction accuracy metric. It requires you to specify a few things:
- The length of recommendation list to consider
- The items to consider as candidates for recommendation
- The items to exclude from recommendation
- For some metrics, the items considered ‘good’ or ‘bad’
For this reason, you cannot just add a top-N metric by its class. To compute top-N nDCG of 10-item lists over all items the user has not rated in the training set, you instead do this:
metric topNnDCG {
listSize 10
candidates ItemSelectors.allItems()
exclude ItemSelectors.trainingItems()
}
More complex configurations are also possible. The following will compute the mean reciprocal rank
in 10-item recommendation lists, where the recommendations are selected from the test items plus 100
random decoys, and consider an item relevant if it was rated at least 3.5 stars. The Matchers.greaterThanOrEqualTo
method comes from Hamcrest.
metric topNMRR {
listSize 10
candidates ItemSelectors.addNRandom(ItemSelectors.testItems(), 100)
exclude ItemSelectors.trainingItems()
goodItems ItemSelectors.testRatingMatches(Matchers.greaterThanOrEqualTo(3.5d))
}
Note: it is possible for a training item to appear among the 100 decoys. It will be excluded by the exclude
set, but the resulting recommendation run will have fewer decoys. This is probably not desired, and is tracked by #759.
As of LensKit 2.2, the following Top-N metrics are available:
topNnDCG
— normalized discounted cumulative gaintopNLength
— actual length of the top-N list (to measure truncated lists due to low coverage)topNRecallPrecision
— precision and recall at N; requires agood
settopNnDCG
— nDCG applied to top-N lists (its more typical application)topNPopularity
— measures popularity of recommended itemstopNMAP
— mean average precision; requires agood
settopNMRR
— mean reciprocal rank; requires agood
set
Each of these is defined by a class in o.g.l.eval.metrics.topn
. The available item selectors are in ItemSelectors.