First thing first, to build a recommender system, it is required to have access on a data set. In rrecsys, we load a user-item rating matrix. Rows represent the users and columns the items. Please notice that we consider the missing values of the user-item rating matrix as NA (not available) values.
Currently rrecsys is equipped with two data sets. We have included the MovieLens 100k and the MovieLens Latest datasets.
We will use these datasets for demonstration.
To load the MovieLens 100k data set:
To load the MovieLens Latest data set:
rrecsys includes a method to read, define or even change the characteristics of a data set. For example, to define a rating scale (e.g., 0.5-5), which also includes real number values, we can write the following:
ML <- defineData(mlLatest100k, minimum = .5, maximum = 5, intScale = TRUE)
ML
## Dataset containing 718 users and 8927 items and a total of 100234 scores.
A dataset can be as well binarized by the following command:
binML <- defineData(mlLatest100k, binary = TRUE, positiveThreshold = 3)
binML
## Binary dataset containing 718 users and 8927 items and a total of 84326 scores.
In this case all the rating in mlLatest100k with a value larger or equal to positiveThreshold are converted to 1 and the rest to NA values.
A dataSet object can be analysed with the following methods:
# Number of times an item was rated.
colRatings(ML)
# Number of times a user has rated.
rowRatings(ML)
# Total number of rating in the rating matrix.
numRatings(ML)
# Sparsity.
sparsity(ML)
A dataSet object can be cropped to contain a specific number of ratings:
# Removing users that rated less than 40 items and items that were rated less than 30 times.
subML <- ML[rowRatings(ML)>=40, colRatings(ML)>=30]
sparsity(ML)
sparsity(subML)
Please notice that the output of defineData is an S4 object (class dataSet). This would be the main input to the dispatcher in rrecsys for training a model.