In which I implement a Recommender System for a sample data set from Andrew Ng's Machine Learning Course.¶
Week 9 of Andrew Ng's ML course on Coursera discusses two very common applied ML algorithms: anomaly detection (think fraud detection or manufacturing quality control) and recommender systems (think Amazon or Netflix). Here I focus on the recommender system portion and use the homework data set to learn about the relevant python tools (the anomaly detection portion was covered here).
Tools Covered:¶
BaseEstimator
for rolling your own estimator in sklearn- Implementing stochastic versus batch gradient descent
Recommender Systems¶
According to Dr. Ng, Recommender systems are rated as some of the most important problems faced by Silicon Valley companies, but they get comparatively little attention in ML academic research.
Model Underlying Recommender Systems¶
A prototypical scenario is a data set which indicates the 0-to-5 star rating given to a set of movies, by a set of users (some of these will be missing if a user hasn't rated a specific movie). The fundamental assumption of recommendation systems is that a users rating of a movie is determined by a paired set of factors that reflect the characteristics intrinsic to that user and the characteristics intrinsic to that movie. For example, a user-intrinsic feature capturing how much a user likes romance, and the corresponding movie-intrinsic feature capturing the degree of romance in a movie. Specifically, we assume a simple linear regression so that
\begin{align*} y(i, j) = (\theta^{(j)})^Tx^{(i)}, \end{align*}where $y(i, j)$ is the rating given by the $j^{th}$ user to the $i^{th}$ movie, and $\theta^{(j)}$ represents factors intrinsic to user $j$ while $x^{(i)}$ represents factors intrinsic to movie $i$. You can envision that user-factors and movie-factors live in the same space, where ratings are given by the inner product between them. This model can be vectorized as
\begin{align*} \mathbf{Y} \sim \mathbf{Y}' = \mathbf{X}\mathbf{\Theta}^T, \end{align*}where $\mathbf{Y}$ are the ground truth ratings, $\mathbf{Y}'$ are the predicted ratings, $\mathbf{X}$ is a matrix where each row contains factors of a specific movie, and $\mathbf{\Theta}$ is a matrix where each row contains factors of a specific user. The problem then amounts to finding $\mathbf{X}$ and $\mathbf{\Theta}$ that makes $\mathbf{Y}$ as close as possible to $\mathbf{Y'}$, typically in the sense of element-wise squared error.
This is why you hear the term Low Rank Matrix Factorization thrown around - you are literally trying to factor a matrix into a product of lower-rank matrices, just like you would factor $21 = 3\;x\;7$. If $\mathbf{Y}$ had no missing values and all the values were consistent with our model then we could use SVD to analytically find $\mathbf{X}$ and $\mathbf{\Theta}$. But imputing missing values tends to be tricky and, at any rate it would create a GINORMOUS non-sparse matrix that would be computationally infeasible to work with. So we're stuck using gradient descent to minimize the sum of squared errors between observed and predicted ratings, together with any regularization we want to include. In this way we are finding the best approximate factorization of $\mathbf{Y}$ by looking only at the existing ratings.
Implicit vs. Explicit Interaction Data¶
I'm going to keep talking in terms of the movie ratings example, but we can generalize recommender systems to refer to any set of users and products, together with some measure of "interaction" between them. The most useful measures of interaction are when the user explicitly tells us how much the like or use the product, such as with a thumbs up or down. But you can also gather implicit interaction metrics, like number of page views, mouse movements, or length of viewing time. This data is generally not as sparse, so the actual implementation of factorization may need to be different.
Content-Based Approach vs. Collaborative Filtering¶
In a content-based approach to recommendation we take explicit, rather than latent factors, for the movie characteristics, and then we learn the latent factors for user characteristics. Specifically, we populate the columns of $\mathbf{X}$ with features which reflect some known characteristics of movies lead actor, run-time, or genre. Then we learn the corresponding user factors in $\Theta$, by using squared error with regularization on all the $\theta$'s. Note that you could instead use explicit user factors, or a mixture!
In contrast to a content-based approach, in collaborative filtering (CF) we treat both the user characteristics and the movie characteristics as latent features that can be mined just by looking at the existing ratings. The users are collaborating, by giving their ratings, to help the system uncover information about users and or movies that will let it to make better predictions. For an extraordinarily clear intro to CF check out this article by Netflix Prize winners on the subject. I am following much of their content in this post.
CF generally performs better and is more powerful than content-based approaches, but is not always possible or appropriate. Chris Clark's great post on recommender systems explains the benefits of each approach. Content-based approaches are useful for a "cold-start" when you don't have existing information about a user or product, or when you know the user is interested in a particular product as with a click-through from a google search for that product. On the other hand, the strength of CF is casting a broader and more flexible net in identifying "similar" products:
What we really want is a recommendation system that drives incremental sales (e.g. sales that would not have happened otherwise). If a customer is looking at the product details page for Harry Potter and the Chamber of Secrets, and your recommender shows Prisoner of Azkaban, and the customer buys it, the data scientists back at Random House HQ should not be high-fiving. It's a safe bet that that customer already knew there were more than two books in the series and would have bought Prisoner of Azkaban anyway. It was not an incremental sale.
Flavors of Collaborative Filtering¶
Collaborative filtering (CF) treats both the user characteristics and the movie characteristics as latent features that can be mined just by looking at the existing ratings. Neighborhood methods of CF mine this information by letting the ratings reveal neighborhoods of similar movies or similar users. For instance, the neighbors of a particular movie $m_j$ will be other movies to which users tended to give a similar rating as they gave to $m_j$. If most users who rated both Need for Speed and Tokyo Drift tended to rate them similarly, then those movies are close neighbors and someone who liked one will probably like the other. Likewise, two users are neighbors if, for movies that they have both rated, their ratings tend to agree. If your close neighbor liked a movie, you probably would too. Specifically, with neighborhood methods the two common approaches to predicting a user $u_j$'s response to an unrated movie $m_i$ is:
- finding the group of $k$ users most simimilar to $u_j$ and averaging their ratings of movie $m_j$
- finding the group of $k$ movies most similar to $m_j$ and averaging user $u_i$'s ratings of those
Latent Factor methods of CF use the existing ratings to actually learn values for both the latent user characteristics and the latent movie characteristics. This is generally a more powerful approach than neighborhood methods. The $x$'s and $\theta$'s are learned simultaneously by minimizing a total regularized cost function:
\begin{align*} J(x^{(1)},..., x^{(n_M)}, \theta^{(1)}, ..., \theta^{(n_u)}) = \textrm{all squared errors} \; + \; \textrm{penalty on all $x_k$} \; + \; \textrm{penalty on all $\theta_k$}, \end{align*}Once all the latent factors are learned, predictions for user $u_j$'s response to an unrated movie $m_i$ is just the inner product of the learned $\theta^{(j)}$ with $x^{(i)}$. The hyperparameters of the full decision procedure are the number of pairs of latent factors and the regularization strength in the objective function.
Full Specification of Latent Factor Collaborative Filtering¶
Here is the complete mathematical specification of the model and algorithm for latent-factors collaborative filtering. We define:
- $q_i, p_i \in R^f$ are the latent movie and user factors, respectively
- $r_{ui}$ is the rating given to the $i^{th}$ movie by the $u^{th}$ user
- $\kappa$ is the set of all pairs $(u, i)$ for which we have an existing rating
Our estimator for user ratings is
\begin{align*} \hat{r}_{ui} = q_i^Tp_u,\\ \textrm{or in other words} \; \mathbf{R} \sim \mathbf{R}' = \mathbf{Q}\mathbf{P}^T. \end{align*}while our objective is:
\begin{align*} \qquad \min_{\textrm{all} \, q, p} J, \qquad J = \bigg[ \sum_{(u,i)\in\kappa} (r_{ui} - q_i^Tp_u)^2 + \lambda\big(||q_i||^2 + ||p_u||^2\big)\bigg]. \end{align*}Numerical Optimization¶
The objective function is minimized over the product space $q\times p$ typically by either Stochastic Gradient Descent (SGD) or Alternating Least Squares (ALS).
For implementing SGD the following framework is helfpul. First define the error between a prediction and the actual rating as:
\begin{align*} e_{ui} \equiv r_{ui} - q_i^Tp_u,\\ \mathbf{E} = \mathbf{R} - \mathbf{P}\mathbf{Q}^T \end{align*}Consider the example below of the partial derivative of the loss with respect to a single latent factor: the $k^{th}$ component of the $i^{th}$ movie feature vector.
\begin{align*} \frac{\partial J}{(q_i)_k} = \sum_{(u,i)\in\kappa} 2(r_{ui} - q_i^Tp_u)^2(-(p_u)_k) + 2\lambda(q_i)_k \end{align*}Inspecting the above, we can see that the coordinate descent update rule for SGD for a single observation should be
\begin{align*} q_i \leftarrow q_i + \gamma(e_{ui}\cdot p_u - \lambda\cdot q_i)\\ p_u \leftarrow p_u + \gamma(e_{ui}\cdot q_i - \lambda\cdot p_u), \end{align*}where $\gamma$ is the learning rate.
Alternatively, the update rule for batch gradient descent can be heavily vectorized to yield
\begin{align*} \mathbf{P} \leftarrow \mathbf{P} + \gamma\big(\mathbf{E}\mathbf{Q} - \lambda\mathbf{P}\big)\\ \mathbf{Q} \leftarrow \mathbf{Q} + \gamma\big(\mathbf{E}^T\mathbf{Q} - \lambda\mathbf{Q}\big). \end{align*}Including Biases¶
If I presented you with the raw matrix of ratings data and asked to you fill in a specific missing rating, you might make your guess by lookimg at several pieces of prior information such as:
- what is the overall average rating from the data set, $\mu$.
- How much higher or lower than $\mu$ does this user tend to rate movies, $b_u$. Is he a harsh critic or particularly easy to please compared to the average?
- How much higher or lower than $\mu$ does this movie tend to get rated, $b_i$. Is it generally considered a good or bad film compared to the average?
These pieces of prior information can be combined together into the bias, $b_{ui} = \mu + b_i + b_u$ of rating $r_{ui}$. So far our model says that a rating is entirely due to the interaction between a particular movie and a particular user, $q_u^Tp_i$, but in light of this discussion we want to also include the effect of this bias (which is independent of the interaction):
\begin{align*} \hat{r_{ui}} = q_i^Tp_u + b_{ui} \end{align*}I won't be doing anything with bias for this homework, but I thought I'd mention it because it is understood to explain a huge amount of the signal in recommender systems.
Further Reading¶
The following links are also helpful and high quality:
- Paper surveying various recommender system approaches (math heavy)
- Paper evaluating various approaches to recommendation
- Ethan Rosenthal's excellent Intro to CF with a simple Python Implementation of Neighborhood Methods
- Quuxlabs Tutorial on Matrix Factorization in Python
- LazyProgrammer Tutorial on Matrix Factorization and CF in Python
- PyData PPT on Low Rank Matrix Approximation Methods Python
- Simple math of Matrix Factorization using gradient descent
- Simple PPT of Matrix Factorization for CF
- Netflix Prize Paper on Recommender Systems
- Cambridge Coding Academy Tutorial on Implenting SGD and ALS for latent factors CF
- Blog post by Simon Funk where he first discusses the Matrix Factorization Implementation for latent factors CF
Recommender Systems in Sklearn... Not Actually a Thing¶
There is a neat, short discussion on github about why scikit-learn doesn't implement any recommender system classes: the upshot is that these problems don't generally follow the neat-and-tidy fit
+ predict
paradigm. I somewhat disagree with this, especially when it comes to latent factors collaborative filtering methods, and so after building the meat of the CF algorithm I'm going to try rolling my own Recommender System estimator and using it with GridSearchCV
.
Quick Look at the Data¶
import scipy.io
import scipy.stats as stats
import matplotlib.pyplot as plt
import matplotlib
import pandas as pd
import numpy as np
from numpy import random
import pickle
import timeit
from sklearn.model_selection import GridSearchCV
import snips as snp # my snippets
snp.prettyplot(matplotlib) # my aesthetic preferences for plotting
%matplotlib inline
cd hw-wk9
header = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('u.data', sep='\t', names=header)
df.head()
We expect most users haven't rated most movies, so let's get a sense of exactly how sparse the data is. The number of missing ratings should be the difference between the number of rows here, and the total number of possible ratings $n_{users} \times n_{movies}$.
n_u = len(df["user_id"].unique())
n_m = len(df["item_id"].unique())
sparsity = len(df)/(n_u*n_m)
print("sparsity of ratings is %.2f%%" %(sparsity*100))
# Split the dataframe into a train and test set
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(df,test_size=0.25)
train_data = pd.DataFrame(train_data)
test_data = pd.DataFrame(test_data)
# Create training and test matrix
R = np.zeros((n_u, n_m))
for line in train_data.itertuples():
R[line[1]-1, line[2]-1] = line[3]
T = np.zeros((n_u, n_m))
for line in test_data.itertuples():
T[line[1]-1, line[2]-1] = line[3]
Implementation with Stochastic Gradient Descent¶
Cambridge Coding Academy Online has a neat tutorial where they implement this exact algorithm with this exact Toy Data Set. They chose to use Stochastic Gradient Descent for their optimzation, which loops through individual ratings (making several full passes through the training set) and updates the relevant latent factors each time. This approch can't be matricized as effectively as normal gradient descent, but it tends to converge more quickly so in terms of full computational cost it's not clear which is better (might depend on the size of the data set). For more on the difference between these two approaches, here is a neat and brief Quora answer. I'm going to try implementing it both ways.
start_time = timeit.default_timer()
# Scoring Function: Root Mean Squared Error
def rmse_score(R, Q, P):
I = R != 0 # Indicator function which is zero for missing data
ME = I * (R - np.dot(P, Q.T)) # Errors between real and predicted ratings
MSE = ME**2
return np.sqrt(np.sum(MSE)/np.sum(I)) # sum of squared errors
# Set parameters and initialize latent factors
f = 20 # Number of latent factor pairs
lmbda = 0.5 # Regularisation strength
gamma=0.01 # Learning rate
n_epochs = 50 # Number of loops through training data
P = 3 * np.random.rand(n_u, f) # Latent factors for users
Q = 3 * np.random.rand(n_m, f) # Latent factors for movies
# Stochastic GD
train_errors = []
test_errors = []
users,items = R.nonzero()
for epoch in range(n_epochs):
for u, i in zip(users,items):
e = R[u, i] - np.dot(P[u, :], Q[i, :].T) # Error for this observation
P[u, :] += gamma * ( e * Q[i, :] - lmbda * P[u, :]) # Update this user's features
Q[i, :] += gamma * ( e * P[u, :] - lmbda * Q[i, :]) # Update this movie's features
train_errors.append(rmse_score(R,Q,P)) # Training RMSE for this pass
test_errors.append(rmse_score(T,Q,P)) # Test RMSE for this pass
# Print how long it took
print("Run took %.2f seconds" % (timeit.default_timer() - start_time))
# Check performance by plotting train and test errors
fig, ax = plt.subplots()
ax.plot(train_errors, color="g", label='Training RMSE')
ax.plot(test_errors, color="r", label='Test RMSE')
snp.labs("Number of Epochs", "RMSE", "Error During Stochastic GD")
ax.legend()
# See how well we did on Test Set Predictions
Rhat = np.dot(P, Q.T)
fig, axs = plt.subplots(figsize=[5, 10], nrows=5, ncols=1, sharex=True)
fig.suptitle("Stochastic GD Test Performance")
for idx, ax in enumerate(axs.ravel()):
vals = Rhat[T == idx+1]
ax.hist(vals, bins=20, normed=True, label="Ground Truth Rating = %i" %(idx+1))
ax.legend()
ax.set_xlim([0, 6])
Implementation with Batch Gradient Descent¶
start_time = timeit.default_timer()
# Scoring Function: Root Mean Squared Error
def rmse_score(R, Q, P):
I = R != 0 # Indicator function which is zero for missing data
ME = I * (R - np.dot(P, Q.T)) # Errors between real and predicted ratings
MSE = ME**2
return np.sqrt(np.sum(MSE)/np.sum(I)) # sum of squared errors
# Set parameters and initialize latent factors
f = 20 # Number of latent factor pairs
lmbda = 50 # Regularisation strength
gamma = 9e-5 # Learning rate
n_epochs = 220 # Number of loops through training data
P = 3 * np.random.rand(n_u, f) # Latent factors for users
Q = 3 * np.random.rand(n_m, f) # Latent factors for movies
# Batch GD
train_errors = []
test_errors = []
for epoch in range(n_epochs):
ERR = np.multiply(R != 0, R - np.dot(P, Q.T)) # compute error with present values of Q, P, ZERO if no rating
# P += gamma*(np.dot(Q.T, ERR.T).T - lmbda*P) # update rule
# Q += gamma*(np.dot(P.T, ERR).T - lmbda*Q) # update rule
P += gamma*(np.dot(ERR, Q) - lmbda*P) # update rule
Q += gamma*(np.dot(ERR.T, P) - lmbda*Q) # update rule
train_errors.append(rmse_score(R,Q,P)) # Training RMSE for this pass
test_errors.append(rmse_score(T,Q,P)) # Test RMSE for this pass
# Print how long it took
print("Run took %.2f seconds" % (timeit.default_timer() - start_time))
# Check performance by plotting train and test errors
fig, ax = plt.subplots()
ax.plot(train_errors, color="g", label='Training RMSE')
ax.plot(test_errors, color="r", label='Test RMSE')
snp.labs("Number of Epochs", "RMSE", "Error During Batch GD")
ax.legend()
# See how well we did on Test Set Predictions
Rhat = np.dot(P, Q.T)
fig, axs = plt.subplots(figsize=[5, 10], nrows=5, ncols=1, sharex=True)
fig.suptitle("Batch GD Test Performance")
for idx, ax in enumerate(axs.ravel()):
vals = Rhat[T == idx+1]
ax.hist(vals, bins=20, normed=True, label="Ground Truth Rating = %i" %(idx+1))
ax.legend()
ax.set_xlim([0, 6])
Rolling a Custom Estimator for Sklearn¶
I love GridSearchCV
and Pipeline
so I'd really like to wrap the above algorithm in sklearn-API-compatible way. I'm referring to the official dev docs on this topic. A great paper from 2013 on the general structure and philosophy of sklearn can he had on ArXiv. I'm going to leave both batch and stochastic GD as options by giving my class a parameter solver
where we can specify which.
So without further ado, let me grab the relevant dev helper objects, like the base class BaseEstimator
that I will inherit from.
import numpy as np
from sklearn.base import BaseEstimator
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.utils.multiclass import unique_labels
from sklearn.metrics import euclidean_distances
class MC(BaseEstimator):
""" An estimator for latent factor collaborative filtering models in Recommender Systems.
"""
def __init__(self, n_u, n_m, n_factors=10, n_epochs=250, lmbda=10, gamma=9e-5, solver="sgd"):
self.n_u = n_u
self.n_m = n_m
self.n_factors = n_factors
self.n_epochs = n_epochs
self.lmbda = lmbda
self.gamma = gamma
self.solver = solver
def fit(self, X, y):
"""Fits all the latent factors for users and items and saves the resulting matrix representations.
"""
X, y = check_X_y(X, y)
# Create training matrix
R = np.zeros((self.n_u, self.n_m))
for idx, row in enumerate(X):
R[row[0]-1, row[1]-1] = y[idx]
# Initialize latent factors
P = 3 * np.random.rand(self.n_u, self.n_factors) # Latent factors for users
Q = 3 * np.random.rand(self.n_m, self.n_factors) # Latent factors for movies
def rmse_score(R, Q, P):
I = R != 0 # Indicator function which is zero for missing data
ME = I * (R - np.dot(P, Q.T)) # Errors between real and predicted ratings
MSE = ME**2
return np.sqrt(np.sum(MSE)/np.sum(I)) # sum of squared errors
# Fit with stochastic or batch gradient descent
train_errors = []
if self.solver == "sgd":
# Stochastic GD
users,items = R.nonzero()
for epoch in range(self.n_epochs):
for u, i in zip(users,items):
e = R[u, i] - np.dot(P[u, :], Q[i, :].T) # Error for this observation
P[u, :] += self.gamma * ( e * Q[i, :] - self.lmbda * P[u, :]) # Update this user's features
Q[i, :] += self.gamma * ( e * P[u, :] - self.lmbda * Q[i, :]) # Update this movie's features
train_errors.append(rmse_score(R,Q,P)) # Training RMSE for this pass
elif self.solver == "batch_gd":
# Batch GD
for epoch in range(self.n_epochs):
ERR = np.multiply(R != 0, R - np.dot(P, Q.T)) # compute error with present values of Q, P, ZERO if no rating
P += self.gamma*(np.dot(Q.T, ERR.T).T - self.lmbda*P) # update rule
Q += self.gamma*(np.dot(P.T, ERR).T - self.lmbda*Q) # update rule
train_errors.append(rmse_score(R,Q,P)) # Training RMSE for this pass
else:
print("I'm sorry, we don't recognize that solver.")
# print("Completed %i epochs, final RMSE = %.2f" %(self.n_epochs, train_errors[-1]))
self.Q = Q
self.P = P
self.train_errors = train_errors
# Return the estimator
return self
def predict(self, X):
""" Predicts a vector of ratings from a matrix of user and item ids.
"""
X = check_array(X)
y = np.zeros(len(X))
PRED = np.dot(self.P, self.Q.T)
for idx, row in enumerate(X):
y[idx] = PRED[row[0]-1, row[1]-1]
return y
def score(self, X, y):
""" Element-wise root mean squared error.
"""
yp = self.predict(X)
err = y - yp
mse = np.sum(np.multiply(err, err))/len(err)
return np.sqrt(mse)
OK now that the class is defined let's use the full data set to get $X$ and $y$ in the proper format. GridSearchCV
will do it's own train/test splitting with K-Folds. I also designed the class to require the total number of users and movies as parameters in the constructor, that was just the simplest thing to do, so we need to count those and then pass them in.
X = df[["user_id", "item_id"]].as_matrix()
y = df["rating"].values
n_u = len(df["user_id"].unique())
n_m = len(df["item_id"].unique())
Grid Search with Batch GD¶
rcmdr = MC(n_u=n_u, n_m=n_m, gamma=6e-5, n_epochs=400, solver="batch_gd")
params = {"lmbda": (45, 50, 55),
"n_factors": (15, 18, 21)}
search = GridSearchCV(rcmdr, param_grid=params, cv=4)
search.fit(X, y)
best_est = search.best_estimator_
results = pd.DataFrame(search.cv_results_)
results[["mean_test_score", "std_test_score", "params"]].sort_values(by=["mean_test_score"], ascending=True).head()
Grid Search with Stochastic GD¶
rcmdr = MC(n_u=n_u, n_m=n_m, gamma=0.01, n_epochs=50, solver="sgd")
params = {"lmbda": (0.25, 0.5, 0.75),
"n_factors": (18, 20, 22)}
search = GridSearchCV(rcmdr, param_grid=params, cv=4)
search.fit(X, y)
best_est = search.best_estimator_
results = pd.DataFrame(search.cv_results_)
results[["mean_test_score", "std_test_score", "params"]].sort_values(by=["mean_test_score"], ascending=True).head()