Alternating Least Squares Collaborative Filtering

5 minute read

Overview:

Recommender Systems are among the most widely used Machine Learning systems deployed in the real world that actually make money and improve user experience. As such, there is perhaps some merit to explore the topic.

In this post I will cover the paper on Collaborative Filtering for Implicit Feedback Datasets. Looks to be a fundamental paper for the subject as it tackles a supposedly common setting - often don’t have explicit feedback from customers, but only their usage logs (implicit preferences).

Why Alternating Least Squares:

In typical settings, people get explicit feedback, and matrix factorisation seems to work well for predicting how customers would rate other products. If \(r_{ui}\) is the rating given to the \(i^{th}\) item by the \(u^{th}\) user, one can try estimating user and item embedding vectors \(\{x_u\}_{u=1}^m\) and \(\{y_i\}_{i=1}^n\) that minimise the objective:

\[\begin{equation} J = \sum_{u,i}(r_{ui}-x_u^Ty_i)^2. \end{equation}\]

The above is similar to the matrix factorisation framing of, “find \(X\) and \(Y\) such that \(R=XY^T\)”, and we know that we can do a SVD to do exactly that. E.g., \(R=U_f\Sigma_f V_f^T\) and \(X=U_f\Sigma_f^{1/2}\) and \(Y=V_f\Sigma_f^{1/2}\) giving \(X\in \mathbb{R}^{m\times f}\) and \(Y\in \mathbb{R}^{n\times f}\), where \(f\) is the rank of \(R\). In practical settings, however, \(m\) and \(n\) can be large - many customers and many items - making the SVD computationally infeasible.

To address settings where there are many customers and items (common setting), we augment the objective to be:

\[\begin{equation} J = \sum_{u,i}(r_{ui}-x_u^Ty_i)^2 + \lambda \left(\sum_u \|x_u\|^2 + \sum_i \|y_i\|^2\right), \end{equation}\]

and perform alternating updates by fixing the item vectors, \(Y\), and optimising w.r.t. the user vectors, \(X\), and vice versa. The above objective also includes a regularisation term so we don’t overfit the data. I will give the updates in the following section where we deal with the fact that instead of ratings (explicit customer feedback), we only have user interaction logs.

Dealing with implicit data

Due to the lack of explicit feedback, we engineer our preference variables:

\[\begin{equation} p_{ui}=1_{\{r_{ui}>\theta\}}, \end{equation}\]

where \(p_{ui}\) is the binary preference proxy for the \(u^{th}\) user towards the \(i^{th}\) item, \(r_{ui}\) is the interaction log/observation of the \(u^{th}\) user interacting with the \(i^{th}\) item (e.g., proportion of TV programme finished by user). Here we have made the assumption that the customer prefers the item if they have interacted more than \(\theta\) units with it.

Since in this setting we don’t have explicit user feedback like ratings e.g., 1 to 5; we don’t know if a customer liked a product or not. The user logs only tell us how much the customer has explored/interacted with a product and is some proxy for confidence in our assumptions about \(p_{ui}\). To this end, we also engineer confidence variables:

\[\begin{equation} c_{ui}=1 + \alpha r_{ui}, \end{equation}\]

where \(\alpha\) is determined by cross validation.

Having introduced our preference variables, \(p_{ui}\), and confidence variables, \(c_{ui}\), the objective function for implicit feedback is:

\[\begin{equation} J = \sum_{u,i}c_{ui}(p_{ui}-x_u^Ty_i)^2 + \lambda \left(\sum_u \|x_u\|^2 + \sum_i \|y_i\|^2\right), \end{equation}\]

where we weigh the impact of the squared errors based on the confidence variables, \(c_{ui}\).

The update steps

Since the functional form of \(J\) treats \(x\) and \(y\) similarly, I will only derive the updates for the user vectors, \(x_u\), and will use a symmetry argument to give the updates for the item vectors, \(y_i\).

\[\begin{gather} \frac{\partial J}{\partial x_u}=0\notag\\ \iff \sum_{i:r_{ui}>0} -2c_{ui}(p_{ui} - x_u^Ty_i)y_i + 2\lambda x_u=0\notag\\ \iff \lambda x_u=\sum_i c_{ui}p_{ui}y_i - x_u^Ty_iy_ic_{ui}\notag\\ \iff (\lambda I + Y^Tdiag(C[u, :])Y)x_u=Y^Tdiag(C[u, :])P[u, :]\notag\\ x_u=\left(\lambda I + Y^Tdiag(C[u, :])Y\right)^{-1}Y^Tdiag(C[u, :])P[u, :]. \end{gather}\]

Currently the above update costs time \(\mathcal{O}(f^2n + f^3)\) due to the \(Y^Tdiag(C[u, :]Y)\) product and the matrix inverse respectively. We have to do this for each user which gives time \(\mathcal{O}(mnf^2 + mf^3)\). We can do better, due to the algebraic structure of the confidence variables, \(c_{ui}=1+\alpha r_{ui}\).

Notice:

\[\begin{align} Y^Tdiag(C[u, :])Y &= Y^Tdiag(C[u, :])Y + Y^TY - Y^TY\notag\\ &=Y^TY + Y^T(diag(C[u, :]) - I)Y\notag\\ &=Y^TY + \sum_{i: c_{ui}>1}(c_{ui} - 1)y_iy_i^T, \end{align}\]

where usually \(\mid\{i: c_{ui}>1\}\mid=:n_u<<n\). If we use the above algebraic trick and precompute \(Y^TY\) in \(\mathcal{O}(f^2n)\) time before starting the user-vector update sweep, the time cost to update user \(u\)’s vector is \(\mathcal{O}(f^2n_u + f^3)\).

Let

\[\begin{equation} N:=\sum_{u=1}^m n_u = \sum_{i=1}^n m_i, \end{equation}\]

where \(m_i=\mid\{u: c_{ui}>1\}\mid\).

Then updating all user vectors costs \(\mathcal{O}(f^2n + Nf^2 + mf^3)\) (precomputation of \(Y^TY\), the trick \(m\) times, and inverting \(m\) times) which is \(\mathcal{O}(Nf^2 + mf^3)\) (assuming each item was interacted with at least one user - \(m_i\ge 1\)).

In practice the update can be written:

\[\begin{equation} x_u=\left(\lambda I + Y^TY + \sum_{i: c_{ui}>1}(c_{ui} - 1)y_iy_i^T\right)^{-1}Y^Tdiag(C[u, :])P[u, :], \end{equation}\]

so we can precompute \(\lambda I + Y^TY\) in \(\mathcal{O}(f^2n)\) time.

By symmetry the update for \(y_i\) is:

\[\begin{equation} y_i=\left(\lambda I + X^TX + \sum_{u: c_{ui}>1}(c_{ui} - 1)x_ux_u^T\right)^{-1}X^Tdiag(C[:, i])P[:, i] \end{equation}\]

which leads to time complexity of \(\mathcal{O}(f^2N + f^3n)\) to update all, \(n\), item vectors, precomputing \(\lambda I + X^TX\) at the beginning of the update phase.

For memory efficiency, we can also represent \(C\) and \(P\) as adjacency lists rather than matrices. Also the above updates can be implemented as solving a linear system:

\[\begin{align} Ax_u&=b\notag\\ A&=\left(\lambda I + Y^TY + \sum_{i: c_{ui}>1}(c_{ui} - 1)y_iy_i^T\right)\notag\\ b&=Y^Tdiag(C[u, :])P[u, :] \end{align}\]

and solve with x = numpy.linalg.solve(A, b).

Predictions

Once the user and item vectors/embeddings are estimated, we can recommend the item whose vector had the greatest inner product with a user’s vector. Or we can return the top-\(k\) recommendations by giving the \(k\) items whose vectors had the greatest inner products with the user vector.

\[\begin{equation} \text{preds}_u=\text{topk}(Yx_u). \end{equation}\]

And the batch version:

\[\begin{equation} \text{preds}=\text{topk}(XY^T\text{, axis}=-1). \end{equation}\]

Interpreting inner products

If you watch some streaming services, you often see “because you watched MOVIE” when they recommend you new shows. That’s arguably interesting for the user to see, so they build an idea of what influences what recommendations they get. It is also useful for developers of recommender systems when debugging. If you recommend Spiderman you may expect the user to have watched other superhero movies.

\[\begin{align} y_i^Tx_u&=y_i^T\left(\lambda I + Y^TY + \sum_{i: c_{ui}>1}(c_{ui} - 1)y_iy_i^T\right)^{-1}Y^Tdiag(C[u, :])P[u, :]\notag\\ &=y_i^TW_uY^Tdiag(C[u, :])P[u, :]\notag\\ &=s_i^Tdiag(C[u, :])P[u, :]\notag\\ &=\sum_{j:r_{uj}>\theta}s_{ij}c_{uj} \end{align}\]

where \(W_u\) is a symmetric posititve definite matrix if \(\lambda > 0\) and \(s_{ij}=y_i^TW_uy_j\) is known as the similarity of item \(i\) with item \(j\). We can thus extract the top \(k\) products \(s_{ij}c_{uj}\) and claim that the user’s interaction with the corresponding items led to the recommendation of item \(i\). For the kernels machine fans, \(y_i^TW_uy_j\) is the \((i, j)^{th}\) entry of a Gram matrix.

Mario Vasilev

Alternating Least Squares Collaborative Filtering

Overview:

Why Alternating Least Squares:

Dealing with implicit data

The update steps

Predictions

Interpreting inner products

You may also enjoy

Multiarmed Bandits

Flask deployment of Gaussian Mixture Model

Gaussian Mixture Model

Expectation Maximisation