Факторизационные модели в рекомендательных системах

  • Published on

  • View

  • Download

Embed Size (px)


<ul><li><p>15 2013 .</p><p>1</p></li><li><p> SVD Factorization Machines</p><p>2</p></li><li><p>Outline</p><p> SVD Factorization Machines</p><p>3</p></li><li><p> : {(u, i, rui)}(u,i)R,I u ,I i ,I rui , (feedback) u </p><p> i,I R /, </p><p> ,I R(u) = {i : (u, i) R}</p><p>:I rui rui,I i? = arg maxi rui,I : sim(i1, i2),I /</p><p>4</p></li><li><p> SVD</p><p>rui = + bu + bi biases</p><p>+ pu, qi personal</p><p>I , bu, bi ;</p><p>I pu Rd () u;I qi Rd () i;I d , .</p><p> : = {pu, qi, bu, bi}</p><p>5</p></li><li><p> SVD</p><p>J() =</p><p>(u,i)Rl(rui(ui)) + (ui) min</p><p>I :J() =</p><p>(u,i)R</p><p>(rui rui)2 + ()</p><p>I :</p><p>J() =</p><p>(u,i)R(rui (rui))2 + ()</p><p>I , ordinal regression, pairwise-...</p><p>:</p><p>() = bias</p><p>(u</p><p>b2u +i</p><p>b2i</p><p>)+user</p><p>u</p><p>pu2 +itemi</p><p>qi2</p><p>6</p></li><li><p> SVD 2D / SVD Yahoo! Music dataset</p><p>Linkin Park </p><p>Green Day </p><p>Nelly </p><p>Red Hot Chili Peppers </p><p>Missy Elliott Beyonc </p><p>50 Cent </p><p>Mariah Carey </p><p>Aerosmith </p><p>Snoop Dogg </p><p>Jay-Z </p><p>U2 </p><p>Metallica </p><p>Mary J. Blige </p><p>Coldplay </p><p>Janet Jackson </p><p>AC/DC </p><p>Madonna </p><p>Nirvana </p><p>Led Zeppelin </p><p>The Doors </p><p>Avril Lavigne </p><p>Bob Marley </p><p>Nine Inch Nails </p><p>Busta Rhymes </p><p>Pop </p><p>R&amp;B </p><p>Rock </p><p>Soul </p><p>Adult Alternative </p><p>Classic Rock </p><p>Soft Pop </p><p>Rock Moderno </p><p>Latin </p><p>Electronic/Dance Mainstream </p><p>Rap </p><p>Mainstream Pop </p><p>R&amp;B Moderno </p><p>Hip-Hop </p><p>Jazz </p><p>Rap </p><p>Disco </p><p>-1.5</p><p>-0.5</p><p>0.5</p><p>1.5</p><p>2.5</p><p>3.5</p><p>4.5</p><p>5.5</p><p>-4.2 -3.2 -2.2 -1.2 -0.2 0.8</p><p>Artists</p><p>Genres</p><p>Figure 1: The most popular musical tracks and gen-res in the Yahoo! Music dataset are embedded intoa 2-dimensional. The open cone suggests the regionof highly preferred items for the user (her logo isat the center of the cone). Note how the learnedembedding separates Rock and similar items fromHip-Hop and similar items. The low dimensional-ity (which is required for a visualization), causes asmall number of items to be wrongly folded near lessrelated items (e.g., Bob Marley).</p><p>loss of accuracy. If the item vectors were normalized, theRoR problem would have been reduced to the well studiednearest neighbor search (as explained in section 3). How-ever, such a normalization will introduce a distortion on thebalance between the original item trait vector and the itembias which constitute the concatenated vector qi. This wouldevidently result in an incorrect solution.Denoting the concatenated user and item vectors as pu</p><p>and qi respectively, and the effective rating for the task ofretrieval as rui, RoR reduces to the following task: Given auser query pu, we want to find an item qi S such that:</p><p>p&gt;u qi = maxqS</p><p>p&gt;u q (6)</p><p>Hence the RoR task is equivalent to the problem of findingthe best-match for a query in a set of points with respect tothe dot-product (described in equation 1). A very simplisticvisualization of this task is depicted in Figure 1. For thegiven user, the best recommendations (in this case songs) liewithin the open cone around the user vector (maximizing thecos (pu,qi) term) and are as far as possible from the origin(maximizing the qi term).</p><p>3. ALGORITHMS FORFINDING BEST-MATCHES</p><p>Efficiently finding the best match using the dot-product(equation 6) appears to be very similar to much existingwork in the literature. Finding the best match with re-spect to the Euclidean (or more generally Lp) distance isthe widely studied problem of nearest-neighbor search inmetric spaces [4]. The nearest-neighbor search problem (inmetric space) can be solved approximately with the popular</p><p>Locality-sensitive hashing (LSH) method [9]. LSH has beenextended to other forms of similarity functions (as opposedto the distance as a dissimilarity function) like the cosinesimilarity [3]. In this section, we show that the problemstated in equation 6 is different from these existing prob-lems.</p><p>Nearest-neighbor Search in Metric Space.The problem of finding the nearest-neighbor in this setting</p><p>is to find a point qi S for a query pu such that:qi = argmin</p><p>qSpu q2 = argmax</p><p>qS</p><p>(p&gt;u q q2 /2</p><p>)6= argmax</p><p>qSp&gt;u q (unless q2 = const q S).</p><p>If all the points in S are normalized to the same length, thenthe problem of finding the best match with respect to thedot-product is equivalent to the problem of nearest-neighborsearch in any metric space. However, without this restric-tion, the two problems can yield very different answers.</p><p>Cosine similarity.Finding the best match with respect to the cosine simi-</p><p>larity is to find a point qi S for a query pu such thatqi = argmax</p><p>qSp&gt;u q/(pu q) = argmax</p><p>qSp&gt;u q/ q</p><p>6= argmaxqS</p><p>p&gt;u q (unless q = const q S).</p><p>As in the previous case, the best match with cosine similar-ity is the best match with dot-products if all the points inthe set S are normalized to the same length. Under gen-eral conditions, the best matches with these two similarityfunctions can be very different.</p><p>Locality-sensitive Hashing.LSH involves constructing hashing functions h which sat-</p><p>isfy the following for any pair of points q, p RD:Pr[h(q) = h(p)] = sim(q, p), (7)</p><p>where sim(q, p) [0, 1] is the similarity function of inter-est. For our situation, we can scale our dataset such that q S, q 1 and assume that the data is in the firstquadrant (such as in non-negative matrix factorization mod-els [19]). In that case, sim(q, p) = q&gt;p [0, 1] is our simi-larity function of interest.For any similarity function to admit a locality sensitive</p><p>hash function family (as defined in equation 7), the distancefunction d(q, p) = 1 sim(q, p) must satisfy the triangleinequality (Lemma 1 in [3]). However, the distance functiond(q, p) = 1 q&gt;p does not satisfy the triangle inequality.Hence LSH cannot be applied to the dot-product similarityfunction even in restricted domains (the first quadrant).</p><p>3.1 Why is finding themaximum dot-productsharder?</p><p>Unlike the distance functions in metric space, dot-productsdo not induce any form of triangle inequality (even undersome assumptions as mentioned in the previous section).Moreover, this lack of any induced triangle inequality causesthe similarity function induced by the dot-products to haveno admissible family of locality sensitive hashing functions.Any modification to the similarity function to conform towidely used similarity functions (like Euclidean distance orCosine-similarity) will create inaccurate results.</p><p>7</p></li><li><p> R Rnm,I rank(R) = k, X Rnk,Y Rmk, </p><p>R = XY T</p><p>I rank(R) &gt; k, :</p><p>R XY T</p><p>I RXY T F minX,Y SVD </p><p>I R :</p><p>i,j</p><p>wij(rij xTi yj)2 = W (RXY T ) minX,Y</p><p>8</p></li><li><p> R Rnm,I rank(R) = k, X Rnk,Y Rmk, </p><p>R = XY T</p><p>I rank(R) &gt; k, :</p><p>R XY T</p><p>I RXY T F minX,Y SVD I R :</p><p>i,j</p><p>wij(rij xTi yj)2 = W (RXY T ) minX,Y</p><p>8</p></li><li><p>I A Rkk:</p><p>R = XY T = XAA1Y T = (XA)(Y AT )T</p><p>I SVD1I rank(W ) = 1 </p><p>I rank(W ) &gt; 1 </p><p>1srebro2003weighted.9</p></li><li><p>I (matrix completion)</p><p>I M = (mij), {mij}(i,j), rankM = r.</p><p>Theorem ( ) ,</p><p>m Cn5/4r log n m = ||, 1 cn3 log n M .2</p><p>2candes2009exact.10</p></li><li><p> : SGD</p><p>J(u,i)(ui) = l(rui(ui)) + ui(ui)</p><p>:I () </p><p>I (u, i) RI rui uiI :</p><p>pnewu = pu (l(rui) ruipu (ui) +</p><p>uipu</p><p>)qnewi = qi </p><p>(l(rui) ruiqi (ui) +</p><p>uiqi</p><p>) </p><p>I , , ....</p><p>11</p></li><li><p> : SGD</p><p>J(u,i)(bu, bi,pu,pi) = (ruirui)+bias(b2u+b2i )+userpu2+itemqi2</p><p>rui(bu, bi,pu,pi) = + bu + bi + pu, qi</p><p>:I () </p><p>I (u, i) RI rui uiI :eui = rui ruipnewu = pu (euiqi + userpu + biasbu)qnewi = qi (euipu + itemqi + biasbi) </p><p>I , , ....</p><p>11</p></li><li><p> : SGD</p><p>I I I :</p><p> rui = f(+ bu + bi + pu, qi)I ( </p><p> )I 3</p><p>I I </p><p> I Vowpal Wabbit4 SGD SVD</p><p>3bottou-91c.4https://github.com/JohnLangford/vowpal_wabbit</p><p>12</p></li><li><p> : ALS</p><p> (biases) .</p><p>(u,i)R</p><p>(rui pTuqi)2 + (</p><p>u</p><p>pu2 +i</p><p>qi2) min</p><p> {qi}, :</p><p>u</p><p> iR(u)</p><p>(rui pTuqi)2 + pu2 min{pu}</p><p>I {pu} {qi}</p><p>I , MapReduce</p><p>13</p></li><li><p> pu</p><p> {qi}. pu :</p><p>p?u = (QTuQu + I)</p><p>1 W u</p><p>QTuru du</p><p>= W udu =</p><p>jR(u)W uqjruj</p><p>Qu qi i R(u), ru .</p><p>I / .</p><p>I .</p><p>14</p></li><li><p> SVD</p><p>rui = qTi p</p><p>?u =</p><p>jR(u)</p><p>qTi W uqjruj =</p><p>jR(u)rujqi, qjW u</p><p> W u , :W u = V</p><p>TuV u. :</p><p>rui =</p><p>jR(u)rujV uqi,V uqj</p><p>I simu(i, j) = qi, qjW u ;</p><p>I V u .</p><p>15</p></li><li><p> rui = 1 . , i: ri = 1. :</p><p>p? = (qTi qi + I)</p><p>1qi</p><p> j:</p><p>rj = qTj (qiq</p><p>Ti + I)</p><p>1qi = {ShermanMorrison formula} =</p><p>=1</p><p>(1 qi</p><p>2</p><p>+ qi2)qj , qi</p><p>I sim(i, j) qj , qiI sim(i, j): </p><p> , .</p><p>16</p></li><li><p>Implicit SVD</p><p>u</p><p>i</p><p>cui(rui rui)2 + () min</p><p>:I cui = 1 rui = 0, (u, i) 6 R</p><p> ALS:I p?u = (Q</p><p>TCuQ+ I)1QTCuruI QTCuQ = QTQ+QT (Cu I)Q</p><p>I QTQ I (Cu I) |R(u)| I Curu |R(u)| </p><p>:I , :</p><p>I rui = 0 dislike, rui = 1 like;I (u, i) 6 R, dislike </p><p>17</p></li><li><p> Max Inner-product</p><p>fixed u : i? arg maxiqi,pu</p><p>I ;</p><p>I I arg minrS p r22 = arg maxrS</p><p>(p, r 12r22)I arg maxrS</p><p>q,rqr = arg maxrS</p><p>q,rr</p><p>I LSH ;</p><p>18</p></li><li><p> Max Inner-product</p><p>fixed u : i? arg maxiqi,pu</p><p> : cone-trees5</p><p>I KD-tree max-inner;I branch-and-bound , ;</p><p>Replacing p and q with p and q by using the aforementioned equalities (similar to the techniques inproof for theorem 3.1), we have:</p><p>q, p = q0, p0+ rprq cos( (p + q)) + rp q0 cos( p) + rq p0 cos( q) max</p><p>rp,rq,p,qq0, p0+ rprq cos( (p + q)) + rp q0 cos( p) + rq p0 cos( q)</p><p> maxrp,rq</p><p>q0, p0+ rprq + rq p0+ rp q0 (since cos() 1), (11) q0, p0+RpRq +Rq p0+Rp q0 , (12)</p><p>where the first inequality comes from the definition of max and the final inequality comes from the fact thatrp Rp, rq Rq.</p><p>For the dual-tree search algorithm (Alg. 6), the maximum-possible inner-product between two tree nodesQ and T is set as</p><p>MIP(Q, T ) = q0, p0+RpRq +Rq p0+Rp q0 .It is interesting to note that this upper bound bound reduces to the bound in theorem 3.1 when the ballcontaining the queries is reduced to a single point, implying Rq = 0.</p><p>Ox</p><p>y</p><p>Figure 8: Cone-tree: These cones are open cones and only the angle made at the origin with the axis ofthe cone is bounded for every point in the cone. The norms of the queries are not bounded at all.</p><p>4.3 Cone-trees for Queries</p><p>An interesting fact is that in equation 1, the point p, where the maximum is achieved, is independent of thenorm ||q|| of the query q. Let q,r be the angle between the q and r at the origin, then the task of searchingfor the maximum inner-product is equivalent to search for a point p S such that:</p><p>p = argmaxrS</p><p>r cos q,r. (13)</p><p>This implies that we only care about the direction of the queries irrespective of their norms. For this reason,we propose the indexing of the queries on the basis of their direction (from the origin) to form a cone-tree (figure 8). The queries are hierarchically indexed as (possibly overlapping) open cones. Each cone isrepresented by a vector, which corresponds to its axis, and an angle, which corresponds to the maximumangle made by any point within the cone with the axis at the origin.</p><p>12</p><p>5ram2012maximum.18</p></li><li><p> SVDI Probabilistic Matrix Factorization ( )I Bayesian PMF ( )</p><p>I MCMC , ALSI Matchbox</p><p>I EP, VB , Infer.NET</p><p>I : </p><p>UVj i</p><p>Rijj=1,...,M i=1,...,N</p><p>V U</p><p>iY</p><p>Vj</p><p>Rijj=1,...,M</p><p>U i iI</p><p>i=1,...,N</p><p>VU</p><p>W</p><p>k=1,...,M</p><p>k</p><p>W</p><p>Figure 1: The left panel shows the graphical model for Probabilistic Matrix Factorization (PMF). The rightpanel shows the graphical model for constrained PMF.</p><p>Many of the collaborative filtering algorithms mentioned above have been applied to modellinguser ratings on the Netflix Prize dataset that contains 480,189 users, 17,770 movies, and over 100million observations (user/movie/rating triples). However, none of these methods have proved tobe particularly successful for two reasons. First, none of the above-mentioned approaches, exceptfor the matrix-factorization-based ones, scale well to large datasets. Second, most of the existingalgorithms have troublemaking accurate predictions for users who have very few ratings. A commonpractice in the collaborative filtering community is to remove all users with fewer than someminimalnumber of ratings. Consequently, the results reported on the standard datasets, such as MovieLensand EachMovie, then seem impressive because the most difficult cases have been removed. Forexample, the Netflix dataset is very imbalanced, with infrequent users rating less than 5 movies,while frequent users rating over 10,000 movies. However, since the standardized test set includesthe complete range of users, the Netflix dataset provides a much more realistic and useful benchmarkfor collaborative filtering algorithms.The goal of this paper is to present probabilistic algorithms that scale linearly with the number ofobservations and perform well on very sparse and imbalanced datasets, such as the Netflix dataset.In Section 2 we present the Probabilistic Matrix Factorization (PMF) model that models the userpreference matrix as a product of two lower-rank user and movie matrices. In Section 3, we extendthe PMF model to include adaptive priors over the movie and user feature vectors and show howthese priors can be used to control model complexity automatically. In Section 4 we introduce aconstrained version of the PMF model that is based on the assumption that users who rate similarsets of movies have similar preferences. In Section 5 we report the experimental results that showthat PMF considerably outperforms standard SVD models. We also show that constrained PMF andPMF with learnable priors improve model performance significantly. Our results demonstrate thatconstrained PMF is especially effective at making better predictions for users with few ratings.</p><p>2 Probabilistic Matrix Factorization (PMF)Suppose we have M movies, N users, and integer rating values from 1 to K1. Let Rij representthe rating of user i for movie j, U RDN and V RDM be latent user and movie featurematrices, with column vectorsUi and Vj representing user-specific and movie-specific latent featurevectors respectively. Since model performance is measured by computing the root mean squarederror (RMSE) on the test set we first adopt a probabilistic linear model with Gaussian observationnoise (see fig. 1, left panel). We define the conditional distribution over the observed ratings as</p><p>p(R|U, V,2) =Ni=1</p><p>Mj=1</p><p>[N (Rij |UTi Vj ,2)</p><p>]Iij, (1)</p><p>where N (x|,2) is the probability density function of the Gaussian distribution with mean andvariance 2, and Iij is the indicator function that is equal to 1 if user i rated movie j and equal to</p><p>1Real-valued ratings can be handled just as easily by the models described in this paper.</p><p>2</p><p>Bayesian Probabilistic Matrix Factorization using MCMC</p><p>UVj i</p><p>Rijj=1,...,M i=1,...,N</p><p>V U</p><p>j</p><p>Rijj=1,...,M i=1,...,N</p><p>V Ui</p><p>U</p><p>U</p><p>0 , W0</p><p>0V0</p><p>V</p><p>, W00</p><p>Figure 1. The left panel shows the graphical model for Probabilistic Matrix Factorization (PMF). The right panel showsthe graphical model for Bayesian PMF.</p><p>tors are assumed to be Gaussian:</p><p>p(U |U ,U ) =Ni=1</p><p>N (Ui|U ,1U ), (5)</p><p>p(V |V ,V ) =Mi=1</p><p>N (Vi|V ,1V ). (6)</p><p>We further place Gaussian-Wishart priors on the userand movie hyperparameters U = {U ,U} andV = {V ,V }:</p><p>p(U |0) = p(U |U )p(U )= N (U |0, (0U )1)W(U |W0, 0), (7)</p><p>p(V |0) = p(V |V )p(V )= N (V |0, (0V )1)W(V |W0, 0). (8)</p><p>Here W is the Wishart distribution with 0 degrees offreedom and a D D scale matrix W0:</p><p>W(|W0, 0) = 1C||(0D1)/2 exp (1</p><p>2Tr(W10 )),</p><p>where C is the normalizing constant. For conveniencewe also define 0 = {0, 0,W0}. In our experimentswe also set 0 = D and W0 to the identity matrixfor both user and movie hyperparameters and choose0 = 0 by symmetry.</p><p>3.2. Predictions</p><p>The predictive distribution of the rating value Rij foruser i and query movie j is obtained by marginalizing</p><p>over model parameters and hyperparameters:</p><p>p(Rij |R,0)...</p></li></ul>


View more >