Jekyll20200430T12:51:25+02:00http://localhost:4000/feed.xmlAndreas BlochMachine Learning BlogAn Overview of Collaborative Filtering Algorithms for Implicit Feedback Data20191216T08:00:00+01:0020191216T08:00:00+01:00http://localhost:4000/AnOverviewofCollaborativeFilteringAlgorithms<p>This blogpost gives an overview of today’s most predominant types of
recommender systems for collaborative filtering from implicit feedback data.
The overview is by no means exhaustive, but it should provide the reader with
a good overview about the topic.</p>
<p>First, some background about recommender systems is given. Also, the setting
of <em>collaborative filtering from implicit feedback data</em> is defined and
the typical application structure of a recommender service is explained.
Then, several collaborative filtering models are described in more detail
together with a discussion on their advantages and disadvantages. Finally, a summary
of the different approaches is given. A slide deck accompanying the article
can be found here:</p>
<p><a href="/files/collaborativefilteringalgorithmsforimplicitfeedbackdata.pdf" target="_blank">
<img src="/img/20191216AnOverviewofCollaborativeFilteringAlgorithms/slidedeck.png?v=1" alt="Slide Deck" width="60%" />
</a></p>
<h2 id="backgroundonrecommendersystems">Background on Recommender Systems</h2>
<p>Recommender systems are at the heart of any of today’s largescale ecommerce,
news, contentstreaming, dating or search platforms. A critical success factor
of such platforms is the ability to reduce the overwhelming amount of options to
a few relevant recommendations matching the users’ individual and trending
interests.</p>
<p>According to McKinsey & Company, 35% of the consumer purchases on Amazon and
75% of the views on Netflix in 2012 came from product recommendations based on
recommendation engines <a class="citation" href="#mckinsey">[1]</a>.
Goodwater Capital <a class="citation" href="#goodwatercap">[2]</a>
also reports that in 2017, 31% of the tracks listened on Spotify stemmed from
personalized playlists generated by Spotify’s recommender system. These numbers
clearly demonstrate the significance of algorithmic recommendations in online
services.</p>
<p>Connecting customers to products that they love is critical to both, the
customers and the companies. Because if users fail to find the products that
interest and engage them, they tend to abandon the platforms
<a class="citation" href="#bennett2007netflix">[3]</a>. In a report about the
businessvalue of their recommender algorithms Netflix describes how the reduction
of the monthly churn both increases the lifetime value of existing subscribers,
and reduces the number of new subscribers that they need to acquire to replace
cancelled members. They estimate the combined effect of personalization and
recommendations to save them more than the exorbitant amount of $1B per year
<a class="citation" href="#gomez2016netflix">[4]</a>.</p>
<p>The urgency of the capability to provide relevant and personalized
recommendations can also be seen in the prize money of $1M that Netflix
advertised in its famous “Netflix Prize” competition of 2006 for the first
team that could improve the Netflix algorithm’s RMSE by 10%
<a class="citation" href="#koren2009matrix">[5]</a>. The competition inspired a multitude of researchers to
participate and to contribute to the development of nextgeneration recommender
systems. In the end, the grand prize was won by Yehuda Koren, Robert Bell and
Chris Volinsky in 2009 who had developed a model that blended the predictions of
hundreds of predictors, including a plethora of matrix factorization models,
neighbourhood models and restricted Boltzmann machines <a class="citation" href="#koren2009bellkor">[6]</a>.</p>
<p>Ironically, in 2012 Netflix published in a blogpost of theirs <a class="citation" href="#netflixblog">[7]</a>
that they never happened to use Koren et al.’s algorithm in production due to its
engineering costs: “the additional accuracy gains that we measured did not seem
to justify the engineering effort needed to bring them into a production
environment.” Anyways, the field of recommender systems has certainly
profited from the inventions that were sparked by virtue of the competition.</p>
<h2 id="thecollaborativefilteringsetting">The Collaborative Filtering Setting</h2>
<p>Many different types of recommender systems have evolved over the years.
One way to characterize recommender systems is through the information that
they consider to produce their rankings:</p>
<ul>
<li>
<p><strong>Collaborative Filtering:</strong> In collaborative filtering the
recommender system purely learns form the interaction patterns between users
and items <a class="citation" href="#hu2008collaborative">[8]</a>.
The contents and features of the items and users are completely ignored.
Users and items are just treated as enumerated nodes of an undirected
(weighted or unweighted) bipartite graph $G=(U\cup I, E)$ where the
items $I$ are indexed as $i_1,…,i_{\card{I}}$ and the users $U$ are indexed as
$u_1,…,u_{\card{U}}$. Nothing more than the vertices, edges $\set{u_j,i_k}$ and
maybe a some edge weights $w_{u_j,i_k}$ are known. Hence collaborative filtering
corresponds to predicting promising links from user nodes to item nodes based
on the observed common connection patterns. It is called <em>collaborative</em> filtering,
because in the collaborative filtering approaches it is commonly assumed that learning the
interaction patterns of one user (e.g., the items a user has interacted with) will
help to predict relevant items for another user that has a similar
interaction pattern (in terms of interacted items) as the latter user.
Hence it is as if users were <em>collaborating</em> to produce the rankings of items
for each other.</p>
</li>
<li>
<p><strong>ContentBased:</strong> In contentbased recommender systems, the
recommender system additionally learns from the content and features of the items
(e.g., image location and data or text data) and sometimes also from the features
of the users, to produce a list of rankings as done for example in
<a class="citation" href="#cml">[9]</a>.</p>
</li>
<li>
<p><strong>ContextAware:</strong> These recommender systems include additional information
about a user’s context <a class="citation" href="#adomavicius2011context">[10]</a>,
e.g., whether the user is accessing the service from a mobile or desktop client,
his current geolocation, his current time of day, whether the user is stationary or
travelling, or whether the user is in a quiet or noisy place.</p>
</li>
</ul>
<p>This blogpost is purely concerned with the <em>collaborative filtering</em> setting.
However, most of the collaborative filtering approaches can usually be extended
to include the other aforementioned information sources, as for example done in the
approach of <a class="citation" href="#cml">[9]</a>.</p>
<h2 id="explicitvsimplicitfeedbackdata">Explicit VS Implicit Feedback Data</h2>
<p>The training of recommender systems relies on data that is gathered from the
feedback that users gave to items. This feedback can be divided into two
categories as defined in <a class="citation" href="#oard1998implicit">[11]</a>:</p>
<ul>
<li>
<p><strong>Explicit Feedback:</strong> Here a user explicitly gives a rating to an
item on a certain scale.</p>
</li>
<li>
<p><strong>Implicit Feedback:</strong> Here a user implicitly provides the
information about the relevance of an item by interacting with it according
to a certain notion of time, e.g., a certain amount of times,
or a total amount of time, or the percentage watched of a movie.</p>
</li>
</ul>
<p>The relevance of algorithms capable of dealing with implicit feedback data can
be motivated by the natural abundance of implicit feedback data versus the
usual scarcity of explicit feedback data <a class="citation" href="#davidson2010youtube">[12]</a>.
This blogpost is purely concerned with <em>binary implicit feedback</em> data,
as done in most of the academic literature.</p>
<h2 id="recommenderserviceaninterplayofaretrievalandarankingsystem">Recommender Service: An Interplay of a Retrieval and a Ranking System</h2>
<p>Before looking at our first concrete collaborative filtering model, let’s
first have a look at how a recommender service is usually structured.
A common practice <a class="citation" href="#cheng2016wide">[13]</a> is to
structure a <em>recommendation service</em> into two components as follows :</p>
<ol>
<li>
<p>A <strong>retrieval system</strong> that retrieves potential relevant items in
a very efficient manner (e.g., KDtrees <a class="citation" href="#kdtrees">[14]</a>),
maximum inner product search (MIPS) <a class="citation" href="#mips">[15]</a>,
localitysensitive hashing (LSH) <a class="citation" href="#lsh1">[16, 17, 18]</a>,
or also neuralnetwork based methods <a class="citation" href="#davidson2010youtube">[12]</a>.</p>
</li>
<li>
<p>A <strong>ranking system</strong>, which we also refer to as <em>recommender
system</em>, which usually runs a more expensive and sophisticated algorithm
to precisely rank the retrieved potential positive candidates such that
they can be presented in their final estimated relevance order to a user
<a class="citation" href="#davidson2010youtube">[12]</a>.</p>
</li>
</ol>
<p>The figure below illustrates the interplay of a retrieval and a ranking system.
Whether to use just a recommender system, or a combination of a retrieval and
recommender system depends on the size of the data (in terms of number of users
and items) and the computational cost of the recommender system. If the ranking
has a high computational cost, and there are lots of items, then it makes sense
to have a retrieval system that retrieves a subset of potential candidates in a
cheap way. An important thing to note here is that, besides the ranking accuracy
of the ranking system, also the quality of the subsample retrieved by
the retrieval system strongly affects the performance metrics of the entire
recommender service.</p>
<div class="figurewithcaption">
<img src="/img/20191216AnOverviewofCollaborativeFilteringAlgorithms/retrievalandrankingsystem.png?v=1" alt="Interplay of Retrieval and Ranking System" width="100%" />
<div class="figurecaption">
Typical structuring of a recommender service into a retrieval system and a
ranking system. Illustration by <a class="citation" href="#cheng2016wide">[13]</a>.
</div>
</div>
<p>In the collaborative filtering setting, the query is just a user index and the
result is a ranked list of indices of recommended items. As depicted, the
retrieval system usually returns a subset in the order of $N=100$ items to
the ranking system.</p>
<p>Now that we have enough background on recommender systems we can have a look at
the first collaborative filtering model.</p>
<h2 id="itempopularity">Item Popularity</h2>
<p>The Item Popularity model is a very simple and efficient ranking model that
simply recommends items based on their popularities. The more popular an item
is, the higher up it is on its recommendation list. Thus, the ranking of an item
$i$ for user $u$ is simply computed through</p>
<script type="math/tex; mode=display">\hat{x}_{ui}=\frac{\#\text{interactions with item }i}{\# \text{interactions in total}}.</script>
<p>Hence, it completely ignores the user’s features, which can be
disadvantageous if one wants to create recommendations tailored to a user’s
preferences.</p>
<p>Nevertheless, it can be a very powerful prediction model.
Its strengths lie in its simplicity and its efficiency. The Item Popularity
model can be particularly useful in situations where there is little information
known about a user’s preferences. Therefore, it is often used to overcome
the coldstart problem.</p>
<p>A reason why the Item Popularity model works quite well in practice is
because, usually, the popularity of items is distributed according to a
powerlaw: most of the interactions happen with a few popular items, and the
rest of the items only have a few interactions. The following plot
illustrates how the number of interactions per item is distributed according to a
powerlaw for the <a href="https://grouplens.org/datasets/movielens/20m/">Movielens20M</a>
dataset.</p>
<div class="figurewithcaption">
<img src="/img/20191216AnOverviewofCollaborativeFilteringAlgorithms/movielens20Mitempopularity.png?v=1" alt="MovieLens20M Item Popularity" width="100%" />
<div class="figurecaption">
The plot shows the number of ratings per movie in descending order for
the MovieLens20M dataset. One can clearly observe the powerlaw nature of the
popularities of the movies in terms of their number of ratings.
</div>
</div>
<h2 id="matrixfactorizationmf">Matrix Factorization (MF)</h2>
<p><em>Matrix factorization (MF)</em> predicts the relevance $\hat{x}_{ui}$ of an
item $i\in I$ for a user $u\in U$ through the dot product</p>
<script type="math/tex; mode=display">\hat{x}_{ui}=\scprod{\vx_u,\vx_i},</script>
<p>where $\vx_u$ and $\vx_i$ are $d$dimensional representations of the user $u$
and item $i$ in a <em>latent factor space</em>. These latent factor representations
of users and items are the parameters $\vtheta$ that are aimed to be learned:</p>
<div class="eqdesktop">
$$
\vtheta=\set{\MX_U,\MX_I},
\qquad
\MX_U\in\R^{\card{U}\times d},\quad
\MX_I\in\R^{\card{I}\times d}.
$$
</div>
<div class="eqmobile">
$$
\vtheta=\set{\MX_U,\MX_I},
$$
$$
\MX_U\in\R^{\card{U}\times d},\quad
\MX_I\in\R^{\card{I}\times d}.
$$
</div>
<p>Several works showed how these latent factor space dimensions tend to capture
concepts of users and items, e.g., “male” or “female” for users, or “serious” vs
“escapist” for movies as illustrated below from the work of <a class="citation" href="#koren2009matrix">[5]</a>.</p>
<div class="figurewithcaption">
<img src="/img/20191216AnOverviewofCollaborativeFilteringAlgorithms/mflatentfactorsinterpretation.png?v=1" alt="Concepts captured in latent factor space" width="100%" />
<div class="figurecaption">
Concepts captured by latent factor dimensions.
Illustration by <a class="citation" href="#koren2009matrix">[5]</a>
</div>
</div>
<p>For <em>explicit</em> feedback data the parameters are trained by minimizing the
squared loss over the observed rankings $x_{ui}$ of the interaction matrix
$\MX\in\R^{\card{U}\times\card{I}}$, collected as training instances $(u,i)\in\cD$:</p>
<div class="eqdesktop">
$$
\cL(\vtheta)
=
\sum_{(u,i)\in\cD}\left(x_{ui}\hat{x}_{ui}\right)^2
=
\sum_{(u,i)\in\cD}\left(x_{ui}\scprod{\vx_u,\vx_i}\right)^2.
$$
</div>
<div class="eqmobile">
$$
\begin{align*}
\cL(\vtheta)
&=
\sum_{(u,i)\in\cD}\left(x_{ui}\hat{x}_{ui}\right)^2
\\
&=
\sum_{(u,i)\in\cD}\left(x_{ui}\scprod{\vx_u,\vx_i}\right)^2.
\end{align*}
$$
</div>
<p>Some approaches for <em>implicit</em> feedback data, such as
<a class="citation" href="#sarwar2000application">[19, 20]</a>, rely on
binarization and imputation of the unobserved entries of $\MX$ as 0, turning the
optimization problem into</p>
<script type="math/tex; mode=display">\cL(\vtheta)=\norm{\MX  \MX_U\MX_I^T}_2^2.</script>
<p>The latter loss clearly shows why the recommender system approach has its name
<em>matrix factorization</em>. Under some conditions, one can assume that
$\rank(\MX_U\MX_I^\T)\leq d$, making the problem tightly
related to Singular Value Decomposition (SVD) and Principal Component Analysis (PCA).</p>
<p>Other approaches for <em>implicit</em> feedback data argue that one should still
impute the nonobserved interactions as $0$, but weigh the
prediction errors for observed and unobserved interactions differently. Such
approaches fall under the category of <em>weighted regularized matrix
factorization (WMRF)</em>, having a training loss of the form</p>
<script type="math/tex; mode=display">\cL(\vtheta)
=
\sum_{(u,i)\in\cD} c_{ui}(x_{ui}\hat{x}_{ui})^2.</script>
<p>where the weights $c_{ui}$ are chosen according to a <em>weightingstrategy</em>.
Hu et al. <a class="citation" href="#hu2008collaborative">[8]</a>, Pan et al. <a class="citation" href="#pan2008one">[21]</a> and He et al.
<a class="citation" href="#eals">[22]</a>, proposed various weightingschemes that all assign a fixed weight
$c_{ui}=1$ to the observed interactions, and the weights $c_{ui}$ for the
unobserved interactions are chosen according to one of the following strategies:</p>
<ul>
<li><strong>UniformWeighting:</strong> Chooses some fixed weight $c_{ui}\in[0,1)$,
meaning that all unobserved interactions share the same probability of
being negative examples.</li>
<li><strong>UserActivity Based:</strong> Chooses the weight based on the number of
ratings that a user $u$ gave: $c_{ui}\propto\norm{\vx_u}_1$. With the argument
that, the more a user has interacted with the system, the more confident one can
be about the inferred irrelevance of the user’s leftout items.</li>
<li><strong>ItemPopularity Based:</strong> Assigns lower weights to popular items.
The rationale behind this is: the more popular an item is, the more likely it
is to be known. Hence, a noninteraction on a popular item is more likely to
be due to its true irrelevance to a user.</li>
</ul>
<p>Other more advanced models also train a global bias $\mu$, and user and
itemspecific biases $\mu_u$ and $\mu_i$ to predict the rankings as</p>
<script type="math/tex; mode=display">\hat{x}_{ui}=\scprod{\vx_u,\vx_i}+\mu_u+\mu_i+\mu.</script>
<p>This aims to compensate for the systematic tendencies that some users tend to
give higher ratings than others, and some items tend to receive higher ratings
than others. The rationale behind this is that the latent concepts (the
$d$dimensions of the latent factor space) should not be used to explain these
systematic tendencies <a class="citation" href="#koren2009matrix">[5]</a>.</p>
<p>It is also a common practice to regularize the user and itemembeddings and the
means with L2regularization</p>
<div class="eqdesktop">
$$
\Omega(\theta)
=
\norm{\MX_U}_2^2
+\norm{\MX_I}_2^2
+\norm{\vmu_U}_2^2
+\norm{\vmu_I}_2^2
+\mu^2.
$$
</div>
<div class="eqmobile">
$$
\begin{align*}
\Omega(\theta)
&=
\norm{\MX_U}_2^2
+\norm{\MX_I}_2^2
\\
&\quad+\norm{\vmu_U}_2^2
+\norm{\vmu_I}_2^2
+\mu^2.
\end{align*}
$$
</div>
<p>This form of regularization can also be motivated from a probabilistic
perspective where the user and itemembeddings and the means are assumed to be
distributed according to multivariate Gaussian distributions in the latent factor
space. For a derivation see <a class="citation" href="#mnih2008probabilistic">[23]</a>.</p>
<p>In the famous Netflix Prize, launched in 2006, the majority of the successful
recommender systems were using matrixfactorization approaches
<a class="citation" href="#bennett2007netflix">[3]</a>. For many years, matrix factorization
has been the ranking model of
firstchoice and a lot of improvements and extensions have been proposed, including:</p>
<ul>
<li>
<p><strong>Alternating LeastSquares (ALS):</strong> Various <em>alternating
leastsquares (ALS)</em> optimization approaches, such as eALS <a class="citation" href="#eals">[22]</a>
and ALSWR <a class="citation" href="#alswr">[24]</a>, have been developed. These approaches aim to
speedup the convergence of the nonconvex optimization problem through
the surrogate of two convex optimization problems. ALS works through the following
alternation: at each iteration, once the user embeddings are fixed and the solutions for the
items are obtained in a closedform, and viceversa.</p>
</li>
<li>
<p><strong>Including Temporal Dimensions:</strong> Another direction of work, e.g.
<a class="citation" href="#koren2009collaborative">[25, 26]</a>, has been
concerned with incorporating temporal dimensions into matrix factorization. These
approaches model the trends of items and the changes of users’ tastes by expressing
the user and item embeddings, and also the biases, as functions of time.</p>
</li>
<li>
<p><strong>NonNegative Matrix Factorization:</strong> For some ranking applications it might
be desirable to only have predictions that are positive.
To this end, several works have been concerned with applying nonnegative matrix
factorization to collaborative filtering, including
<a class="citation" href="#luo2014efficient">[27, 28]</a>.</p>
</li>
<li>
<p><strong>Online Learning and Regret Bounds:</strong> A lot of effort has
also been invested in the development of online learning algorithms (e.g.
<a class="citation" href="#eals">[22, 29]</a>) and the derivation of regret bounds (e.g.
<a class="citation" href="#dadkhahi2018alternating">[30, 31]</a>), making it
possible to scale matrix factorization to bigdata settings with online learning
and convenient regret bounds.</p>
</li>
</ul>
<p>Note that while all examples here have been using the squared loss,
<em>matrix factorization</em> can also be trained using the pairwise BPR loss as done in
<a class="citation" href="#rendle2009bpr">[32]</a>.</p>
<p>Allinall, matrix factorization has demonstrated to be a powerful and
successful recommender model. Indeed, it had been successfully used to do
YouTube video recommendations, until it got replaced by neural network
approaches recently <a class="citation" href="#davidson2010youtube">[12]</a>. The fact that, at its heart,
matrix factorization only uses a bilinear form to predict the rankings, makes
it computationally very attractive. However, as we will see in what follows,
recent stateoftheart approaches critique the inner product for failing at
propagating similarities <a class="citation" href="#cml">[9]</a> and for being too rigid, in the sense that
it is only a bilinear form as opposed to a more powerfulnonlinear prediction function
<a class="citation" href="#ncf">[33]</a>.</p>
<h2 id="collaborativemetriclearningcml">Collaborative Metric Learning (CML)</h2>
<p>Metric learning approaches aim to learn a distance metric that assigns smaller
distances between similar pairs, and larger distances between dissimilar pairs.
Collaborative Metric Learning (CML) <a class="citation" href="#cml">[9]</a> advocates the embedding of users
and items for recommender systems in <em>metric spaces</em> in order to exploit a
phenomenon called <em>similarity propagation</em>. In their work, Hsieh et al.
<a class="citation" href="#cml">[9]</a> explain how <em>similarity propagation</em> is achieved
due to the fact that a distance metric $d$ must respect, amongst several other conditions,
the crucial triangleinequality:</p>
<script type="math/tex; mode=display">\forall x,y,z\colon
\quad
d(y,z)\leq d(x,y) + d(x,z).</script>
<p>This implies that, given the information that “$x$ is similar to both $y$ and
$z$” the learned metric $d$ will not only pull the pairs $y$ and $z$ close to
$x$, but <em>also</em> pull $y$ and $z$ relatively close to oneanother. Thus, the
similarity of $(x,y)$ and $(x,z)$ is <em>propagated</em> to $(y,z)$.</p>
<p>The authors critique that, since matrix factorization is using the
innerproduct, and the innerproduct does not necessarily respect the
triangle inequality (e.g., violation for $x=1$, $y=z=1$), such a <em>similarity
propagation</em> is not happening in matrix factorization approaches.
An illustration of how the innerproduct fails at propagating similarities
even for a simple interaction matrix is given below:</p>
<div class="figurewithcaption">
<img src="/img/20191216AnOverviewofCollaborativeFilteringAlgorithms/cmlsimilaritypropagationillustration.png?v=1" alt="CML Similarity Propagation Illustration" width="100%" />
<div class="figurecaption">
Illustration by <a class="citation" href="#cml">[9]</a> showing how the innerproduct
fails at propagating similarities. In the example $U_3$ likes both, $v_1$ and
$v_2$. Since $U_1$ likes $v_1$ and $U_2$ likes $v_2$ the items $v_1$ and
$v_2$ are placed inbetween the users in the metriclearning approach. With
matrix factorization the dotproduct is 2 if a user liked an item and 0
otherwise, representing a stable setting. However, the similarity between
$(U_3,v_1)$ and $(U_3,v_2)$ is not propagated to $(v_1,v_2)$ because we have that
$\scprod{v_1,v_2}=0$. Even though MF may yield the same recommendation
performance, the similarities between the items $v_1$ and $v_2$ aren't
obtained as well as with the metric learning approach.
</div>
</div>
<p>The great convenience of encoding users and items in a metric space $(\cM,d)$ is
that the joint metric space does not only encode the similarity between users
and items, but it can also be used to determine useruser and itemitem
similarities. This improves the interpretability of the model, as opposed to a
model that relies on the inner product to compute similarities. What matrix
factorization approaches usually do to compensate for this lack is to compute
useruser or itemitem similarities using the cosinedistance.
However as illustrated in the figure above, this doesn’t yield optimal results.</p>
<p>Hopefully, by now the reader should be convinced that <em>similarity propagation</em>
is a desirable property to have in order to generalize from the observed useritem
interactions to unseen pairs of interactions and useruser and itemitem similarities.
Next, we’ll look at how the training of the embeddings is performed in CML.</p>
<p>The embedding training approach of of CML is to pull positive useritem
pairs close together and to push negative useritem pairs far apart according
to some margin. This process will then cluster users who colike the same items
together, and also cluster the items that are coliked by the same users together.
Eventually, a situation is reached where the nearest neighbours of any user $u$ will become:</p>
<ul>
<li>the items the user $u$ liked, and</li>
<li>the items liked by other users who share a similar taste with user $u$.</li>
</ul>
<p>Therefore, learning from the observed positive interactions
<em>propagates</em> these relationships also to useruser and itemitem pairs for
which there are <em>no</em> observed relationships.</p>
<p>In CML the relevances then are simply predicted as the negative distance</p>
<script type="math/tex; mode=display">\hat{x}_{ui}=d(\vx_u,\vx_i),</script>
<p>meaning that a closeby item as a higher ranking than an item that is farther away.
The optimization objective trained to achieve the aforementioned desiderata
is the following:</p>
<script type="math/tex; mode=display">\cL(\vtheta)
=
\cL_m(\vtheta)
+
\lambda\Omega(\vtheta)
\quad \text{s.t. }
\norm{\vx_*}\leq 1,</script>
<p>where the various loss terms have the following meanings:</p>
<ul>
<li>
<p>The term $\cL_m(\vtheta)$ is the WARP loss of the predicted rankings, given
by the negative metric space distances $\hat{x}_{ui}=d(\vx_u,\vx_i)$ and
$\hat{x}_{uj}=d(\vx_u,\vx_j)$:</p>
<div class="eqdesktop">
$$
\cL_m(\vtheta)=\sum_{(i,j)\in\cS}\sum_{(u,k)\nin\cS}
w_{ij}\left[m+d(\vx_u,\vx_i)^2d(\vx_u,\vx_j)^2\right]_+.
$$
</div>
<div class="eqmobile" style="fontsize:75%">
$$
\cL_m(\vtheta)=\sum_{(i,j)\in\cS}\sum_{(u,k)\nin\cS}
w_{ij}\left[m+d(\vx_u,\vx_i)^2d(\vx_u,\vx_j)^2\right]_+.
$$
</div>
<p>The set $\cS$ is the set of observed positive interactions. The gradients caused
by the WARP loss for a user $u$ and its positive and negative items are
illustrated in the figure below.</p>
<div class="figurewithcaption">
<img src="/img/20191216AnOverviewofCollaborativeFilteringAlgorithms/cmlpositiveandnegativeitemgradients.png?v=1" alt="CML Gradients" width="100%" />
<div class="figurecaption">
The figure by <a class="citation" href="#cml">[9]</a> shows the gradients created by the WARP loss
in CML. For the positive items of a user, gradients are created to pull them closer
until they lie within a certain margin $m$ to the user. For the negative items of a
user, gradients are created to push them away until they lie far away from the user,
outside a ball of radius $m$, where $m$ is the margin, usually chosen as $1$.
</div>
</div>
</li>
<li>
<p>The regularization term $\Omega(\vtheta)$ uses <em>covariance
regularization</em> as proposed by Cogswell et al. <a class="citation" href="#cogswell2015reducing">[34]</a>.</p>
<script type="math/tex; mode=display">\Omega(\vtheta)=\norm{\MSigma  \diag(\MSigma)}_2^2,</script>
<p>where $\MSigma$ is the covariance matrix of the concatenation of all the user
and item embeddings.
This decorrelates the dimensions of the metric space. Since covariances can
be seen as a measure of linear redundancy between dimensions, this loss
essentially tries to prevent each dimension from being redundant by penalizing
offdiagonal entries in the covariance matrix and thus encouraging the
embeddings to efficiently utilize the given space.
The covariance matrix is computed as follows: Let $\MY$ be the concatenation
of the $d$dimensional user and itemembeddings $\MX_U$ and $\MX_I$:</p>
<script type="math/tex; mode=display">\MY=\begin{bmatrix}
\MX_U\\
\MX_I
\end{bmatrix}
\in\R^{(\card{U}+\card{I})\times d}.</script>
<p>The mean embedding vector is then computed as</p>
<div class="eqdesktop">
$$
\vmu
:=
\frac{1}{\card{U}+\card{I}}\sum_{i=1}^{\card{U}+\card{I}}
\MY_{i,:}\in\R^{1\times d},
$$
</div>
<div class="eqmobile">
$$
\vmu
:=
\tfrac{1}{\card{U}+\card{I}}\sum_{i=1}^{\card{U}+\card{I}}
\MY_{i,:},
$$
</div>
<p>and the covariance matrix $\MSigma$ is obtained via</p>
<div class="eqdesktop">
$$
\MSigma
=
\frac{1}{\card{U}+\card{I}}
\sum_{i=1}^{\card{U}+\card{I}}
\left(\MY_{i,:}\vmu\right)^\T
\left(\MY_{i,:}\vmu\right)\in\R^{d\times d}.
$$
</div>
<div class="eqmobile">
$$
\MSigma
=
\tfrac{1}{\card{U}+\card{I}}
\sum_{i=1}^{\card{U}+\card{I}}
\left(\MY_{i,:}\vmu\right)^\T
\left(\MY_{i,:}\vmu\right).
$$
</div>
</li>
<li>
<p>The optimization constraint $\norm{\vx_{*}}\leq 1$ forces all user and
item embeddings $\vx_{*}$ to stay within a unitsphere in order to easily
apply locality sensitive hashing (LSH) <a class="citation" href="#lsh1">[16]</a> later. L2regularization is
intentionally avoided, as this would just pull the embeddings towards the
origin. The authors argue that in the metric space the origin does not have
any specific meaning.</p>
</li>
</ul>
<p>One great advantage of CML is that it can the recommendations can be easily
performed on massive datasets. Since CML uses the Euclidean distance to
represent the relevances, it can be used with offtheshelf LSH. In contrast, matrix
factorization approaches would have to use approximate Maximum Inner Product Search
(MIPS), which is considered to be a much harder problem than LSH <a class="citation" href="#mips">[15]</a>. A disadvantage of CML might be that the distance function itself might not be
expressive enough to represent the complex useritem relevance relationships,
which might be modeled through arbitrary nonlinear interaction functions as
suggested in some approaches that follow later.</p>
<p>Still, the example of CML clearly illustrated the benefits obtained through
<em>similarity propagation</em> when learning a distance metric to predict the
relevances. This also motivates the next recommender system approaches, which also
learn distance metrics in hyperbolic space to predict rankings.</p>
<h2 id="hyperbolicrecommendersystems">Hyperbolic Recommender Systems</h2>
<p>So far, three approaches <a class="citation" href="#hrs">[35, 36, 37]</a> harnessing
hyperbolic geometry for recommender systems have been published. Both approaches represent users and
items in hyperbolic geometry and predict the relevance between users and items as</p>
<script type="math/tex; mode=display">% <![CDATA[
\hat{x}_{ui}=\alpha d(\vx_u,\vx_i),\quad\text{with }\alpha<0, %]]></script>
<p>where $\vx_u$ and $\vx_i$ are the trained user and itemembeddings, lying in
hyperbolic geometry, and $d$ is the geodesic distance function in hyperbolic
geometry.</p>
<p>Since all approaches use distance
metrics to represent the relevance relationships between users and items they all
fall under the category of <em>metric learning approaches</em>, just as the aforementioned
CML. Therefore, they also benefit from the <em>similarity propagation</em> phenomenon, as
the hyperbolic geodesic distance also has to respect the triangle inequality.</p>
<p>The approaches of <a class="citation" href="#hrs">[35, 36, 37]</a> train their embeddings via
the BPR loss <a class="citation" href="#rendle2009bpr">[32]</a>, yielding the optimization objective</p>
<div class="eqdesktop">
$$
\cL(\vtheta)
=
\sum_{(u,i,j)}
\log\left(\sigma\left(
\alpha\left(
d(\vx_u,\vx_i)d(\vx_u,\vx_j)
\right)\right)\right),
$$
</div>
<div class="eqmobile">
$$
\sum_{(u,i,j)}
\log\left(\sigma\left(
\alpha\left(
d(\vx_u,\vx_i)d(\vx_u,\vx_j)
\right)\right)\right),
$$
</div>
<p>where the parameters $\vtheta$ consist of the user embeddings $\MX_U$, the item
embeddings $\MX_I$ and the scalar $\alpha$. In some of their experiments
<a class="citation" href="#shrs">[36]</a> also use the WMBR <a class="citation" href="#liu2017wmrb">[38]</a>
loss. The embeddings are trained via the Riemannian optimization used with hyperbolic spaces.
An illustration of the architecture of the <em>pairwise learning approach</em> is given in the figure
below.</p>
<div class="figurewithcaption">
<img src="/img/20191216AnOverviewofCollaborativeFilteringAlgorithms/hrs.png?v=1" alt="Pairwise Learning in Hyperbolic Recommender System" width="100%" />
<div class="figurecaption">
Pairwise learning approach with hyperbolic recommender systems.
Illustration by <a class="citation" href="#hrs">[35]</a>.
</div>
</div>
<p>Even though Vinh et al. <a class="citation" href="#hrs">[35]</a> use the Poincaré ball, and
Chamberlain et al. <a class="citation" href="#shrs">[36]</a> use the hyperboloid as their model for
hyperbolic geometry, the two approaches can be considered as practically
equivalent. The only real difference is that Chamberlain et al. further apply
some L2regularization on the user and item embeddings.</p>
<p>Chamberlain et al. <a class="citation" href="#shrs">[36]</a> further explored the
possibilities of expressing users as the Einstein midpoint $\vmu$ (corresponding to a mean)
of their positively interacted items in order to reduce the amount of learned parameters.</p>
<div class="eqdesktop">
$$
\vx_u:=\vmu(\dset{i\in I}{u \text{ has positively interacted with }i}).
$$
</div>
<p>In their experiments, Chamberlain et al. showed that expressing the users as the
midpoint of their interacted items can speedup the training, due to the
reduced amount of parameters, without sacrificing the model’s recommendation
performance. Such an approach is particularly useful for <em>asymmetric</em> datasets that contain
much more users than items ($\card{U}\gg\card{I}$).</p>
<p>The third hyperbolic recommender system approach of Schmeier et al. <a class="citation" href="#bhrs">[37]</a> embedded their entities using a different loss</p>
<script type="math/tex; mode=display">\cL(\vtheta)
=

\sum_{(u,i)\in\cD}
\log\left(
\frac{
e^{\hat{x}_{ui}}
}{
e^{\hat{x}_{ui}}
+
\sum_{(u,j)\in\cD'}
e^{\hat{x}_{uj}}
}
\right),</script>
<p>where $\cD$ is the dataset of positive interactions and $\cD’$ is a set of $K$
negative interactions for user $u$, obtained through negative sampling. This loss
also encourages that relevant useritem paris are closeby, and irrelevant
useritem pairs are farther apart. Also, this is exactly the same loss that was used
by Nickel & Kiela <a class="citation" href="#nickel2017poincare">[39]</a> to train a
graphembedding for the representation of hypernymy relations in hyperbolic space.</p>
<p>To conclude, let’s discuss the advantages and disadvantages of these hyperbolic
recommender models. The advantages of these metric learning approaches are the same as
the ones of CML: the most important one being that they all profit from similarity
propagation and thus simultaneously learn useruser and itemitem similarity.
Furthermore, as revealed in the experiments of three approaches, the choice of a
hyperbolic geometry turned out to provide a good bias for the representations.
One can also imagine that one could perform fast nearestneighbour search for massive
datasets through a generalization of LSH to hyperbolic space. Similarly as to CML,
these approaches have the disadvantage that the distance function might not be
powerful enough to express the useritem relevance relationships entirely. It might be
that this relationship is even better modeled through a nonlinear interaction function as
suggested in the next approach.</p>
<h2 id="neuralcollaborativefilteringncf">Neural Collaborative Filtering (NCF)</h2>
<p>In contrast to the bilinear prediction function used with matrix factorization, or
a rigid distance function as used with the metric learning approaches,
<em>Neural Collaborative Filtering (NCF)</em>, introduced by He et al.
<a class="citation" href="#ncf">[33]</a>, aims to learn a <em>nonlinear interaction function</em>
acting on trained user and item embeddings to predict the relevances. The nonlinear
interaction function is implemented through two models that are trained jointly:</p>
<ul>
<li>
<p><strong>Generalized Matrix Factorization (GMF):</strong>
representing a generalization of the innerproduct, where the products in
the inner product are further scaled by individual factors.</p>
</li>
<li>
<p><strong>MultiLayer Perceptron (MLP):</strong> a pyramidal 3layer perceptron
with ReLU activations.</p>
</li>
</ul>
<p>The ranking $\hat{x}_{ui}$, representing the relevance of item $i\in I$ for
a user $u\in U$ is computed as follows:</p>
<ul>
<li>
<p>First, the user and item embeddings, that are also learned, are retrieved
for each of the two joint models:</p>
<script type="math/tex; mode=display">\vx_u^{GMF}, \vx_i^{GMF},\qquad \vx_u^{MLP}, \vx_i^{MLP}.</script>
</li>
<li>
<p>Then, the interaction for GMF is computed by building the
elementwise product of the corresponding user and item embedding vectors.
Then a weighted sum of the product’s coefficients is computed through a
parameter $\vh$ and also a bias is added:</p>
<script type="math/tex; mode=display">\hat{x}_{ui}^{GMF}=\vh^T\left(\vx_u^{GMF}\odot \vx_i^{GMF}\right) +
b^{GMF}.</script>
<p>Note how for $\vh=(1,\ldots,1)^\T$ and $b^{GMF}=0$ the GMF model
recovers the usual dotproduct that is used in matrixfactorization.</p>
</li>
<li>
<p>Then, the interaction for the MLP is computed by concatenating the
corresponding user and item embeddings and passing them
through the pyramidal 3layer perceptron with ReLU activations. Also,
a bias is added:</p>
<script type="math/tex; mode=display">\hat{x}_{ui}^{MLP}=\text{MLP}(\vx_{u}^{MLP}, \vx_{i}^{MLP}) + b^{MLP}.</script>
</li>
<li>
<p>In order to weigh the importance of both models, the results of GMF
and the 3MLP may be scaled by factors $\alpha$ and $(1\alpha)$, where
$\alpha\in(0,1)$. In their proposed architecture He et al. fixed
$\alpha=0.5$. Finally, the ranking $\hat{x}_{ij}$ is obtained by building a
convex combination of the activations $\hat{x}_{ui}^{GMF}$ and
$\hat{x}_{ui}^{MLP}$, and then and passing the result through the sigmoid
function:</p>
<script type="math/tex; mode=display">\hat{x}_{ui}=\sigma\left(
\alpha \hat{x}_{ui}^{GMF} + (1\alpha)\hat{x}_{ui}^{MLP}
\right).</script>
</li>
</ul>
<p>An illustration of the architecture of NCF is given in the figure below. The
models GMF and MLP can also be instantiated on their own, while the sigmoid
output function is maintained.</p>
<div class="figurewithcaption">
<img src="/img/20191216AnOverviewofCollaborativeFilteringAlgorithms/ncf.png?v=1" alt="Neural Collaborative Filtering (NCF)" width="100%" />
<div class="figurecaption" style="textalign:center">
NCF architecture illustrated by <a class="citation" href="#ncf">[33]</a>
</div>
</div>
<p>In their experiments He et al. trained NCF using the binary crossentropy loss
with negative sampling. One could also use the pairwise BPR loss to train the
model’s parameters, however, in their experiments He et al. observed better
performance metrics with the binary crossentropy loss and negative sampling:</p>
<div class="eqdesktop">
$$
\cL(\vtheta)=\sum_{(u,i)\in\cD}
\left[x_{ui}\log(\hat{x}_{ui}) + (1x_{ui})\log(1\hat{x}_{ui})\right],
$$
</div>
<div class="eqmobile">
$$
\sum_{\llap{(u,i)}\rlap{\in\cD}}
\left[x_{ui}\log(\hat{x}_{ui}) + (1x_{ui})\log(1\hat{x}_{ui})\right],
$$
</div>
<p>where the training instances $(u,i)\in\cD$ consist of positive and
negative interactions. The negative interactions are oversampled according to a
negative sampling factor of $K$, where $K=5$ turned out to work well for both
datasets (Movielens1M and Pinterest) for the datasets used in the experiments of He et al.</p>
<p>While the models GMF and MLP can be instantiated each on their own, the
experiments of the He et al. <a class="citation" href="#ncf">[33]</a> revealed that the joint model
outperformed the individual models in every case. For the datasets
Movielens1M and Pinterest, their proposed architecture outperformed a stateoftheart
matrixfactorization baseline eALS <a class="citation" href="#eals">[22]</a> (alternating
leastsquares MF). Thus, one can argue that modeling the interaction as a nonlinear
function, instead of a bilinear function, is advantageous for the accurate prediction
of rankings.</p>
<p>The disadvantages of NCF are that, in contrast to the metric learning
approaches, NCF lacks interpretability and NCF does not automatically learn
useruser and itemitem similarities via similarity propagation.
Also, the rank prediction is rather computationally expensive and does not scale
well to predictions over the the full set of items if one should desire to do so.
Also, techniques like LSH and MIPS cannot be applied to its embeddings or latent
representations to get fast nearestneighbour search. However, for massive datasets it
may still be used in conjunction with a retrieval system.</p>
<h2 id="autoencodersforcollaborativefiltering">Autoencoders for Collaborative Filtering</h2>
<p>Recently, autoencoders have gained momentum in the field of recommender systems.
The important connection to be noticed here is that a 1layer autoencoder with
linear activation functions reduces to the problem of <em>matrix factorization</em></p>
<script type="math/tex; mode=display">\cL(\vtheta)=\norm{\MX\MD\MC\MX}_2^2,</script>
<p>where the original interaction matrix $\MX$ is approximated through a
lowdimensional approximation of the original interaction matrix $\MX$.</p>
<p>So, in some sense, general autoencoder approaches can be seen as
<em>nonlinear matrix factorization</em>, where the interaction matrix is
approximated through nonlinear encoder and decoder functions, $C$ and $D$,
that are learned through optimizing the objective</p>
<script type="math/tex; mode=display">\cL(\vtheta)=\norm{\MXD(C(\MX))}_2^2.</script>
<p>One important advantage of these autoencoder approaches, assuming that they are
trained on userinteraction vectors, is that they can achieve <em>strong</em>
generalization (as explained the work of <a class="citation" href="#liang2018variational">[40]</a>),
since they may do predictions for a user or item interaction vector that was not
observed at training time, whereas all the approaches that we have seen so far
always relied on having a pretrained embedding vector for each user and item.</p>
<p>Similarly as with matrix factorization, the major challenge in these autoencoder
approaches is that for typical recommender system datasets the input vectors are
<em>extremely sparse</em>, inhibiting to get informative gradients
<a class="citation" href="#strub2015collaborative">[41]</a>. One of the first papers
applying autoencoders to collaborative filtering was the one by Sedhain et al.,
proposing the architecture Autorec<a class="citation" href="#sedhain2015autorec">[42]</a>.
Autorec computes the reconstruction of an input
$\vx\in\R^d$, that can be either a user or an iteminteraction vector, via a
shallow autoencoder</p>
<script type="math/tex; mode=display">h(\vx;\vtheta)
=
f(\MW g(\MV\vx +\vmu)+\vb),</script>
<p>where $f$ and $g$ are elementwise activation functions. The objective trained
to optimize Autorec’s parameters $\vtheta$ is</p>
<div class="eqdesktop">
$$
\min_{\vtheta}\sum_{\vx\in\cS}^n\norm{\vxh(\vx;\vtheta)}_{\cO}^2
+\frac{\lambda}{2}
(\norm{\MW}_F^2+\norm{\MV}_F^2),
\quad
\lambda >0,
$$
</div>
<div class="eqmobile">
$$
\begin{align*}
\min_{\vtheta}
&\sum_{\vx\in\cS}^n\norm{\vxh(\vx;\vtheta)}_{\cO}^2
\\
&+\frac{\lambda}{2}
(\norm{\MW}_F^2+\norm{\MV}_F^2),
\quad
\lambda >0,
\end{align*}
$$
</div>
<p>where $\norm{\argdot}_{\cO}$ means that gradientupdates are only computed
to parameters that are connected to the <em>observed</em> entries of the interaction
vector $\vx$. So, actually there exist two versions of Autorec: $U$Autorec
and $I$Autorec. They differ by the training examples that they consider:
$U$Autorec uses userinteraction vectors $\cS_U=\set{\vx_u}_{u\in\U}$ and
$I$Autorec uses iteminteraction vectors $\cS_I=\set{\vx_i}_{i\in I}$. The
training/validation/testsplit was done by doing a random 80%/10%/10%split
of the interactions. In the case of $I$Autorec, the prediction of the
relevance of item $i$ for a a user $u$ is done through</p>
<script type="math/tex; mode=display">\hat{x}_{ui}=(h(\vx_i))_u.</script>
<p>The architecture and gradientupdates with $I$Autorec are illustrated in
the following picture.</p>
<div class="figurewithcaption">
<img src="/img/20191216AnOverviewofCollaborativeFilteringAlgorithms/autorec.png?v=1" alt="Architecture of Autorec" width="100%" />
<div class="figurecaption">
Architecture of $I$Autorec: The edges (parameters)
that are connected to unobserved interactions are not
updated due to the masking of the loss
(masked edges marked in gray).
</div>
</div>
<p>In their experiments, Sedhain et al. noticed that $I$Autorec performed better
than $U$Autorec and argued that this had to do with the fact that, in their
considered datasets, the iteminteraction vectors were denser than the
userinteraction vectors, leading to more reliable predictions. While they do
not state this explicitly, one may also believe that using the denser
iteminteraction vectors also leaded to more and better gradients, since more
inputs are nonzero. In Sedhain et al.’s experiments Autorec outperformed
classical matrix factorization methods, motivating the use of autoencoders.
Preliminary experiments also revealed that deeper autoencoders perform better.</p>
<p>The paper from Strub & Mary <a class="citation" href="#strub2015collaborative">[41]</a> then was one of the
first autoencoder approaches for collaborative filtering to explicitly state the
sparsity problem and to concretely tackle it with an approach. Strub & Mary
also trained a loss similar to the masked squared reconstruction loss as Sedhain
et al., with two important differences:</p>
<ul>
<li>
<p>They did not just consider the reconstruction error, but the
<em>prediction</em> error for a heldout item, therefore changing the
reconstruction target by the additional target item entry.</p>
</li>
<li>
<p>They applied masking noise to the input interaction vectors to
have the autoencoder learn to reconstruct interactions.</p>
</li>
</ul>
<p>In their experiments, Strub & Mary also showed that autoencoders can
outperform stateoftheart recommender baselines.</p>
<p>Inspired by $I$Autorec <a class="citation" href="#sedhain2015autorec">[42]</a> and the approach of Strub &
Mary <a class="citation" href="#strub2015collaborative">[41]</a>, Kuchaiev & Ginsburg
<a class="citation" href="#kuchaiev2017training">[43]</a> from NVIDIA came along with further improvements and
another solution to tackle the sparsity problem. Kuchaiev & Ginsburg used a
technique called <em>dense refeeding</em> to deal with the the natural sparsity
of the interaction vectors.
With dense refeeding the output $h(\vx)$, that is considered to be denser since
it is a probabilitydistribution, is refed to the autoencoder as an input, and
is just reconsidered as a new training example with the same original target. Their
argument for using $h(\vx)$ again
as an input is that $h(\vx)$ should represent a <em>fixedpoint</em> of the
autoencoder: $h(h(\vx))=h(\vx)$. Kuchaiev & Ginsburg mention that the dense
refeeding could be repeated several times, but they only applied it once.
A further improvement that Kuchaiev & Ginsburg did, is to use a timebased
training/validation/testsplit, motivated by the fact that the recommender
system should predict future ratings from past ones, rather than randomly
missing ratings (as opposed to the splits used in
<a class="citation" href="#sedhain2015autorec">[42, 41]</a>).
In their experiments, Kuchaiev & Ginsburg observed that deeper autoencoders
with SELU activation functions and a high dropoutrate (e.g., 0.8), where dropout
is only used on the latent representation, trained together with dense
refeeding performed the best. An important remark here is that one should not
apply dropout on the initial layer, as happening with the masking noise of Strub
& Mary, if one does not want the recommender system to learn to predict
<em>any</em> random missing rating (e.g., correlations between items), but rather
a <em>future</em> rating.</p>
<p>Recently, Liang et al. from Netflix <a class="citation" href="#liang2018variational">[40]</a> came along and
extended variational autencoders (VAEs) to collaborative filtering. Their
generative model uses a multinomial likelihood, with the motivation that is
bettersuited for modeling implicit feedback data, since it is a closer proxy to
the ranking loss compared to the more popular likelihood functions such as the
multivariate Gaussian (induced by squared loss) and logistic loss.</p>
<p>The entire VAE architecture is then given by the encoder $g_{\phi}$ and decoder $f_{\theta}$ as</p>
<div class="eqdesktop">
$$
\vx_u
\stackrel{g_{\phi}}{\mapsto}
(\mu_{\phi}(\vx_u),\sigma_{\phi}(\vx_u))
\stackrel{\epsilon\sim\cN(\vo,\MI)}{\mapsto}
\vz_u=\vmu_{\phi}(\vx_u)+\epsilon\odot\vsigma_{\phi}(\vx_u)
\stackrel{f_{\theta}}{\mapsto}
\hat{\vx}_u,
$$
with
$$
\pi(\vz_u)\propto\exp(f_{\theta}(\vz_u)),
$$
$$
\hat{\vx}_u\sim Mult(N_{u},\pi(\vz_{u})),
$$
$$
\log\left(\cDist{p_{\theta}}{\hat{\vx}_{u}}{\vz_u}\right)
=
\sum_{i=1}^{\card{I}}x_{ui}\log(\pi_i(\vz_u)).
$$
</div>
<div class="eqmobile">
$$
\begin{align*}
\vx_u
&\stackrel{g_{\phi}}{\mapsto}
(\mu_{\phi}(\vx_u),\sigma_{\phi}(\vx_u))
\\
&\stackrel{\llap{\epsilon\sim}\rlap{\cN(\vo,\MI)}}{\mapsto}
\vz_u=\vmu_{\phi}(\vx_u)+\epsilon\odot\vsigma_{\phi}(\vx_u)
\\
&\stackrel{f_{\theta}}{\mapsto}
\hat{\vx}_u,
\end{align*}
$$
with
$$
\pi(\vz_u)\propto\exp(f_{\theta}(\vz_u)),
$$
$$
\hat{\vx}_u\sim Mult(N_{u},\pi(\vz_{u})),
$$
$$
\log\left(\cDist{p_{\theta}}{\hat{\vx}_{u}}{\vz_u}\right)
=
\sum_{i=1}^{\card{I}}x_{ui}\log(\pi_i(\vz_u)).
$$
</div>
<p>The VAE is trained by annealing the evidence lower bound (ELBO)</p>
<div class="eqdesktop">
$$
\cL_{\beta}(\theta,\phi)
=
\sum_{\vx_u}\left[
\Exp[
\cDist{q_{\phi}}{\vz_u}{\vx_u}
]{\log\left(
\cDist{p_{\theta}}{\hat{\vx}_u}{\vz_u}
\right)}
\beta
KL\left(
\cDist{q_{\phi}}{\vz_u}{\vx_u}

p(\vz_u)
\right)
\right],
$$
</div>
<div class="eqmobile">
$$
\begin{align*}
\cL_{\beta}(\theta,\phi)
&=
\sum_{\vx_u}\Big[
\Exp[
\cDist{q_{\phi}}{\vz_u}{\vx_u}
]{\log\left(
\cDist{p_{\theta}}{\hat{\vx}_u}{\vz_u}
\right)}
\\
&\;\quad\beta
KL\left(
\cDist{q_{\phi}}{\vz_u}{\vx_u}

p(\vz_u)
\right)
\Big],
\end{align*}
$$
</div>
<p>where $\beta$ is chosen to be $\beta<1$, relaxing the prior constraint
that</p>
<script type="math/tex; mode=display">\frac{1}{\card{U}}\sum_{u\in
U} \cDist{q_{\phi}}{\vz_u}{\vx_u} \approx p(\vz)=\cN(\vz;\vo,\MI)</script>
<p>and sacrificing the ability to good ancestral sampling. Anyways, Kuchaiev &
Ginsburg argue that ancestral sampling is not needed in collaborative filtering.
Therefore, the prediction is simply performed by using the mean
$\mu_{\phi}(\vx_u)$ in the forward propagation, without any sampling of
$\epsilon$, giving</p>
<script type="math/tex; mode=display">\hat{\vx}_u=f_{\theta}(\vmu_{\phi}(\vx_u)).</script>
<p>In their experiments Liang et al. showed that the Multinomial likelihood yields
slightly better results than the Gaussian and the logistic likelihood. Further,
they showed that their architecture outperforms several stateoftheart
baselines.</p>
<p>Let us conclude the presentation of autoencoders for collaborative filtering
with two properties that all autoencoder approaches have in common: One
advantage of all autoencoder approaches is that they do not need any negative
sampling. One major disadvantage of all the autoencoder approaches is that,
inherently, depending on whether they are itembased or userbased, they either
predict the relevance of all $\card{I}$ items for a user, or the relevance
to all $\card{U}$ users of an item. This means that the input/output dimensions of
the autoencoders are always at least $\min(\card{U},\card{I})$. Thus, they do
not scale to datasets with a massive amount of users and items. Also,
their latent representations are hard to interpret and it’s not clear how
to perform efficient nearestneighbour searches as one could do with LSH.</p>
<h2 id="summaryofpresentedcollaborativefilteringapproaches">Summary of Presented Collaborative Filtering Approaches</h2>
<p>Let’s recap on a highlevel what the various recommender system approaches do in
order to generalize to recommendations on unseen interactions:</p>
<ul>
<li>
<p><strong>Item Popularity:</strong> recommends items based on their popularity. The hope is
that the prediction through the popularity of items generalizes to the relevances of unseen
useritem pairs.</p>
</li>
<li>
<p><strong>Matrix Factorization:</strong> learns embeddings that are fed through a <em>bilinear</em> interaction
function, the dot product, to predict the rankings. The hope is that the dotproduct of
embedding vectors of unseen useritem pairs generalizes to the true relevances.</p>
</li>
<li><strong>Metric Learning Approaches:</strong> learn user and item embeddings in a metric spaces such
that the distance is correlated with the relevance between users and items. Using a distance
metric allows to exploit the phenomenon of <em>similarity propagation</em>. The hope is that
the distances between the learned embeddings generalize to the true distances between unseen
useritem pairs.
<ul style="margintop:5px; marginbottom: 5px">
<li><b>Collaborative Metric Learning:</b> uses Euclidean metric space</li>
<li><b>Hyperbolic Recommender Systems:</b> use hyperbolic metric spaces</li>
</ul>
</li>
<li>
<p><strong>Neural Collaborative Filtering:</strong> learns embeddings and the parameters of a <em>nonlinear</em>
interaction function, that is a joint model of a shallow generalized matrix factorization and a
pyramidal MLP, to predict the rankings. The hope is that the predictions of rankings through this
nonlinear interaction function with the learned embeddings generalize well for
unseen useritem pairs.</p>
</li>
<li><strong>Autoencoder Approaches:</strong> can be seen as <em>nonlinear</em> matrix factorization. They learn a
<em>nonlinear</em> mapping from user (or item) interaction vectors to a latent representation, or
distribution, from which other relevant items will be predicted through a <em>nonlinear</em>
decoding function that gives the relevances of, or a relevancedistribution
over, the items (or users for an item). The hope is that the encoding and decoding of (unseen)
interaction vectors gives the right relevance predictions for the yet unobserved entries.</li>
</ul>
<h2 id="references">References</h2>
<ol class="bibliography"><li><span id="mckinsey">“How retailers can keep up with consumers.” https://www.mckinsey.com/industries/retail/ourinsights/howretailerscankeepupwithconsumers.</span></li>
<li><span id="goodwatercap">“Understanding Spotify: Making Music Through Innovation.” https://www.goodwatercap.com/thesis/understandingspotify, 2018.</span></li>
<li><span id="bennett2007netflix">J. Bennett, S. Lanning, and others, “The netflix prize,” in <i>Proceedings of KDD cup and workshop</i>, 2007, vol. 2007, p. 35.</span></li>
<li><span id="gomez2016netflix">C. A. GomezUribe and N. Hunt, “The netflix recommender system: Algorithms, business value, and innovation,” <i>ACM Transactions on Management Information Systems (TMIS)</i>, vol. 6, no. 4, p. 13, 2016.</span></li>
<li><span id="koren2009matrix">Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” <i>Computer</i>, no. 8, pp. 30–37, 2009.</span></li>
<li><span id="koren2009bellkor">Y. Koren, “The bellkor solution to the netflix grand prize,” <i>Netflix prize documentation</i>, vol. 81, no. 2009, pp. 1–10, 2009.</span></li>
<li><span id="netflixblog">“Netflix Recommendations: Beyond the 5 stars (Part 1).” https://medium.com/netflixtechblog/netflixrecommendationsbeyondthe5starspart155838468f429, 2012.</span></li>
<li><span id="hu2008collaborative">Y. Hu, Y. Koren, and C. Volinsky, “Collaborative filtering for implicit feedback datasets,” in <i>2008 Eighth IEEE International Conference on Data Mining</i>, 2008, pp. 263–272.</span></li>
<li><span id="cml">C.K. Hsieh, L. Yang, Y. Cui, T.Y. Lin, S. Belongie, and D. Estrin, “Collaborative metric learning,” in <i>Proceedings of the 26th international conference on world wide web</i>, 2017, pp. 193–201.</span></li>
<li><span id="adomavicius2011context">G. Adomavicius and A. Tuzhilin, “Contextaware recommender systems,” in <i>Recommender systems handbook</i>, Springer, 2011, pp. 217–253.</span></li>
<li><span id="oard1998implicit">D. W. Oard, J. Kim, and others, “Implicit feedback for recommender systems,” in <i>Proceedings of the AAAI workshop on recommender systems</i>, 1998, vol. 83.</span></li>
<li><span id="davidson2010youtube">J. Davidson <i>et al.</i>, “The YouTube video recommendation system,” in <i>Proceedings of the fourth ACM conference on Recommender systems</i>, 2010, pp. 293–296.</span></li>
<li><span id="cheng2016wide">H.T. Cheng <i>et al.</i>, “Wide & deep learning for recommender systems,” in <i>Proceedings of the 1st workshop on deep learning for recommender systems</i>, 2016, pp. 7–10.</span></li>
<li><span id="kdtrees">J. L. Bentley, “Kd trees for semidynamic point sets,” in <i>Proceedings of the sixth annual symposium on Computational geometry</i>, 1990, pp. 187–197.</span></li>
<li><span id="mips">A. Shrivastava and P. Li, “Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS),” in <i>Advances in Neural Information Processing Systems</i>, 2014, pp. 2321–2329.</span></li>
<li><span id="lsh1">A. Gionis, P. Indyk, R. Motwani, and others, “Similarity search in high dimensions via hashing,” in <i>Vldb</i>, 1999, vol. 99, no. 6, pp. 518–529.</span></li>
<li><span id="lsh2">S. HarPeled, P. Indyk, and R. Motwani, “Approximate nearest neighbor: Towards removing the curse of dimensionality,” <i>Theory of computing</i>, vol. 8, no. 1, pp. 321–350, 2012.</span></li>
<li><span id="lsh3">M. Bawa, T. Condie, and P. Ganesan, “LSH forest: selftuning indexes for similarity search,” in <i>Proceedings of the 14th international conference on World Wide Web</i>, 2005, pp. 651–660.</span></li>
<li><span id="sarwar2000application">B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Application of dimensionality reduction in recommender systema case study,” Minnesota Univ Minneapolis Dept of Computer Science, 2000.</span></li>
<li><span id="koren2008factorization">Y. Koren, “Factorization meets the neighborhood: a multifaceted collaborative filtering model,” in <i>Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining</i>, 2008, pp. 426–434.</span></li>
<li><span id="pan2008one">R. Pan <i>et al.</i>, “Oneclass collaborative filtering,” in <i>2008 Eighth IEEE International Conference on Data Mining</i>, 2008, pp. 502–511.</span></li>
<li><span id="eals">X. He, H. Zhang, M.Y. Kan, and T.S. Chua, “Fast matrix factorization for online recommendation with implicit feedback,” in <i>Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval</i>, 2016, pp. 549–558.</span></li>
<li><span id="mnih2008probabilistic">A. Mnih and R. R. Salakhutdinov, “Probabilistic matrix factorization,” in <i>Advances in neural information processing systems</i>, 2008, pp. 1257–1264.</span></li>
<li><span id="alswr">Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan, “Largescale parallel collaborative filtering for the netflix prize,” in <i>International conference on algorithmic applications in management</i>, 2008, pp. 337–348.</span></li>
<li><span id="koren2009collaborative">Y. Koren, “Collaborative filtering with temporal dynamics,” in <i>Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining</i>, 2009, pp. 447–456.</span></li>
<li><span id="xiong2010temporal">L. Xiong, X. Chen, T.K. Huang, J. Schneider, and J. G. Carbonell, “Temporal collaborative filtering with bayesian probabilistic tensor factorization,” in <i>Proceedings of the 2010 SIAM international conference on data mining</i>, 2010, pp. 211–222.</span></li>
<li><span id="luo2014efficient">X. Luo, M. Zhou, Y. Xia, and Q. Zhu, “An efficient nonnegative matrixfactorizationbased approach to collaborative filtering for recommender systems,” <i>IEEE Transactions on Industrial Informatics</i>, vol. 10, no. 2, pp. 1273–1284, 2014.</span></li>
<li><span id="gu2010collaborative">Q. Gu, J. Zhou, and C. Ding, “Collaborative filtering: Weighted nonnegative matrix factorization incorporating user and item graphs,” in <i>Proceedings of the 2010 SIAM international conference on data mining</i>, 2010, pp. 199–210.</span></li>
<li><span id="mairal2010online">J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” <i>Journal of Machine Learning Research</i>, vol. 11, no. Jan, pp. 19–60, 2010.</span></li>
<li><span id="dadkhahi2018alternating">H. Dadkhahi and S. Negahban, “Alternating Linear Bandits for Online MatrixFactorization Recommendation,” <i>arXiv preprint arXiv:1810.09401</i>, 2018.</span></li>
<li><span id="wang2017factorization">H. Wang, Q. Wu, and H. Wang, “Factorization bandits for interactive recommendation,” in <i>ThirtyFirst AAAI Conference on Artificial Intelligence</i>, 2017.</span></li>
<li><span id="rendle2009bpr">S. Rendle, C. Freudenthaler, Z. Gantner, and L. SchmidtThieme, “BPR: Bayesian personalized ranking from implicit feedback,” in <i>Proceedings of the twentyfifth conference on uncertainty in artificial intelligence</i>, 2009, pp. 452–461.</span></li>
<li><span id="ncf">X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.S. Chua, “Neural collaborative filtering,” in <i>Proceedings of the 26th international conference on world wide web</i>, 2017, pp. 173–182.</span></li>
<li><span id="cogswell2015reducing">M. Cogswell, F. Ahmed, R. Girshick, L. Zitnick, and D. Batra, “Reducing overfitting in deep networks by decorrelating representations,” <i>arXiv preprint arXiv:1511.06068</i>, 2015.</span></li>
<li><span id="hrs">T. D. Q. Vinh, Y. Tay, S. Zhang, G. Cong, and X.L. Li, “Hyperbolic recommender systems,” <i>arXiv preprint arXiv:1809.01703</i>, 2018.</span></li>
<li><span id="shrs">B. P. Chamberlain, S. R. Hardwick, D. R. Wardrope, F. Dzogang, F. Daolio, and S. Vargas, “Scalable Hyperbolic Recommender Systems,” <i>arXiv preprint arXiv:1902.08648</i>, 2019.</span></li>
<li><span id="bhrs">T. Schmeier, J. Chisari, S. Garrett, and B. Vintch, “Music recommendations in hyperbolic space: an application of empirical bayes and hierarchical poincaré embeddings,” in <i>Proceedings of the 13th ACM Conference on Recommender Systems</i>, 2019, pp. 437–441.</span></li>
<li><span id="liu2017wmrb">K. Liu and P. Natarajan, “WMRB: Learning to Rank in a Scalable Batch Training Approach,” <i>arXiv preprint arXiv:1711.04015</i>, 2017.</span></li>
<li><span id="nickel2017poincare">M. Nickel and D. Kiela, “Poincaré embeddings for learning hierarchical representations,” in <i>Advances in neural information processing systems</i>, 2017, pp. 6338–6347.</span></li>
<li><span id="liang2018variational">D. Liang, R. G. Krishnan, M. D. Hoffman, and T. Jebara, “Variational autoencoders for collaborative filtering,” in <i>Proceedings of the 2018 World Wide Web Conference</i>, 2018, pp. 689–698.</span></li>
<li><span id="strub2015collaborative">F. Strub and J. Mary, “Collaborative filtering with stacked denoising autoencoders and sparse inputs,” 2015.</span></li>
<li><span id="sedhain2015autorec">S. Sedhain, A. K. Menon, S. Sanner, and L. Xie, “Autorec: Autoencoders meet collaborative filtering,” in <i>Proceedings of the 24th International Conference on World Wide Web</i>, 2015, pp. 111–112.</span></li>
<li><span id="kuchaiev2017training">O. Kuchaiev and B. Ginsburg, “Training deep autoencoders for collaborative filtering,” <i>arXiv preprint arXiv:1708.01715</i>, 2017.</span></li></ol>Andreas Bloch under the assistance of Octavian Ganea and Gary BécigneulRecommender systems are at the heart of many of today's online platforms. Numerous approaches for collaborative filtering from implicit feedback data have been invented over the years. This blogpost provides an overview of several of these recommender algorithms.A Universal Model for Hyperbolic, Euclidean and Spherical Geometries20191121T20:00:00+01:0020191121T20:00:00+01:00http://localhost:4000/KStereographicModel<p>The Euclidean space is the default choice for representations in many of
today’s machine learning tasks. Recent advances have also shown how a variety of
tasks can further benefit from representations in (products of) spaces of constant
curvature. The paper by Gu et al. <a class="citation" href="#gu2018learning">[1]</a>
presents an approach that estimates and commits to the curvature of spaces in advance
and then learns embeddings the chosen (products of) spaces of constant curvature.</p>
<p>In this blogpost we present a geometric model, called the <strong>$\kappa$Stereographic
Model</strong>, that harnesses the formalism of gyrovector spaces in order to capture all three
geometries of constant curvature (hyperbolic, Euclidean and spherical) at once.
Furthermore, the presented model also allows to smoothly interpolate between the
geometries of constant curvature and thus provides a way to learn the curvature of
spaces jointly with the embeddings.</p>
<p>The $\kappa$stereographic model has been elaborated within the scope of the B.Sc. and M.Sc.
theses of Andreas Bloch, Gregor Bachmann and Ondrej Skopek at
the Data Analytics Lab of the ETH Zürich under the assistance of Octavian Ganea and Gary Bécigneul.
All of the theses were related to learning representations in products of spaces of
constant curvature. Two papers that apply the $\kappa$stereographic model are:</p>
<ul>
<li><a href="https://arxiv.org/abs/1911.08411">
<strong>Mixedcurvature Variational Autoencoders</strong>
</a>
<a class="citation" href="#skopek2019mixed">[2]</a> (ICLR 2020),<br />
Ondrej Skopek, Octavian Eugen Ganea, Gary Bécigneul
</li>
<li><a href="https://arxiv.org/abs/1911.05076">
<strong>Constant Curvature Graph Convolutional Networks</strong>
</a>
<a class="citation" href="#bachmann2019constant">[3]</a>,<br />
Gregor Bachmann, Gary Bécigneul, Octavian Eugen Ganea
</li>
</ul>
<p>This blogpost aims to explain and illustrate the $\kappa$stereographic
model in more detail. Also, it accompanies the
<a href="https://github.com/andbloch/geoopt/tree/universalmanifold/geoopt/manifolds /stereographic">Pytorch implementation of the $\kappa$stereographic model</a> by
Andreas Bloch. Many thanks here go to Maxim Kochurov whose Poincaré ball implementation
provided an excellent starting point.</p>
<p>We’ll start by first looking at the underlying algebraic structure of gyrovector
spaces. Then the formulas for the $\kappa$stereographic model are given and the
interpolation between spaces of constant curvature is illustrated for a variety
of concepts. Finally, a code example using the $\kappa$stereographic
model is provided.</p>
<! TODO: Introduction to Riemannian Manifolds for Machine Learning Blogpost >
<! The interested reader which is not familiar with Riemannian geometry may have >
<! a look at my introductory blogpost on Riemannian manifolds for machine learning. >
<h2 id="modelsforsphericalandhyperbolicgeometries">Models for Spherical and Hyperbolic Geometries</h2>
<p>Two common models for hyperbolic and spherical geometries are the hyperboloid
and the sphere. The formulas for these two model classes are very dual which
can be observed for example in the formulas of the paper “Spherical and hyperbolic
embeddings of data” <a class="citation" href="#wilson2014spherical">[4]</a>.
One just has to switch a few signs and exchange the trigonometric
functions with their hyperbolic variants in order to get the formulas for the
hyperboloid from the ones of the sphere, and viceversa. This explains why the
hyperboloid is sometimes also referred to as the “<em>pseudosphere</em>.”</p>
<p>Two alternative models for hyperbolic and spherical geometries are the Poincaré
ball and the stereographic projection of the sphere. Each of them result from
the stereographic projection of the hyperboloid and the sphere, respectively, as
illustrated in the figures below:</p>
<div class="contentdesktop">
<table style="marginbottom:0px; border: 0px;">
<tr style="border: 0px;">
<th style="textalign:center; border: 0px;">Hyperboloid & Poincaré Ball </th>
<th style="textalign:center; border: 0px;">Sphere & Stereographic Projection of Sphere</th>
</tr>
<tr>
<td style="textalign:center; border: 0px;">
<img src="/img/20191121KStereographicModel/hyperboloidsproj.png?v=1" width="70%" />
</td>
<td style="textalign:center; border: 0px;">
<img src="/img/20191121KStereographicModel/spheresproj.png?v=1" width="90%" />
</td>
</tr>
</table>
</div>
<div class="contentmobile">
<div class="figurewithcaption">
<div class="figuretitle" style="marginbottom: 15px">
Hyperboloid & Poincaré Ball
</div>
<img src="/img/20191121KStereographicModel/hyperboloidsproj.png?v=1" width="100%" />
</div>
<div class="figurewithcaption">
<div class="figuretitle" style="marginbottom: 15px">
Sphere & Stereographic Projection of Sphere
</div>
<img src="/img/20191121KStereographicModel/spheresproj.png?v=1" width="100%" />
</div>
</div>
<p>In this blogpost we show how the duality of the Poincaré ball and the stereographic
projection of the sphere can be captured within Ungar’s formalism of gyrovector
spaces <a class="citation" href="#ungar2005analytic">[5]</a>. The reason
that the <em>stereographic projections</em> were chosen in the model that will be presented
in what follows is that they allow to smoothly interpolate between spaces of positive
and spaces of negative curvature – this can be very useful to learn the curvatures of factors of
product spaces used to train embeddings, but more on that in a future blogpost on
curvature learning in product spaces. For now, we just want to familiarize ourselves with
gyrovector spaces and our concrete instantiation of a gyrovector space: the
<em>$\kappa$stereographic model</em> for spherical, Euclidean and hyperbolic geometries.</p>
<h2 id="gyrovectorspaces">Gyrovector Spaces</h2>
<p>An important property of hyperbolic space is that it isn’t a vector space. To this
end, Ungar introduced the algebraic structure of <em>gyrovector spaces</em>
<a class="citation" href="#ungar1991thomas">[6]</a>,
which have operations and properties
reminiscent of the ones of vector spaces. Indeed, gyrogroups and gyrovector spaces
are a generalization of groups and vector spaces. One great advantage of gyrovector spaces
is that with Ungar’s gyrovector space approach to hyperbolic geometry we get much
more intuitive and concise formulas for things like geodesics, distances or
the Pythagorean theorem in hyperbolic geometry.</p>
<!
So far, gyrovector spaces have been mostly used to represent
the hyperbolic geometry for the study of special relativity theory. In a recent
work of theirs, Ganea et al. <a class="citation" href="#hnn">[7]</a>
showed how one can harness the formalism of gyrovector spaces to implement the
essential operations for deep neural networks that operate on hyperbolic representations.
Throughout the development of our theses related to curvature learning, we discovered
that the algebraic structure of a gyrovector space can also be instantiated to represent
spherical geometries  more on that in the next section. First, we want to familiarize
ourselves with the abstract notion of gyrovector spaces.
>
<p>In what follows we present the most important definitions that give rise to
gyrovector spaces. The definitions presented here are taken from
Ungar’s work <a class="citation" href="#ungar2001hyperbolic">[8]</a>, where
the algebra’s axioms and and some of its deriveable properties are introduced
jointly for convenience. For a presentation that restricts the definitions to a
minimal set of axioms and then separately derivates the resulting properties the
interested reader is referred to Ungar’s other work <a class="citation" href="#ungar2005analytic">[5]</a>.</p>
<div class="mathblock">
<span class="def"><strong>D. (Groupoid)</strong></span> A <em>groupoid</em> $(S, +)$ is a pair of a
nonempty set $S$ and a binary operation $+\colon S\times S\to S$.
</div>
<div class="mathblock">
<span class="def"><strong>D. (Groupoid Automorphism)</strong></span> An
<em>automorphism</em> $\phi$ of a groupoid $(S,+)$ is a <em>bijective</em>
selfmap of $S$ that respects its binary operation,
<div class="eqdesktop">
$$
\forall s_1,s_2\in S
\colon
\quad
\phi(s_1 + s_2)
=
\phi(s_1) + \phi(s_2).
$$
</div>
<div class="eqmobile">
$$
\forall s_1,s_2\in S
\colon
\,
\phi(s_1 + s_2)
=
\phi(s_1) + \phi(s_2).
$$
</div>
</div>
<p>In other words, the groupoid automorphism $\phi$ preserves the structure of the groupoid.
Furthermore, the set of all automorphisms of a groupoid form a group:</p>
<div class="mathblock">
<span class="def"><strong>D. (Automorphism Group)</strong></span> The set of all automorphisms of a
groupoid $(S,+)$ forms a group, denoted as $\Aut(S,+)$.
</div>
<p>Now, we define a gyrogroup, which is an essential component of a gyrovector space.</p>
<div class="mathblock">
<span class="def"><strong>D. (Gyrogroup)</strong></span> A groupoid $(G,\oplus)$ is a
<em>gyrogroup</em> if its binary operation satisfies the following axioms and properties. In $G$, there
exists a unique element, $0$, called the identity, satisfying
<div class="eqdesktop">
$$
0\oplus a= a\oplus 0= a,
\qquad\text{additive identity},
$$
</div>
<div class="eqmobile">
$$
0\oplus a= a\oplus 0= a,
$$
</div>
for all $a\in G$. For each $a$ in $G$, there exists a unique invese $\ominus a$ in $G$,
satisfying
<div class="eqdesktop">
$$
\ominus a \oplus a = a\ominus a = 0,\qquad\text{inverse},
$$
</div>
<div class="eqmobile">
$$
\ominus a \oplus a = a\ominus a = 0,
$$
</div>
where we use the notation $a\ominus b=a\oplus (\ominus b)$ for $a,b\in G$.
For any $a,b\in G$, the selfmap $\gyr[a,b]$ of $G$ is given by the equation
$$
\gyr[a,b]z =
\ominus(a\oplus b) \oplus (a \oplus (b\oplus z)),
$$
for all $z\in G$. Furthermore, the following conditions hold for all $a,b,c\in
G$:
<div class="eqdesktop">
$$
\begin{align*}
\gyr[a,b]&\in \Aut(G,\oplus),
&& \text{gyroautomorphism property},
\\
a\oplus (b\oplus c) &= (a \oplus b) \oplus \gyr[a,b]c,
&& \text{left gyroassociative law},
\\
(a\oplus b) \oplus c &= a \oplus (b \oplus \gyr[b,a]c),
&& \text{right gyroassociative law},
\\
\gyr[a,b] &= \gyr[a\oplus b, b],
&& \text{left loop property},
\\
\gyr[a,b] &= \gyr[a, b\oplus a],
&& \text{right loop property},
\\
\ominus(a\oplus b) &= \gyr[a,b](\ominus b\ominus a),
&& \text{gyrosum inversion law},
\\
\gyr^{1}[a,b] &= \gyr[b,a],
&& \text{gyroautomorphism inversion}.
\end{align*}
$$
</div>
<div class="eqmobile">
$$
\begin{align*}
\gyr[a,b]&\in \Aut(G,\oplus),
\\
a\oplus (b\oplus c) &= (a \oplus b) \oplus \gyr[a,b]c,
\\
(a\oplus b) \oplus c &= a \oplus (b \oplus \gyr[b,a]c),
\\
\gyr[a,b] &= \gyr[a\oplus b, b],
\\
\gyr[a,b] &= \gyr[a, b\oplus a],
\\
\ominus(a\oplus b) &= \gyr[a,b](\ominus b\ominus a),
\\
\gyr^{1}[a,b] &= \gyr[b,a],
\end{align*}
$$
</div>
The operation $\gyr\colon G\times G\to\Aut(G,\oplus)$ is called the <em>gyrator</em> of $G$ and
the automorphism $\gyr[a,b]\colon G\to G$ is called the gyroautomorphism of $G$, generated by $a,
b\in G$.
</div>
<p>An important thing to note in the definition of the gyrogroup, is that if one was
to leave out the gyrations in the definitions, one would get the properties of a
group algebra. Indeed, if one instantiates the gyrogroup with vectors and vector
addition, then the gyration becomes trivial and we end up with an algebra that
has the properties of a group. Thus, the gyrogroup can be seen as a genralization of
the group structure.</p>
<p>Analogously to groups and gyrogroups, the commutativity for gyrogroups is defined as fllows:</p>
<div class="mathblock">
<span class="def"><strong>D. (Gyrocommutative Gyrogroup)</strong></span>
A gyrogroup is <em>gyrocommutative</em> if it satisfies
<div class="eqdesktop">
$$
a \oplus b
=
\gyr[a,b](b\oplus a),
\qquad
\text{gyrocommutative law.}
$$
</div>
<div class="eqmobile">
$$
a \oplus b
=
\gyr[a,b](b\oplus a).
$$
</div>
</div>
<p>Some gyrocommutative gyrogroups admit scalar multiplication, turning them into
gyrovector spaces. Just as commutative groups with a scalar multiplication give rise
to vector spaces.</p>
<!
make sure that "subseteq" (subset including) and not real subset (excluding) carrier
is meant. The book uses the real subset notation. But I think they mean "subseteq" with it.
This concerns the properties 1. and 3.
>
<div class="mathblock">
<span class="def"><strong>D. (Gyrovector Spaces)</strong></span> A <em>real inner
product gyrovector space</em> $(G,\oplus,\otimes)$ (gyrovector space, in short) is a
gyrocommutative gyrogroup $(G,\oplus)$ that obeys the following axioms and properties:
<ol>
<li>
$G$ is a subset of a real inner product vector space $\bbV$ called the carrier of
$G$, $G\subseteq \bbV$, from which it inherits its inner product, $\scprod{\argdot,\argdot}$,
and norm, $\norm{\argdot}$, which are invariant under gyroautomorphisms, that is,
<div class="eqdesktop">
$$
\scprod{\gyr[\vu,\vv]\va,\gyr[\vu,\vv]\vb}=\scprod{\va,\vb},
\qquad\text{conformality}.
$$
</div>
<div class="eqmobile">
$$
\scprod{\gyr[\vu,\vv]\va,\gyr[\vu,\vv]\vb}=\scprod{\va,\vb}.
$$
</div>
</li>
<li>
$G$ admits a scalar multiplication, $\otimes$, possessing the following properties. For all
real numbers $r,r_1,r_2\in\R$, natural numbers $n\in\N$ and all points $\va,\vu,\vv\in G$:
<div class="eqdesktop">
$$
\begin{align*}
1\otimes \va
&=
\va,
&& \text{multiplicative identity},
\\
n\otimes \va
&=
\va\oplus\cdots\oplus \va,
&& \text{gyroaddition }n\text{ times},
\\
(r)\otimes\va
&=
r \otimes (\ominus \va),
&& \text{sign distributivity},
\\
(r_1+r_2)\otimes\va
&=
r_1\otimes \va\oplus r_2\otimes\va,
&&\text{scalar distributive Law},
\\
(r_1r_2)\otimes\va
&=
r_1\otimes(r_2\otimes\va),
&&\text{scalar associative law},
\\
r\otimes(r_1\otimes\va\oplus r_2\otimes\va)
&=
r\otimes(r_1\otimes\va)\oplus r\otimes(r_2\otimes\va),
&&\text{monodistributive law},
\\
\frac{\abs{r}\otimes\va}{\norm{r\otimes\va}}
&=
\frac{\va}{\norm{\va}},
&&\text{scaling property},
\\
\gyr[\vu,\vv](r\otimes\va)
&=
r\otimes\gyr[\vu,\vv]\va
&&\text{gyroautomorphism property},
\\
\gyr[r_1\otimes\va,r_2\otimes\va]
&=
\id
&&\text{identity automorphism},
\\
\end{align*}
$$
</div>
<div class="eqmobile">
<! TODO: mobile misses one equation (monodistributive law) >
$$
\begin{align*}
1\otimes \va
&=
\va,
\\
n\otimes \va
&=
\va\oplus\cdots\oplus \va,
\\
(r)\otimes\va
&=
r \otimes (\ominus \va),
\\
(r_1+r_2)\otimes\va
&=
r_1\otimes \va\oplus r_2\otimes\va,
\\
(r_1r_2)\otimes\va
&=
r_1\otimes(r_2\otimes\va),
\\
\frac{\abs{r}\otimes\va}{\norm{r\otimes\va}}
&=
\frac{\va}{\norm{\va}},
\\
\gyr[\vu,\vv](r\otimes\va)
&=
r\otimes\gyr[\vu,\vv]\va
\\
\gyr[r_1\otimes\va,r_2\otimes\va]
&=
\id
\end{align*}
$$
</div>
where we use the notation $r\otimes\va=\va\otimes r$.
</li>
<li>
The algebra $(\norm{\G},\oplus,\otimes)$ for the set $\norm{G}$ of onedimensional "vectors",
$$
\norm{G}=\dset{\norm{\va}}{\va\in G}\subseteq\R.
$$
has a real vector space structure with vector addition $\oplus$ and scalar
multiplication $\otimes$, such that for all $r\in\R$ and $\va,\vb\in G$,
<div class="eqdesktop">
$$
\begin{align*}
\norm{r\otimes\va}
&=
\abs{r}\otimes\norm{\va},
&&\text{homogeneity property},
\\
\norm{\va\oplus\vb}
&\leq
\norm{\va}\oplus\norm{\vb},
&&\text{gyrotriangle property},
\end{align*}
$$
</div>
<div class="eqmobile">
$$
\begin{align*}
\norm{r\otimes\va}
&=
\abs{r}\otimes\norm{\va},
\\
\norm{\va\oplus\vb}
&\leq
\norm{\va}\oplus\norm{\vb},
\end{align*}
$$
</div>
connecting the addition and scalar multiplication of $G$ and $\norm{G}$.
</li>
</ol>
</div>
<p>From the definition of above, one can easily verify that $(1)\otimes\va=\ominus\va$, and
$\norm{\ominus\va}=\norm{\va}$. One should also note that the ambiguous use of $\oplus$ and
$\otimes$ as interrelated operations in the gyrovector space $(G,\oplus,\otimes)$ and its
associated onedimensional “vector” space $(\norm{G},\oplus,\otimes)$ should not raise any
confusion, since the sets in which these operations operate are always clear from the context.
E.g., in vector spaces we also use the same notation, $+$, for the addition operation between
vectors and between their magnitudes, and the same notation for the scalar multiplication between
two scalars and between a scalar and a vector.</p>
<p>However, it’s important to note that the operations in the gyrovector space
$(G,\oplus,\otimes)$ are nonassociativenondistributive gyrovector space operations, and the
operations in $(\norm{G},\oplus,\otimes)$ are associativedistributive vector space
operations. Additionally, the gyroaddition $\oplus$ is gyrocommutative in the former, and
commutative in the latter. Also note that in the vector space $(\norm{G},\oplus,\otimes)$ the
gyrations are trivial.</p>
<p>Next, we’ll look at how we can use the algebraic structure of a gyrovector space to
implement one universal model, able to capture hyperbolic, spherical and Euclidean
geometries.</p>
<h2 id="thekappastereographicmodelforhyperbolicsphericalandeuclideangeometry">The $\kappa$Stereographic Model for Hyperbolic, Spherical and Euclidean Geometry</h2>
<p>In their paper about “Hyperbolic neural networks” <a class="citation" href="#hnn">[7]</a>,
Ganea et al. already showed how one can harness the gyrovector space formalism
presented by Ungar in <a class="citation" href="#ungar2005analytic">[5]</a> to
implement all the necessary tools for deep neural networks that operate on
hyperbolic representations. Later, throughout the development of our theses related
to curvaturelearning, we discovered and verified that one can use the same gyrovector
space formalism also with <em>positive</em> sectional
curvature in order to also implement all of the necessary operations for the model of the
stereographic projection of the sphere.</p>
<p>For some $n$dimensional manifold with constant sectional curvature
$\kappa\in\R$, we can instantiate a corresponding gyrovector space algebra
$(\cM_\kappa^n,\oplus_\kappa,\otimes_\kappa)$, to intuitively express
and compute important operations on the manifold in closed form. The concrete definitions
of the carrier set and operations for the gyrovector space algebra
$(\cM_\kappa^n,\oplus_\kappa,\otimes_\kappa)$ are given in what follows.</p>
<div class="mathblock">
<span class="def"><strong>D. (Carrier Set)</strong></span> The <em>carrier set</em>
$\cM_\kappa^n$ of an $n$dimensional gyrovector space, corresponding to a manifold of constant
sectional curvature $\kappa$, is defined as:
$$
\cM_\kappa^n
=
\dset{\vx\in\R^n}{\kappa\norm{\vx}_2^2<1}.
$$
</div>
<p>One may easily verify that the carrier set $\cM_\kappa^n$ simplifies to the entire $\R^n$
for spherical and Euclidean geometries and to the open ball of radius
$R=1/\sqrt{\kappa}$ for hyperbolic geometries. Thus, another way to express
$\cM_\kappa^n$ is:</p>
<div class="eqdesktop">
$$
\cM_\kappa^n
=
\begin{cases}
\dset{\vx\in\R^n}{\norm{\vx}_2<R},
& \text{for }\kappa<0,
& \text{hyperbolic geometry},
\\
\R^n,
& \text{for }\kappa=0,
& \text{Euclidean geometry},
\\
\R^n,
& \text{for }\kappa>0,
& \text{spherical geometry}.
\end{cases}
$$
</div>
<div class="eqmobile">
$$
\cM_\kappa^n
=
\begin{cases}
\dset{\vx}{\norm{\vx}_2<R},
& \kappa<0,
\text{ (hyp.)},
\\
\R^n,
& \kappa=0, \text{ (Eucl.)},
\\
\R^n,
& \kappa>0,
\text{ (sph.)}.
\end{cases}
$$
</div>
<p>For the addition in our gyrovector space algebra we use the plainvanilla Möbius
addition:</p>
<div class="mathblock">
<span class="def"><strong>D. (Möbius Addition)</strong></span> The <em>Möbius addition</em>
of $\vx,\vy\in\cM_\kappa^n$ is defined as
<div class="eqdesktop">
$$
\vx\oplus_\kappa\vy
=
\frac{
\left(12\kappa\scprod{\vx,\vy}\kappa\norm{\vy}_2^2\right)\vx
+\left(1+\kappa\norm{\vx}_2^2\right)\vy
}{
1  2\kappa\scprod{\vx,\vy}+\kappa^2\norm{\vx}_2^2\norm{\vy}_2^2
}.
$$
</div>
<div class="eqmobile" style="zoom:72%">
$$
\vx\oplus_\kappa\vy
=
\frac{
\left(12\kappa\scprod{\vx,\vy}\kappa\norm{\vy}_2^2\right)\vx
+\left(1+\kappa\norm{\vx}_2^2\right)\vy
}{
1  2\kappa\scprod{\vx,\vy}+\kappa^2\norm{\vx}_2^2\norm{\vy}_2^2
}.
$$
</div>
</div>
<p>An important property of the Möbius addition is that it recovers the Euclidean
vector space addition when $\kappa\to 0$:</p>
<script type="math/tex; mode=display">\lim_{\kappa\to 0}\vx \oplus_\kappa \vy = \vx + \vy.</script>
<p>Having defined the Möbius addition, the Möbius subtraction is then simply
defined as:</p>
<div class="mathblock">
<span class="def"><strong>D. (Möbius Subtraction)</strong></span> The <em>Möbius subtraction</em>
of $\vx,\vy\in\cM_\kappa^n$ is defined as
$$
\vx\ominus_\kappa\vy = \vx\oplus_\kappa(\ominus\vy).
$$
</div>
<p>For $\kappa\leq 0$ it has been shown that $(\cM_\kappa^n,\oplus_\kappa)$ forms a
<em>commutative gyrogroup</em>, where additive inverses are simply given as $\ominus \vx=\vx$
<a class="citation" href="#ungar2001hyperbolic">[8]</a>. Furthermore, one can easily verify
that for $\kappa=0$ the gyration becomes trivial, making the algebra $(\cM_0^n,\oplus_0)$
simply a <em>commutative group</em>. However, it turns out that for $\kappa>0$ there are exceptions
where the Möbius addition is indefinite:</p>
<div class="mathblock">
<span class="thm"><strong>T. (Definiteness of Möbius Addition)</strong></span> The Möbius addition
is indefinite, meaning that the denominator
$12\kappa\vx^\T\vy+\kappa^2\norm{\vx}_2^2\norm{\vy}_2^2=0$,
if and only if $\kappa>0$ and $\vx=\vy/(\kappa\norm{\vy}_2^2)\neq 0$.
</div>
<p>The theorem can be proven using the CauchySchwarz inequality. Now, what this means
is that for positive curvature $\kappa>0$, we have the situation that for every point
$\vx$, $\vx\neq \vzero$, there exists <em>exactly one</em> other collinear point</p>
<script type="math/tex; mode=display">\vy=\frac{1}{\kappa}\frac{\vx}{\norm{\vx}_2^2},</script>
<p>for which the Möbius addition is indefinite. Therefore, strictly speaking, the gyrogroup
structure is broken for $\kappa>0$ due to the indefiniteness of the Möbius addition in these
cases. Hence, for $\kappa>0$ we only get a <em>pseudo</em> gyrogroup, which behaves like a
gyrogroup, as long as the Möbius addition is definite.</p>
<p>However, since for every point $\vx$, $\vx\neq \vzero$, there is <em>only</em> exactly <em>one</em>
other point <em>out of the many other possible points</em> in $\cM_\kappa^n$, for which the Möbius
addition is indefinite, this is not too much of an issue for practical
applications. One may circumvent these rare cases of indefinite Möbius additions by fixing
them through minclamping the denominator to a small numerical value $\epsilon>0$. This way,
the Möbius addition can be extended with desirable approximation behaviour in these indefinite
cases.</p>
<p>Before defining the scalar multiplication we first want to introduce the
following trigonometric functions that are parametrized by the sectional
curvature $\kappa$. These trigonometric functions will help us to unify and
simplify the notations for spherical and hyperbolic geometry within the elegant
formalism of gyrovector spaces.</p>
<! TODO: add definition ranges of these functions at some point >
<div class="mathblock">
<span class="def"><strong>D. (CurvatureDependent Trigonometric Functions)</strong></span>
<div class="eqdesktop">
$$
\tan_\kappa(x)
=
\begin{cases}
\frac{1}{\sqrt{\kappa}}\tanh(\sqrt{\kappa}x), & \kappa<0,\\
x, & \kappa=0,\\
\frac{1}{\sqrt{\kappa}}\tan(\sqrt{\kappa}x), & \kappa>0.
\end{cases}
$$
$$
\arctan_\kappa(x)
=
\begin{cases}
\frac{1}{\sqrt{\kappa}}\arctanh(\sqrt{\kappa}x), & \kappa<0,\\
x, & \kappa=0,\\
\frac{1}{\sqrt{\kappa}}\arctan(\sqrt{\kappa}x), & \kappa>0.\\
\end{cases}
$$
$$
\arcsin_\kappa(x)
=
\begin{cases}
\frac{1}{\sqrt{\kappa}}\arcsinh(\sqrt{\kappa}x), & \kappa<0,\\
x, & \kappa=0,\\
\frac{1}{\sqrt{\kappa}}\arcsin(\sqrt{\kappa}x), & \kappa>0.\\
\end{cases}
$$
</div>
<div class="eqmobile">
$$
\begin{align*}
&\tan_\kappa(x)
=
\\
&\qquad
\begin{cases}
\frac{1}{\sqrt{\kappa}}\tanh(\sqrt{\kappa}x), & \kappa<0,\\
x, & \kappa=0,\\
\frac{1}{\sqrt{\kappa}}\tan(\sqrt{\kappa}x), & \kappa>0.
\end{cases}
\end{align*}
$$
$$
\begin{align*}
&\arctan_\kappa(x)
=
\\
&\qquad\begin{cases}
\frac{1}{\sqrt{\kappa}}\arctanh(\sqrt{\kappa}x), & \kappa<0,\\
x, & \kappa=0,\\
\frac{1}{\sqrt{\kappa}}\arctan(\sqrt{\kappa}x), & \kappa>0.\\
\end{cases}
\end{align*}
$$
$$
\begin{align*}
&\arcsin_\kappa(x)
=
\\
&\qquad\begin{cases}
\frac{1}{\sqrt{\kappa}}\arcsinh(\sqrt{\kappa}x), & \kappa<0,\\
x, & \kappa=0,\\
\frac{1}{\sqrt{\kappa}}\arcsin(\sqrt{\kappa}x), & \kappa>0.\\
\end{cases}
\end{align*}
$$
</div>
</div>
<p>If one ignores the definitions of the upper functions for the cases where
$\kappa=0$, one may verify that for any
$f_\kappa\in\set{\tan_\kappa,\arctan_\kappa,\arcsin_\kappa}$
the identity map is approached as $\kappa\to 0$:</p>
<script type="math/tex; mode=display">\lim_{\kappa\to 0} f_\kappa(x)=\id(x)=x.</script>
<p>This is exactly the motivation behind the definitions of the cases where
$\kappa=0$. Also, these limits will become useful later when we’ll show how
the $\kappa$stereographic model approaches an Euclidean geometry for $\kappa\to 0$.</p>
<! TODO: put an illustrative example here at some point >
<p>Having defined these curvaturedependent trigonometric functions we can now
use them to define an augmented version of the Möbius scalar multiplication as
follows:</p>
<div class="mathblock">
<span class="def"><strong>D. (Augmented Möbius Scalar Multiplication)</strong></span> The
<em>augmented Möbius scalar multiplication</em> of $\vx\in\cM_\kappa^n\setminus\set{\vo}$
by $\alpha\in\R$ is defined as
$$
\alpha\otimes_\kappa \vx
=
\tan_\kappa
\left(
\alpha
\arctan_\kappa
\left(
\norm{\vx}_2
\right)\right)
\frac{\vx}{\norm{\vx}_2}.
$$
</div>
<p>Hence, the augmented Möbius scalar multiplication only distinguishes itself
from the conventional Möbius scalar multiplication through the usage of our parametrized
trigonometric functions that also support nonnegative curvature $\kappa\geq0$.
However, there’s one subtlety for $\kappa > 0$ that has to be considered:</p>
<div class="mathblock">
<span class="thm"><strong>T. (Definiteness of Augmented Möbius Scalar Multiplication)</strong></span>
The augmented Möbius scalar multiplication is indefinite, if and only if
$\kappa>0$ and $\alpha\arctan_{\kappa}(\norm{\vx}_2)=\frac{\pi}{2}+k\pi$
for some $k\in\Z$.
</div>
<p>However, this indefiniteness is not too tragic. One may redefine the augmented
scalar multiplication for an indefinite case $\alpha\otimes_k\vx$ recursively as</p>
<script type="math/tex; mode=display">\alpha\otimes_{\kappa}\vx
:=
\frac{\alpha}{2}\otimes_{\kappa}
\left(\frac{\alpha}{2}\otimes_{\kappa}\vx\right),</script>
<p>giving an augmented gyrovector scalar multiplication that is definite for
all real scalars.</p>
<p>Having defined the carrier set, the gyrovector space addition and the scalar
multiplication we now have everything for our (pseudo) gyrovector space algebra
$(\cM_\kappa^n,\oplus_\kappa,\otimes_\kappa)$. Now, let’s have a look at how we
can use this (pseudo) gyrovector space algebra to express all the necessary
operations for the Poincaré ball ($\kappa<0$) and the stereographic projection
of the sphere ($\kappa>0$):</p>
<table style="backgroundcolor:#f5f5f5;" class="formulatable">
<tr>
<th style="textalign:left;">Description</th>
<th style="textalign:left;">ClosedForm Formula on $\cM_\kappa^n$, $\kappa\neq 0$</th>
</tr>
<tr>
<td style="textalign:left;">
Radius
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$R:=\frac{1}{\sqrt{\abs{\kappa}}}\in(0,\infty)$
</td>
</tr>
<tr>
<td style="textalign:left;">
Tangent Space
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$\cT_{\vx}\cM_\kappa^n=\bbR^n$
</td>
</tr>
<tr>
<td style="textalign:left;">
Conformal Factor
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
\lambda_{\vx}^\kappa
=
\frac{2}{1+\kappa\norm{\vx}_2^2}
\in
(0,\infty)
$
</td>
</tr>
<!
<tr>
<td style="textalign:left;">
Lorentz Factor
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
\gamma_{\vx}^\kappa
=
\frac{1}{\sqrt{1+\kappa\norm{\vx}_2^2}}
\in
(0,\infty)
$
</td>
</tr>
>
<tr>
<td style="textalign:left;">
Metric Tensor
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
\vg^{\kappa}_{\vx}
=
\left(\lambda_{\vx}^\kappa\right)^2\MI
\in
\R^{n\times n}
$
</td>
</tr>
<tr>
<td style="textalign:left;">
Tangent Space Inner Product<br />
$\scprod{\argdot,\argdot}_{\vx}^\kappa\colon\cT_{\vx}\cM_\kappa^n\times\cT_{\vx}\cM_\kappa^n\to\R$
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
\scprod{\vu,\vv}_{\vx}^\kappa
=
\vu^T\vg_{\vx}^\kappa\vv
=
(\lambda_{\vx}^\kappa)^2\scprod{\vu,\vv}$
</td>
</tr>
<tr>
<td style="textalign:left;">
Tangent Space Norm
<br />
$\norm{\argdot}_{\vx}^\kappa\colon\cT_{\vx}\cM_\kappa^n\to\R^+_0$
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$\norm{\vu}_{\vx}^\kappa=\lambda_{\vx}^\kappa\norm{\vu}_2$
</td>
</tr>
<! TODO: find out what this norm is used for!!!!!!!!!!!!!!!!!!!!!!!!!!!
<tr>
<td style="textalign:left;">
Manifold Norm
<br/>
$\norm{\argdot}_{M}^\kappa\colon\cM_\kappa^n\to\R^+_0$
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
\norm{\vx}_{M}^\kappa
=
(\gamma_{\vx}^\kappa)^2\norm{\vx}_2
$
</td>
</tr>
>
<tr>
<td style="textalign:left;">
Angle $\theta_{\vx}(\vu,\vv)$ between two Tangent Vectors
$\vu,\vv\in\cT_{\vx}\cM_\kappa^n\setminus\set{\vo}$
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
\theta_{\vx}(\vu,\vv)
=\arccos\left(
\frac{\scprod{\vu,\vv}}{\norm{\vu}_2\norm{\vv}_2}
\right)
\quad\text{(conformal)}
$
</td>
</tr>
<tr>
<td style="textalign:left;">
Distance
<br />
$d_{\kappa}\colon \cM_\kappa^n\times \cM_\kappa^n\to\R^+_0$
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
d_{\kappa}(\vx,\vy)
=
2\arctan_\kappa\left(\norm{(\vx)\oplus_\kappa\vy}_2\right)
$
</td>
</tr>
<tr>
<td style="textalign:left;">
Distance of $\vx$ to Hyperplane $H_{\vp,\vw}$ Described by Point $\vp$ and Normal Vector $\vw$
<br />
$d_{\kappa}^{H_{\vp,\vw}}\colon \cM_\kappa^n\to\R^+_0$
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
d_\kappa^{H_{\vp,\vw}}(\vx)
=
\arcsin_\kappa\left(
\frac{
2 \abs{\scprod{\vp)\oplus_\kappa \vx, \vw}}
}{
\left(1+\kappa\norm{(\vp)\oplus_\kappa \vx}_2^2\right)\norm{\vw}_2
}
\right)
$
</td>
</tr>
<tr>
<td style="textalign:left;">
Exponential Map
<br />
$\exp_{\vx}^{\kappa}\colon\cT_{\vx}\cM_\kappa^n\to\cM_\kappa^n$
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
\exp_{\vx}^{\kappa}(\vu)
=
\vx\oplus_\kappa
\left(
\tan_\kappa\left(
\frac{1}{2}
\norm{\vu}_{\vx}^\kappa
\right)
\frac{\vu}{\norm{\vu}_2}
\right)
$
</td>
</tr>
<tr>
<td style="textalign:left;">
Log Map
<br />
$\log_{\vx}^\kappa\colon\cM_\kappa^n\to\cT_{\vx}\cM_\kappa^n$
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
<div class="eqdesktop">
$
\log_{\vx}^\kappa(\vy)=
2\arctan_\kappa\left(\norm{(\vx)\oplus_\kappa\vy}_2\right)
\frac{(\vx)\oplus_\kappa \vy}{\norm{(\vx)\oplus_\kappa \vy}_{\vx}^\kappa}
$
</div>
<div class="eqmobile">
$
\begin{align*}
&\log_{\vx}^\kappa(\vy)=
\\
&\quad 2\arctan_\kappa\left(\norm{(\vx)\oplus_\kappa\vy}_2\right)
\frac{(\vx)\oplus_\kappa \vy}{\norm{(\vx)\oplus_\kappa \vy}_{\vx}^\kappa}
\end{align*}
$
</div>
</td>
</tr>
<tr>
<td style="textalign:left;">
Geodesic from $\vx\in\cM_\kappa^n \text{ to } \vy\in\cM_\kappa^n$
<br />
$\vgamma_{\vx\to\vy}^\kappa\colon [0,1]\to\cM_\kappa^n$
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
\vgamma_{\vx\to\vy}^{\kappa}(t)=\vx\oplus_\kappa\left(t\otimes_\kappa\left((\vx)
\oplus_\kappa\vy\right)
\right)
$
</td>
</tr>
<tr>
<td style="textalign:left;">
UnitSpeed Geodesic at Time $t\in\R$
Starting from $\vx\in\cM_\kappa^n$ in Direction of $\vu\in\cT_{\vx}\cM_\kappa^n$
<br />
$\vgamma_{\vx,\vu}^{\kappa}\colon \R\to\cM_\kappa^n$
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
\vgamma_{\vx,\vu}^{\kappa}(t)
=
\vx \oplus_\kappa
\left(
\tan_\kappa\left(
\frac{1}{2}t
\right)
\frac{\vu}{\norm{\vu}_2}
\right)
$
</td>
</tr>
<tr>
<td style="textalign:left;">
Antipode $\vx^a$ of $\vx$ for $\kappa>0$
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
\vx^a
=
\frac{1}{\lambda_{\vx}^{\kappa}\kappa\norm{\vx}_2^2}(\vx)
$
</td>
</tr>
<tr>
<td style="textalign:left;">
Weighted Midpoint
<br />
$\vm_{\kappa}\colon (\cM_\kappa^d)^n\times\R^n\to\cM_\kappa^d$
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
<div class="eqdesktop">
$
\vm_\kappa(\vx_{1:n},\alpha_{1:n})
=
\frac{1}{2}
\otimes_\kappa
\left(
\sum_{i=1}^n
\frac{
\alpha_i\lambda_{\vx_i}^\kappa
}{
\sum_{j=1}^n\alpha_j(\lambda_{\vx_i}^\kappa  1)
}
\vx_i
\right)
$
</div>
<div class="eqmobile">
$
\begin{align*}
&\vm_\kappa(\vx_{1:n},\alpha_{1:n})
=\\
&\quad
\frac{1}{2}
\otimes_\kappa
\left(
\sum_{i=1}^n
\frac{
\alpha_i\lambda_{\vx_i}^\kappa
}{
\sum_{j=1}^n\alpha_j(\lambda_{\vx_i}^\kappa  1)
}
\vx_i
\right)
\end{align*}
$
</div>
for $\kappa>0$ this also requires determining which of
$\vm_{\kappa}$ and $\vm_{\kappa}^a$ minimizes the sum of distances
<br />
</td>
</tr>
</table>
<p>The formulas in the table above clearly illustrate how dual the Poincaré ball
and the stereographic projection of the sphere are. In hindsight, the duality of
the two stereographic models is not so surpising, since both models are
the stereographic projections of the hyperboloid and the sphere, which are known
to be dual. Now, let’s see next how the dual models connect even more for
$\kappa\to 0$.</p>
<h2 id="recoveryofeuclideangeometryaskappato0">Recovery of Euclidean Geometry as $\kappa\to 0$</h2>
<p>The following formulas show how in the limit $\kappa\to\ 0$ an Euclidean geometry,
with a Cartesian coordinate system at intervals of $2$ units, is recovered:</p>
<table style="backgroundcolor:#f5f5f5;" class="formulatable">
<tr>
<th style="textalign:left;">Description</th>
<th style="textalign:left;">ClosedForm Formula on $\cM_\kappa^n$ for $\kappa\to 0$</th>
</tr>
<tr>
<td style="textalign:left;">
Radius
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$R\to\infty$
</td>
</tr>
<tr>
<td style="textalign:left;">
Tangent Space
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$\cT_{\vx}\cM_0^n=\bbR^n$
</td>
</tr>
<tr>
<td style="textalign:left;">
Conformal Factor
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
\lambda_{\vx}^0
=
2
$
</td>
</tr>
<!
<tr>
<td style="textalign:left;">
Lorentz Factor
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
\gamma_{\vx}^0
=
1
$
</td>
</tr>
>
<tr>
<td style="textalign:left;">
Metric Tensor
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
\vg^{0}_{\vx}
=
(\lambda_{\vx}^0)^2\MI
=
4\MI
\in
\R^{n\times n}
$
</td>
</tr>
<tr>
<td style="textalign:left;">
Tangent Space Inner Product<br />
$\scprod{\argdot,\argdot}_{\vx}^0\colon\cT_{\vx}\cM_0^n\times\cT_{\vx}\cM_0^n\to\R$
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
\scprod{\vu,\vv}_{\vx}^0
=
\vu^T\vg_{\vx}^0\vv
=
(\lambda_{\vx}^0)^2\scprod{\vu,\vv}
=
4\scprod{\vu,\vv}
$
</td>
</tr>
<tr>
<td style="textalign:left;">
Tangent Space Norm
<br />
$\norm{\argdot}_{\vx}^0\colon\cT_{\vx}\cM_0^n\to\R^+_0$
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$\norm{\vu}_{\vx}^0=\lambda_{\vx}^0\norm{\vu}_2=2\norm{\vu}_2$
</td>
</tr>
<! TODO: find out what this norm is used for!!!!!!!!!!!!!!!!!!!!!!!!!!!
<tr>
<td style="textalign:left;">
Manifold Norm
<br/>
$\norm{\argdot}_{M}^0\colon\cM_0^n\to\R^+_0$
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
\norm{\vx}_{M}^0
=
(\gamma_{\vx}^0)^2\norm{\vx}_2
=
\norm{\vx}_2
$
</td>
</tr>
>
<tr>
<td style="textalign:left;">
Angle $\theta_{\vx}(\vu,\vv)$ between two Tangent Vectors
$\vu,\vv\in\cT_{\vx}\cM_0^n\setminus\set{\vo}$
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
\theta_{\vx}(\vu,\vv)
=\arccos\left(
\frac{\scprod{\vu,\vv}}{\norm{\vu}_2\norm{\vv}_2}
\right)
$
</td>
</tr>
<tr>
<td style="textalign:left;">
Distance
<br />
$d_{0}\colon \cM_0^n\times \cM_0^n\to\R^+_0$
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
d_{0}(\vx,\vy)
=
2\norm{\vx\vy}_2
$
</td>
</tr>
<tr>
<td style="textalign:left;">
Distance of $\vx$ to Hyperplane $H_{\vp,\vw}$ Described by Point $\vp$ and Normal Vector $\vw$
<br />
$d_{\kappa}^{H_{\vp,\vw}}\colon \cM_\kappa^n\to\R^+_0$
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
d_\kappa^{H_{\vp,\vw}}(\vx)
=
2 \abs{\scprod{\vx\vp, \frac{\vw}{\norm{\vw}_2} }}
$
</td>
</tr>
<tr>
<td style="textalign:left;">
Exponential Map
<br />
$\exp_{\vx}^{0}\colon\cT_{\vx}\cM_0^n\to\cM_0^n$
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
\exp_{\vx}^{0}(\vu)
=
\vx+\vu
$
</td>
</tr>
<tr>
<td style="textalign:left;">
Log Map
<br />
$\log_{\vx}^0\colon\cM_0^n\to\cT_{\vx}\cM_0^n$
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
\log_{\vx}^0(\vy)=
\vy\vx
$
</td>
</tr>
<tr>
<td style="textalign:left;">
Geodesic from $\vx\in\cM_0^n \text{ to } \vy\in\cM_0^n$
<br />
$\vgamma_{\vx\to\vy}^0\colon [0,1]\to\cM_0^n$
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
\vgamma_{\vx\to\vy}^{0}(t)
=
\vx+t(\vy\vx)
$
</td>
</tr>
<tr>
<td style="textalign:left;">
UnitSpeed Geodesic at Time $t\in\R$
Starting from $\vx\in\cM_0^n$ in Direction of $\vu\in\cT_{\vx}\cM_0^n$
<br />
$\vgamma_{\vx,\vu}^{0}\colon \R\to\cM_0^n$
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
\vgamma_{\vx,\vu}^{0}(t)
=
\vx +
\frac{1}{2}t
\frac{\vu}{\norm{\vu}_2}
$
</td>
</tr>
<tr>
<td style="textalign:left;">
Weighted Midpoint
<br />
$\vm_{0}\colon (\cM_\kappa^d)^n\times\R^n\to\cM_\kappa^d$
</td>
<td style="textalign:left;verticalalign:middle; paddingleft:0px; paddingright:10px;">
$
\vm_\kappa(\vx_{1:n},\alpha_{1:n})
=
\sum_{i=1}^n
\frac{
\alpha_i
}{
\sum_{j=1}^n\alpha_j
}
\vx_i
$
</td>
</tr>
</table>
<!
TODO: logmap, behaviour (not unique for positive curvature).
That's maybe what makes the matrix multiplication tests fail
for positive curvature.
>
<p>The formulas for the Euclidean geometry have a “2” or a “4” here and there
which comes from the fact that the points $\vx,\vy$ are in the basis of a
Cartesian coordinate system with intervals of size 2. This coordinate system just
emerges in the limit $\kappa\to 0$ from the Poincaré ball and also from the
stereographic projection of the sphere, which both can be seen as a coordinate
systems for a corresponding hyperbolic and spherical geometry, respectively.</p>
<p>A few things to note in the formulas of above are: that the exponential and the
logmap are simply given through vector addition and subtraction. Also, the
distance is simply upscaled by a factor of 2 due to the choice of coordinates. The
weighted midpoint simply becomes the familiar weighted average.</p>
<h2 id="gridsofgeodesicsatequidistantintervals">Grids of Geodesics at Equidistant Intervals</h2>
<p>Now that we’ve entirely described the models mathematically, let’s also
build up our intuitions through some illustrations of the geometries in 2D.
First, we want to see how a grid of equidistant geodesics in two orthogonal
directions looks like in the geometries of constant curvature.</p>
<h3 id="gridofgeodesicsoneuclideanmanifold">Grid of Geodesics on Euclidean Manifold</h3>
<p>Here’s how a grid of geodesics looks like on the $x/y$plane,
a.k.a. the 2D Euclidean manifold. Actually, that’s nothing special, the grid
just consists of straight lines, reminding us of the Cartesian coordinate
system. Note that this 2D grid could be a crosssection of a 3D Euclidean
geometry.</p>
<div class="figurewithcaption">
<div class="figuretitle">
Grid of Geodesics at Equidistant Intervals<br /> on Euclidean Manifold<br /> ($\kappa=0$)
</div>
<img src="/img/20191121KStereographicModel/gridofgeodesicsK0.0.svg?v=4" alt="Grid of Geodesics at Equidistant Intervals on Euclidean Manifold" width="100%" />
</div>
<h3 id="gridofgeodesicsonpoincaréball">Grid of Geodesics on Poincaré Ball</h3>
<p>Now, let’s get a feeling of how a grid of geodesics at equidistant
intervals on the $x/y$axes looks like on the Poincaré disk:</p>
<div class="figurewithcaption">
<div class="figuretitle">
Grid of Geodesics at Equidistant Intervals<br /> on Poincaré Ball<br /> ($\kappa=1$)
</div>
<img src="/img/20191121KStereographicModel/gridofgeodesicsK1.0.svg?v=3" alt="Grid of Geodesics at Equidistant Intervals on Poincaré Ball" width="100%" />
</div>
<p>Recall, that the points on the Poincaré disk result from the stereographic
projection of the hyperboloid. Here are a few things to note about the grid
of geodesics on the Poincaré ball.</p>
<ul>
<li>The center of the Poincaré ball, which we also refer to as the origin, represents the lower
tip of the upper sheet of the twosheeted hyperboloid.</li>
<li>The border of the Poincaré ball represents points at infinity, or alternatively,
points on the hyperboloid that are infinitely far away from the origin.</li>
<li>The geodesics intersect the border of the Poincaré ball at a right angle.</li>
<li>From a plain "eye" pointofview the interval between the geodesics becomes
tighter and tighter towards the border of the Poincaré ball. But according
to the hyperbolic metric they are equidistant. This tightening happens because
volumes and distances towards border of the Poincaré ball grow exponentially,
since close to the border, even small dislocations on the Poincaré ball
correspond to large dislocations on the hyperboloid.</li>
</ul>
<p>One can also imagine how a 3D hyperbolic geometry looks like if one imagines that
this 2D grid of geodesics is the crosssection of the inside of a 3D Poincaré ball.</p>
<h3 id="gridofgeodesicsonstereographicprojectionofsphere">Grid of Geodesics on Stereographic Projection of Sphere</h3>
<p>Here’s how the equivalent grid of geodesics looks like for the stereographic
projection of the 2D sphere:</p>
<div class="figurewithcaption">
<div class="figuretitle">
Grid of Geodesics at Equidistant Intervals<br /> on Stereographic Projection of Sphere
<br />
($\kappa=1$)
</div>
<img src="/img/20191121KStereographicModel/gridofgeodesicsK1.0.svg?v=2" alt="Grid of Geodesics at Equidistant Intervals on Stereographic Projection of Sphere" width="100%" />
</div>
<p>Some things to note in this grid are:</p>
<ul>
<li>The center inside the ball is the stereographic projection of the south pole.</li>
<li>The ball marked in bold corresponds to the equator of the 2D sphere.</li>
<li>All the geodesics actually represent a greatcircle of the 2D sphere.</li>
<li>The geodesic length of the part of a greatcircle that is inside the bold ball
(equator) is equal to the geodesic length of the part of the greatcircle that is
on the outside of the equator.</li>
<li>Each pair of geodesics meets exactly twice.</li>
</ul>
<p>Similarly, one can also imagine how a 3D spherical geometry would look like if one
thinks of this grid as a crosssection of a 3D spherical geometry.</p>
<h2 id="illustrationsofsmoothinterpolationsbetweengeometriesofpositiveandnegativecurvature">Illustrations of Smooth Interpolations between Geometries of Positive and Negative Curvature</h2>
<p>In the following we want to illustrate how the $\kappa$stereographic model
allows to compute useful notions for the manifolds of constant curvature
and how these notions smoothly interpolate for changes of the curvature
$\kappa$.</p>
<h3 id="paralleltransportofunitgyrovectors">Parallel Transport of Unit Gyrovectors</h3>
<p>The following animation shows how the parallel transport of unit gyrovectors
smoothly adapts to changing values of the curvature $\kappa$:</p>
<div class="figurewithcaption">
<div class="figuretitle">
Parallel Transport of Gyrovectors
</div>
<img src="/img/20191121KStereographicModel/gyrovectorparalleltransport.gif?v=2" alt="Parallel Transport of Gyrovectors" width="100%" />
</div>
<h3 id="midpoints">Midpoints</h3>
<p>The following animation shows how the equallyweighted geodesic midpoint
$\vm_{\kappa}$ of $\vx_1,\ldots,\vx_4$ smoothly changes for changes
in $\kappa$. The corresponding shortest paths from the points $\vx_1,\ldots,\vx_4$ to their
midpoint $\vm_{\kappa}$ are also illustrated. For positive curvature the antipode of
the midpoint $\vm_{\kappa}$ is also shown.</p>
<div class="figurewithcaption">
<div class="figuretitle">
Midpoint
</div>
<img src="/img/20191121KStereographicModel/midpoint.gif?v=4" alt="Midpoint" width="100%" />
</div>
<p>Observe how for positive curvature the midpoint moves to the northern hemisphere once that the
embeddings of $\vx_1,\ldots,\vx_4$ also start to reside there.</p>
<h3 id="geodesicdistance">Geodesic Distance</h3>
<p>The following animation shows how the scalar field of distances to a
point $\vx$ smoothly transforms for various curvatures. One thing
to note here is how for spherical geometries with a large curvature
the maximal representable distance becomes very small due to the
spherical structure of the manifold. Another thing to notice for
negative curvature is that most of the space resides at the border
of the Poincaré ball  thus it is crucial to work with <code class="highlighterrouge">float64</code> in
order to represent distances accurately.</p>
<div class="figurewithcaption">
<div class="figuretitle">
Geodesic Distance
</div>
<img src="/img/20191121KStereographicModel/distance.gif?v=3" alt="Distance" width="100%" />
<div class="figurecaption" style="textalign:center">
Heatmap of square root of distance $\sqrt{d_{\kappa}(\vx,\argdot)}$ to $\vx$
</div>
</div>
<h3 id="hyperplanedistance">Hyperplane Distance</h3>
<p>The following animation shows how the scalar field of distances to a
hyperplane smoothly transforms for various curvatures. The hyperplane
is described by a a point $\vp$ on the hyperplane and a normal vector $\vw$.
Note how the hyperplane just becomes a straight line for $\kappa\to 0$
and a greatcricle for $\kappa>0$:</p>
<div class="figurewithcaption">
<div class="figuretitle">
Hyperplane Distance
</div>
<img src="/img/20191121KStereographicModel/distance2plane.gif?v=3" alt="Distance to Hyperplane" width="100%" />
<div class="figurecaption" style="textalign:center">
Heatmap of square root of hyperplane distance $\sqrt{d_{\kappa}^{H_{\vp,\vw}}(\argdot)}$
</div>
</div>
<h2 id="implementationofkappastereographicmodel">Implementation of $\kappa$Stereographic Model</h2>
<p>An Pytorch implementation of the $\kappa$stereographic model was contributed by
Andreas Bloch to the opensource geometric optimization library geoopt. The code for
the $\kappa$stereographic model can be found here:</p>
<p style="textalign:center">
<a class="btn btndefault btnsm" href="https://github.com/andbloch/geoopt/tree/universalmanifold/geoopt/manifolds/stereographic" target="_blank">
<i class="fa fagithub fa2" ariahidden="true"></i> $\kappa$Stereographic Model Source</a>
</p>
<p>Note that the implementation doesn’t support the Euclidean geometry for $\kappa=0$.
Instead only the cases $\abs{\kappa}>0.001$ are supported, such that the formulas of
above always provide gradients for $\kappa$ in order to be able to learn the curvatures
of embedding spaces.</p>
<h2 id="codeexampleforkappastereographicmodel">Code Example for $\kappa$Stereographic Model</h2>
<p>Here’s a quick code example that shows how the $\kappa$stereographic model can be instantiated
and used to perform some of the aforementioned operations. The code example also shows
how to train the curvature $\kappa$ to achieve certain targets.</p>
<div class="languagepython highlighterrouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">from</span> <span class="nn">geoopt.manifolds.stereographic</span> <span class="kn">import</span> <span class="n">StereographicExact</span>
<span class="kn">from</span> <span class="nn">geoopt.optim</span> <span class="kn">import</span> <span class="n">RiemannianAdam</span>
<span class="kn">from</span> <span class="nn">geoopt</span> <span class="kn">import</span> <span class="n">ManifoldTensor</span>
<span class="kn">from</span> <span class="nn">geoopt</span> <span class="kn">import</span> <span class="n">ManifoldParameter</span>
<span class="c1"># MANIFOLD INSTANTIATION AND COMPUTATION OF MANIFOLD QUANTITIES ################
</span>
<span class="c1"># create manifold with initial K=1.0 (Poincaré Ball)
</span>
<span class="n">manifold</span> <span class="o">=</span> <span class="n">StereographicExact</span><span class="p">(</span><span class="n">K</span><span class="o">=</span><span class="mf">1.0</span><span class="p">,</span>
<span class="n">float_precision</span><span class="o">=</span><span class="n">torch</span><span class="o">.</span><span class="n">float64</span><span class="p">,</span>
<span class="n">keep_sign_fixed</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
<span class="n">min_abs_K</span><span class="o">=</span><span class="mf">0.001</span><span class="p">)</span>
<span class="c1"># get manifold properties
</span>
<span class="n">K</span> <span class="o">=</span> <span class="n">manifold</span><span class="o">.</span><span class="n">get_K</span><span class="p">()</span><span class="o">.</span><span class="n">item</span><span class="p">()</span>
<span class="n">R</span> <span class="o">=</span> <span class="n">manifold</span><span class="o">.</span><span class="n">get_R</span><span class="p">()</span><span class="o">.</span><span class="n">item</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="s">"K={K}"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="s">"R={R}"</span><span class="p">)</span>
<span class="c1"># define dimensionality of space
</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">10</span>
<span class="k">def</span> <span class="nf">create_random_point</span><span class="p">(</span><span class="n">manifold</span><span class="p">,</span> <span class="n">n</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
<span class="n">x_norm</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">p</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">/</span><span class="n">x_norm</span> <span class="o">*</span> <span class="n">manifold</span><span class="o">.</span><span class="n">get_R</span><span class="p">()</span> <span class="o">*</span> <span class="mf">0.9</span> <span class="o">*</span> <span class="n">torch</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">x</span>
<span class="c1"># create two random points on manifold
</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">create_random_point</span><span class="p">(</span><span class="n">manifold</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">create_random_point</span><span class="p">(</span><span class="n">manifold</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span>
<span class="c1"># compute their initial distances
</span>
<span class="n">initial_dist</span> <span class="o">=</span> <span class="n">manifold</span><span class="o">.</span><span class="n">dist</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="s">"initial_dist={initial_dist.item():.3f}"</span><span class="p">)</span>
<span class="c1"># compute the log map of y at x
</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">manifold</span><span class="o">.</span><span class="n">logmap</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="c1"># compute tangent space norm of v at x (should be equal to initial distance)
</span>
<span class="n">v_norm</span> <span class="o">=</span> <span class="n">manifold</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">v</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="s">"v_norm={v_norm.item():.3f}"</span><span class="p">)</span>
<span class="c1"># compute the exponential map of v at x (=y)
</span>
<span class="n">y2</span> <span class="o">=</span> <span class="n">manifold</span><span class="o">.</span><span class="n">expmap</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">v</span><span class="p">)</span>
<span class="n">diff</span> <span class="o">=</span> <span class="p">(</span><span class="n">y</span><span class="o"></span><span class="n">y2</span><span class="p">)</span><span class="o">.</span><span class="nb">abs</span><span class="p">()</span><span class="o">.</span><span class="nb">sum</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="s">"diff={diff.item():.3f}"</span><span class="p">)</span>
<span class="c1"># CURVATURE OPTIMIZATION #######################################################
</span>
<span class="c1"># define embedding_optimizer for curvature
</span>
<span class="n">curvature_optimizer</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">optim</span><span class="o">.</span><span class="n">SGD</span><span class="p">([</span><span class="n">manifold</span><span class="o">.</span><span class="n">get_trainable_K</span><span class="p">()],</span> <span class="n">lr</span><span class="o">=</span><span class="mf">1e2</span><span class="p">)</span>
<span class="c1"># set curvature to trainable
</span>
<span class="n">manifold</span><span class="o">.</span><span class="n">set_K_trainable</span><span class="p">(</span><span class="bp">True</span><span class="p">)</span>
<span class="c1"># define training loop to optimize curvature until the points have a
</span>
<span class="c1"># certain target distance
</span>
<span class="k">def</span> <span class="nf">train_curvature</span><span class="p">(</span><span class="n">target_dist</span><span class="p">):</span>
<span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">):</span>
<span class="n">curvature_optimizer</span><span class="o">.</span><span class="n">zero_grad</span><span class="p">()</span>
<span class="n">dist_now</span> <span class="o">=</span> <span class="n">manifold</span><span class="o">.</span><span class="n">dist</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="p">(</span><span class="n">dist_now</span> <span class="o"></span> <span class="n">target_dist</span><span class="p">)</span><span class="o">.</span><span class="nb">pow</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="n">loss</span><span class="o">.</span><span class="n">backward</span><span class="p">()</span>
<span class="n">curvature_optimizer</span><span class="o">.</span><span class="n">step</span><span class="p">()</span>
<span class="c1"># keep the points x and y fixed,
</span>
<span class="c1"># train the curvature until the distance is 0.1 more than the initial distance
</span>
<span class="c1"># > curvature smaller than initial curvature
</span>
<span class="n">train_curvature</span><span class="p">(</span><span class="n">initial_dist</span> <span class="o">+</span> <span class="mf">1.0</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="s">"K_smaller={manifold.get_K().item():.3f}"</span><span class="p">)</span>
<span class="c1"># keep the points x and y fixed,
</span>
<span class="c1"># train the curvature until the distance is 0.1 less than the initial distance
</span>
<span class="c1"># > curvature greater than initial curvature
</span>
<span class="n">train_curvature</span><span class="p">(</span><span class="n">initial_dist</span> <span class="o"></span> <span class="mf">1.0</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="s">"K_larger={manifold.get_K().item():.3f}"</span><span class="p">)</span>
<span class="c1"># EMBEDDING OPTIMIZATION #######################################################
</span>
<span class="c1"># redefine x and y as manifold parameters and assign them to manifold such
</span>
<span class="c1"># that the embedding_optimizer knows according to which manifold the gradient
</span>
<span class="c1"># steps have to be performed
</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">ManifoldTensor</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">manifold</span><span class="o">=</span><span class="n">manifold</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">ManifoldParameter</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">ManifoldTensor</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">manifold</span><span class="o">=</span><span class="n">manifold</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">ManifoldParameter</span><span class="p">(</span><span class="n">y</span><span class="p">)</span>
<span class="c1"># define embedding optimizer and pass embedding parameters
</span>
<span class="n">embedding_optimizer</span> <span class="o">=</span> <span class="n">RiemannianAdam</span><span class="p">([</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">],</span> <span class="n">lr</span><span class="o">=</span><span class="mf">1e1</span><span class="p">)</span>
<span class="c1"># define a training loop to optimize the embeddings of x and y
</span>
<span class="c1"># until they have a certain distance
</span>
<span class="k">def</span> <span class="nf">train_embeddings</span><span class="p">(</span><span class="n">target_dist</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1000</span><span class="p">):</span>
<span class="n">embedding_optimizer</span><span class="o">.</span><span class="n">zero_grad</span><span class="p">()</span>
<span class="n">dist_now</span> <span class="o">=</span> <span class="n">manifold</span><span class="o">.</span><span class="n">dist</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="p">(</span><span class="n">dist_now</span> <span class="o"></span> <span class="n">target_dist</span><span class="p">)</span><span class="o">.</span><span class="nb">pow</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="n">loss</span><span class="o">.</span><span class="n">backward</span><span class="p">()</span>
<span class="n">embedding_optimizer</span><span class="o">.</span><span class="n">step</span><span class="p">()</span>
<span class="c1"># print current distance
</span>
<span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="s">"dist(x,y)={manifold.dist(x,y).item():.3f}"</span><span class="p">)</span>
<span class="c1"># optimize until points have target distance of 4.0
</span>
<span class="n">train_embeddings</span><span class="p">(</span><span class="mf">4.0</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="s">"dist(x,y)={manifold.dist(x,y).item():.3f} target:4.0"</span><span class="p">)</span>
<span class="c1"># optimize until points have target distance of 2.0
</span>
<span class="n">train_embeddings</span><span class="p">(</span><span class="mf">2.0</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="s">"dist(x,y)={manifold.dist(x,y).item():.3f} target:2.0"</span><span class="p">)</span>
</code></pre></div></div>
<p>That’s all for the $\kappa$stereographic model. We hope that this blogpost
and the provided implementation supports the sparking of new ideas for
applications of our $\kappa$stereographic model.</p>
<h2 id="references">References</h2>
<ol class="bibliography"><li><span id="gu2018learning">A. Gu, F. Sala, B. Gunel, and C. Ré, “Learning MixedCurvature Representations in Product Spaces,” 2018.</span></li>
<li><span id="skopek2019mixed">O. Skopek, O.E. Ganea, and G. Bécigneul, “Mixedcurvature Variational Autoencoders,” <i>arXiv preprint arXiv:1911.08411</i>, 2019.</span></li>
<li><span id="bachmann2019constant">G. Bachmann, G. Bécigneul, and O.E. Ganea, “Constant Curvature Graph Convolutional Networks,” <i>arXiv preprint arXiv:1911.05076</i>, 2019.</span></li>
<li><span id="wilson2014spherical">R. C. Wilson, E. R. Hancock, E. Pekalska, and R. P. W. Duin, “Spherical and hyperbolic embeddings of data,” <i>IEEE transactions on pattern analysis and machine intelligence</i>, vol. 36, no. 11, pp. 2255–2269, 2014.</span></li>
<li><span id="ungar2005analytic">A. A. Ungar, <i>Analytic hyperbolic geometry: Mathematical foundations and applications</i>. World Scientific, 2005.</span></li>
<li><span id="ungar1991thomas">A. A. Ungar, “Thomas precession and its associated grouplike structure,” <i>American Journal of Physics</i>, vol. 59, no. 9, pp. 824–834, 1991.</span></li>
<li><span id="hnn">O. Ganea, G. Bécigneul, and T. Hofmann, “Hyperbolic neural networks,” in <i>Advances in neural information processing systems</i>, 2018, pp. 5345–5355.</span></li>
<li><span id="ungar2001hyperbolic">A. A. Ungar, “Hyperbolic trigonometry and its application in the Poincaré ball model of hyperbolic geometry,” <i>Computers & Mathematics with Applications</i>, vol. 41, no. 12, pp. 135–147, 2001.</span></li></ol>Andreas Bloch in collaboration with Ondrej Skopek and Gregor Bachmann under the assistance of Octavian Ganea and Gary BécigneulThis blogpost presents a geometric model that harnesses the formalism gyrovector spaces in order to capture all three geometries of constant curvature at once. Furthermore, the presented model allows to smoothly interpolate between different curvatures in order to learn the curvature of spaces jointly with the embeddings.Stochastic Gradient Descent on Riemannian Manifolds20191015T16:00:00+02:0020191015T16:00:00+02:00http://localhost:4000/StochasticGradientDescentonRiemannianManifolds<p>In this blogpost I’ll explain how Stochastic Gradient Descent (SGD) is generalized to the
optimization of loss functions on Riemannian manifolds.
First, I’ll give an overview of the kind of problems that are suited for Riemannian
optimization. Then, I’ll explain how <em>Riemannian Stochastic Gradient Descent (RSGD)</em> works
in detail and I’ll also show how RSGD is performed in the case where the Riemannian manifold of
interest is a product space of several Riemannian manifolds. If you’re already experienced
with SGD and you’re just about getting started with Riemannian optimization this blogpost is
exactly what you were looking for.</p>
<h2 id="typicalriemannianoptimizationproblems">Typical Riemannian Optimization Problems</h2>
<p>Let’s first consider the properties of optimization problems that we’d
typically want to solve through Riemannian optimization. Riemannian optimization
is particularly wellsuited for problems where we want to optimize a loss function</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\cL\colon\cM&\to\R
\\
\vtheta&\mapsto\cL(\vtheta)
\end{align*} %]]></script>
<p>that is defined on a <em>Riemannian manifold</em> $(\cM,g)$. This means that
the optimization problem <em>requires</em> that the optimized parameters $\vtheta\in\cM$
lie on the “smooth surface” of a Riemannian manifold $(\cM,g)$. One can easily
think of constrained optimization problems where the constraint can be described through
points lying on a Riemannian manifold (e.g., the parameters must lie on a sphere, the parameters
must be a rotation matrix, …). Riemannian optimization then gives us the possibilty
of turning a constrained optimization problem into an unconstrained one that
can be naturally solved via Riemannian optimization.</p>
<p>So, in Riemannian optimization we’re interested in finding an optimal solution $\vtheta^*$ for our
parameters</p>
<script type="math/tex; mode=display">\vtheta^*\in\argmin_{\vtheta} \cL(\vtheta)</script>
<p>that lie on a Riemannian manifold. The following two figures illustrate the heatmaps
of some nonconvex loss functions, that are defined on an Euclidean and spherical
Riemannian manifold.</p>
<div class="figurewithcaption">
<img src="/img/20191015StochasticGradientDescentonRiemannianManifolds/heatmapplane.png?v=1" alt="Heatmap of loss function defined on the Euclidean plane" width="70%" />
<div class="figurecaption" style="textalign:center">
Heatmap of a loss function $\cL$ defined on the Euclidean plane
</div>
</div>
<div class="figurewithcaption">
<img src="/img/20191015StochasticGradientDescentonRiemannianManifolds/heatmapsphere.png?v=1" alt="Heatmap of loss function a spherical manifold" width="40%" />
<div class="figurecaption" style="textalign:center">
Heatmap of a loss function $\cL$ defined on a spherical manifold
</div>
</div>
<p>Similarly as with SGD in Euclidean vector spaces, in Riemannian optimization we want
to perform a gradientbased descent on the surface of the manifold. The gradient steps should
also be based on the gradients of the loss fuction $\cL$, such that we finally find some
parameters $\vtheta^*$ that hopefully lie at a global minimum of the loss.</p>
<h2 id="whatsdifferentwithsgdonriemannianmanifolds">What’s Different with SGD on Riemannian Manifolds?</h2>
<p>Let’s first look at what makes RSGD different from the usual SGD in the
Euclidean vector spaces. Actually, RSGD just works like SGD when applied to our wellknown
Euclidean vector spaces, because RSGD is just a generalization of SGD to arbitrary
Riemannian manifolds.</p>
<p>Indeed, the Euclidean vector space $\R^n$ can be interpreted as a Riemannian
manifold $(\R^n, g_{ij})$, known as the <em>Euclidean manifold</em>, with the
metric $g_{ij}=\delta_{ij}$. When using our usual SGD to optimize a loss function defined over
the Euclidean manifold $\R^n$, we iteratively compute the following gradients and
gradient updates on minibatches in order to hopefully converge to an optimal solution:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\nabla_{\vtheta}\cL(\vtheta)&=\left(\fpartial{\cL(\theta)}{\theta_i}\right)_{i=1}^d,
\\
\vtheta^{(t+1)}&\gets\vtheta^{(t)}\eta_t\nabla_{\vtheta}\cL(\vtheta^{(t)}).
\end{align*} %]]></script>
<p>Now, in some optimization problems the solution space, or solution manifold $\cM$,
might have a structure that is different from the Euclidean manifold. Let’s consider
two examples of optimization problems that can be captured as Riemannian optimization
problems and let’s have a look at the challenges that arise in the gradient updates:</p>
<ol>
<li>
<p><strong>Points on a Sphere:</strong> The optimization problem may require that the
parameters $\vtheta=(x,y,z)$ lie on a 2Dspherical manifold of radius 1 that
is embedded in 3D ambient space. The corresponding Riemannian manifold $(\cM,g)$ would
then be</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\cM&=\sdset{\vtheta\in\R^{3}}{\norm{\vtheta}_2=1},
\\
g&=\MI.
\end{align*} %]]></script>
<p>In this case we have a loss function $\cL\colon\cM\to\R$ that gives
us the loss for any point (or parameter) $\vtheta$ on the sphere $\cM$.
Our machine learning framework might then automatically provide us with the
derivatives of the loss</p>
<div class="eqdesktop">
$$
\vh(\vtheta^{(t)})
=
\left(
\fpartial{\cL(\vtheta^{(t)})}{x},
\fpartial{\cL(\vtheta^{(t)})}{y},
\fpartial{\cL(\vtheta^{(t)})}{z}
\right)
$$
</div>
<div class="eqmobile">
$$
\vh(\vtheta^{(t)})
=
\left(
\tfrac{\partial\cL(\vtheta^{(t)})}{\partial x},
\tfrac{\partial\cL(\vtheta^{(t)})}{\partial y},
\tfrac{\partial\cL(\vtheta^{(t)})}{\partial z}
\right)
$$
</div>
<p>evaluated at our current parameters $\vtheta^{(t)}$. But now, how are we going about
updating the parameters $\vtheta^{(t)}$ with $\vh(\vtheta^{(t)})$? We can’t
just use our wellknown updaterule of SGD for Euclidean vector spaces, since
we’re not guaranteed that the update rule</p>
<script type="math/tex; mode=display">\vtheta^{(t+1)}\gets\vtheta^{(t)}\eta_t\vh(\vtheta^{(t)})</script>
<p>yields a valid update $\vtheta^{(t+1)}$ that lies on the surface of spherical
manifold $\cM$. Before seeing how this can be solved through Riemannian optimization,
let’s first consider another example where we encounter a similar issue.</p>
</li>
<li>
<p><strong>DoublyStochastic Matrices:</strong> One may also think of a more complicated
optimization problem, where the parameters $\vtheta$ must be a square matrix with positive
coefficients such that the coefficients of every row and column sum up to 1.
It turns out that this solution space $\cM$ also represents a Riemannian manifold,
having thus a smooth “surface” and a metric $g_{\MX}$ that smoothly varies with $\MX$.
This Riemannian manifold $(\cM,g_{\MX})$ is the manifold of socalled <em>doublystochastic
matrices</em> <a class="citation" href="#douik2018manifold">[1]</a>, given by:</p>
<div class="eqdesktop">
$$
\begin{align*}
\cM&=\dset{\MX\in\R^{d\times d}}{
\begin{array}{rl}
\forall i,j\in\set{1,\ldots,d}\colon
&X_{ij}\geq 0,
\\
\forall i\in\set{1,\ldots,d}\colon
&\sum_{k=1}^d X_{ik}=\sum_{k=1}^d X_{ki} = 1.
\end{array}
},
\\
g_{\MX}&=\Tr{(\MA\oslash\MX)\MB^\T}.
\end{align*}
$$
</div>
<div class="eqmobile">
$$
\cM=\dset{\MX\in\R^{d\times d}}{
\begin{array}{c}
\forall i,j\in\set{1,\ldots,d}\colon
\\
X_{ij}\geq 0,
\\
\sum_{k=1}^d X_{ik}=1,
\\
\sum_{k=1}^d X_{ki}=1.
\end{array}
},
$$
$$
g_{\MX}=\Tr{(\MA\oslash\MX)\MB^\T}.
$$
</div>
<p>Again, our machine learning framework may give us the derivatives of the
loss w.r.t. each of the parameter matrix’ coefficients and evaluate it at our
current parameters $\vtheta^{(t)}:$</p>
<script type="math/tex; mode=display">\MH(\vtheta^{(t)})
=
\left(
\fpartial{\cL(\vtheta)}{X_{ij}}
\right)_{i,j=1}^{d}.</script>
<p>Again, the simple gradientupdate rule of SGD for parameters in Euclidean
vector spaces</p>
<script type="math/tex; mode=display">\vtheta^{(t+1)}\gets\vtheta^{(t)}\eta_t\MH(\vtheta^{(t)})</script>
<p>would not guarantee us that the update always yields a matrix $\vtheta^{(t+1)}$ with
nonnegative coefficients whose rows and columns sum up to 1.</p>
</li>
</ol>
<p>In both examples we saw that the simple SGD updaterule for Euclidean vector spaces is
insufficient to guarantee the validity of the updated parameters. So now the question is,
how can we perform valid gradient updates to parameters that are defined on arbitrary
Riemannian manifolds? – That’s exactly where Riemannian optimization comes in, which we’ll
look at next!</p>
<h2 id="performinggradientstepsonriemannianmanifolds">Performing Gradient Steps on Riemannian Manifolds</h2>
<p>In the “curved” spaces of Riemannian manifolds the gradient updates
ideally should follow the “curved” geodesics instead of just following straight
lines as done in SGD for parameters on our familiar Euclidean manifold $\bbR^n$.
To this end, the seminal work of Bonnabel <a class="citation" href="#bonnabel">[2]</a> introduced
<em>Riemannian Stochastic Gradient Descent (RSGD)</em> that generalizes SGD to Riemannian manifolds.
In what follows I’ll explain and illustrate how this technique works.</p>
<p>Let’s now assume that we have a typical Riemannian optimization problem, e.g., one of the
two mentioned previously, where the solution space is given by an arbitrary
$d$dimensional Riemannian manifold $(\cM, g)$ and we’re interested in finding an optimal solution
of a loss function $\cL\colon\cM\to\cR$, that is defined for any parameters $\vtheta$ on the
Riemannian manifold. Let $\vtheta^{(t)}\in\cM$ denote our current set of parameters at
timestep $t$. A gradientstep is then performed through the application of the following three
steps:</p>
<ol>
<li>
<p>Evaluate the gradient of $\cL$ w.r.t. the parameters $\vtheta$ at $\vtheta^{(t)}$.</p>
</li>
<li>
<p>Orthogonally project the gradient onto the tangent space $\cT_{\vtheta^{(t)}}\cM$ to get
the tangent vector $\vv$, pointing in the direction of steepest ascent of $\cL$.</p>
</li>
<li>
<p>Perform a gradientstep on the surface of the manifold in the negative direction of the tangent
vector $\vv$, to get the updated parameters.</p>
</li>
</ol>
<p>We’ll now look at these steps in more detail in what follows.</p>
<h3 id="computationofgradientwrtcurrentparameters">Computation of Gradient w.r.t. Current Parameters</h3>
<p>In order to minimize our loss function $\cL$, we first have to determine
the gradient. The gradient w.r.t. our parameters $\vtheta$, evaluated
at our current parameters $\vtheta^{(t)}$, is computed as follows:</p>
<script type="math/tex; mode=display">\vh:=
\nabla_{\vtheta}\cL(\vtheta^{(t)})
=
\vg^{1}_{\vtheta^{(t)}}
\fpartial{\cL(\vtheta^{(t)})}{\vtheta}.</script>
<p>The computation and evaluation of the derivatives $\fpartial{\cL(\vtheta^{(t)})}{\vtheta}$ is
usually just performed automatically through the autodifferentiation functionality of our
machine learning framework of choice. However, the multiplication with the
inverse metric $\vg^{1}_{\vtheta^{(t)}}$ usually has to be done manually in order to
obtain the correct quantity for the gradient $\nabla_{\vtheta}\cL(\vtheta^{(t)})$.</p>
<p>If that should be new to you that we have to multiply the partial derivatives
by the inverse of the metric tensor $\vg^{1}_{\vtheta^{(t)}}$ to obtain the
gradient then don’t worry too much about that now. Let me just tell you
that, actually, that’s how the gradient is defined in general. The reason you
might have never come across this multiplication by the inverse metric tensor
is that for the usual Euclidean vector space, with the usual Cartesian coordinate
system, the inverse metric tensor $\vg^{1}$ just simplifies to the identity matrix $\MI$,
and is therefore usually omitted for convenience.</p>
<p>The reason behind the multiplication by the inverse metric tensor is that
we want the gradient to be a vector that is <em>invariant</em> under the choice
of a specific coordinate system. Furthermore, it should satisfy
the following two properties that we already know from the gradient in Euclidean
vector spaces:</p>
<ul>
<li>
<p>The gradient evaluated at $\vtheta^{(t)}$ points into the direction of
steepest ascent of $\cL$ at $\vtheta^{(t)}$.</p>
</li>
<li>
<p>The norm of the gradient at $\vtheta^{(t)}$ is equal to the
value of the directional derivative in a unit vector of the gradient’s direction.</p>
</li>
</ul>
<p>Explaining the reasons behind the multiplication with $\vg^{1}_{\vtheta^{(t)}}$ in more detail
would explode the scope of this blogbost. However, in case you should want to learn more about
this, I highly recommend that you have a look at
<a href="https://www.youtube.com/watch?v=e0eJXttPRZI&list=PLlXfTHzgMRULkodlIEqfgTSH1AY_bNtq&index=1">
Pavel Grinfield’s valuable lectures on tensor calculus</a> (until
Lesson 5a for the gradient), where you can learn the reasons behind the
multiplication with the inverse metric tensor in a matter of a few hours.</p>
<h3 id="orthogonalprojectionofgradientontotangentspace">Orthogonal Projection of Gradient onto Tangent Space</h3>
<p>Since the previously computed gradient $\vh=\nabla_{\vtheta}\cL(\vtheta^{(t)})$ may be lying
just somewhere in ambient space, we first have to determine the component of $\vh$ that lies in
the tangent space at $\vtheta^{(t)}$. The situation is illustrated in the figure below, where
we can see our manifold $\cM$, the gradient $\vh$ lying in the ambient space, and the tangent
space $\cT_{\vtheta^{(t)}}\cM$, which represents a firstorder approximation of the manifold’s
surface around our current parameters $\vtheta^{(t)}$.</p>
<div class="figurewithcaption">
<img src="/img/20191015StochasticGradientDescentonRiemannianManifolds/rsgdsteps/orthogonalprojectionofgradientontotangentspace.png?v=1" alt="Orthogonal Projection of Gradient onto Tangent Space" width="70%" datadesktopwidth="70%" datamobilewidth="100%" />
<div class="figurecaption" style="textalign:center">
Orthogonal Projection of Gradient $\vh=\nabla_{\vtheta}\cL(\vtheta^{(t)})$
onto Tangent Space $\cT_{\vtheta^{(t)}}\cM$
</div>
</div>
<p>The component of $\vh$ that lies in $\cT_{\vtheta^{(t)}}\cM$ is determined through
the orthogonal projection of $\vh$ from the ambient space onto the tangent space
$\cT_{\vtheta^{(t)}}\cM$:</p>
<script type="math/tex; mode=display">\vv=\proj_{\cT_{\vtheta^{(t)}}\cM}(\vh).</script>
<p>Depending on the chosen representation for the manifold $\cM$ it might even be that
the orthogonal projection is not even necessary, e.g., in the case where any tangent
space is always equal to the ambient space.</p>
<h3 id="gradientstepfromtangentvector">Gradient Step from Tangent Vector</h3>
<p>Having determined the direction of steepest increase $\vv$ of $\cL$ in the tangent space
$\cT_{\vtheta^{(t)}}\cM$ we can now use it to perform a gradient step. As with
the usual SGD, we want to take a step in the <em>negative</em> gradient direction in order
to hopefully <em>decrease</em> the loss. Thus, in the tangent space, we take a step in the
direction of $\eta_t\vv$, where $\eta_t$ is our learning rate, and obtain
the point $\eta_t\vv$ in the tangent space $\cT_{\vtheta^{(t)}}$.</p>
<p>Recall, that the tangent space $\cT_{\vtheta^{(t)}}$ represents a firstorder
approximation of the manifold’s smooth surface at the point $\vtheta^{(t)}$. Hence,
the vector $\eta_t\vv\in\cT_{\vtheta^{(t)}}$ is in a direct correspondence with the
point $\vtheta^{(t+1)}\in\cM$ that we’d like to reach through our gradient update. The mapping
which maps tangent vectors to their corresponding points on the manifold is exactly the
exponential map. Thus, we may just map $\eta_t\vv$ to $\vtheta^{(t+1)}$ via the exponential map
to perform our gradient update:</p>
<script type="math/tex; mode=display">\vtheta^{(t+1)}
=
\exp_{\vtheta^{(t)}}(\eta_t\vv).</script>
<p>The gradientstep is illustrated in the figure below. As one can observe, the exponential
map is exactly what makes the parameters stay on the surface and also what forces
gradient updates to follow the curved geodesics of the manifold.</p>
<div class="figurewithcaption">
<img src="/img/20191015StochasticGradientDescentonRiemannianManifolds/rsgdsteps/gradientstepviaexpmap.png?v=1" alt="Orthogonal Projection of Gradient onto Tangent Space" width="70%" datadesktopwidth="70%" datamobilewidth="100%" />
<div class="figurecaption" style="textalign:center">
Gradient Step via Exponential Map
</div>
</div>
<p>Another equivalent way of seeing the gradient update is the following: The mapping
which moves the point $\vtheta^{(t)}\in\cM$ in the initial direction $\vv$
along a geodesic of length $\norm{\eta_t\vv}_{\vtheta^{(t)}}$ is exactly the
exponential map $\exp_{\vtheta^{(t)}}(\argdot)$.</p>
<p>Sometimes, as mentioned by Bonnabel in <a class="citation" href="#bonnabel">[2]</a>, for computational
efficiency reasons or when it’s hard to solve the differential equations to obtain the
exponential map, the gradient step is also approximated through the retraction $\cR_{\vx}(\vv)$:</p>
<script type="math/tex; mode=display">\cR_{\vx}(\vv):=\proj_{\cM}(\vx+\vv),</script>
<p>where the function $\proj_\cM$ is the orthogonal projection from the ambient space (that
includes the tangent space) onto the manifold $\cM$. Hence, the retraction represents a
firstorder approximation of the exponential map. The possible differences between parameter
updates through the exponential map and the retraction are illustrated in the figure below:</p>
<div class="figurewithcaption">
<img src="/img/20191015StochasticGradientDescentonRiemannianManifolds/rsgdsteps/expmapvsretraction.png?v=2" alt="Exponential Map VS Retraction" width="70%" datadesktopwidth="70%" datamobilewidth="100%" />
<div class="figurecaption" style="textalign:center">
Exponential Map VS Retraction
</div>
</div>
<p>As one can see the retraction first follows a straight line in the tangent space and then
orthogonally projects the point in the tangent space onto the manifold. The exponential map
instead performs exact updates along to the manifold’s curved geodesics with a geodesic length
that corresponds to the tangent space norm of the tangent vector $\eta_t\vv$. Therefore, the
different update methods may lead to different parameter updates. Which one is better to use depends on the
specific manifold, the computational cost of the exponential map, the size of the gradientsteps
and the behaviour of the loss function.</p>
<p>To summarize, all we need to perform RSGD is 1) the inverse of the metric tensor, 2)
the formula for the orthogonal projection onto tangent spaces, and 3) the exponential map
or the retraction to map tangent vectors to corresponding points on the manifold.
The formulas for the steps 1)3) vary from manifold to manifold and can usually
be found in papers or other online resources. Here are few resources, that give the concrete
formulas for some useful manifolds:</p>
<ul>
<li>
<p><strong>Poincaré Ball:</strong>
<a href="https://papers.nips.cc/paper/7213poincareembeddingsforlearninghierarchicalrepresentations.pdf">
Nickel, Maximillian, and Douwe Kiela. “Poincaré embeddings for learning hierarchical
representations.” Advances in neural information processing systems. 2017.
</a></p>
</li>
<li>
<p><strong>Sphere & Hyperboloid:</strong>
<a href="http://eprints.whiterose.ac.uk/78407/1/SphericalFinal.pdf">
Wilson, Richard C., et al. “Spherical and hyperbolic embeddings of data.” IEEE transactions on
pattern analysis and machine intelligence 36.11 (2014): 22552269.
</a></p>
</li>
<li>
<p><strong>Birkhoff Polytope:</strong>
<a href="http://openaccess.thecvf.com/content_CVPR_2019/papers/Birdal_Probabilistic_Permutation_Synchronization_Using_the_Riemannian_Structure_of_the_Birkhoff_CVPR_2019_paper.pdf">
Birdal, Tolga, and Umut Simsekli. “Probabilistic Permutation Synchronization using the
Riemannian Structure of the Birkhoff Polytope.” Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. 2019.
</a></p>
</li>
<li>
<p><strong>Grassmannian Manifold:</strong>
<a href="https://arxiv.org/pdf/1808.02229.pdf">
Zhang, Jiayao, et al. “Grassmannian learning: Embedding geometry awareness in shallow and deep
learning.” arXiv preprint arXiv:1808.02229 (2018).
</a></p>
</li>
<li>
<p><strong>Several other Matrix Manifolds:</strong>
<a href="https://www.researchgate.net/profile/Rodolphe_Sepulchre/publication/220693013_Optimization_Algorithms_on_Matrix_Manifolds/links/09e4150b8678c0da06000000/OptimizationAlgorithmsonMatrixManifolds.pdf">
Absil, PA., Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds.
Princeton University Press, 2009.
</a></p>
</li>
</ul>
<h2 id="riemanniansgdonproductsofriemannianmanifolds">Riemannian SGD on Products of Riemannian Manifolds</h2>
<p>Since Cartesian products of Riemannian manifolds are again Riemannian manifolds, RSGD can also
be applied in these product spaces. In this case, let $(\cP,\vg)$ be a product of $n$ Riemannian
manifolds $(\cM_i,\vg_i)_{i=1}^n$, and let $\vg$ be the induced product metric:</p>
<div class="eqdesktop">
$$
\cP:=\cM_1\times\cdots\times\cM_n,
\qquad
\vg:=\begin{pmatrix}
\vg_1 & & \\
& \ddots & \\
& & \vg_n
\end{pmatrix}.
$$
</div>
<div class="eqmobile">
$$
\cP:=\cM_1\times\cdots\times\cM_n,
$$
$$
\vg:=\begin{pmatrix}
\vg_1 & & \\
& \ddots & \\
& & \vg_n
\end{pmatrix}.
$$
</div>
<p>Furthermore, let the optimization problem on $\cP$ be</p>
<script type="math/tex; mode=display">\vtheta^*=\argmin_{\vtheta\in\cP}\cL(\vtheta).</script>
<p>Then, the nice thing with product spaces is that the exponential map in the product
space $\cP$ simply decomposes into the concatenation of the exponential maps of the
individual factors $\cM_i$. Similarly, the orthogonal projection and the gradient
computations also decompose into the corresponding individual operations on the
product’s factors. Hence, RSGD on products of Riemannian manifolds is simply achieved
by performing the aforementioned gradientstep procedure separately for each of the manifold’s
factors. A concrete algorithm for product spaces is given in the algorithm below:</p>
<p><img src="/img/20191015StochasticGradientDescentonRiemannianManifolds/rsgdalgo.svg?v=5" alt="RSGD Algorithm" width="100%" style="margintop:40px;marginbottom:40px;" /></p>
<h2 id="riemannianadaptiveoptimizationmethods">Riemannian Adaptive Optimization Methods</h2>
<p>The successful applications of Riemannian manifolds in machine learning impelled Gary
Bécigneul and Octavian Ganea to further generalize adaptive optimization algorithms such
as ADAM, ADAGRAD and AMSGRAD to products of Riemannian manifolds. For the details of the adaptive
optimization methods I refer you to their paper <a class="citation" href="#riemannianadaptive">[3]</a>. A readytouse pytorch implementation of their proposed optimization algorithms, along with
the implementation of several manifolds, has been published on github by Maxim Kochurov in his
geometric optimization library called geoopt:</p>
<ul>
<li>
<p><a href="https://github.com/geoopt/geoopt/blob/master/geoopt/optim/rsgd.py">Riemannian SGD</a></p>
</li>
<li>
<p><a href="https://github.com/geoopt/geoopt/blob/master/geoopt/optim/radam.py">Riemannian ADAM</a></p>
</li>
</ul>
<p>We’ll use the Riemannian ADAM (RADAM) in the codeexample that follows in order to see
how to perform Riemannian optimization in product spaces with geoopt.</p>
<h2 id="codeexampleforriemannianoptimizationinproductspaces">Code Example for Riemannian Optimization in Product Spaces</h2>
<p>Here’s a simple code example that shows how to perform Riemannian optimization in
a product space. In this example, we’ll optimize the embedding of a graph $G$
that is a <em>cycle</em> of $n=20$ nodes such that their original graphdistances $d_G(x_i,x_j)$ are
preserved as well as possible in the geodesic distances $d_{\cP}(x_i,x_j)$ of the arrangement of
the embeddings in the product space $\cP$. The product space that we’ll choose in our example is
a torus (product of two circles). And the loss that we’ll optimize is just the squared loss of
the graph and product space distances:</p>
<script type="math/tex; mode=display">\cL(\vtheta)=\sum_{i,j} \left(d_G(x_i,x_j)d_{\cP}(x_i,x_j)\right)^2.</script>
<p>The following plot shows how the positions of the embeddings evolve over time and finally
arrange in a setting that approximates the original graph distances of the cycle graph.</p>
<div class="figurewithcaption">
<img src="/img/20191015StochasticGradientDescentonRiemannianManifolds/graphembedding.gif?v=2" alt="Evolution of Graph Embedding in Product Space" width="70%" datadesktopwidth="70%" datamobilewidth="100%" />
<div class="figurecaption" style="textalign:center">
Evolution of Graph Embedding in Product Space
</div>
</div>
<p>Here’s the code that shows how the optimization of this graph embedding is performed:</p>
<div class="languagepython highlighterrouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">from</span> <span class="nn">geoopt</span> <span class="kn">import</span> <span class="n">ManifoldTensor</span><span class="p">,</span> <span class="n">ManifoldParameter</span>
<span class="kn">from</span> <span class="nn">geoopt.manifolds</span> <span class="kn">import</span> <span class="n">SphereExact</span><span class="p">,</span> <span class="n">Scaled</span><span class="p">,</span> <span class="n">ProductManifold</span>
<span class="kn">from</span> <span class="nn">geoopt.optim</span> <span class="kn">import</span> <span class="n">RiemannianAdam</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">numpy</span> <span class="kn">import</span> <span class="n">pi</span><span class="p">,</span> <span class="n">cos</span><span class="p">,</span> <span class="n">sin</span>
<span class="kn">from</span> <span class="nn">mayavi</span> <span class="kn">import</span> <span class="n">mlab</span>
<span class="kn">import</span> <span class="nn">imageio</span>
<span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm</span>
<span class="c1"># CREATE CYCLE GRAPH ###########################################################
</span>
<span class="c1"># Here we prepare a graph that is a cycle of n nodes. We then compute all pair
</span>
<span class="c1"># wise graph distances because we'll want to learn an embedding that embeds the
</span>
<span class="c1"># vertices of the graph on the surface of a torus, such that the distances of
</span>
<span class="c1"># the induced discrete metric space of the graph are preserved as well as
</span>
<span class="c1"># possible through the positioning of the embeddings on the torus.
</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">20</span>
<span class="n">training_examples</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="c1"># only consider pairwise distances below diagonal of distance matrix
</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">i</span><span class="p">):</span>
<span class="c1"># determine distance between vertice i and j
</span>
<span class="n">d</span> <span class="o">=</span> <span class="n">i</span><span class="o"></span><span class="n">j</span>
<span class="k">if</span> <span class="n">d</span> <span class="o">></span> <span class="n">n</span><span class="o">//</span><span class="mi">2</span><span class="p">:</span>
<span class="n">d</span> <span class="o">=</span> <span class="n">n</span><span class="o"></span><span class="n">d</span>
<span class="c1"># scale down distance
</span>
<span class="n">d</span> <span class="o">=</span> <span class="n">d</span> <span class="o">*</span> <span class="p">((</span><span class="mi">2</span> <span class="o">*</span> <span class="n">pi</span> <span class="o">*</span> <span class="mf">0.3</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="n">n</span><span class="o"></span><span class="mi">1</span><span class="p">))</span>
<span class="c1"># add edge and weight to training examples
</span>
<span class="n">training_examples</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">i</span><span class="p">,</span><span class="n">j</span><span class="p">,</span><span class="n">d</span><span class="p">))</span>
<span class="c1"># the training_examples now consist of a list of triplets (v1, v2, d)
</span>
<span class="c1"># where v1, v2 are vertices, and d is their (scaled) graph distance
</span>
<span class="c1"># CREATION OF PRODUCT SPACE (TORUS) ############################################
</span>
<span class="c1"># create first sphere manifold of radius 1 (default)
</span>
<span class="c1"># (the Exact version uses the exponential map instead of the retraction)
</span>
<span class="n">r1</span> <span class="o">=</span> <span class="mf">1.0</span>
<span class="n">sphere1</span> <span class="o">=</span> <span class="n">SphereExact</span><span class="p">()</span>
<span class="c1"># create second sphere manifold of radius 0.3
</span>
<span class="n">r2</span> <span class="o">=</span> <span class="mf">0.3</span>
<span class="n">sphere2</span> <span class="o">=</span> <span class="n">Scaled</span><span class="p">(</span><span class="n">SphereExact</span><span class="p">(),</span> <span class="n">scale</span><span class="o">=</span><span class="n">r2</span><span class="p">)</span>
<span class="c1"># create torus manifold through product of two 1dimensional spheres (actually
</span>
<span class="c1"># circles) which are each embedded in a 2D ambient space
</span>
<span class="n">torus</span> <span class="o">=</span> <span class="n">ProductManifold</span><span class="p">((</span><span class="n">sphere1</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span> <span class="p">(</span><span class="n">sphere2</span><span class="p">,</span> <span class="mi">2</span><span class="p">))</span>
<span class="c1"># INITIALIZATION OF EMBEDDINGS #################################################
</span>
<span class="c1"># init embeddings. sidenote: this initialization was mostly chosen for
</span>
<span class="c1"># illustration purposes. you may want to consider better initialization
</span>
<span class="c1"># strategies for the product space that you'll consider.
</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span><span class="o">.</span><span class="nb">abs</span><span class="p">()</span><span class="o">*</span><span class="mf">0.5</span>
<span class="c1"># augment embeddings tensor to a manifold tensor with a reference to the product
</span>
<span class="c1"># manifold that they belong to such that the optimizer can determine how to
</span>
<span class="c1"># convert the the derivatives of pytorch to the correct Riemannian gradients
</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">ManifoldTensor</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">manifold</span><span class="o">=</span><span class="n">torus</span><span class="p">)</span>
<span class="c1"># project the embeddings onto the spheres' surfaces (inplace) according to the
</span>
<span class="c1"># orthogonal projection from ambient space onto the sphere's surface for each
</span>
<span class="c1"># spherical factor
</span>
<span class="n">X</span><span class="o">.</span><span class="n">proj_</span><span class="p">()</span>
<span class="c1"># declare the embeddings as trainable manifold parameters
</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">ManifoldParameter</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="c1"># PLOTTING FUNCTIONALITY #######################################################
</span>
<span class="c1"># array storing screenshots
</span>
<span class="n">screenshots</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1"># torus surface
</span>
<span class="n">phi</span><span class="p">,</span> <span class="n">theta</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mgrid</span><span class="p">[</span><span class="mf">0.0</span><span class="p">:</span><span class="mf">2.0</span> <span class="o">*</span> <span class="n">pi</span><span class="p">:</span><span class="mf">100j</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">:</span><span class="mf">2.0</span> <span class="o">*</span> <span class="n">pi</span><span class="p">:</span><span class="mf">100j</span><span class="p">]</span>
<span class="n">torus_x</span> <span class="o">=</span> <span class="n">cos</span><span class="p">(</span><span class="n">phi</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">r1</span> <span class="o">+</span> <span class="n">r2</span> <span class="o">*</span> <span class="n">cos</span><span class="p">(</span><span class="n">theta</span><span class="p">))</span>
<span class="n">torus_y</span> <span class="o">=</span> <span class="n">sin</span><span class="p">(</span><span class="n">phi</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">r1</span> <span class="o">+</span> <span class="n">r2</span> <span class="o">*</span> <span class="n">cos</span><span class="p">(</span><span class="n">theta</span><span class="p">))</span>
<span class="n">torus_z</span> <span class="o">=</span> <span class="n">r2</span> <span class="o">*</span> <span class="n">sin</span><span class="p">(</span><span class="n">theta</span><span class="p">)</span>
<span class="c1"># embedding point surface
</span>
<span class="n">ball_size</span> <span class="o">=</span> <span class="mf">0.035</span>
<span class="n">u</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">pi</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">pi</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span>
<span class="n">ball_x</span> <span class="o">=</span> <span class="n">ball_size</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">outer</span><span class="p">(</span><span class="n">cos</span><span class="p">(</span><span class="n">u</span><span class="p">),</span> <span class="n">sin</span><span class="p">(</span><span class="n">v</span><span class="p">))</span>
<span class="n">ball_y</span> <span class="o">=</span> <span class="n">ball_size</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">outer</span><span class="p">(</span><span class="n">sin</span><span class="p">(</span><span class="n">u</span><span class="p">),</span> <span class="n">sin</span><span class="p">(</span><span class="n">v</span><span class="p">))</span>
<span class="n">ball_z</span> <span class="o">=</span> <span class="n">ball_size</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">outer</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">size</span><span class="p">(</span><span class="n">u</span><span class="p">)),</span> <span class="n">cos</span><span class="p">(</span><span class="n">v</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">plot_point</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">z</span><span class="p">):</span>
<span class="n">point_color</span> <span class="o">=</span> <span class="p">(</span><span class="mi">255</span><span class="o">/</span><span class="mi">255</span><span class="p">,</span> <span class="mi">62</span><span class="o">/</span><span class="mi">255</span><span class="p">,</span> <span class="mi">160</span><span class="o">/</span><span class="mi">255</span><span class="p">)</span>
<span class="n">mlab</span><span class="o">.</span><span class="n">mesh</span><span class="p">(</span><span class="n">x</span> <span class="o">+</span> <span class="n">ball_x</span><span class="p">,</span> <span class="n">y</span> <span class="o">+</span> <span class="n">ball_y</span><span class="p">,</span> <span class="n">z</span> <span class="o">+</span> <span class="n">ball_z</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">point_color</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">update_plot</span><span class="p">(</span><span class="n">X</span><span class="p">):</span>
<span class="c1"># transform embedding (2D X 2D)coordinates to 3D coordinates on torus
</span>
<span class="n">cos_phi</span> <span class="o">=</span> <span class="n">X</span><span class="p">[:,</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">r1</span>
<span class="n">sin_phi</span> <span class="o">=</span> <span class="n">X</span><span class="p">[:,</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="n">r1</span>
<span class="n">xx</span> <span class="o">=</span> <span class="n">X</span><span class="p">[:,</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">cos_phi</span> <span class="o">*</span> <span class="n">X</span><span class="p">[:,</span><span class="mi">2</span><span class="p">]</span> <span class="o">*</span> <span class="n">r2</span>
<span class="n">yy</span> <span class="o">=</span> <span class="n">X</span><span class="p">[:,</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">sin_phi</span> <span class="o">*</span> <span class="n">X</span><span class="p">[:,</span><span class="mi">2</span><span class="p">]</span> <span class="o">*</span> <span class="n">r2</span>
<span class="n">zz</span> <span class="o">=</span> <span class="n">r2</span> <span class="o">*</span> <span class="n">X</span><span class="p">[:,</span><span class="mi">3</span><span class="p">]</span>
<span class="c1"># create figure
</span>
<span class="n">mlab</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="mi">700</span><span class="p">,</span> <span class="mi">500</span><span class="p">),</span> <span class="n">bgcolor</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="c1"># plot torus surface
</span>
<span class="n">torus_color</span> <span class="o">=</span> <span class="p">(</span><span class="mi">0</span><span class="o">/</span><span class="mi">255</span><span class="p">,</span> <span class="mi">255</span><span class="o">/</span><span class="mi">255</span><span class="p">,</span> <span class="mi">255</span><span class="o">/</span><span class="mi">255</span><span class="p">)</span>
<span class="n">mlab</span><span class="o">.</span><span class="n">mesh</span><span class="p">(</span><span class="n">torus_x</span><span class="p">,</span> <span class="n">torus_y</span><span class="p">,</span> <span class="n">torus_z</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">torus_color</span><span class="p">,</span> <span class="n">opacity</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
<span class="c1"># plot embedding points on torus surface
</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="n">plot_point</span><span class="p">(</span><span class="n">xx</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">yy</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">zz</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
<span class="c1"># save screenshot
</span>
<span class="n">mlab</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="n">azimuth</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">elevation</span><span class="o">=</span><span class="mi">60</span><span class="p">,</span> <span class="n">distance</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span> <span class="n">focalpoint</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o"></span><span class="mf">0.2</span><span class="p">))</span>
<span class="n">mlab</span><span class="o">.</span><span class="n">gcf</span><span class="p">()</span><span class="o">.</span><span class="n">scene</span><span class="o">.</span><span class="n">_lift</span><span class="p">()</span>
<span class="n">screenshots</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">mlab</span><span class="o">.</span><span class="n">screenshot</span><span class="p">(</span><span class="n">antialiased</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span>
<span class="n">mlab</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="c1"># TRAINING OF EMBEDDINGS IN PRODUCT SPACE ######################################
</span>
<span class="c1"># build RADAM optimizer and specify the embeddings as parameters.
</span>
<span class="c1"># note that the RADAM can also optimize parameters which are not attached to a
</span>
<span class="c1"># manifold. then it just behaves like the usual ADAM for the Euclidean vector
</span>
<span class="c1"># space. we stabilize the embedding every 1 steps, which rthogonally projects
</span>
<span class="c1"># the embedding points onto the manifold's surface after the gradientupdates to
</span>
<span class="c1"># ensure that they lie precisely on the surface of the manifold. this is needed
</span>
<span class="c1"># because the parameters may get slightly off the manifold's surface for
</span>
<span class="c1"># numerical reasons. Not stabilizing may introduce small errors that accumulate
</span>
<span class="c1"># over time.
</span>
<span class="n">riemannian_adam</span> <span class="o">=</span> <span class="n">RiemannianAdam</span><span class="p">(</span><span class="n">params</span><span class="o">=</span><span class="p">[</span><span class="n">X</span><span class="p">],</span> <span class="n">lr</span><span class="o">=</span><span class="mf">1e2</span><span class="p">,</span> <span class="n">stabilize</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="c1"># we'll just use this as a random examples sampler to get some stochasticity
</span>
<span class="c1"># in our gradient descent
</span>
<span class="n">num_training_examples</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">training_examples</span><span class="p">)</span>
<span class="n">training_example_indices</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="n">num_training_examples</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">get_subset_of_examples</span><span class="p">():</span>
<span class="k">return</span> <span class="nb">list</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">training_example_indices</span><span class="p">,</span>
<span class="n">size</span><span class="o">=</span><span class="nb">int</span><span class="p">(</span><span class="n">num_training_examples</span><span class="o">/</span><span class="mi">4</span><span class="p">),</span>
<span class="n">replace</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span>
<span class="c1"># training loop to optimize the positions of embeddings such that the
</span>
<span class="c1"># distances between them become as close as possible to the true graph distances
</span>
<span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">1000</span><span class="p">)):</span>
<span class="c1"># zeroout the gradients
</span>
<span class="n">riemannian_adam</span><span class="o">.</span><span class="n">zero_grad</span><span class="p">()</span>
<span class="c1"># compute loss for next batch
</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">tensor</span><span class="p">(</span><span class="mf">0.0</span><span class="p">)</span>
<span class="n">indices_batch</span> <span class="o">=</span> <span class="n">get_subset_of_examples</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">indices_batch</span><span class="p">:</span>
<span class="n">v_i</span><span class="p">,</span> <span class="n">v_j</span><span class="p">,</span> <span class="n">target_distance</span> <span class="o">=</span> <span class="n">training_examples</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="c1"># compute the current distances between the embeddings in the product
</span>
<span class="c1"># space (torus)
</span>
<span class="n">current_distance</span> <span class="o">=</span> <span class="n">torus</span><span class="o">.</span><span class="n">dist</span><span class="p">(</span><span class="n">X</span><span class="p">[</span><span class="n">v_i</span><span class="p">,:],</span> <span class="n">X</span><span class="p">[</span><span class="n">v_j</span><span class="p">,:])</span>
<span class="c1"># add squared loss of current and target distance to the loss
</span>
<span class="n">loss</span> <span class="o">+=</span> <span class="p">(</span><span class="n">current_distance</span> <span class="o"></span> <span class="n">target_distance</span><span class="p">)</span><span class="o">.</span><span class="nb">pow</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="c1"># compute derivative of loss w.r.t. parameters
</span>
<span class="n">loss</span><span class="o">.</span><span class="n">backward</span><span class="p">()</span>
<span class="c1"># let RADAM compute the gradients and do the gradient step
</span>
<span class="n">riemannian_adam</span><span class="o">.</span><span class="n">step</span><span class="p">()</span>
<span class="c1"># plot current embeddings
</span>
<span class="k">with</span> <span class="n">torch</span><span class="o">.</span><span class="n">no_grad</span><span class="p">():</span>
<span class="n">update_plot</span><span class="p">(</span><span class="n">X</span><span class="o">.</span><span class="n">detach</span><span class="p">()</span><span class="o">.</span><span class="n">numpy</span><span class="p">())</span>
<span class="c1"># CREATE ANIMATED GIF ##########################################################
</span>
<span class="n">imageio</span><span class="o">.</span><span class="n">mimsave</span><span class="p">(</span><span class="n">f</span><span class="s">'training.gif'</span><span class="p">,</span> <span class="n">screenshots</span><span class="p">,</span> <span class="n">duration</span><span class="o">=</span><span class="mi">1</span><span class="o">/</span><span class="mi">24</span><span class="p">)</span>
</code></pre></div></div>
<p>Of course, one might choose better geometries to embed a cycle graph. Also,
a better embedding could have been achieved if the embeddings had wrapped
around the curved tube of the torus. This example was mostly chosen to have an illustrative
minimal working example in order to get you started Riemannian optimization in product spaces. A
paper that extensively studies the suitability of products of spaces of constant curvature to learn
distancepreserving embeddings of realworld graphs is the work of Gu et al.
<a class="citation" href="#productspaces">[4]</a>.</p>
<p>That’s all for now, I hope that my motivation and explanation of RSGD was helpful
to you and that you are now ready to get started with Riemannian optimization.</p>
<h2 id="references">References</h2>
<ol class="bibliography"><li><span id="douik2018manifold">A. Douik and B. Hassibi, “Manifold Optimization Over the Set of Doubly Stochastic Matrices: A SecondOrder Geometry,” <i>arXiv preprint arXiv:1802.02628</i>, 2018.</span></li>
<li><span id="bonnabel">S. Bonnabel, “Stochastic gradient descent on Riemannian manifolds,” <i>IEEE Transactions on Automatic Control</i>, vol. 58, no. 9, pp. 2217–2229, 2013.</span></li>
<li><span id="riemannianadaptive">G. Bécigneul and O.E. Ganea, “Riemannian adaptive optimization methods,” <i>arXiv preprint arXiv:1810.00760</i>, 2018.</span></li>
<li><span id="productspaces">A. Gu, F. Sala, B. Gunel, and C. Ré, “Learning MixedCurvature Representations in Product Spaces,” 2018.</span></li></ol>Andreas Bloch under the assistance of Octavian Ganea and Gary BécigneulStochastic Gradient Descent (SGD) is the default workhorse for most of today's machine learning algorithms. While the majority of SGD applications is concerned with Euclidean spaces, recent advances also explored the potential of Riemannian manifolds. This blogpost explains how the concept of SGD is generalized to Riemannian manifolds.