Review of "The geometry of non-linear stochastic gradient descent" a paper by K.A. Krakowski, R.E. Mahony, R.C. Williamson, M.K. Warmuth --------- Reviewer 1 recommends accepting the paper after major revisions Reviewer 2 recommends accepting the paper as is. Reviewer 3 recommends rejecting the paper but is willing to reconsider it after major revisions. As JMLR does not allow for "Reconsider after major revision" but only for "Reconsider after major revisions" I am rejecting the paper. I encourage the authors to rewrite the paper from scratch, taking into account the comments of the reviewers. If you decide to submit the paper to JMLR, please include in the cover letter a reference to this submission. ------------------------------------------------------------ review 1 This paper studies generalization of the famous "gradient descent" online-learning algorithm where the domain, or "decision set" of the online learner is a general Riemannian manifold. Geometry is intimately linked to online learning, and in particular the geometry of the decision set. As such, the paper has potential to be a superb research paper, but at its current state I think it is half-baked. I appreciate the theoretical contribution presented, and think that the introduction of Riemannian geometry could prove to be extremely useful in the future, thus would recommend acceptance subject to the comments and recommendations below. 1. There is no mention of computational efficiency in the paper, and this is a great caveat. When is the geometric variant of GD efficient ? If the manifold is not a convex set I doubt this, or any algorithm for that matter, can be made efficient. Please address and say when it's efficient and when lower bounds can be shown. 2. Page 4 references some "sampling process". In the online setting this is irrelevant. I suggest using game-theoretic notation, and regret as the performance measure. You can say it generalizes the PAC and other statistical learning models. 3. presentation - I consider the major contribution of this paper to be the introduction of the geometric concepts. As such this is not well presented. A general rule - do not define concepts till the place where you use them. Is there any use for "Gauss's great theorem" ? If yes, you can explain it where it is used. If not, you can put this and other irrelevant (though interesting) facts in separate section, warning the reader that these concepts are not used. (if they do provide crucial insight, then such facts should be stated, but again with a clear separation of what is used and what not). On the same issue, the basic concepts are not well presented. Section 2 is extremely dense and at first reading very hard to make out almost anything from it. Give plenty of examples, this is perhaps the most important section of the paper, given that it's contribution is conceptual. Give examples of everything starting from manifolds to "exponential maps". 4. Results - It is not clear if the paper gives algorithms with better bounds than the state of the art. I imagine that this should be the case at least for some manifolds. If so, state it prominently. Define the parameters of manifolds which are important for performance bounds, and state previous and current algorithms bounds in such terms. I suggest given a short formal statement of results in the beginning. 5. Related work - The geometry of the decision set was recently considered in a related paper of Abernethy, Hazan and Rakhlin al in the latest COLT 2008. Their notion of geometry was to use the characterization of self-concordant barriers, and in their application this gave substantial benefit in performance (which indicates maybe that the paper's geometric framework can be beneficial, after all, the motivation should be better performance). At the very least this paper should be referenced, but it would be great to draw connections (I realize this may not be time-feasible for the current paper). ----------------------------------------------------------------- review 2 Short Summary: The paper provides a study of gradient descent algorithms which respect the Riemannian geometry of the manifold. Diffeomorphisms (isomorphisms for smooth manifolds) between manifolds naturally suggest an equivalence relation on algorithms. Regret bounds are proved for explicit and implicit updates which closely follow the corresponding results in R^n dating back to the 90's. These algorithms originate from minimization of the loss plus a divergence to the previous point. Finally, the authors show that link functions (studied previously) provide a diffeomorphism between the manifold and R^n if the manifold has zero curvature. Conclusion: I think the paper provides a nice generalization of known results to manifolds and is suitable for publication. Informally, the ideas can be summarized as follows. Look at a manifold with its Riemannian metric. This induces a notion of distances on the manifold. Take the divergence to be this distance squared (which leads to GD in R^n). An previously studied idea of minimizing a "loss plus divergence" leads to an algorithm with a bound on its regret; the proof, on a high level, is very similar to the proof for the Euclidean metric. Here, the geometric interpretation of the explicit and implicit updates is very nice. The divergence terms in the regret bound need, at the end of the day, to be bounded using properties of the specific manifold. Link functions turn out to provide a diffeomorphism with R^n, thus giving a handle on the regret terms. Not surprisingly, the authors recover the known results for EG and GD (gradient descent). Novelty: In some sense, the results are straightforward, yet require some basic knowledge of differential geometry. I would characterize the main contribution of the paper as "generalizing the known results to manifold distances". The connection between isometries and link functions is nice. Overall, the paper serves as a starting point for obtaining interesting results in specific settings. One interesting algorithm is introduced, which arises from looking at the simplex as diffeomorphic to the positive orthant of a sphere. In addition to my concerns about the validity of this construction (see below), it is not clear why this leads to a better algorithm. Significance: I believe this work provides an important generalization of known results to the manifold domain. Clarity: The paper is well-written. Specific comments: 1. p. 9, before Example 3: It would be good to provide more information about Fisher Information metric of the multinomial distribution. 2. please, define the pullback map 3. In Example 3, the simplex is mapped to only a positive orthant of the sphere. This fact is missing from the notation. Furthermore, this means that the mapping is not a diffeomorphism with the whole sphere. What happens if the algorithm for the sphere creates a point w_{t+1} outside of the positive orthant? How is it mapped back to the simplex? Is Corollary 5 valid? 4. The assumption in Theorem 7 that the image of \gamma_t is \lambda_t-convex essentially means that it must hold for every geodesic, as w^* is unknown. 5. The implicit update algorithm is essentially what has been termed as "to be the leader" algorithm, as it "knows the future". It has been shown many times before that this algorithm suffers only a constant regret (independent of T), which is similar to the statement of Theorem 9. It would be nice to mention these connections to the known results. 10. Please, define E_l and E_m on page 20. Typos: 1. p.2, 3rd paragraph: "the view that the geometric algorithm are" -> "algorithms" 2. p. 6, Remark 2: on-linear -> on-line 3. p. 6, 2nd paragraph in 3.2: "for given a sample" -> "for a given sample" 4. p. 13, last line: "it make sense" -> "it makes sense" 5. p. 22, last line: "it follows the the distance between two consecutive point" ---------------------------------------------------------- review 3 Review of "The geometry of non-linear stochastic gradient descent" a paper by K.A. Krakowski, R.E. Mahony, R.C. Williamson, M.K. Warmuth The paper studies stochastic gradient descent, and a variant based on an implicit update, in the case when the weights belong to a Riemannian manifold. Stochastic gradient descent (with implicit or explicit updates) are well studied in the case of weights belonging to the Euclidean space or to some RKHS. A different generalization is provided by link functions (introduced by one of the authors), by which we can reparametrize the weights of stochastic gradient descent obtaining variants as p-norm gradient descent or exponentiated gradient. In a previous paper by two of the authors the framework of stochastic gradient descent on manifold was introduced, and link functions were shown to be special cases. This paper continues the investigation, proving relative loss bounds for stochastic gradient descent with both types of updates. Moreover, the relationship with link functions is futher explicitated, and the known relative loss bounds are recovered as special cases of the manifold framework. My overall impression on this paper is negative for three main reasons. First, the algorithmic framework described in the paper fails to provide concrete interesting instances of new algorithms. The generality of the approach is not exploited to design new updates that are shown to be effective in specific domains. Also, no really new bounds are presented, and no new proof techniques are introduced. Second, although one of the paper's main contributions is the explanation of the connection between Riemannian manifolds and learning, the writing is not particularly friendly to the profane reader. In my opinion this is a paper written for people who are already quite familiar with the subject of Riemannian manifolds (not the majority of JMLR readers perhaps). Third, although this paper has been submitted seven years after the initial work on preferential structures for online learning, much of it appears to reiterate concepts and notions that already appeared in that earlier paper. This might be a false impression, but in all cases I think the presentation should be improved in order to better distinguish the conceptual novelties that mark the progress and the insights gained in this paper. In conclusion, even if the last two comments essentially involve presentational issue, the first one is meant to be more substantial. My recommendation is thus rejection, although I am willing to reconsider the paper after a major revision addressing all three comments.