# Immersions, submersions, embeddings

Some tldr-excerpts from Lee and Spivak formalizing embeddings and stuff.

Topological embedding -- an injective continuous map $f: A\to X$ that is also a homeomorphism onto its image $f(A)$. We can think of $f(A)$ "as a homeomorphic copy of $A$ in $X$" (Lee, 2011).

A smooth map $F: M\to N$ is said to have rank $k$ at $p \in M$ if the linear map ${F_*}_{T_p M}$ ( the pushforward) has rank $k$. $F$ is of constant rank $k$ if it is of rank $k$ at every point.

Immersion -- smooth map $F:M \to N$ whose pushforward $F_*$ is injective at every point, that is $\operatorname{rank} F = \operatorname{dim} M$.

Submersion -- smooth map $F:M \to N$ whose pushforward is surjective at every point, that is $\operatorname{rank} F = \operatorname{dim} N$.

(Smooth) Embedding (of a manifold) -- an injective immersion $F: M\to N$ that is also a topological embedding.

So, a map $F: M\to N$ is an embedding, if

1. $\operatorname{rank}F = \operatorname{dim} M$,
2. $F$ is injective,
3. $F$ is a homeomorphism onto $F(M)$ with subspace topology.

# Cone

A cone over topological space $X$ is the quotient

A point $a$ of that cone can be identified with a point $x\in X$ and the distance $\lvert Ox \rvert$ to origin (the apex fiber) $O = X{\times}\{0\}$.

The reason I care about cones is the notion of the tangent cone of a metric space at a point.

# Further procrastination

• Just learned Sussman (author of SICP) also authored a monograph on differential geometry

• And on classical mechanics

• Moreover, the former followes the concept of Turtle Geomtry and states in its Prologue the approach I admired most since my early childhood: learning things by programming them, thus forcing oneself to be precise and exact in judgements and claims. I'm recalling right now again that first "lecture" on elementary notions of set theory the summer before admission to VSU... Constructing function as a set so it becomes more "tangible" an object. The Katharsis that followed. I didn't realize back then that it's same as in programming. Five years I've been living with guilt and shame that I started as a coder and not a Mathematician. Five years I felt programming is disgusting and despisable thing to do. And only now I truly realize that the thing I loved about it in those first years is the same thing I've fallen in love with Mathematics for that summer of 2014.

• Also stumbled upon a tweet mentioning the following interpretation of Laplace operator as measuring average sign of a function around the point. Sort of trivial, and resembles how we derive sufficient min/max conditions, yet I did not notice.

• Majority of these I found in: JAX cookbook

• Update! Accidentally found these slides by Absil giving some historical propspect on the subject

• For instance, the slides mention Luenberger (1973) stating that "we'd perform line search along geodesics... if'twere feasible". Now we're closer to the roots of the whole thing

# MD is not RSGD, but RSGD also does M from MD

The whole idea of trying to parallel mirror descent with following geodesics as in RSGD has come to naught. And not the way one would expect, because MD still seems "type-correct" and RSGD doesn't yet. Long story short: in RSGD we're pulling back COTANGENTS but updating along a TANGENT.

Update! Before updating, we're raising an index of cotangent by applying inverse metric tensor, thus making it a tangent! Thanks to @ferrine for the idea.

Following $\mathbb{R}^m\to\mathbb{R}^n$ analogy of previous posts:

\begin{equation*} F:M\to N, \end{equation*}
\begin{equation*} \xi = F^*\eta\in\mathcal{T}^*M,~\text{for}~\eta\in\mathcal{T}^*N, \end{equation*}
\begin{equation*} X = \xi^\sharp = g^{-1}(\xi) = g^{-1} \xi^\top~\text{so it becomes a column}. \end{equation*}

# Pullbacks

Going on with trivialities. I still haven't finished my story with autodiff, and since I've got to code it for DL homework... Well.

So, we consider $F:M{=}\mathbb{R}^m\to N{=}\mathbb{R}^n$. Tangent spaces look exactly like original spaces except for where they go in multiplication: for $g:N\to\mathbb{R}$ a tangent vector $Y\in\mathcal{T}N$ is basically $\mathbb{R}^n$ and acts by $Y(g) = \langle \nabla_{F(p)} g, Y\rangle$.

Pushforward $F_*:\mathcal{T}M\to \mathcal{T}N$ is defined by $(F_* X)(g) = X(g\circ F)$ for $X\in\mathcal{T} M \sim \mathbb{R}^m$. Thus

\begin{equation*} \begin{split} (F_* X)(g) &= X(g\circ F) = \langle \nabla_{p} g\circ F, x\rangle \\ &= \left\langle ( \left. DF \right|_p \left.Dg\right|_{F(p)} )^\top, x\right\rangle \\ &= \langle {\underbrace{J_p(F)}_{\left. DF \right|_p}}^\top \underbrace{\nabla_{F(p)}}_{\left.Dg\right|_{F(p)}^\top} g, x \rangle \\ &= \left\langle \nabla_{F(p)}g, J_p(F)x \right\rangle . \end{split} \end{equation*}

Here we use $D$ to denote Fr'echet derivatives (a linear map) and nabla to denote gradient (a vector -- the Riescz representation of that linear map) and we also identify linear map $DF$ with Jacobian matrix $J(F)$. Also I denote $X$ casted to $\mathbb{R}^m$ as just $x$. I don't like how I'm already using too many different notations (after all, that's what I scorn differential geometers at for) but at the moment it seems fit.

So, basically the equation above means that in Euclidean case pushforward $F_*$ acts on tangent $X$ merely by multiplying with Jacobian $J_p(F)$. In terms of matrix multiplication, $F_* X$ is just $J_p(F) x\in\mathbb{R}^n$

Further, pullback $F^*$ by definition maps right cotangent $\xi\in\mathcal{T}_{F(p)}^*N$ into left cotangent $F^*\xi\in\mathcal{T}_p M$ which acts as: $(F^*\xi) X = \xi(F_*X)$.

\begin{equation*} \begin{split} (F^*\xi) X &= \xi(F_*X)\\ &= \xi^\top J_p(F)x . \end{split} \end{equation*}

That is, pullbacked cotangent is just $\xi^\top J_p(F)\in (\mathbb{R}^m)^*$ (acting on $x$ from the left) and pullback $F^*$ itself is still the same $J_p(F)$ except acting on cotangents from the left. It is equivalent to say that pullback acts on $\operatorname{column}(\xi)$ as transposed Jacobian $J_p(F)^\top$:

\begin{equation*} \operatorname{column}(F^*\xi) = J_p(F)^\top \operatorname{column}(\xi). \end{equation*}

Now, why pulling back in gradient descent? Take $F:\mathbb{R}^n \to\mathbb{R}$. Right cotangent is just a number $\alpha$. Left cotangent would be $\alpha J_p(F)$. It is such that

\begin{equation*} F^*\alpha X = \alpha(F_* X) = \alpha J_p(F)X \end{equation*}

What happens when we pull, transpose, and then push?

# Dual to dual

Just realized how double dual isn't really the original space even in simple Euclidean case (which I was aware of, but some how didn't feel I was understanding): with vector being a column $x\in\mathbb{R}^n$, dual a row $x^\top\in(\mathbb{R}^n)^*$, the double dual $x^{\top\top}$ is indeed a column, except when it acts on the rows in multiplication it goes TO THE RIGHT and not to the left:

\begin{equation*} \begin{split} x^\top(y) &= x^\top y,\\ x^{\top\top}(y^{\top}) &= y^\top x^{\top\top} = y^\top x. \end{split} \end{equation*}

This contrasts with rows acting on columns, where the dual (the row) acts from the left.

So exciting.

Update! It's much more than that! If we treat $\mathbb{R}^n$ as a manifold, it turns out then that its tangent space looks more like double dual $(\mathbb{R}^n)^{**}$ rather than $\mathbb{R}^n$ or $(\mathbb{R}^n)^*$, because when we consider a tangent vector acting on scalar functions $\mathbb{R}^n\to\mathbb{R}$ in the special -- linear -- case, the tangent goes on the right and it does so as a column. Take a scalar function $(\mathbb{R}^n)^* \ni a^\top : \mathbb{R}^n \to \mathbb{R}$. Then a tangent $X\in\mathcal{T}\mathbb{R}^n$ should act on $a$ from the right:

\begin{equation*} X(a) = \left.\partial_t(t\mapsto a(x) + a^\top t X)\right|_{t=0} = a^\top X. \end{equation*}

Here $x\in\mathbb{R}^n$ denotes $X\in\mathcal{T}\mathbb{R}^n$ casted to $\mathbb{R}^n$

# SKDL19L2 take-outs

• Dropout is similar to ensembling

• Conjugate gradients are akin to momentum, mixing new direction with previous ones. That is quite different from my previous intuition that we "always move in new direction" which was correct... up to the metric.

• Nesterov is about two things:

• It is better to use gradient at the destination point than the gradient at the origin

• With large momentum, we got a meaningful enough estimate of where we end up to compute that gradient

• Mentioned this blog post again

• Again this bullshit about SGD being "incorrect". It's not that SGD is not correct, it's that you define it the wrong way (being aware that it should be done in different way) and omit the explicit isomorphism.

# Betancourt: higher-order autodiff

Just stumbled upon an open tab with Betancourt's "Geometric theory of higher order automatic differentiation" which I started reading the new year night but quickly got distracted from. I remember my first feeling was that it's slightly more verbose than actually needed and that I'd have used some different wordings. While this might be true, I'm finding the survey in its intro very clear and explaining. I must remind my imaginary interlocutor that I have not as of yet went through any course of differential geometry, only skimmed some textbooks and wikis. I've been really struggling. Mainly because of terrible notation and language established by physicists, I believe, and employed in most classic texts. There are exceptions of course. Some of seemingly good texts include e.g. works of Lee. I think I would've solved my problems if I read carefully Lee's monographs and walked through exercises therein. I'm not yet ready to make this effort (not ready to do anything at all).

As for higher-order differential geometric structures, I have only encountered jets when reading Arnold's "lectures on PDEs", where they were in fact treated (if my memory doesn't deceive me) as "things" that appear in Taylor expansions, without actually specifying their "datatypes".

Now, here's what I actually wanted to remember when I started typing this note: a good survey rapidly introducing principal concepts before verbose and detailed body of a text makes a lot of difference. Lee, say, pours on you quite an amount of information that makes use of terms that haven't been yet made concrete. And that information might give you a good intuition if you already got some very basic framework of concepts and notations to add new nodes and connections to. Betancourt on the other hand throws "pullbacks" and "pushforwards" at you in the very beginning. He throws them pretty concretized, almost tangible, in the sense that he defines domains and ranges of the mappings, and essential properties of their actions. He doesn't spend too much time on it, doesn't overload the reader with questions like existence. These questions are important later for rigorous analysis, but not for sketching the initial map of the field, not for initial understanding of connections to other concepts.

Brain is a terribly lazy thing. At least mine is. The art of writing (when the purpose of writing is to explain a subject and convey a message) is in hacking reader's brain so as to leave it no chance to object comprehending the message. That's an obvious thing that I always knew too. Yet I tend to forget this when actually writing.

# Eric Moulines' mini-course

I was a fool enough to miss most of Eric's course. Well, just one lecture out of two, technically. But still. Though it wasn't much of my fault either: that Saturday had been reserved long time ago.

So, I listened to his talk last Wednesday which wasn't exactly impressive, then missed the first lecture on MC, and today finally arrived for his last lecture which was really cool. He's a great reader!

Now, I'm to remind you that I don't know much about measures and probabilities, it's not my field. So most of the stuff that I found cool might in fact be quite trivial. But here are things that were new to me and seemed cool:

• Hamiltonian paths are closed. For this reason when doing HMC we got to be taking random lengths lest our iterations end up just where they begin
• Total Variation is actually a norm on measures which is dual to $\infty$-norm on functions! Really lovely. And quite useful, because I couldn't persuade myself to try and really read the definition of TV for along time now. You know, analytic definitions always make you feel disrespected -- it is as if the author didn't care enough to pre-process his idea into a form convenient enough to be taken as an apparent geometrical truth.
• There exists a coupling formulation for TV: For measures $\xi, \xi' \in \mathbb{M}_1(\mathbb{X}, \mathcal{X})$ there exist $\mathbb{X}$-valued random variables (on some prob. space) such that
• We've also covered Dobrushin coefficient.
• We've even applied Banach's fixed-point theorem! Felt like I was home. Not all this stochastic-differential-rubbish that they don't care to define signatures of!