Записи о main


Immersions, submersions, embeddings

Some tldr-excerpts from Lee and Spivak formalizing embeddings and stuff.

Topological embedding -- an injective continuous map that is also a homeomorphism onto its image . We can think of "as a homeomorphic copy of in " (Lee, 2011).

A smooth map is said to have rank at if the linear map ( the pushforward) has rank . is of constant rank if it is of rank at every point.

Immersion -- smooth map whose pushforward is injective at every point, that is .

Submersion -- smooth map whose pushforward is surjective at every point, that is .

(Smooth) Embedding (of a manifold) -- an injective immersion that is also a topological embedding.

So, a map is an embedding, if

  1. ,
  2. is injective,
  3. is a homeomorphism onto with subspace topology.


A cone over topological space is the quotient

A point of that cone can be identified with a point and the distance to origin (the apex fiber) .

The reason I care about cones is the notion of the tangent cone of a metric space at a point.

Must needs

Must needs be

Whenever I encounter constructs like "It_1 must needs be X" I tend to decompose it

"it must be that it_1 needs to be X",

rather than into

"it must be X" intensified by an adverb "needs",

which seem to be the consensus.

Needs must

There's also archaic "needs must" in which "needs" seems to be a noun and they (the "needs") actually "must":

If needs must, I'll do it.

Finally, there's other "must needs" in which "needs" acts as an amplifier and some knowledgeable people identify it as an adverb:

Shall have to

I also have just encountered a construct "shall have to":

Double negations

It might seem English rather discourages the use of double negations, so that the following sentences, if at all parseable, are likely to be taken as a sign of lack of education:

  1. I haven't got no money.

  2. I never don't do that.

The reason these sentences feel smelly is that they contain double negations which technically cancel each other, so that the sentences above might read:

  1. I'm not in the state of lack of money

  2. It never happens that I don't do that (e.g. never happens that I forget to do that)

But because of low likelihood of the original constructs, one would rather assume the message contain a mistake.

The second example contrasts with the situation we got in French and Russian, where we use what might seem like double negatives:

  1. Я (I) никогда (NEVER) не (NOT) делаю (do) этого (that)

  2. Я (I) никогда (NEVER) не (NOT) курю (smoke).

  3. Je (I) ne fais (not do) jamais (NEVER) ca (that)

Those aren't really double negatives, it's rather that the scopes of verbs and negations are propagated differently, and actually omitting the "никогда" or "jamais" would lead to a contradiction in the message. For instance, the sentence

Я (I) никогда (NEVER) курю (smoke)

might be interpreted as comprised of claims:

  1. I do smoke ("я курю").
  2. The modality of this event, i.e. the answer to the question "how often that happens?" is: "never" ("никогда").

The two are in conflict with each other and while one could try and use this construct to deliver the idea of him not smoking, its likelihood is neglectible.

Now it seems that in Francais (though I don't really understand French yet) the situation is the same as we say:

Jamais (NEVER) plus (more) Je ne (NOT) Te dirai (will say)

while the sentence without "ne":

Jamais plus je te dirai

does sound contradictive, just as it would in Russian.

To emphasize the difference with English, let's note that we'd rather encode the message "Je ne fais jamais ca" as

I don't ever do that

Which can be decomposed to

  1. I don't do that

  2. My behaviour is consistent, i.e. I always ("ever") choose the policy "not do that"

My friend has given me a hint this might be coming from Latin in which both Russian and Francais have roots.

Double negatives in English

It wouldn't be true, however, to say that two negating terms cannot occur in one sentence. First and trivial, there are "Niggish" constructions like

"I ain't got no money",

which sound rather natural.

However, the case that got me curious is the use of "either" which I consider a "negating term". So, a perfectly valid example of two negating terms going in a row in English can be seen in:

-- I'm not a linguist.

-- Me neither!

Moreover, one can notice that it takes an effort to put a non-negative term in place of "either" and the following sequence

-- I'm not a linguist.

-- Me too.

Further procrastination

  • Just learned Sussman (author of SICP) also authored a monograph on differential geometry

  • And on classical mechanics

  • Moreover, the former followes the concept of Turtle Geomtry and states in its Prologue the approach I admired most since my early childhood: learning things by programming them, thus forcing oneself to be precise and exact in judgements and claims. I'm recalling right now again that first "lecture" on elementary notions of set theory the summer before admission to VSU... Constructing function as a set so it becomes more "tangible" an object. The Katharsis that followed. I didn't realize back then that it's same as in programming. Five years I've been living with guilt and shame that I started as a coder and not a Mathematician. Five years I felt programming is disgusting and despisable thing to do. And only now I truly realize that the thing I loved about it in those first years is the same thing I've fallen in love with Mathematics for that summer of 2014.

  • Also stumbled upon a tweet mentioning the following interpretation of Laplace operator as measuring average sign of a function around the point. Sort of trivial, and resembles how we derive sufficient min/max conditions, yet I did not notice.

  • Majority of these I found in: JAX cookbook

  • Update! Accidentally found these slides by Absil giving some historical propspect on the subject

  • For instance, the slides mention Luenberger (1973) stating that "we'd perform line search along geodesics... if'twere feasible". Now we're closer to the roots of the whole thing

MD is not RSGD, but RSGD also does M from MD

The whole idea of trying to parallel mirror descent with following geodesics as in RSGD has come to naught. And not the way one would expect, because MD still seems "type-correct" and RSGD doesn't yet. Long story short: in RSGD we're pulling back COTANGENTS but updating along a TANGENT.

Update! Before updating, we're raising an index of cotangent by applying inverse metric tensor, thus making it a tangent! Thanks to @ferrine for the idea.

Following \(\mathbb{R}^m\to\mathbb{R}^n\) analogy of previous posts:

\begin{equation*} F:M\to N, \end{equation*}
\begin{equation*} \xi = F^*\eta\in\mathcal{T}^*M,~\text{for}~\eta\in\mathcal{T}^*N, \end{equation*}
\begin{equation*} X = \xi^\sharp = g^{-1}(\xi) = g^{-1} \xi^\top~\text{so it becomes a column}. \end{equation*}


Going on with trivialities. I still haven't finished my story with autodiff, and since I've got to code it for DL homework... Well.

So, we consider \(F:M{=}\mathbb{R}^m\to N{=}\mathbb{R}^n\). Tangent spaces look exactly like original spaces except for where they go in multiplication: for \(g:N\to\mathbb{R}\) a tangent vector \(Y\in\mathcal{T}N\) is basically \(\mathbb{R}^n\) and acts by \(Y(g) = \langle \nabla_{F(p)} g, Y\rangle\).

Pushforward \(F_*:\mathcal{T}M\to \mathcal{T}N\) is defined by \((F_* X)(g) = X(g\circ F)\) for \(X\in\mathcal{T} M \sim \mathbb{R}^m\). Thus

\begin{equation*} \begin{split} (F_* X)(g) &= X(g\circ F) = \langle \nabla_{p} g\circ F, x\rangle \\ &= \left\langle ( \left. DF \right|_p \left.Dg\right|_{F(p)} )^\top, x\right\rangle \\ &= \langle {\underbrace{J_p(F)}_{\left. DF \right|_p}}^\top \underbrace{\nabla_{F(p)}}_{\left.Dg\right|_{F(p)}^\top} g, x \rangle \\ &= \left\langle \nabla_{F(p)}g, J_p(F)x \right\rangle . \end{split} \end{equation*}

Here we use \(D\) to denote Fr'echet derivatives (a linear map) and nabla to denote gradient (a vector -- the Riescz representation of that linear map) and we also identify linear map \(DF\) with Jacobian matrix \(J(F)\). Also I denote \(X\) casted to \(\mathbb{R}^m\) as just \(x\). I don't like how I'm already using too many different notations (after all, that's what I scorn differential geometers at for) but at the moment it seems fit.

So, basically the equation above means that in Euclidean case pushforward \(F_*\) acts on tangent \(X\) merely by multiplying with Jacobian \(J_p(F)\). In terms of matrix multiplication, \(F_* X\) is just \(J_p(F) x\in\mathbb{R}^n\)

Further, pullback \(F^*\) by definition maps right cotangent \(\xi\in\mathcal{T}_{F(p)}^*N\) into left cotangent \(F^*\xi\in\mathcal{T}_p M\) which acts as: \((F^*\xi) X = \xi(F_*X)\).

\begin{equation*} \begin{split} (F^*\xi) X &= \xi(F_*X)\\ &= \xi^\top J_p(F)x . \end{split} \end{equation*}

That is, pullbacked cotangent is just \(\xi^\top J_p(F)\in (\mathbb{R}^m)^*\) (acting on \(x\) from the left) and pullback \(F^*\) itself is still the same \(J_p(F)\) except acting on cotangents from the left. It is equivalent to say that pullback acts on \(\operatorname{column}(\xi)\) as transposed Jacobian \(J_p(F)^\top\):

\begin{equation*} \operatorname{column}(F^*\xi) = J_p(F)^\top \operatorname{column}(\xi). \end{equation*}

Now, why pulling back in gradient descent? Take \(F:\mathbb{R}^n \to\mathbb{R}\). Right cotangent is just a number \(\alpha\). Left cotangent would be \(\alpha J_p(F)\). It is such that

\begin{equation*} F^*\alpha X = \alpha(F_* X) = \alpha J_p(F)X \end{equation*}

What happens when we pull, transpose, and then push?

Dual to dual

Just realized how double dual isn't really the original space even in simple Euclidean case (which I was aware of, but some how didn't feel I was understanding): with vector being a column \(x\in\mathbb{R}^n\), dual a row \(x^\top\in(\mathbb{R}^n)^*\), the double dual \(x^{\top\top}\) is indeed a column, except when it acts on the rows in multiplication it goes TO THE RIGHT and not to the left:

\begin{equation*} \begin{split} x^\top(y) &= x^\top y,\\ x^{\top\top}(y^{\top}) &= y^\top x^{\top\top} = y^\top x. \end{split} \end{equation*}

This contrasts with rows acting on columns, where the dual (the row) acts from the left.

So exciting.

Update! It's much more than that! If we treat \(\mathbb{R}^n\) as a manifold, it turns out then that its tangent space looks more like double dual \((\mathbb{R}^n)^{**}\) rather than \(\mathbb{R}^n\) or \((\mathbb{R}^n)^*\), because when we consider a tangent vector acting on scalar functions \(\mathbb{R}^n\to\mathbb{R}\) in the special -- linear -- case, the tangent goes on the right and it does so as a column. Take a scalar function \((\mathbb{R}^n)^* \ni a^\top : \mathbb{R}^n \to \mathbb{R}\). Then a tangent \(X\in\mathcal{T}\mathbb{R}^n\) should act on \(a\) from the right:

\begin{equation*} X(a) = \left.\partial_t(t\mapsto a(x) + a^\top t X)\right|_{t=0} = a^\top X. \end{equation*}

Here \(x\in\mathbb{R}^n\) denotes \(X\in\mathcal{T}\mathbb{R}^n\) casted to \(\mathbb{R}^n\)

SKDL19L2 take-outs

  • Dropout is similar to ensembling

  • Conjugate gradients are akin to momentum, mixing new direction with previous ones. That is quite different from my previous intuition that we "always move in new direction" which was correct... up to the metric.

  • Nesterov is about two things:

    • It is better to use gradient at the destination point than the gradient at the origin

    • With large momentum, we got a meaningful enough estimate of where we end up to compute that gradient

  • Mentioned this blog post again

  • Again this bullshit about SGD being "incorrect". It's not that SGD is not correct, it's that you define it the wrong way (being aware that it should be done in different way) and omit the explicit isomorphism.