# Posts about main (old posts, page 2)

Subcategories:# Eric Moulines' mini-course

I was a fool enough to miss most of Eric's course. Well, just one lecture out of two, technically. But still. Though it wasn't much of my fault either: that Saturday had been reserved long time ago.

So, I listened to his talk last Wednesday which wasn't exactly impressive, then missed the first lecture on MC, and today finally arrived for his last lecture which was really cool. He's a great reader!

Now, I'm to remind you that I don't know much about measures and probabilities, it's not my field. So most of the stuff that I found cool might in fact be quite trivial. But here are things that were new to me and seemed cool:

- Hamiltonian paths are closed. For this reason when doing HMC we got to be taking random lengths lest our iterations end up just where they begin
- Total Variation is actually a norm on measures
which is dual to -norm on functions!
Really lovely.
And quite useful, because I couldn't persuade myself
to try and
*really read*the definition of TV for along time now. You know, analytic definitions always make you feel disrespected -- it is as if the author didn't care enough to pre-process his idea into a form convenient enough to be taken as an apparent geometrical truth. - There exists a coupling formulation for TV: For measures there exist -valued random variables (on some prob. space) such that
- We've also covered Dobrushin coefficient.
- We've even applied Banach's fixed-point theorem! Felt like I was home. Not all this stochastic-differential-rubbish that they don't care to define signatures of!

# Getting started with GANs: Part 1, getting access to GPUs

Here's how the things are now: I finally talked to Thibaut and he agreed to supervise me; probably we'll be working on something related to barycentre or curvature in geodesic spaces, or whatever (we've discussed some possible problems but it's not that we've agreed on something very specific yet); that's all very fascinating mathematical stuff but sooner or later I got to catch up with the real world; so one of the things I hear most often of and related to optimal transport distances are WGANs and it seems reasonable to begin my acquiatance with what's outside with them.

The nuance about this my endeavour is that I've never in my life actually trained even a CNN. Always felt like it's something trivial and not worth knowing the details of...

As `@ferres`

suggested, before getting all into WGANs
it's worth preparing a baseline.
So I'll first get through Goodfellows old
GANs.

I'll be using the cheapest GPU-powered AWS instance out there:

It's nothing too fancy, just better than a laptop with integrated graphics:

I'm covering it with the studentpack promo worth $100.

...and it won't launch:

Spot instance wouldn't launch either -- "no capacity". So requesting a limit increase.

# Notation for DEs in Hilbert spaces

Look guys, I appreciate and understand
that Hilbert are self-dual,
but if you use *two different notations*
for differentiation in a single equation
when you could've used just one -- you got to really mean it.

Look at this:

Wherefore do you write so many symbols
if could've written less and thus avoided being a liar?
Take for a Hilbert space in discussion,
the left hand site is the derivative -- a row-vector ,
and the RHS is a column -- same derivative *encoded* as a vector.

# \( \mu(\mathrm{d}x) \) and \( \mathrm{d}\mu(x) \)

't is known that if you begin from Lebesgue point of view
you naturally end up with notation
where can be thought of as an infitesimal *set*;
Riemann on the other hand leads you to
which stands for "change of measure".
This is all of course a birdish language which doesn't
define anything for real and does not help computation.

Now, I really hate all these limit-centric notations.
In smooth and bounded non-random finite-dimensional Euclidean world
(in other words in any naive calculus)
we most elegantly get rid of it, saying that
is a differential form, that is it maps each
into a multilinear function -- a tangent,
a multilinear approximation of the integral curve.
Then even though the action of the linear operator
may involve limits in its definition -- which is perfectly fine --
our is a pretty understandable object
that has specific *type*. The integral curve is naturally
characterized as the one with given initial value and tangents at each point.
This in a rigorous way captures the original intuition
of operating with "infinitesimal increments".
The bird-language phrasing
"**for every** small enough a "
actually means that we **define a** linear **map** that turns **every**
into approximation of .

But why -- it feels like I fail to construct even for myself an analogous interpretation for all these or .

The naive way to describe the latter is to say simply
that it is a *random linear operator* -- the derivative of
for any fixed outcome .

Yet with the *set*... isn't very linear a space
and it isn't about linearity neither.
Also this notation isn't going in line with the main idea --
that we're actually splitting the image space instead of the domain.

# Porto and curvature

To-day (or more precisely the day before this night) was: a bottle of Chateaux (Merlot, Caubernet-Sauvignon), one of Don't-remember-what, two shots of Sandeman vintage, and a glass of Sandeman tawny.

I found all of them sweet to taste rather than dry which they in fact are. Portos were too sweet to me (even tawny one), most like Cohors. But when limited to a single glass methinks they make a very noble drink for our cold times.

I also learned about some really cool cake: P\`avlova (thanks to they-will-understand-I-am-reffering-to-them)

With three good Portos Villani's lectures start looking especially reasonable and charming -- feels like you understand the world! I'm now watching this one: https://youtu.be/zo46TEp6FB8. It got quite some praise in some medium post that I found in some tweet. Now what I mean by "charming" is for example these stories about "Kantorovich's duality theorem being considered a very capitalistic theorem". TLDR: "capitalistic" is actually the strong duality which tells in the case considered that trying to minimize over constraint set by yourself gives the same result as letting a hired man maximize his profit. But him mentioning "(about Monge and Kantorovich) both were extremely precautious students, and both of them were proffesors at the age of 22" drives me back into depression. I'm two years older than Galois was when he got shot, and I've done thus far nothing.

Oh, I also finally had some good runs: night, forest, and speed-mad doge pulling forward. Long gone feeling. Running e-track at dorm can't even be compared.

# Quick fix for mathjax SVG renderer in jupyterlab from conda

$ sudo pacman -Sy mathjax $ cd ~/.conda $ find -iname 'MathJax' -type d -exec rsync /usr/share/mathjax/jax/output/SVG/ \{\}/jax/output/SVG/ -rvuP \;

# Variational derivative

Zadorozhny's now reading lectures on what he calls "variational methods and
random processes". Despite of some common abuse of notation the course seems to
be qute useful. I'll post some summaries of what he's teaching us. Yet I'll use
slightly different notation and thus as always, all the good is his and all the
wrong is mine. **This is the part about variational derivatives**.

### Spaces

Let be one of the following Banach spaces:

- the space of bounded continuous functions defined on the interval with the norm:

- the space of functions possessing bounded continuous derivatives:

- the space of classes of equivalent functions with the norm:

### Definition

We consider a functional and we consider it the context
of variational problems or optimal control problems. The common approach to
investigate these problems is linearization. Now we could quite naturally begin
with consideration of Frechet derivative of (assuming it has one):
where is a bounded linear operator and is the Frechet derivative of
at the function . Yet it turns out there's something even more
convenient. As we saw earlier^{1}^{2} the functional is often
an integral one. This and some further exploration leads to the following
definition:

The function is called a *functional*
or *variational derivative* of if the Frechet derivative at any
function is a *bounded integral operator* with the *kernel*

In simpler words, the variational derivative is a kernel of the Frechet derivative.

**Nota Bene:** while the Frechet derivative is unique it may have infinitely
many equivalent *kernels*.

### Notation

There's an established notation for the variational derivative. As you could've
guessed it's a bit messy. Although aware of the distinction, in his works
Zadorozhny usually doesn't distinguish a function from its value a derivative
from the increment of the value associated with specific increment of the
argument --- that is the derivative applied to that increment. The wikipedia
page^{3} is prone to a similar abuse of notation.

Thus so far we will use the following notation for *the value of the functional
derivative at specific function and time :*

Sadly, Zadorozhny often calls this thing a function which of course it isn't.
He also uses this same symbol for the value of the derivative and for all of the
functions
,
,
It's common though. The real problem comes from the fact that he also calls both
a *derivative* and a *differential* (and not just their value) the following
expression:

I suppose that's because he's talking too much with physics folks. The common
notion of a derivative is that of a *linear mapping* which approximates the
function under consideration as good as possible. The integral above isn't a
derivative of but the function
is.
The notion of a "differential" is then a non-sense. As
opposed to the notion of a "differential associated with specific change ".
Which is still excessive. That's a long and old story though which probably
deserves its own writeup.

To denote the function itself I'd rather propose the notation

Yet obviously things are going to get more complicated than that when it comes to multivaried functionals and the change of variables. In these cases I believe multiindex notation could be used just as with usual derivatives.

### Examples

### Inclusions

It is important to note that spaces are included in each other as normed spaces so that if is a bounded linear operator in then it is just as well bounded in all . And so are .

Suppose we have two norms which lie in the following relation:

Then

Thus whenever

it's also

In other words if as then as .

Now it's easily seen that

and^{4}

which proves our assertion about inclusions.

### Properties or "The rules of variational differentiation"

Here one needs to be precise in what spaces the derivative is bounded and where it's not. Despite of that I'll omit proofs for some trivial cases.

#### Derivative of a constant functional

#### Homogeneity

If has a variational derivative, then so does and:

#### Additivity

If the functionals and have variational derivatives, then so does their sum and the following equality holds:

#### Derivative of a multiple

If the functionals and have variational derivatives then so does and:

That is .

#### Chain rule 1

Suppose has a variational derivative and is a function. Consider a functional

Then

##### Proof

#### Chain rule 2

Suppose has a variational derivative and is a function. Consider a functional

If is , then has a variational derivative and:

Just in case you're not familiar with this notation, let's rewrite this definition using an auxilliary function defined by when we fix some . Then .

Then the statement above can be rewritten as:

##### Proof

Let's consider an increment:

In the second line we used a linear approximation of the value of a differentiable function .

In the third line we've rewritten the preceding expression so that we get the expression of the form , where and are two functions.

This allows us to use a linear approximation in the fourth line.

Note the in the end. It's actually .

It follows then that in (and in by inclusion) the variational derivative of does exist and:

### Bibliography

# Formal series

I intend to write a series of posts on some topics I found often to be vaguely or mis-represented in baccalaureate's math courses. This one's about something called "formal series" in discrete mathematics. It's written in a rough and inaccurate manner because the true reason I'm publishing it now is that I'm afraid to forget about it again. I'll (re)write this whole thing later when I have time. Yet here's the gist (a bit more under the cut): don't try to fool students and call it "application of complex analysis to performing some operations on sequences".

These dudes introduce some magical "formal series" without specifying what they really are and what sequences are and why all of this really works. The way it oughts to be taught is as follows:

- Since the very first semester (or even better in school) the folks should learn some (very basic) basics of functional analysis. Sets and functions. And the difference between the a funciton and its value at some point
- Then you teach them that a sequence in the set \( X \) is a map from \( \mathbb{N} \) to \( X \).
- You teach them that a series in the v.s. \( X \) is essentialy a
*sequence*of partial sums and the series and and sequences in v.s. are interchangeable - You teach them about differentiability and power series
- If your for your sequence \( a \) the value \( \limsup_k\sqrt[k]{|a_k|} \) is finite then the corresponding power series \( \sum_k a_k (x - x_0)^k \) is convergent in some disc. That's right, it's Hadamard's bit, works well in Banach spaces
- If you add one series to another (or multiple by a constant or do you name what to it) then it's going to just change the radius of convergence somehow. But the series will still converge at least somewhere. Well actually I'm not sure right now what happens if they're centered around different points I'll look through it after I finally get some sleep.
- It's an (absolutely) convergent powerseries thus it defines a differentiable function!
- You could try and mix and split your series into some parts resembling Taylor series of some well-known functions
- How now, you've got combinations of some trivia? I reckon you know something about adding \( \frac{1}{1-ax} \) to \( \frac{2}{1-bx} \).
- Mix and expand!

# Projections

There are some things about projections that, it turned out, I was getting all wrong. Specifically, it concerns the norms of projections. All right, I shall confess: I never realized the norm could be greater than one.

## Recap

A linear operator \( P: X\to X\) is called a projection if \( P^2 = P \). Suppose \( X \) is a normed v.s. Then we can introduce the "operator norm": \[ \|A\|_{\mathrm{op}} = \inf \{C\in\mathbb{R};\ \|Ax\|\leq C\|x\|\ \text{for all}\ x\in X\}\] for those operators \( A \) for which it's finite. Here \( \|\cdot\|:X\to\mathbb{R} \) is the norm in \( X \). That value happens to equal \[ \|A\|_{\mathrm{op}} = \sup_{x\neq 0} \frac{\|Ax\|}{\|x\|} = \sup_{\|x\|=1} \|Ax\|. \] Such operators form a linear subspace of the space of all operators on \( X \). Moreover they form a Banach Algebra with the usual multiplication (composition) of operators and the submultiplicative norm \( \|\cdot\|_{\mathrm{op}} \): \[ \|AB\|_{\mathrm{op}} \leq \|A\|_{\mathrm{op}}\|B\|_{\mathrm{op}} \]

## Wrong intuition

The (wrong) first intuition is that for a non-zero projection there will be \(
\|P\|=1 \). Here's where this deception comes from. I'd erroneously
think of orthogonal projections and expect
\[
Px = \left\{\begin{aligned}
& x,\ \text{for}\ x\in X_1,\\
& 0,\ \text{otherwise}.\end{aligned}\right.
\]
Then obviously \( \|P\| = \sup_{\|x\|=1} \|x\| = 1. \)
Yet the projection needs not be of the form above --- it may be *oblique*.

Here's an example AG Baskakov used to demonstrate this phenomenon to me --- his ignorant student. Consider \( X = \mathbb{R}^n \) and an operator \( P \) defined by the formula: \[ Px = (x, a) b, \] for some fixed \( a, b \in X \). Then \( P^2 x = Py = (y, a) b = (x, a) (b, a) b, \) where \( y = Px = (x, a) b \). Thus chose we such \( a, b \) that \( (a, b) = 1 \) the operator \( p \) would be a projection. Now to its norm: \[ \sup_{\|x\|=1} \|Px\| = \sup_{\|x\|=1} |(x, a)|\|b\| = (\frac{1}{\sqrt{a}}a, a)\|b\| = \|a\|\|b\|. \] Could we pick \( a \) and \( b \) to satisfy \( (a, b) = \|a\|\|b\|\cos\phi = 1 \) and \( \|a\|\|b\| > 1 \)? Most certainly, sir.

## Fix

But let's just see what can be told from the definition. Since \( P^2=P \) and since the norm is submultiplicative it's clear that: \[ \|P\|_{\mathrm{op}} = \|P^2\|_{\mathrm{op}} \leq \|P\|_{\mathrm{op}}^2. \] This in turn implies \[ 1\leq \|P\|_{\mathrm{op}} \]