# Getting started with GANs: Part 1, getting access to GPUs

Here's how the things are now: I finally talked to Thibaut and he agreed to supervise me; probably we'll be working on something related to barycentre or curvature in geodesic spaces, or whatever (we've discussed some possible problems but it's not that we've agreed on something very specific yet); that's all very fascinating mathematical stuff but sooner or later I got to catch up with the real world; so one of the things I hear most often of and related to optimal transport distances are WGANs and it seems reasonable to begin my acquiatance with what's outside with them.

The nuance about this my endeavour is that I've never in my life actually trained even a CNN. Always felt like it's something trivial and not worth knowing the details of...

As @ferres suggested, before getting all into WGANs it's worth preparing a baseline. So I'll first get through Goodfellows old GANs.

I'll be using the cheapest GPU-powered AWS instance out there:

It's nothing too fancy, just better than a laptop with integrated graphics:

I'm covering it with the studentpack promo worth $100. ...and it won't launch: Spot instance wouldn't launch either -- "no capacity". So requesting a limit increase. # Notation for DEs in Hilbert spaces Look guys, I appreciate and understand that Hilbert are self-dual, but if you use two different notations for differentiation in a single equation when you could've used just one -- you got to really mean it. Look at this: Wherefore do you write so many symbols if could've written less and thus avoided being a liar? Take $\mathbb{R}^n$ for a Hilbert space in discussion, the left hand site is the derivative -- a row-vector ${^n}\mathbb{R}$, and the RHS is a column -- same derivative encoded as a vector. # $\mu(\mathrm{d}x)$ and $\mathrm{d}\mu(x)$ 't is known that if you begin from Lebesgue point of view you naturally end up with notation $\mu(\mathrm{d}x)$ where $\mathrm{d}x$ can be thought of as an infitesimal set; Riemann on the other hand leads you to $\mathrm{d}\mu(x)$ which stands for "change of measure". This is all of course a birdish language which doesn't define anything for real and does not help computation. Now, I really hate all these limit-centric notations. In smooth and bounded non-random finite-dimensional Euclidean world (in other words in any naive calculus) we most elegantly get rid of it, saying that $\mathrm{d}x$ is a differential form, that is it maps each $x$ into a multilinear function -- a tangent, a multilinear approximation of the integral curve. Then even though the action of the linear operator $\int_A$ may involve limits in its definition -- which is perfectly fine -- our $\mathrm{d}x$ is a pretty understandable object that has specific type. The integral curve is naturally characterized as the one with given initial value and tangents at each point. This in a rigorous way captures the original intuition of operating with "infinitesimal increments". The bird-language phrasing "$\mathrm{d}f \approx \frac{\mathrm{d}f}{\mathrm{d}x}\mathrm{d}x$for every small enough a $\mathrm{d}x$" actually means that we define a linear map that turns every $\Delta x$ into approximation of $\Delta f$. But why -- it feels like I fail to construct even for myself an analogous interpretation for all these $\mu\mathrm{d}x$ or $\mathrm{d}W_t$. The naive way to describe the latter is to say simply that it is a random linear operator -- the derivative of $t\mapsto W_t(\omega)$ for any fixed outcome $\omega$. Yet with $\mathrm{d}x$ the set... $2^X$ isn't very linear a space and it isn't about linearity neither. Also this notation isn't going in line with the main idea -- that we're actually splitting the image space instead of the domain. # Porto and curvature To-day (or more precisely the day before this night) was: a bottle of Chateaux (Merlot, Caubernet-Sauvignon), one of Don't-remember-what, two shots of Sandeman vintage, and a glass of Sandeman tawny. I found all of them sweet to taste rather than dry which they in fact are. Portos were too sweet to me (even tawny one), most like Cohors. But when limited to a single glass methinks they make a very noble drink for our cold times. I also learned about some really cool cake: P\avlova (thanks to they-will-understand-I-am-reffering-to-them) With three good Portos Villani's lectures start looking especially reasonable and charming -- feels like you understand the world! I'm now watching this one: https://youtu.be/zo46TEp6FB8. It got quite some praise in some medium post that I found in some tweet. Now what I mean by "charming" is for example these stories about "Kantorovich's duality theorem being considered a very capitalistic theorem". TLDR: "capitalistic" is actually the strong duality which tells in the case considered that trying to minimize over constraint set by yourself gives the same result as letting a hired man maximize his profit. But him mentioning "(about Monge and Kantorovich) both were extremely precautious students, and both of them were proffesors at the age of 22" drives me back into depression. I'm two years older than Galois was when he got shot, and I've done thus far nothing. Oh, I also finally had some good runs: night, forest, and speed-mad doge pulling forward. Long gone feeling. Running e-track at dorm can't even be compared. # Quick fix for mathjax SVG renderer in jupyterlab from conda $ sudo pacman -Sy mathjax
$cd ~/.conda$ find -iname 'MathJax' -type d -exec rsync /usr/share/mathjax/jax/output/SVG/ \{\}/jax/output/SVG/ -rvuP \;
`

# Variational derivative

Zadorozhny's now reading lectures on what he calls "variational methods and random processes". Despite of some common abuse of notation the course seems to be qute useful. I'll post some summaries of what he's teaching us. Yet I'll use slightly different notation and thus as always, all the good is his and all the wrong is mine. This is the part about variational derivatives.

### Spaces

Let $(X, \|\cdot\|)$ be one of the following Banach spaces:

• the space of bounded continuous functions defined on the interval $T\subset\mathbb{R}$ with the norm:

• the space of functions possessing $k$ bounded continuous derivatives:

• the space $L^p(T),\ p\in\mathbb{N}$ of classes of equivalent functions with the norm:

### Definition

We consider a functional $u: X\to \mathbb{R}$ and we consider it the context of variational problems or optimal control problems. The common approach to investigate these problems is linearization. Now we could quite naturally begin with consideration of Frechet derivative of $u$ (assuming it has one): $u(x+h) = u(x) + u'(x)(h) + o(\|h\|),$ where $u'(x)$ is a bounded linear operator and is the Frechet derivative of $u$ at the function $x$. Yet it turns out there's something even more convenient. As we saw earlier12 the functional $u$ is often an integral one. This and some further exploration leads to the following definition:

The function $\phi:\mathbb{R}\times X\to \mathbb{R}$ is called a functional or variational derivative of $u$ if the Frechet derivative $u'(x)$ at any function $x\in X$ is a bounded integral operator with the kernel $t\mapsto \phi(t, x)$

In simpler words, the variational derivative is a kernel of the Frechet derivative.

Nota Bene: while the Frechet derivative is unique it may have infinitely many equivalent kernels.

### Notation

There's an established notation for the variational derivative. As you could've guessed it's a bit messy. Although aware of the distinction, in his works Zadorozhny usually doesn't distinguish a function from its value a derivative from the increment of the value associated with specific increment of the argument --- that is the derivative applied to that increment. The wikipedia page3 is prone to a similar abuse of notation.

Thus so far we will use the following notation for the value of the functional derivative at specific function $x$ and time $t$:

Sadly, Zadorozhny often calls this thing a function which of course it isn't. He also uses this same symbol for the value of the derivative and for all of the functions $t\mapsto \frac{\delta u(x)}{\delta x(t)}$, $x\mapsto \frac{\delta u(x)}{\delta x(t)}$, $(t, x) \mapsto \frac{\delta u(x)}{\delta x(t)}.$ It's common though. The real problem comes from the fact that he also calls both a derivative and a differential (and not just their value) the following expression:

I suppose that's because he's talking too much with physics folks. The common notion of a derivative is that of a linear mapping which approximates the function under consideration as good as possible. The integral above isn't a derivative of $u$ but the function $h\mapsto \int_T \phi(t, x)h(t)\mathrm{d}t$ is. The notion of a "differential" is then a non-sense. As opposed to the notion of a "differential associated with specific change $h$". Which is still excessive. That's a long and old story though which probably deserves its own writeup.

To denote the function $\phi$ itself I'd rather propose the notation

Yet obviously things are going to get more complicated than that when it comes to multivaried functionals and the change of variables. In these cases I believe multiindex notation could be used just as with usual derivatives.

### Inclusions

It is important to note that $C_b^k$ spaces are included in each other as normed spaces so that if $A$ is a bounded linear operator in $C_b1$ then it is just as well bounded in all $C_b^k$. And so are $L^p,\ p\in\mathbb{N}$.

Suppose we have two norms $\rho_1, \rho_2$ which lie in the following relation:

Then

Thus whenever

it's also

In other words if $\omega(h) = o(\rho_1(h))$ as $\rho_1(h)\to 0$ then $\omega(h) = o(\rho_2(h))$ as $\rho_2(h)\to 0$.

Now it's easily seen that

and4

which proves our assertion about inclusions.

### Properties or "The rules of variational differentiation"

Here one needs to be precise in what spaces the derivative is bounded and where it's not. Despite of that I'll omit proofs for some trivial cases.

#### Homogeneity

If $u$ has a variational derivative, then so does $cu$ and:

If the functionals $u$ and $v$ have variational derivatives, then so does their sum and the following equality holds:

#### Derivative of a multiple

If the functionals $u$ and $v$ have variational derivatives then so does $uv:x\mapsto u(x)v(x)$ and:

That is $\delta(uv)(t, x) = v(x)(\delta u)(t, x) + u(x)(\delta v)(t, x)$.

#### Chain rule 1

Suppose $u:X\to\mathbb{R}$ has a variational derivative and $f:\mathbb{R}\to\mathbb{R}$ is a $C_b^1(\mathbb{R})$ function. Consider a functional

Then

#### Chain rule 2

Suppose $u:X\to\mathbb{R}$ has a variational derivative and $f:\mathbb{R}\to\mathbb{R}$ is a $C_b^1(\mathbb{R})$ function. Consider a functional

If $(X, \|\cdot\|)$ is $L^p,\ p\in\mathbb{N}$, then $w$ has a variational derivative and:

Just in case you're not familiar with this notation, let's rewrite this definition using an auxilliary function $y=f\circ x$ defined by $y(t) = f(x(t))$ when we fix some $x$. Then $w(x) = u(y)$.

Then the statement above can be rewritten as:

##### Proof

Let's consider an increment:

In the second line we used a linear approximation of the value $f(x(t) + h(t))$ of a differentiable function $f$.

In the third line we've rewritten the preceding expression so that we get the expression of the form $u(x_0 + \Delta x)$, where $x_0$ and $\Delta x$ are two functions.

This allows us to use a linear approximation $u(x_0 + \Delta x) = u(x_0) + \int (\delta u)(t, x_0)\Delta x(t)\mathrm{d}t + o(\|\Delta x\|)$ in the fourth line.

Note the $o(\|h\|_1)$ in the end. It's actually $\int_T o(h(t))\mathrm{d}t$.

It follows then that in $L^1$ (and in $L^p$ by inclusion) the variational derivative of $w$ does exist and:

### Bibliography

1. {{ site.url }}link://slug/var-calc-criteria

2. {{ site.url }}link://slug/opt-criterions

3. https://en.wikipedia.org/wiki/Functional_derivative

4. https://en.wikipedia.org/wiki/H%C3%B6lder%27s_inequality

# Formal series

I intend to write a series of posts on some topics I found often to be vaguely or mis-represented in baccalaureate's math courses. This one's about something called "formal series" in discrete mathematics. It's written in a rough and inaccurate manner because the true reason I'm publishing it now is that I'm afraid to forget about it again. I'll (re)write this whole thing later when I have time. Yet here's the gist (a bit more under the cut): don't try to fool students and call it "application of complex analysis to performing some operations on sequences".

These dudes introduce some magical "formal series" without specifying what they really are and what sequences are and why all of this really works. The way it oughts to be taught is as follows:

1. Since the very first semester (or even better in school) the folks should learn some (very basic) basics of functional analysis. Sets and functions. And the difference between the a funciton and its value at some point
2. Then you teach them that a sequence in the set $X$ is a map from $\mathbb{N}$ to $X$.
3. You teach them that a series in the v.s. $X$ is essentialy a sequence of partial sums and the series and and sequences in v.s. are interchangeable
4. You teach them about differentiability and power series
5. If your for your sequence $a$ the value $\limsup_k\sqrt[k]{|a_k|}$ is finite then the corresponding power series $\sum_k a_k (x - x_0)^k$ is convergent in some disc. That's right, it's Hadamard's bit, works well in Banach spaces
6. If you add one series to another (or multiple by a constant or do you name what to it) then it's going to just change the radius of convergence somehow. But the series will still converge at least somewhere. Well actually I'm not sure right now what happens if they're centered around different points I'll look through it after I finally get some sleep.
7. It's an (absolutely) convergent powerseries thus it defines a differentiable function!
8. You could try and mix and split your series into some parts resembling Taylor series of some well-known functions
9. How now, you've got combinations of some trivia? I reckon you know something about adding $\frac{1}{1-ax}$ to $\frac{2}{1-bx}$.
10. Mix and expand!

# Projections

There are some things about projections that, it turned out, I was getting all wrong. Specifically, it concerns the norms of projections. All right, I shall confess: I never realized the norm could be greater than one.

## Recap

A linear operator $P: X\to X$ is called a projection if $P^2 = P$. Suppose $X$ is a normed v.s. Then we can introduce the "operator norm": $\|A\|_{\mathrm{op}} = \inf \{C\in\mathbb{R};\ \|Ax\|\leq C\|x\|\ \text{for all}\ x\in X\}$ for those operators $A$ for which it's finite. Here $\|\cdot\|:X\to\mathbb{R}$ is the norm in $X$. That value happens to equal $\|A\|_{\mathrm{op}} = \sup_{x\neq 0} \frac{\|Ax\|}{\|x\|} = \sup_{\|x\|=1} \|Ax\|.$ Such operators form a linear subspace of the space of all operators on $X$. Moreover they form a Banach Algebra with the usual multiplication (composition) of operators and the submultiplicative norm $\|\cdot\|_{\mathrm{op}}$: $\|AB\|_{\mathrm{op}} \leq \|A\|_{\mathrm{op}}\|B\|_{\mathrm{op}}$

## Wrong intuition

The (wrong) first intuition is that for a non-zero projection there will be $\|P\|=1$. Here's where this deception comes from. I'd erroneously think of orthogonal projections and expect Px = \left\{\begin{aligned} & x,\ \text{for}\ x\in X_1,\\ & 0,\ \text{otherwise}.\end{aligned}\right. Then obviously $\|P\| = \sup_{\|x\|=1} \|x\| = 1.$ Yet the projection needs not be of the form above --- it may be oblique.

Here's an example AG Baskakov used to demonstrate this phenomenon to me --- his ignorant student. Consider $X = \mathbb{R}^n$ and an operator $P$ defined by the formula: $Px = (x, a) b,$ for some fixed $a, b \in X$. Then $P^2 x = Py = (y, a) b = (x, a) (b, a) b,$ where $y = Px = (x, a) b$. Thus chose we such $a, b$ that $(a, b) = 1$ the operator $p$ would be a projection. Now to its norm: $\sup_{\|x\|=1} \|Px\| = \sup_{\|x\|=1} |(x, a)|\|b\| = (\frac{1}{\sqrt{a}}a, a)\|b\| = \|a\|\|b\|.$ Could we pick $a$ and $b$ to satisfy $(a, b) = \|a\|\|b\|\cos\phi = 1$ and $\|a\|\|b\| > 1$? Most certainly, sir.

## Fix

But let's just see what can be told from the definition. Since $P^2=P$ and since the norm is submultiplicative it's clear that: $\|P\|_{\mathrm{op}} = \|P^2\|_{\mathrm{op}} \leq \|P\|_{\mathrm{op}}^2.$ This in turn implies $1\leq \|P\|_{\mathrm{op}}$