Записи о main (2 страница со старыми записями)


A truth

Back to watching TVs in the background throwing my time away. Yea, it is reprehensible and all that, but... just take a look at what cool things you sometime may find:

House MD, S05E03: Stop saying "a truth". There's only one truth

Eric Moulines' mini-course

I was a fool enough to miss most of Eric's course. Well, just one lecture out of two, technically. But still. Though it wasn't much of my fault either: that Saturday had been reserved long time ago.

So, I listened to his talk last Wednesday which wasn't exactly impressive, then missed the first lecture on MC, and today finally arrived for his last lecture which was really cool. He's a great reader!

Now, I'm to remind you that I don't know much about measures and probabilities, it's not my field. So most of the stuff that I found cool might in fact be quite trivial. But here are things that were new to me and seemed cool:

  • Hamiltonian paths are closed. For this reason when doing HMC we got to be taking random lengths lest our iterations end up just where they begin
  • Total Variation is actually a norm on measures which is dual to -norm on functions! Really lovely. And quite useful, because I couldn't persuade myself to try and really read the definition of TV for along time now. You know, analytic definitions always make you feel disrespected -- it is as if the author didn't care enough to pre-process his idea into a form convenient enough to be taken as an apparent geometrical truth.
  • There exists a coupling formulation for TV: For measures there exist -valued random variables (on some prob. space) such that
  • We've also covered Dobrushin coefficient.
  • We've even applied Banach's fixed-point theorem! Felt like I was home. Not all this stochastic-differential-rubbish that they don't care to define signatures of!

Getting started with GANs: Part 1, getting access to GPUs

Here's how the things are now: I finally talked to Thibaut and he agreed to supervise me; probably we'll be working on something related to barycentre or curvature in geodesic spaces, or whatever (we've discussed some possible problems but it's not that we've agreed on something very specific yet); that's all very fascinating mathematical stuff but sooner or later I got to catch up with the real world; so one of the things I hear most often of and related to optimal transport distances are WGANs and it seems reasonable to begin my acquiatance with what's outside with them.

The nuance about this my endeavour is that I've never in my life actually trained even a CNN. Always felt like it's something trivial and not worth knowing the details of...

As @ferres suggested, before getting all into WGANs it's worth preparing a baseline. So I'll first get through Goodfellows old GANs.

I'll be using the cheapest GPU-powered AWS instance out there:

It's nothing too fancy, just better than a laptop with integrated graphics:

I'm covering it with the studentpack promo worth $100.

...and it won't launch:

Spot instance wouldn't launch either -- "no capacity". So requesting a limit increase.

Notation for DEs in Hilbert spaces

Look guys, I appreciate and understand that Hilbert are self-dual, but if you use two different notations for differentiation in a single equation when you could've used just one -- you got to really mean it.

Look at this:

Wherefore do you write so many symbols if could've written less and thus avoided being a liar? Take for a Hilbert space in discussion, the left hand site is the derivative -- a row-vector , and the RHS is a column -- same derivative encoded as a vector.

\( \mu(\mathrm{d}x) \) and \( \mathrm{d}\mu(x) \)

't is known that if you begin from Lebesgue point of view you naturally end up with notation where can be thought of as an infitesimal set; Riemann on the other hand leads you to which stands for "change of measure". This is all of course a birdish language which doesn't define anything for real and does not help computation.

Now, I really hate all these limit-centric notations. In smooth and bounded non-random finite-dimensional Euclidean world (in other words in any naive calculus) we most elegantly get rid of it, saying that is a differential form, that is it maps each into a multilinear function -- a tangent, a multilinear approximation of the integral curve. Then even though the action of the linear operator may involve limits in its definition -- which is perfectly fine -- our is a pretty understandable object that has specific type. The integral curve is naturally characterized as the one with given initial value and tangents at each point. This in a rigorous way captures the original intuition of operating with "infinitesimal increments". The bird-language phrasing "for every small enough a " actually means that we define a linear map that turns every into approximation of .

But why -- it feels like I fail to construct even for myself an analogous interpretation for all these or .

The naive way to describe the latter is to say simply that it is a random linear operator -- the derivative of for any fixed outcome .

Yet with the set... isn't very linear a space and it isn't about linearity neither. Also this notation isn't going in line with the main idea -- that we're actually splitting the image space instead of the domain.

Porto and curvature

To-day (or more precisely the day before this night) was: a bottle of Chateaux (Merlot, Caubernet-Sauvignon), one of Don't-remember-what, two shots of Sandeman vintage, and a glass of Sandeman tawny.

I found all of them sweet to taste rather than dry which they in fact are. Portos were too sweet to me (even tawny one), most like Cohors. But when limited to a single glass methinks they make a very noble drink for our cold times.

I also learned about some really cool cake: P\`avlova (thanks to they-will-understand-I-am-reffering-to-them)

With three good Portos Villani's lectures start looking especially reasonable and charming -- feels like you understand the world! I'm now watching this one: https://youtu.be/zo46TEp6FB8. It got quite some praise in some medium post that I found in some tweet. Now what I mean by "charming" is for example these stories about "Kantorovich's duality theorem being considered a very capitalistic theorem". TLDR: "capitalistic" is actually the strong duality which tells in the case considered that trying to minimize over constraint set by yourself gives the same result as letting a hired man maximize his profit. But him mentioning "(about Monge and Kantorovich) both were extremely precautious students, and both of them were proffesors at the age of 22" drives me back into depression. I'm two years older than Galois was when he got shot, and I've done thus far nothing.

Oh, I also finally had some good runs: night, forest, and speed-mad doge pulling forward. Long gone feeling. Running e-track at dorm can't even be compared.

Variational derivative

Zadorozhny's now reading lectures on what he calls "variational methods and random processes". Despite of some common abuse of notation the course seems to be qute useful. I'll post some summaries of what he's teaching us. Yet I'll use slightly different notation and thus as always, all the good is his and all the wrong is mine. This is the part about variational derivatives.


Let be one of the following Banach spaces:

  • the space of bounded continuous functions defined on the interval with the norm:

  • the space of functions possessing bounded continuous derivatives:

  • the space of classes of equivalent functions with the norm:


We consider a functional and we consider it the context of variational problems or optimal control problems. The common approach to investigate these problems is linearization. Now we could quite naturally begin with consideration of Frechet derivative of (assuming it has one): where is a bounded linear operator and is the Frechet derivative of at the function . Yet it turns out there's something even more convenient. As we saw earlier12 the functional is often an integral one. This and some further exploration leads to the following definition:

The function is called a functional or variational derivative of if the Frechet derivative at any function is a bounded integral operator with the kernel

In simpler words, the variational derivative is a kernel of the Frechet derivative.

Nota Bene: while the Frechet derivative is unique it may have infinitely many equivalent kernels.


There's an established notation for the variational derivative. As you could've guessed it's a bit messy. Although aware of the distinction, in his works Zadorozhny usually doesn't distinguish a function from its value a derivative from the increment of the value associated with specific increment of the argument --- that is the derivative applied to that increment. The wikipedia page3 is prone to a similar abuse of notation.

Thus so far we will use the following notation for the value of the functional derivative at specific function and time :

Sadly, Zadorozhny often calls this thing a function which of course it isn't. He also uses this same symbol for the value of the derivative and for all of the functions , , It's common though. The real problem comes from the fact that he also calls both a derivative and a differential (and not just their value) the following expression:

I suppose that's because he's talking too much with physics folks. The common notion of a derivative is that of a linear mapping which approximates the function under consideration as good as possible. The integral above isn't a derivative of but the function is. The notion of a "differential" is then a non-sense. As opposed to the notion of a "differential associated with specific change ". Which is still excessive. That's a long and old story though which probably deserves its own writeup.

To denote the function itself I'd rather propose the notation

Yet obviously things are going to get more complicated than that when it comes to multivaried functionals and the change of variables. In these cases I believe multiindex notation could be used just as with usual derivatives.



It is important to note that spaces are included in each other as normed spaces so that if is a bounded linear operator in then it is just as well bounded in all . And so are .

Suppose we have two norms which lie in the following relation:


Thus whenever

it's also

In other words if as then as .

Now it's easily seen that


which proves our assertion about inclusions.

Properties or "The rules of variational differentiation"

Here one needs to be precise in what spaces the derivative is bounded and where it's not. Despite of that I'll omit proofs for some trivial cases.

Derivative of a constant functional


If has a variational derivative, then so does and:


If the functionals and have variational derivatives, then so does their sum and the following equality holds:

Derivative of a multiple

If the functionals and have variational derivatives then so does and:

That is .

Chain rule 1

Suppose has a variational derivative and is a function. Consider a functional



Chain rule 2

Suppose has a variational derivative and is a function. Consider a functional

If is , then has a variational derivative and:

Just in case you're not familiar with this notation, let's rewrite this definition using an auxilliary function defined by when we fix some . Then .

Then the statement above can be rewritten as:


Let's consider an increment:

In the second line we used a linear approximation of the value of a differentiable function .

In the third line we've rewritten the preceding expression so that we get the expression of the form , where and are two functions.

This allows us to use a linear approximation in the fourth line.

Note the in the end. It's actually .

It follows then that in (and in by inclusion) the variational derivative of does exist and:


  1. {{ site.url }}link://slug/var-calc-criteria 

  2. {{ site.url }}link://slug/opt-criterions 

  3. https://en.wikipedia.org/wiki/Functional_derivative 

  4. https://en.wikipedia.org/wiki/H%C3%B6lder%27s_inequality 

Formal series

I intend to write a series of posts on some topics I found often to be vaguely or mis-represented in baccalaureate's math courses. This one's about something called "formal series" in discrete mathematics. It's written in a rough and inaccurate manner because the true reason I'm publishing it now is that I'm afraid to forget about it again. I'll (re)write this whole thing later when I have time. Yet here's the gist (a bit more under the cut): don't try to fool students and call it "application of complex analysis to performing some operations on sequences".

These dudes introduce some magical "formal series" without specifying what they really are and what sequences are and why all of this really works. The way it oughts to be taught is as follows:

  1. Since the very first semester (or even better in school) the folks should learn some (very basic) basics of functional analysis. Sets and functions. And the difference between the a funciton and its value at some point
  2. Then you teach them that a sequence in the set \( X \) is a map from \( \mathbb{N} \) to \( X \).
  3. You teach them that a series in the v.s. \( X \) is essentialy a sequence of partial sums and the series and and sequences in v.s. are interchangeable
  4. You teach them about differentiability and power series
  5. If your for your sequence \( a \) the value \( \limsup_k\sqrt[k]{|a_k|} \) is finite then the corresponding power series \( \sum_k a_k (x - x_0)^k \) is convergent in some disc. That's right, it's Hadamard's bit, works well in Banach spaces
  6. If you add one series to another (or multiple by a constant or do you name what to it) then it's going to just change the radius of convergence somehow. But the series will still converge at least somewhere. Well actually I'm not sure right now what happens if they're centered around different points I'll look through it after I finally get some sleep.
  7. It's an (absolutely) convergent powerseries thus it defines a differentiable function!
  8. You could try and mix and split your series into some parts resembling Taylor series of some well-known functions
  9. How now, you've got combinations of some trivia? I reckon you know something about adding \( \frac{1}{1-ax} \) to \( \frac{2}{1-bx} \).
  10. Mix and expand!


There are some things about projections that, it turned out, I was getting all wrong. Specifically, it concerns the norms of projections. All right, I shall confess: I never realized the norm could be greater than one.


A linear operator \( P: X\to X\) is called a projection if \( P^2 = P \). Suppose \( X \) is a normed v.s. Then we can introduce the "operator norm": \[ \|A\|_{\mathrm{op}} = \inf \{C\in\mathbb{R};\ \|Ax\|\leq C\|x\|\ \text{for all}\ x\in X\}\] for those operators \( A \) for which it's finite. Here \( \|\cdot\|:X\to\mathbb{R} \) is the norm in \( X \). That value happens to equal \[ \|A\|_{\mathrm{op}} = \sup_{x\neq 0} \frac{\|Ax\|}{\|x\|} = \sup_{\|x\|=1} \|Ax\|. \] Such operators form a linear subspace of the space of all operators on \( X \). Moreover they form a Banach Algebra with the usual multiplication (composition) of operators and the submultiplicative norm \( \|\cdot\|_{\mathrm{op}} \): \[ \|AB\|_{\mathrm{op}} \leq \|A\|_{\mathrm{op}}\|B\|_{\mathrm{op}} \]

Wrong intuition

The (wrong) first intuition is that for a non-zero projection there will be \( \|P\|=1 \). Here's where this deception comes from. I'd erroneously think of orthogonal projections and expect \[ Px = \left\{\begin{aligned} & x,\ \text{for}\ x\in X_1,\\ & 0,\ \text{otherwise}.\end{aligned}\right. \] Then obviously \( \|P\| = \sup_{\|x\|=1} \|x\| = 1. \) Yet the projection needs not be of the form above --- it may be oblique.

Here's an example AG Baskakov used to demonstrate this phenomenon to me --- his ignorant student. Consider \( X = \mathbb{R}^n \) and an operator \( P \) defined by the formula: \[ Px = (x, a) b, \] for some fixed \( a, b \in X \). Then \( P^2 x = Py = (y, a) b = (x, a) (b, a) b, \) where \( y = Px = (x, a) b \). Thus chose we such \( a, b \) that \( (a, b) = 1 \) the operator \( p \) would be a projection. Now to its norm: \[ \sup_{\|x\|=1} \|Px\| = \sup_{\|x\|=1} |(x, a)|\|b\| = (\frac{1}{\sqrt{a}}a, a)\|b\| = \|a\|\|b\|. \] Could we pick \( a \) and \( b \) to satisfy \( (a, b) = \|a\|\|b\|\cos\phi = 1 \) and \( \|a\|\|b\| > 1 \)? Most certainly, sir.


But let's just see what can be told from the definition. Since \( P^2=P \) and since the norm is submultiplicative it's clear that: \[ \|P\|_{\mathrm{op}} = \|P^2\|_{\mathrm{op}} \leq \|P\|_{\mathrm{op}}^2. \] This in turn implies \[ 1\leq \|P\|_{\mathrm{op}} \]

Features of orthogonal projections