#mathematics

Fractals

src: talk at Microsoft Research

Nature vs Culture

Scale-free, or scale-invariance. If you’re given a picture of some rock, then you can’t tell what scale it is at, since you have self-similarity across scaling. Hence you need that reference scale on the bottom right. I wonder if this relates to scale-invariance (à la [[effectiveness-of-normalized-quantities]]).

Roughness. Self-similarity. Now there’s a mathematical framework for thinking about these very human, imprecise notions.1 Curiously, this reminds me of how I think about social networks and how a lot of this messiness is due to social agents. Somehow with this framework it almost seems to abstract away the messiness, and focus on some curious scale-invariance.

Graphs of stock prices have a similar issue (definitely felt this first-hand), whereby if you’re not given time scales, you really can’t distinguish between day or month (besides the upward trend for, say, S&P500).

Independence vs Uncorrelated: if you look at stocks, and if there is a spike one day, then it’s pretty likely that in the next day, there’s also going to be spikes (volatility is high). The only thing is that you don’t know the sign of the activity. So you have expectation zero, but these two events not independent.

Classical take on stock markets is to treat it like brownian motion (continuous time gaussian process). But this doesn’t capture fat-tail-ness and long-range dependence. But with the fractal view, using one parameter (with two real-valued parts), you can basically describe the whole gamut of stock market fluctuations. You can elicit things like anti-persistence and persistence. Start with a trend (upward), add a generator, and then just keep repeating (i.e. fractal).2 Definitely feels like the way you can get brownian motion as a limit to finite random walks, except the key thing is that you’re not taking step sizes to zero, but something a little more self-similar.

  • keep moments finite, because then you can take central limit theorems. so things can average out.

The fractal view of markets is descriptive, not prescriptive.

It’s sort of interesting that, with the advent of ML, what you want is over-parameterization. Like that’s sought after now. Whereas in the old days, it’s all about finding the simple thing that then produces complexity. So here Mandelbrot talks about how using multi-fractals, he ends up with only two parameters, and by varying only those two, is able to produce remarkably different scenarios. He contrasts this to other methods, that would have to imbue their model with various extensions in order to be able to capture all the various facets of the characteristics of the price movement.

In some sense, this is good: there seems to be some scale-free nature to price movements. So there is some pattern going on. I’m not too familiar with the math, but I feel like specifying the roughness is half the battle. Or you just say that the rest is just noise or perturbations.

Pseudo-inverses and SGD

Power Tower

source: YouTube, Graph

Power Tower for $b = 1.52$ Figure 1: Power Tower for \(b = 1.52\)

Power Tower for $b = 2^{1/2}$ Figure 2: Power Tower for \(b = 2^{1/2}\)

New Linear Algebra

source: videos, slides

New way to teach linear algebra.

\[ \begin{align*} Ax = \begin{bmatrix} 1 & 4 & 5 \\ 3 & 2 & 5 \\ 2 & 1 & 3 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} = \begin{bmatrix} 1 \\ 3 \\ 2 \end{bmatrix} x_1 + \begin{bmatrix} 4 \\ 2 \\ 1 \end{bmatrix} x_2 + \begin{bmatrix} 5 \\ 5 \\ 3 \end{bmatrix} x_3 \end{align*} \]

\(A = CR\)

\[ \begin{align*} A = CR = \begin{bmatrix} 1 & 4 \\ 3 & 2 \\ 2 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \end{bmatrix} \end{align*} \]

  • getting independent columns, which is the basis for the column space.
    • clearly, you can always do this
    • it’s just that sometimes it’s not interesting (i.e. if it’s already a basis)
  • \(R\) is the basis for the row space, also rank 2
  • number of independent columns = \(r\) = number of independent rows
  • \(\operatorname{rank}(C(A)) = \operatorname{rank}(R(A))\)
  • \(Ax = 0\) has one solution, and there are \(n-r\) independent solutions to \(Ax = 0\)
  • \(R\) turns out to be the row reduced echolon form of \(A\)

\(Ax = 0\)

\[ \begin{align*} \begin{bmatrix} \text{row } 1 \\ \vdots \\ \text{row } m \end{bmatrix} \begin{bmatrix} \\ x \\ \, \end{bmatrix} = \begin{bmatrix} 0 \\ \vdots \\ 0 \end{bmatrix} \end{align*} \]

  • 0 on the right means that the rows must all be orthogonal to \(x\)
  • i.e. every \(x\) in the nullspace \(N(A)\) must be orthogonal to the row space.
  • if you take transposes you get the equivalent statement for column space

Figure 1: Big Picture Linear Algebra

Big Picture Linear Algebra

\(A = BC\)

\[ \begin{align*} A = BC = b_1 c_1^* + b_2 c_2^* + \ldots \end{align*} \]

  • here, \(b_i\) are the column vectors of \(B\), while \(c_i^*\) are the row vectors of \(C\).
  • you can think of matrix also as column by row multiplication
    • it’s more like building blocks there, adding up rank 1 matrices

\(A = LU\)

  • you can rewrite \(A\) as a product of lower-triangular and upper-triangular matrices.
  • that’s what you’re doing when you’re solving simultaneous equations by elimination

\(Q^\top Q\)

  • \(Q\) with orthogonal(normal) column vectors, such that \(Q^\top Q = I\)
  • not necessarily true that \(QQ^\top = I\), if not square
  • \(QQ^\top\) is like projection:
    • \(QQ^\top QQ^\top = QQ^\top\)
  • if square, then \(QQ^\top = I\) and \(Q^\top = Q^{-1}\)
  • length is preserved by \(Q\)

\(A=QR\)

  • Gram-Schmidt: orthogonalize the columns of \(A\).
  • assuming full rank of \(C(A)\)
  • \(Q^\top A = Q^\top Q R = R\)
  • Major application: Least Squares
    • want to minimize \(\norm{b - Ax}^2 = \norm{e}^2\)
    • normal equation for minimizing solution (best \(\hat{x}\)): \(A^\top e = 0\)
      • if \(A = QR\), then you get \(R \hat{x} = Q^\top b\)

Figure 2: Least Square

Least Square

\(S = Q \Lambda Q^\top\)

  • symmetric matrix \(S=S^\top\) have orthogonal eigenvectors
  • how to show \(x^\top y = 0\)
    • \(Sx=\lambda x \implies x^\top S^\top = \lambda x^\top \implies x^\top S y = \lambda x^\top y\)
    • \(Sy=\alpha y \implies x^\top S y = \alpha x^\top y\)
    • hence, \(\lambda x^\top y = \alpha x^\top y\). but \(\lambda \neq \alpha\), since eigenvalues are distinct
      • hence, \(x^\top y = 0\)
  • things are not nice when things aren’t square
    • but \(A^\top A\) is always square, symmetric, PSD
    • same holds for \(AA^\top\)
      • same eigenvalues
      • \(x \to Ax\) to convert eigenvectors
  • from definition of eigenvalues: \(AX = X \Lambda\)
    • if you don’t have symmetry, then you can’t just hit it with \(X^\top\) and get things to cancel

\(A = U \Sigma V^\top\)

  • \(U^\top U = I, V^\top V = I\)
  • \(AV = U \Sigma\) (like the eigen-decomposition, but not quite)
    • \(Av_i = \sigma_i u_i\)
  • \(U, V\) are rotations, \(\Sigma\) stretch.
  • how to choose orthonormal \(v_i\) in the row space of \(A\)
    • \(v_i\) are eigenvectors of \(A^\top A\) (which are orthonormal, from above)
    • \(A^\top A v_i = \lambda_i v_i = \sigma_i^2 v_i\)
    • so then \(A^\top A V = \Sigma^2 V\)
  • how to choose \(u_i\)
    • \(u_i = \frac{1}{\sigma_i} A v_i\)
    • check orthonormal: \(u_i^\top u_j\)
  • combining, \(A^\top A v_i = \sigma_i^2 v_i \implies A^\top u_i = \sigma_i v_i\)

Figure 3: Singular Value Decomposition

Singular Value Decomposition