Posts tagged with linear algebra

## Test Coverage as a Spanning Basis

• People make mistakes
• computers misinterpret their programmers
• requirements change, and the plumber forgets whether he attached the hoopschwangler to the snarbgrabbler or went through the spare bildibungs. Drat! Forgot that the snarbgrabbler manual, page 157, says explicitly never to instantiate a hoobschwangler via snobgratchits, unless you first demephistiate with a gorkypipe.
• Ah, well, now something’s broke and we’re not quite sure why.

For all these reasons and more, modern software developers advocate writing tests. Just like your maths teacher showed you ways to verify that the answer you got was the correct one (just don’t mess up the verification as well!), software architects want to know that things work as expected, or if not, why.

So the skill of writing good software tests is a worthy modern skill, which may never have existed or even been conceivable before.

"Think of ways things can fail" doesn’t quite capture it.

It’s clearly not necessary to test for every way that a program can fail. Sometimes one test can cover a range of failures: for example if it thinks 1+90==100, then I can derive other errors like it might not get 2+90 correct.

• In some sense test1 which asserts that 1+90≠100 covers test1.1 which asserts that 1+1+90≠101.
• Also it could conceivably be the case that test1 and test2 together cover what test3 is looking for.


From these examples I could speak of test coverage as a linear-algebraic basis.

Let’s say that λ•test1 covers all of the bad behaviour caught by test1. And λ•test1 + μ•test2 covers all of the bad cases caught by the combination of running both test1 and test2. Then it would be redundant to write tests that are already spanned by your basis of tests. And in this sense part of the problem of good test coverage is a way of coming up with independent (non-collinear) tests.

Robustness would also be desirable, as would covering the more important types of failure.

## Rank-Nullity Theorem

The rank-nullity theorem in linear algebra says that dimensions either get

• thrown in the trash
• or show up

after the mapping.

By “the trash” I mean the origin—that black hole of linear algebra, the /dev/null, the ultimate crisscross paper shredder, the ashpile, the wormhole to void and cancelled oblivion; that country from whose bourn no traveller ever returns.

The way I think about rank-nullity is this. I start out with all my dimensions lined up—separated, independent, not touching each other, not mixing with each other. ||||||||||||| like columns in an Excel table. I can think of the dimensions as separable, countable entities like this whenever it’s possible to rejigger the basis to make the dimensions linearly independent.

I prefer to always think about the linear stuff in its preferably jiggered state and treat how to do that as a separate issue.

So you’ve got your 172 row × 81 column matrix mapping 172→ separate dimensions into →81 dimensions. I’ll also forget about the fact that some of the resultant →81 dimensions might end up as linear combinations of the input dimensions. Just pretend that each input dimension is getting its own linear λ stretch. Now linear just means multiplication.

Linear stretches λ affect the entire dimension the same. They turn a list like [1 2 3 4 5] into [3 6 9 12 15] (λ=3). It couldn’t be into [10 20 30 − 42856712 50] (λ=10 except not everywhere the same stretch=multiplication).

Also remember – everything has to stay centred on 0. (That’s why you always know there will be a zero subspace.) This is linear, not affine. Things stay in place and basically just stretch (or rotate).

So if my entire 18th input dimension [… −2 −1 0 1 2 3 4 5 …] has to get transformed the same, to [… −2λ −λ 0 λ 2λ 3λ 4λ 5λ …], then linearity has simplified this large thing full of possibility and data, into something so simple I can basically treat it as a stick |.

If that’s the case—if I can’t put dimensions together but just have to λ stretch them or nothing, and if what happens to an element of the dimension happens to everybody in that dimension exactly equal—then of course I can’t stick all the 172→ input dimensions into the →81 dimension output space. 172−81 of them have to go in the trash. (effectively, λ=0 on those inputs)

So then the rank-nullity theorem, at least in the linear context, has turned the huge concept of dimension (try to picture 11-D space again would you mind?) into something as simple as counting to 11 |||||||||||.

In many showers there are two taps: Hot and Cold. You use them to control both the pressure and the temperature of the water, but you do so indirectly: the pressure is controlled by the sum of the (position of the) two taps, while the temperature is controlled by their difference. Thus, the basis you are given:

Hot = (1,0)
Cold = (0,1)

Isn’t the basis you want:

Pressure = (1,1)
Temperature = (1,−1)

($\dpi{200} \bg_white \large \begin{bmatrix}+1 & +1 \\ +1 & -1 \end{bmatrix}$ would be the basis isomorphism)



other bases:

(Source: qr.ae)

## Tangent Space & getting Punched in the Face

I used to think vectors were the same as points. They’re lists of numbers, right? Tuples with each item ∈ the same domain (or same data type). But … they’re not.

The way vector mathematics describes the forces of the world is to instantiate one linear-algebra at each point on the surface where a force is imparted.

When I strike a ball with my foot, look at the exact contact point (and pretend the ball doesn’t squish across my foot—even though it does, because my pegas are so poderosas). There’s a tangent space at that contact point and the exact particulars of my posture, shape of my foot|shoe, and so on determine the force vector that I use to bend the ball like Beckham.

Newton’s f=ma corrected Aristotle’s theory of motion

Vacuum isn’t possible: Vacuum doesn’t occur, but hypothetically, terrestrial motion in a vacuum would be indefinitely fast.

So we know we only need that one strike and then inertia minus drag convolved with lay-of-the-land will determine a full path for the ball: till my teammate’s head contacts it—another tangent space, another vector—and we score.

That’s football for simple tangent spaces—the Aristotle-v-Newton scenario. Now onto the most poetic sport, where two holonomic dynamical systems dance within each other’s gyre, moving in and out of each other’s movements and—occasionally—osculating a tangent that links one part of me with one part of you.

All of the logic about connecting vectors head-to-tail with parallelograms is only to reason about one single strike. The whole linear algebra on force vectors is a complete examination of the possibilities of the variations and the ways to strike at that exact same point.

To talk about striking different points requires a connection (parallel transport). [And remember if you hook my cheek versus jab my forehead, those arenotparallel so even more maths is involved. Also hard tissue (bone) versus soft tissue (cheek).]

Striking the ball on the side (to spin it) or under (to chip it) would be a connection on the S² manifold of the ball — moving the point of tangency — which is a different algebra’s worth of logic. Landing a punch on a different part of me is also a connection (that would be parallel transport of the strike vector on the surface manifold (with boundary) of my face/torso/armpit/whatever).

Keeping my torso over the ball is even yet another calculus—one that I suspect is much, much more complicated. Even though Julian James Faraway and others work on these questions of whole-body mechanics, I don’t think ∃ a complete mathematical theory of how all of the fixed holonomic parts of bodies that have very similar morphogenetic shapes (similar ratios of forearm to humerus length, etc) interact—how soft tissue and hard tissue in the usual places interact and specific angles can make a judoist with excellent technique and little strength able to whirl the body of a weight-lifter around her fulcrum.

Or how this guy can deliver a lot of force with proper dynamical posture (“shape”) when he’s clearly weak and fat. I can start to imagine the beginnings of something like that but it doesn’t obviously fit into the tangent space points & vectors story, except in a very complicated way of vectors connected to vectors connected to vectors, with each connection (not the same as the parallel-transport connection sense of the word I used above! Back to the base namespace) holonomically constrained or even "soft-constrained" in the case of soft tissue. Same with a blow landing parallel-transported different places on the surface of my body. The transport takes care of the initial strike vector but not how those forces travel and twist through my skeleton, soft tissues, down to the floor (either through my feet or through my *rse if the blow knocks me down). And sure, you could FEM-approximate my interior with a 3-D mesh and a deep geometric logic—and even make the ribs more rigid than the liver—but that’s just going to make you want higher maths even more as you grapple with the army of ε’s you’ll convoke with all the approximations, dx size, parameters, and so on.

Or why it’s better to swing the bat this way, or teach your body to do its swimming strokes in that motion, or various other dynamical or static particulars of human body shape and internal anatomical facts.



BTW, an amateur ultimate fighter once told me that the order of importance for winning a fight is:

1. technique (brasilian jiu jitsu technique)
2. balance
3. agility
4. strength

He didn’t mention tolerance for getting punched in the head but he seemed to put up with it remarkably. PS: judo guys have sexy bodies.

## How do I Create the Identity Matrix in R? Also a bit of group theory.

I googled for this once upon a time and nothing came up. Hopefully this saves someone ten minutes of digging about in the documentation.

You make identity matrices with the keyword diag, and the number of dimensions in parentheses.

> diag(3)
[,1] [,2] [,3]
[1,]    1 0 0
[2,]    0 1 0
[3,]    0 0 1 

That’s it.

> diag(11)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,]    1 0 0 0 0 0 0 0 0 0 0
[2,]    0 1 0 0 0 0 0 0 0 0 0
[3,]    0 0 1 0 0 0 0 0 0 0 0
[4,]    0 0 0 1 0 0 0 0 0 0 0
[5,]    0 0 0 0 1 0 0 0 0 0 0
[6,]    0 0 0 0 0 1 0 0 0 0 0
[7,]    0 0 0 0 0 0 1 0 0 0 0
[8,]    0 0 0 0 0 0 0 1 0 0 0
[9,]    0 0 0 0 0 0 0 0 1 0 0
[10,]    0 0 0 0 0 0 0 0 0 1 0
[11,]    0 0 0 0 0 0 0 0 0 0 1 

But while I have your attention, let’s do a couple mathematically interesting things with identity matrices.

First of all you may have heard of Tikhonov regularisation, or ridge regression. That’s a form of penalty to rule out overly complex statistical models. @benoithamelin explains on @johndcook’s blog that

• Tikhonov regularisation is also a way of puffing air on a singular matrix det|M|=0 so as to make the matrix invertible without altering the eigenvalues too much.

Now how about a connection to group theory?

First take a 7-dimensional identity matrix, then rotate one of the rows off the top to the bottom row.

> diag(7)[ c(2:7,1), ]
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    0 1 0 0 0 0 0
[2,]    0 0 1 0 0 0 0
[3,]    0 0 0 1 0 0 0
[4,]    0 0 0 0 1 0 0
[5,]    0 0 0 0 0 1 0
[6,]    0 0 0 0 0 0 1
[7,]    1 0 0 0 0 0 0 

Inside the brackets it’s [row,column]. So the concatenated c(2,3,4,5,6,7,1) become the new row numbers.

Let’s call this matrix M.7 (a valid name in R) and look at the multiples of it. Matrix multiplication in R is the %*% symbol, not the * symbol. (* does entry-by-entry multiplication, which is good for convolution but not for this.)

Look what happens when you multiply M.7 by itself: it starts to cascade.

> M.7   %*%   M.7
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    0 0 1 0 0 0 0
[2,]    0 0 0 1 0 0 0
[3,]    0 0 0 0 1 0 0
[4,]    0 0 0 0 0 1 0
[5,]    0 0 0 0 0 0 1
[6,]    1 0 0 0 0 0 0
[7,]    0 1 0 0 0 0 0
> M.7   %*%   M.7   %*%   M.7
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    0 0 0 1 0 0 0
[2,]    0 0 0 0 1 0 0
[3,]    0 0 0 0 0 1 0
[4,]    0 0 0 0 0 0 1
[5,]    1 0 0 0 0 0 0
[6,]    0 1 0 0 0 0 0
[7,]    0 0 1 0 0 0 0 

If I wanted to do straight-up matrix powers rather than typing M %*% M %*% M %*% M %*% ... %*% M 131 times, I would need to require(expm) package and then the %^% operator for the power.

Here are some more powers of M.7:

> M.7   %^%   4
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    0 0 0 0 1 0 0
[2,]    0 0 0 0 0 1 0
[3,]    0 0 0 0 0 0 1
[4,]    1    0    0    0    0    0    0
[5,]    0    1    0    0    0    0    0
[6,]    0    0    1    0    0    0    0
[7,]    0    0    0    1    0    0    0
> M.7   %^%   5
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    0 0 0 0 0 1 0
[2,]    0 0 0 0 0 0 1
[3,]    1    0    0    0    0    0    0
[4,]    0    1    0    0    0    0    0
[5,]    0    0    1    0    0    0    0
[6,]    0    0    0    1    0    0    0
[7,]    0    0    0    0    1    0    0
> M.7   %^%   6
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    0 0 0 0 0 0 1
[2,]    1    0    0    0    0    0    0
[3,]    0    1    0    0    0    0    0
[4,]    0    0    1    0    0    0    0
[5,]    0    0    0    1    0    0    0
[6,]    0    0    0    0    1    0    0
[7,]    0    0    0    0    0    1    0
> M.7   %^%   7
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    1    0    0    0    0    0    0
[2,]    0    1    0    0    0    0    0
[3,]    0    0    1    0    0    0    0
[4,]    0    0    0    1    0    0    0
[5,]    0    0    0    0    1    0    0
[6,]    0    0    0    0    0    1    0
[7,]    0    0    0    0    0    0    1


Look at the last one! It’s the identity matrix! Back to square one!

Or should I say square zero. If you multiplied again you would go through the cycle again. Likewise if you multiplied intermediate matrices from midway through, you would still travel around within the cycle. It would be exponent rules thing^x × thing^y = thing^[x+y] modulo 7.

What you’ve just discovered is the cyclic group P₇ (also sometimes called Z₇). The pair M.7, %*% is one way of presenting the only consistent multiplication table for 7 things. Another way of presenting the group is with the pair {0,1,2,3,4,5,6}, + mod 7 (that’s where it gets the name Z₇, because ℤ=the integers. A third way of presenting the cyclic 7-group, which we can also do in R:

> w <- complex( modulus=1, argument=2*pi/7 )
> w
[1] 0.6234898+0.7818315i
> w^2
[1] −0.2225209+0.9749279i
> w^3
[1] −0.9009689+0.4338837i
> w^4
[1] −0.9009689−0.4338837i
> w^5
[1] −0.2225209−0.9749279i
> w^6
[1] 0.6234898−0.7818315i
> w^7
[1] 1−0i

Whoa! All of a sudden at the 7th step we’re back to “1" again. (A different one, but "the unit element" nonetheless.)

So three different number systems

• counting numbers;
• matrix-blocks; and
• a ring of imaginary numbers

— are all demonstrating the same underlying logic.

Although each is merely an idea with only a spiritual existence, these are the kinds of “logical atoms” that build up the theories we use to describe the actual world scientifically. (Counting = money, or demography, or forestry; matrix = classical mechanics, or video game visuals; imaginary numbers = electrical engineering, or quantum mechanics.)

Three different number systems but they’re all essentially the same thing, which is this idea of a “cycle-of-7”. The cycle-of-7, when combined with other simple groups (also in matrix format), might model a biological system like a metabolic pathway.

Philosophically, P₇ is interesting because numbers—these existential things that seem to be around whether we think about them or not—have naturally formed into this “circular” shape. When a concept comes out of mathematics it feels more authoritative, a deep fact about the logical structure of the universe, perhaps closer to the root of all the mysteries.

In the real world I’d expect various other processes to hook into P₇—like a noise matrix, or some other groups. Other fundamental units should combine with it; I’d expect to see P₇ instantiated by itself rarely.

Mathematically, P₇ is interesting because three totally different number systems (imaginary, counting, square-matrix) are shown to have one “root cause” which is the group concept.

John Rhodes got famous for arguing that everything, but EVERYTHING, is built up from a logical structure made from SNAGs, of which P₇=C₇=Z₇ is one. viz, algebraic engineering

Or, in the words of Olaf Sporns:

[S]imple elements organize into dynamic patterns … Very different systems can generate strikingly similar patterns—for example, the motions of particles in a fluid or gas and the coordinated movements of bacterial colonies, swarms of fish, flocks of birds, or crowds of commuters returning home from work. … While looking for ways to compute voltage and current flow in electrical networks, the physicist Gustav Kirchhoff represented these networks as graphs…. [His] contemporary, Arthur Cayley, applied graph theoretical concepts to … enumerating chemical isomers….

Graphs, then, can be converted into adjacency matrices by putting a 0 where there is no connection between a and b in the [row=a, column=b], or putting a (±)1 where there is a (directed) link between the two nodes. The sparse [0's, 1's] matrix M.7 above is a transition matrix of the cyclical C₇ picture: 1 → 2 → 3 → 4 → 5 …. A noun (C₇) converted into a verb (%*% M.7).

In short, groups are one of those things that make people think: Hey, man, maybe EVERYTHING is a matrix. I’m going to go meditate on that.

I feel vindicated in several ways by the Netflix Engineering team’s recent blog post explaining what they did with the results of the Netflix Prize. What they wrote confirms what I’ve been saying about recommendations as well as my experience designing recommendation engines for clients, in several ways:

1. Fancy ML techniques don’t matter so much. The winning BellKor/Pragmatic Chaos teams implemented ensemble methods with something like 112 techniques smushed together. You know how many of those the Netflix team implemented? Exactly two: RBM's and SVD.

If you’re a would-be internet entrepreneur and your idea relies on some ML but you can’t afford a quant to do the stuff for you, this is good news. Forget learning every cranny of research like Pseudo-Markovian Multibagged Quantile Dark Latent Forests! You can watch an hour-long video on OCW by Gilbert Strang which explains SVD and two hour-long Google Tech Talks by Geoff Hinton on RBM’s. RBM’s are basically a superior subset of neural network with a theoretical basis why it’s superior. SVD is a dimension reduction technique from linear algebra. (There are many Science / Nature papers on dimension reduction in biology; if you don’t have a licence there are paper-request fora on Reddit.)

Not that I don’t love reading about awesome techniques, or that something other than SVD isn’t sometimes appropriate. (In fact using the right technique on the right portion of the problem is valuable.) What Netflix people are telling us is that, in terms of a Kaggleistic one-shot on the monolithic data set, the diminishing marginal improvements to accuracy from a mega-ensemble algo don’t count as useful knowledge.

2. Domain knowledge trumps statistical sophistication. This has always been the case in the recommendation engines I’ve done for clients. We spend most of our time trying to understand the space of your customers’ preferences — the cells, the topology, the metric, common-sense bounds, and so on. You can OO program these characteristics. And (see bottom) doing so seems to improve the ML result a lot.

Another reason you’re probably safe ignoring the bleeding edge of ML research is that most papers develop general techniques, test them on famous data sets, and don’t make use of domain-specific knowledge. You want a specific technique that’s going to work with your customers, not a no-free-lunch-but-optimal-according-to-X academic algorithm. Some Googlers did a sentiment-analysis paper on exactly this topic: all of the text analysis papers they had looked at chose not to optimise on specific characteristics (like keywords or text patterns) known to anyone familiar with restaurant-review data. They were able to achieve a superior solution to that particular problem without fancy new maths, only using common sense and exploration specific to their chosen domain (restaurant reviews).

3. What you measure matters more than what you squeeze out of the data. The reason I don’t like* Kaggle is that it’s all about squeezing more juice out of existing data. What Netflix has come to understand is that it’s more important to phrase the question differently. The one-to-five-star paradigm is not going to accurately assess their customers’ attitudes toward movies. The similarity space is more like Dr Hinton’s reference to a ten-dimensional library where neighbourhood relationships don’t just go along a Dewey Decimal line but also style, mood, season, director, actors, cinematography, and yes the “People like you” metric (“collaborative filtering”, a spangled bit of jargon).

For them the preferences evolve fairly quickly over time. That has to make it hard. If your users’ preferences evolve over time: good luck, it may be quite hard.

John Wilder Tukey: "To statisticians, hubris should mean the kind of pride that fosters an inflated idea of one’s powers and thereby keeps one from being more than marginally helpful to others. … The feeling of “Give me (or more likely even, give my assistant) the data, and I will tell you what the real answer is!” is one we must all fight against again and again, and yet again." via John D Cook

Relatedly, a friend of mine who’s doing a Ph.D. in complexity (modularity in Bayesian networks) has been reading the Kaggle fora from time to time. His observation of the Kaggle winners is that they usually win with gross assumptions about either the generating process or the underlying domain. Basically they limit the ML search using common sense and data exploration; that gives them a significant boost in performance (1−AUC).

* I admire @antgoldbloom for following through on his idea and I do think they have a positive impact on the world. Which is much better than the typical “Someone should make X, that would be a great business” or even worse but still typical: "I’ve been saying they should have that!” Still, I do hold to my one point of critique: there’s no back-and-forth in Kaggle’s optimisation.

### Linear Transformations will take you on a Trip Comparable to that of Magical Mushroom Sauce, And Perhaps cause More Lasting Damage

Long after I was supposed to “get it”, I finally came to understand matrices by looking at the above pictures. Staring and contemplating. I would come back to them week after week. This one is a stretch; this one is a shear; this one is a rotation. What’s the big F?

The thing is that mathematicians think about transforming an entire space at once. Any particular instance or experience must be of a point, but in order to conceive and prove statements about all varieties and possibilities, mathematicians think about “mappings of the entire possible space of objects”. (This is true in group theory as much as in linear algebra.)

So the change felt by individual ink-spots going from the original-F to the F-image would be the experience of an actual orbit in a dynamical system, of an actual feather blown by a bit of wind, an actual bullet passing through an actual heart, an actual droplet in the Mbezi River pulsing forward with the flow of time. But mathematicians consider the totality of possibilities all at once. That’s what “transforming the space” means.

$\large \dpi{300} \bg_white \begin{pmatrix} a \rightsquigarrow a & | & a \rightsquigarrow b & | & a \rightsquigarrow c \\ \hline b \rightsquigarrow a & | & b \rightsquigarrow b & | & b \rightsquigarrow c \\ \hline c \rightsquigarrow a & | & c \rightsquigarrow b & | & c \rightsquigarrow c \end{pmatrix}$

What do the slots in the matrix mean? Combing from left to right across the rows of numbers often means “from”. Going from top to bottom along the columns often means “to”. This is true in Markov transition matrices for example, and those combing motions correspond with basic matrix multiplication.

So there’s a hint of causation to this matrix business. Rows are the “causes” and columns are the “effects”. Second row, fifth column is the causal contribution of input B to the resulting output E and so on. But that’s not 100% correct, it’s just a whiff of a hint of a suggestion of a truth.

The “domain and image” viewpoint in the pictures above (which come from Flanigan & Kazdan about halfway through) is a truer expression of the matrix concept.

• [ [1, 0], [0, 1] ] maps the Mona Lisa to itself,
• [ [.799, −.602], [.602, .799] ] has a determinant of 1 — does not change the amount of paint — and rotates the Mona Lisa by 37° counterclockwise,
• [ [1, 0], [0, 2] ] stretches the image northward;
• and so on.

MATRICES IN WORDS

Matrices aren’t* just 2-D blocks of numbers — that’s a 2-array. Matrices are linear transformations. Because “matrix” comes with rules about how the numbers combine (inner product, outer product), a matrix is a verb whereas a 2-array, which can hold any kind of data with any or no rules attached to it, is a noun.

* (NB: Computer languages like R, Java, and SAGE/Python have their own definitions. They usually treat vector == list && matrix == 2-array.)

Linear transformations in 1-D are incredibly restricted. They’re just proportional relationships, like “Buy 1 more carton of eggs and it will cost an extra $2.17. Buy 2 more cartons of eggs and it will cost an extra$4.34. Buy 3 more cartons of eggs and it will cost an extra \$6.51….”  Bo-ring.

In scary mathematical runes one writes:

$\large \dpi{200} \bg_white \begin{matrix} y \propto x \\ \textit{---or---} \\ y = \mathrm{const} \cdot x \end{matrix}$

And the property of linearity itself is written:

$\large \dpi{200} \bg_white \begin{matrix} a \cdot f(\cdots, \; \blacksquare , \; \cdots) + b \cdot f( \cdots, \; \blacksquare,\; \cdots) \\ = \\ f( \cdots, \; a \cdot \blacksquare + b \cdot \blacksquare, \; \cdots) \end{matrix} \\ \\ \qquad \footnotesize{\bullet f \text{ is the linear mapping}} \\ \qquad \bullet a, b \in \text{the underlying number corpus } \mathbb{K} \\ \qquad \bullet \text{above holds for any term } \blacksquare}$

Or say: rescaling or adding first, it doesn’t matter which order.



The matrix revolution does so much generalisation of this simple concept it’s hard to imagine you’re still talking about the same thing. First of all, the insight that mathematically abstract vectors, including vectors of generalised numbers, can represent just about anything. Anything that can be “added” together.

And I put the word “added” in quotes because, as long as you define an operation that obeys commutativity, associativity, and distributes over multiplication-by-a-scalar, you get to call it “addition”! See the mathematical definition of ring.

• The blues scale has a different notion of “addition” than the diatonic scale.
• Something different happens when you add a spiteful remark to a pleased emotional state than when you add it to an angry emotional state.
• Modular and noncommutative things can be “added”. Clock time, food recipes, chemicals in a reaction, and all kinds of freaky mathematical fauna fall under these categories.
• Polynomials, knots, braids, semigroup elements, lattices, dynamical systems, networks, can be “added”. Or was that “multiplied”? Like, whatever.
• Quantum states (in physics) can be “added”.
• So “adding” is perhaps too specific a word—all we mean is “a two-place input, one-place output satisfying X, Y, Z”, where X,Y,Z are the properties from your elementary school textbook like identity, associativity, commutativity.

But that’s just vectors. Matrices also add dimensionality. Linear transformations can be from and to any number of dimensions:

• 1→7
• 4→3
• 1671 → 5
• 18 → 188
• and X→1 is a special case, the functional. Functionals comprise performance metrics, size measurements, your final grade in a class, statistical moments (kurtosis, skew, variance, mean) and other statistical metrics (Value-at-Risk, median), divergence (not gradient nor curl), risk metrics, the temperature at any point in the room, EBITDA, not function(x) { c( count(x), mean(x), median(x) ) }, and … I’ll do another article on functionals.

In contemplating these maps from dimensionality to dimensionality, it’s a blessing that the underlying equation is so simple as linear (proportional). When thinking about information leakage, multi-parameter cause & effect, sources & sinks in a many-equation dynamical system, images and preimages and dual spaces; when the objects being linearly transformed are systems of partial differential equations, — being able to reduce the issue to mere multi-proportionalities is what makes the problems tractable at all.

So that’s why so much painstaking care is taken in abstract linear algebra to be absolutely precise — so that the applications which rely on compositions or repetitions or atlases or inversions of linear mappings will definitely go through.



Why would anyone care to learn matrices?

Understanding of matrices is the key difference between those who “get” higher maths and those who don’t. I’ve seen many grad students and professors reading up on linear algebra because they need it to understand some deep papers in their field.

• Linear transformations can be stitched together to create manifolds.
• If you add Fourier | harmonic | spectral techniques + linear algebra, you get really trippy — yet informative — views on things. Like spectral mesh compressions of ponies.
• The “linear basis” and “linear combination” metaphors extend far. For example, to eigenfaces or When Doves Cry Inside a Convex Hull.
• You can’t understand slack vectors or optimisation without matrices.
• JPEG, discrete wavelet transform, and video compression rely on linear algebra.
• A 2-matrix characterises graphs or flows on graphs. So that’s Facebook friends, water networks, internet traffic, ecosystems, Ising magnetism, Wassily Leontief’s vision of the economy, herd behaviour, network-effects in sales (“going viral”), and much, much more that you can understand — after you get over the matrix bar.
• The expectation operator of statistics (“average”) is linear.
• Dropping a variable from your statistical analysis is linear. Mathematicians call it “projection onto a lower-dimensional space” (second-to-last example at top).
• Taking-the-derivative is linear. (The differential, a linear approximation of a could-be-nonlinear function, is the noun that results from doing the take-the-derivative verb.)
• The composition of two linear functions is linear. The sum of two linear functions is linear. From these it follows that long differential equations—consisting of chains of “zoom-in-to-infinity" (via "take-the-derivative") and "do-a-proportional-transformation-there" then "zoom-back-out" … long, long chains of this, can amount in total to no more than a linear transformation.

• If you line up several linear transformations with the proper homes and targets, you can make hard problems easy and impossible problems tractable. The more “advanced-mathematics” the space you’re considering, the more things become linear transformations.
• That’s why linear operators are used in both quantum mechanical theory and practical things like building helicopters.
• You can understand dynamical systems, attractors, and thereby understand love better through matrices.

## Angle = Volume

This is trippy, and profound.

The determinant — which tells you the change in size after a matrix transformation 𝓜 — is just an Instance of the Alternating Multilinear Map.

(Alternating meaning it goes + − + − + − + − ……. Multilinear meaning linear in every term, ceteris paribus:

$\large \dpi{200} \bg_white \begin{matrix} a \; f(\cdots \blacksquare \cdots) + b \; f( \cdots \blacksquare \cdots) \\ = \\ f( \cdots a \ \blacksquare + b \ \blacksquare \cdots) \end{matrix} \\ \\ \qquad \footnotesize{\bullet f \text{ is the multilinear mapping}} \\ \qquad \bullet a, b \in \text{the underlying number corpus } \mathbb{K} \\ \qquad \bullet \text{above holds for any term } \blacksquare \text{ (if done one-at-a-time)}$)



Now we tripThe inner product — which tells you the “angle” between 2 things, in a super abstract sense — is also an instantiation of the Alternating Multilinear Map.

In conclusion, mathematics proves that Size is the same kind of thing as Angle

Say whaaaaaat? I’m going to go get high now and watch Koyaanaasqatsi.

## Positive Semi-Definite

Mathematical matrices are blocks of numbers. So they don’t necessarily have definite characteristics the way plain vanilla, singleton numbers do. They’re freer. For example 131 ∈ positive and −9 ∈ negative. But is

$\bg_white \dpi{300} \begin{bmatrix} -5 & 0 \\ -3 & 8 \end{bmatrix}$

positive, negative, or zero? Well, it has all three within it. How can a matrix be “just” +, or 0?

Likewise, is the function sin (x) ℝ→ℝ positive, negative, or zero? It maps to an infinite number (ℶ) of positives, an infinite number (ℶ) of negatives, and to a lesser-but-still-infinite number (ℵ) of zeroes. So how could it be just one of the three.

But James Mercer realised in 1909 that some functions “act like” regular singleton numbers, being positive, negative, or zero. And if it walks like a duck, quacks like a duck, smells like a duck — it is a duck. For example if you multiply any ℝ→ℝ function by a Schwartz function, you won’t change its sign. Hmm, that’s just like multiplying by a positive number — never changes the sign of its mate.



Matrices

Similarly, some matrices “act like” single solitary numbers, being positive, negative, or zero.

• Matrices that act like a single positive number are called positive definite.
• Matrices that act like a single negative number are called negative definite.
• Matrices that act like a single non-negative (≥0) number are called positive semidefinite.
• Matrices that act like a single non-positive (≤0) number are called negative semidefinite.

Being able to identity these simple subtypes of matrices makes Théorie much easier. Instead of talking about linear operators in general and not getting many facts to reason from, the person coming up with the theory can build with smaller, easier-to-handle blocks, and later on prove that the small blocks can build up anything.

I first saw semi-definite matrices in economic theory, where a p.s.d. Hessian indicates a local minimum of a value function. (A Hessian is like a Jacobian but filled with second derivatives instead of first. And a value function maps from {all the complicated choices of life} → utility ∈ ℝ. So value functions have a Holy Grail status.)

But semi-definite & definite functions are used in functional data analysis as well.



Functions

$\large \dpi{200} \bg_white \int 3x - 5x^2 + 13 x^3 \ \rm{d}x$

Positive semi-definite functions are used as kernels in

• landmark regression
• data smoothing, especially of high-frequency time series
• audio transformations
• photoshop transformations
• BS stock price prediction

At some point in the 20th century we learned to expand the definition of number to include any corpus of things that behaves like numbers. Nowadays you can say corpora of functions and well-behaved corpora of matrices effectively are numbers. (I mean “commutative rings with identity and multiplicative inverses”.) That is a story for another time.



### Uses

Besides being philosophically interesting as an example of a generalised number, positive semi-definite functions and matrices have practical uses. Because they are simple building blocks of more complicated things,

## Jacobian

In the world of linear approximations of multiple parameters and multiple outputs, the Jacobian is a matrix that tells you: if I twist this knob, how does that part of the output change?

$\large \dpi{200} \bg_white \begin{bmatrix} \partial \, F^1 \over \partial \; a & \partial \, F^1 \over \partial \; b & \partial \, F^1 \over \partial \; c & \cdots & \partial \, F^1 \over \partial \; z \\ \partial \, F^2 \over \partial \; a & \partial \, F^2 \over \partial \; b & \partial \, F^2 \over \partial \; c & \\ \\ & \vdots & & \ddots \\ \\ \partial \, F^N \over \partial \; a & \partial \, F^N \over \partial \; b & \partial \, F^N \over \partial \; c & \cdots & \partial \, F^N \over \partial \; z \end{bmatrix}$
$\dpi{200} \bg_white \large \begin{bmatrix} ^{\partial F^\mathbf{1}} \!/ _{\partial \mathbf{a}} && ^{\partial F^1} \!/ _{\partial \mathbf{b}} && ^{\partial F^1} \!/ _{\partial \mathbf{c}} && \cdots && ^{\partial F^1} \!/ _{\partial \mathbf{z}} \\ \\ ^{\partial F^\mathbf{2}} \!/ _{\partial a} && ^{\partial F^2} \!/ _{\partial b} && ^{\partial F^2} \!/ _{\partial c} && \cdots && ^{\partial F^2} \!/ _{\partial z} \\ \\ && \vdots &&&& \ddots \\ \\ ^{\partial F^\mathbf{N}} \!/ _{\partial a} && ^{\partial F^N} \!/ _{\partial b} && ^{\partial F^N} \!/ _{\partial c} && \cdots && ^{\partial F^\mathbf{N}} \!/ _{\partial \mathbf{z}} \end{bmatrix}$

(The Jacobian is defined at a point. If the space not flat, but instead only approximated by flat things that are joined together, then you would stitch together different Jacobians as you stitch together different flats.)

Pretend that a through z are parameters, or knobs you can twist. Let’s not say whether you have control over them (endogenous variables) or whether the environment / your customers / your competitors / nature / external factors have control over them (exogenous parameters).

And pretend that through F are the separate kinds of output. You can think in terms of a real number or something else, but as far as I know the outputs cannot be linked in a lattice or anything other than a matrix rectangle.

In other words this matrix is just an organised list of “how parameter c affects output F”.

Notan bene — the Jacobian is just a linear approximation. It doesn’t carry any of the info about mutual influence, connections between variables, curvature, wiggle, womp, kurtosis, cyclicity, or even interaction effects.

A Jacobian tensor would tell you how twisting knob a knocks on through parameters h, l, and p. Still linear but you could work out the outcome better in a difficult system — or figure out what happens if you twist two knobs at once.

In maths jargon: the Jacobian is a matrix filled with partial derivatives.