Why A^{T}A is invertible? (4) Linear Algebra

Appendix A: null space and column space
We use the null space for the proof, therefore, I will explain the null space a bit. If you know about the null space, of course you can skip this entry.

The null space of a matrix A is a set of non-zero vector x that satisfies Ax = 0.
Let me show you an example square matrix A that has null space.
When x \neq 0 , following x is a solution.
Therefore, this x is a null space of A. When an a is an scalar, ax, a \neq 0 are also the solutions. It means these are also null space. In this example, the matrix is singular (and square). If the matrix is not singular, the solution must be 0 only. Because, if a square matrix is not singular, there is the inverse,
Therefore, x = 0. In this case, we say there is no null space.

Let me show you another example, but this time a rectangle matrix A that has null space.
The solution is the same as the last example.

By the way, these are all about definition of null space. I could finish the story about null space at this point, however, I would like to think about the meaning of null space a bit more. I would like to talk about the existence of the inverse matrix, so let's talk about square matrices. Because a rectangle matrix can be interpreted as a matrix that has too many equations or too less equations, there is no inverse for such matrices.

null space is related with a matrix is singular or not, that means the matrix has independent columns or not. Because, if we write down an A with column vector, and multiply with x,

Therefore, Ax is a linear combination of the column vectors a_i with coefficients x_i. If this result is 0, x is a null space. This is one aspect of null space.

When we look at the columns of Ax = b, b is the linear combination of column vector of A. A space consists of column vectors is called column space. If we have n independent columns in n-dimensional space, we could express any point in the n-dimensional space. This means we have a solution for any b and we have the inverse. If A has the inverse, the linear combination of the column vector becomes 0 iff x= 0. This also means there is no null space (= or only 0 exists).

Now we see the null space, column space, and the existence of the inverse. Do you think you have now a bit clearer view about the relationship among them?


Thanks to Marc D. who listened my explanation and gave me comments.


Why A^{T}A is invertible? (3) Linear Algebra

By the way, one of my friend told me my blog is totally ununderstandable. If you share my friend's opinion, and if something I could improve, please let me know. Thanks!

Explanation from Linear combination and basis transformation

Let me explain this from another point of view. This explanation doesn't need null space, but, it lacks a rigorous proof. However, we have a proof by null space and so I think this is correct. For me this is a more intuitive explanation.

From the independence of the columns of A, the independent vectors will be organized as the following. You can also see the how the resulting columns are computed.
Since A^T is a linear operator,
This becomes 0 when A^{T} a_* degenerates. However, this is also the same form of basis transformation. Each basis is each column. Because, A^{T}'s row are independent (since A's column are independent, transposing A exchanges columns with rows.) Each independent row vector is projected to the each independent columns. The column vector's independence is preserved unless A^{T} a_* degenerates.

Unfortunately, I have no proof of degeneration. But, what happens is an independent basis is transformed into another independent basis. I don't see how this independence breaks under this condition.

By the way, this matrix is called metric tensor. My favorite explanation of this is in the book ``Mathematical sciences of graphics'' by Koukichi Sugihara, Chapter 1. (I have the first edition, the third printing. This has a typo in Equation 1.20, page 19. But, the introducing equations are correct and only the result equation has a typo.)

Next I would like to have an appendix, explanation about null space, column space, and existence of inverse. This might not be necessary, so the story about A^T A ended here.


Why A^{T}A is invertible? (2) Linear Algebra

Why A^{T}A has the inverse

Let me explain why A^{T}A has the inverse, if the columns of A are independent. First, if a matrix is n by n, and all the columns are independent, then this is a square full rank matrix. Therefore, there is the inverse. So, the problem is when A is a m by n, rectangle matrix.  Strang's explanation is based on null space. Null space and column space are the fundamental of the linear algebra. This explanation is simple and clear. However, when I was a University student, I did not recall the explanation of the null space in my linear algebra class. Maybe I was careless. I regret that...

Explanation based on null space

This explanation is based on Strang's book. Column space and null space are the main characters. Let's start with this explanation.

Assume where

x is in the null space of A.  The matrices (A^{T} A) and A share the null space as the following:
This means, if x is in the null space of A, x is also in the null space of A^{T} A. If x = 0 is only the possible case, A^{T} A has the inverse since the column space spans the whole dimension of A.

If we can multiply the inverse of A^{T} from the left, but, A is a rectangle matrix, therefore there is no inverse of it. Instead, we can multiply x^T from the left.

x^{T} A^{T} is a row vector, Ax is a column vector. The multiplication of them are an inner product.If Ax = b, then x^{T} A^{T} = (A x)^{T} = b^{T}, b^{T} b = 0. (Note, the last 0 is not a vector, a scalar) Inner product of the identical vectors become 0 if and only if 0. Since the inner product is \sum (b_i)^2 = 0 (squared sum = 0).

This is a nice explanation. We can also use the independence of A's columns, this concludes null space has only 0. A^{T}A shares the null space with A, this means A^{T}A's columns are also independent. Also, A^{T}A is a square matrix. Then, A^{T}A is a full rank square matrix. The columns of A are independent, but, it doesn't span the m-dimensional space since A is a rectangle matrix. Instead, the columns of A^{T}A span the n-dimensional space. Therefore, there is the inverse.

I would like to add one point. Assume B where A \neq B,

Therefore, I first thought B and A share the null space. It's wrong. Because,
This means only two vectors: a = (x^{T} B) and b = (A x) are perpendicular. It doesn't mean (x^{T} B) = 0. The transpose of this is the following.

We actually don't know x^{T} B is 0. Therefore, we don't know x is in the left null space of B or not.  A and A^{T}A share the nulls pace, but, given arbitrary B, B and BA usually don't share the null space. In the Strang's book, this is not mentioned. Maybe it is too obvious, but, I misunderstand it at the first time.

Next time, I will explain this another point of view.


Why A^{T}A is invertible? (1) Linear Algebra

We can see matrix multiplication A^{T}A frequently, for example, in the least square method. If matrix A has independent columns, this matrix has the following great properties: it is square, symmetric, and has the inverse. I am interested in why A^{T}A has this properties and would like to explain that. The explanation is based on Gilbert Strang's Introduction of Linear Algebra, 4th edition. The explanation uses null space and quadric form, it is an elegant explanation. But, if I just follow the Strang's explanation, I only need to write: see Strang book. So, I would like to add a geometric explanation. Although, I don't have a rigid proof, it is just a little bit intuitive for me. Yet, I hope you can enjoy that. Even if my explanation is bad, you can still see the Strang's explanation.

Properties of A^{T}A

We can see matrix multiplication A^{T}A frequently. We can see this pattern in a projection operation, least square computation (Actually, those two are the same).

In general, we can assume the following assumptions.

  • When A is not a square matrix and would like to know the least square solution. In this case, the matrix is m by n and m > n. This means, the number of equations is larger than the unknowns. When this happens? For example, you observe some signals or take some statistics, like take a photograph to analyze something. You sometimes take samples exact the same, but the result might differ because of observation error. In this case, you have more samples than the parameters. If m < n case, you can still take more samples. So, m by n is a general case.
  • We could make the columns of A independent. If not, we can trash such columns. If we don't know how many parameters are in the model and remove too many columns, we might have a wrong answer. This is rather a technical issue, it is just possible. This just says, we can do a kind of best effort.

If the second assumption, independent columns, is true, A^{T}A has the following nice properties:

  • Square
  • Symmetric
  • Existence of inverse

Therefore, this matrix is useful.

I would like to stress that the column independence of A is necessary. Otherwise, there is no inverse of A^{T}A. You can easily see the following.

It is rather easy to see why A^{T}A is square matrix, since [n by m] [m by n] is [n by n] because of the matrix multiplication rule.

You can also see the following shows the symmetric property.

Where A^{{T}^{T}} = A, transpose of transpose is the original matrix. The question is existence of the inverse.


Filter design (2)

Input actual data

Let's input some signal to the filter that we designed last time. Table 1 shows the input of constant signal. Constant signal will be boring, but, We start with a simple one.

Table 1 Constant input

To make sure, I will show you how to compute the Table 1's red output. Here y_n and n=1,
In this case, the filter gets the first three inputs. The inputs are all the same (= 1), therefore all the outputs are also the same. Let's compute the transfer function. We compute all the time the transfer function in digital filter.
What, Jojo, You! Be surprised. (it's a bit old and does anybody know Jojo's strange adventure by Araki Hirohiko?) The ratio of input and output is equal to the transfer function!
Here is a small details. In the Table, y_n has a value at n=0,8 since we can compute the cos value. But there usually is no n=-1 value in the real data acquisition (If we start with n=0, then no data at n=-1). Therefore, it is also possible to say there is no output value at y_0.

I think this example is a bit boring since input is constant and thus
the output is also constant. The followings are dynamic examples in
Table 2 and 3.

Table 2 cos π/3 input

Table 3 cos 2π/3 input

Let me show you how to compute the red output in Table 2.
The transfer function is the following.
Transfer function represents that how much input is transfered to the output. Here the value is 1, this means, the input and output are the same. The signal of this frequency go through this filter without change.  You can clearly see the correspondence of the input and the output, they are the same. Transfer function is great.  In the case of Table 3,

The transfer of this is:

Needless to say, this frequency can not pass this filter.


The mathematics used in here is not so complex. If you can accept the Euler's equation, the high school math can handle the rest of them. You can Euler's equation is also explained by (infinite) series expansion of e^x, sin(x), cos(x), then you can compare the sin(x), cos(x) and e^x. This gives you a hint of relationship between them. At least one can see some kind of relationship, I presume.

The transfer function is an eigenvalue of the sampling filter function. These are ideal cases, but it is fun to see the theory works pretty well.

Filter design (1)

This is about filter design of Hamming's digital filter. The chapter 3.9 is outstanding, so I would like to make a note.

While ago there was a soccer game. How to suppress the Vuvuzela sound (but other sounds should pass) was a popular topic at that time. In this case, you analyze the frequency of Vuvuzela sound and design a filter which doesn't pass that frequency. The design of digital filter is: which frequency of the signal will be passed and which frequency of the signal will not be passed.

See the following article for instance, http://blogs.mathworks.com/loren/2010/06/30/vuvuzela-denoising-with-parametric-equalizers/

In chapter 3.9, a simple non recursive filter design is described as an example.

The filter looks like the following.
This filter uses three samples. The design of filter is to determine the a,b to fit to your desire. First, we will calculate the transfer function of this filter. As I explained in a former blog post, a transfer function is an eigenvalue of the filter function. Thus, it shows the output of the input of filter F. Here the input signal x (this is a function of time t, x(t)) and the filter function is F, then the eigenvalue λ is F x = λ x. If you are familiar with linear algebra, you can consider the F as a matrix A, a scalar λ, then you will see that is the same as A x = λ x. Now we compute the transfer function:
Therefore, transfer function is
If you want to make your filter passing the π/3 frequency (=1) and you want to cut 2π/3 frequency (=0), we set this transfer function under that condition.
By solving the system, we get

The filter is:

We design a digital filter, but, it is still not clear what this is. So, next time, I will input the actual signal into this filter and see how this filter works.


Introduction to Linear Algebra: Projection and a Simple Kalman Filter (3)

Simple Kalman filter

In a sense, Kalman filter can predict a near future from the past and current condition. (Maybe this is a bit too much to say, there are of course only some extent and limitations.) Let's write it in a eqation.

 x_{new} = x_{old} + f(x_{current})

This means that the future is somewhat the extension of previous state with a modification, and the modification is now hidden in the function f. OK, I am not an expert about Kalman filter, so, I will write down only a simple one that I could handle.

In a simple Kalman filter, x_{new} is predicted by past x_i-s. The past data span a subspace of the data and the new value is just the best projection onto the past subspace. That how I understand it. (This might be wrong.) In the Strang's book, an interesting computation method is explained, but, it is just a sketch. Here I will write the overview of the idea of Strang's book.

First, I need a preparation about 1/k.
We saw the last two blog entries that the best expectation of sequence of the data is the average in the sense of least square. Using the above equation, I will rewrite the average equation. You will say why? I will explain the reason shortly.
Now I have enough materials to explain what I want to do. We have a sequence of data from i = 1 to n. We will have more data n+1, n+2, ... later. But, if we need to compute all the history of the data to predict the near future, this is not so good. Because, the time passed, we need linearly more computation. What we need is some summary of history state and the current state, then using these two states, we compute the best next state (the best is in the least square sense). If we could do that, the computation cost is always the same, this is nice for a realtime system. It just return to the x_{new} = x_{old} + f(x_{current}), we want to know the near future by old and current states. Therefore, we made a equation that has n and n-1 (please see the equation again, you see (i=1 to n-1) + n.) Here, \frac{1}{n-1} \sum_{i=1}^{n-1} x_i is the average of the past, so I rewrite this as x_{old}.
Now you see this is the best prediction in the least square sense with the history of the last step and the current state. This is Wow.

This three articles, we saw the least square from two point of views: calculus (analysis) and linear algebra. They are the same. We also see its application, Kalman filter.


Introduction to Linear Algebra: Projection and a Simple Kalman Filter (2)

Linear algebra way

In the linear algebra way, the best x is a projection of right hand side onto the column space of the matrix. Because the solution is only possible in the A's column space (means the solution is only represented by the A's column vector's linear combination), the best solution is the projection of b onto the column space. Figure 1 shows the geometric interpretation. The matrix A and b are:

Here A is a vector, so the projected best x is:

The result is the same to the calculus way. This is also the average. (If you are not familiar with projection matrix, see the Figure 1. In the Figure, e is perpendicular with A, i.e., A^T e = 0. You can derive it from here.)
Figure 1. Project b to A's column space

My interest is why these are the same. Minimization of quadric is a calculus (analysis) problem, and projection is more a geometric problem. Is this just a coincidence? Of course not. They have the same purpose.

As in Figure 1, projection is the minimal distance between b and A's column space. The distance is square root of squared sum in the Euclidean space (Pythagoras's theorem). Minimizing it is the same to the calculus way, though, the idea and how to compute it looks different.

Once I wrote the following, I like Keigo Higashino's books. One of his book, ``Yougisha X no Kenshin'', describes about deceiving a problem makes wrong conclusion. The example is mathematics, if you deceive someone as this is a calculus problem, but actually it is a geometry problem, one can not reach the correct answer. A clever guy deceive the police by putting an wrong answer, that's a fun detective story. Higashino's book usually describes science quite well, though, I have a question on this point. I found interesting that when the problem setting is the same, even different approcaches can conclude the same result in mathematics. I feel this more and more when I study mathematics more. I am surprised when the different looking problems are originated from the same idea. It looks like climbing a mountain. When I was at the sea level, I can only see some of the views from specific directions only, these views look different, however, once I see the overview from the mountain, they are just the same view from different positions. Studying mathematics is like to have a map. Some of the roads seems not connected, but if I have a map, I can see actually they are connected. I think all the roads are somewhat connected. If I could not see, that means I need to study more.

Here we found the closest vector from a vector by

  • calculus: minimize the error
  • linear algebra: projection.

Next time, I would like to talk about Kalman filter. Kalman filter estimates a near future based on observed data that have some noise. The basic idea to predict near future is the same as mentioned in this blog. But I recently happen to know it has one interesting aspect and I would like to explain that.


Introduction to Linear Algebra: Projection and a Simple Kalman Filter (1)

Introduction about linear algebra, calculus, and a simple Kalman filter

I found interesting that the relationship among linear algebra, calculus, and a simple Kalman filter. So I will make a memo about this.


Given a simple observed data sequence (x_i), we can assume these are equally plausible and we can apply the least square method. In this simple setting, the outcome is just an average. You can find this topic in Gilbert Strang's an Introduction to Linear Algebra, chapter 4.2.

For example, the observed data is [70 80 120]^T, if these are equally plausible,

 x = 70
 x = 80
 x = 120.

These euqations are strange. They look like, x is 70 and 80 and 120, simultaneously. Here, these are observed data, so they have some errors. But actually we observe the same data. The motivation is to find the best possible data we could  have from the observations.

Therefore, the system is:
There is no such x that satisfies the system. But, let us think about what is the best x.

Calculus way

We could use an idea from calculus to find the best x. The idea is: the observation has errors and what kind of x can minimize the error. This is called Gauss's least square method. First, we compute the squared error (E). This is also in the Strang's book.

 E = (x-70)^2 + (x-80)^2 + (x-120)

E is x^2's equation, therefore this is parabola. That means we can find a minimal point. Such point has 0 tangent, therefore, we could find it with
Let's compute this.

You can see this is actually an average. The best x is  an average, this fits my intuition. However, I did not think about why average is the best in what kind of sense. In this case, variance is relatively small, then, my intuition works. However, if the variance becomes larger, my intuition stops working. I once wrote about this in my blog ``A 6σ Woman.'' Average is the best here in the least square sense.

Next, let's find the best x with linear algebra way.


ifgi tracer: report 2010-11-12

I am slowly working on my small hobby project: ifgi tracer.
Currently, I can load an obj file and rotate with trackball operation it in the window. So far, it is still OpenGL only. No light, No Zoom, No translation, no shading.... But, at least I don't have Gimbal lock. The source code is publically available, though, at this point, there is no value at all. But, I would like to have some fun on this project.


Eigenvalue and transfer function (8)

Last time I use sin and cos, but this relationship becomes simpler if we use Euler's formula.
Let's apply the same operator T.
Wow again. This is also eigenfunction of operator T.

This function is based on trigonometric functions. Therefore, we use these trigonometric function as the basis of the frequency domain analysis. Eigenvalues show us a brief overview of the operation and its function.

Assume we have an input x and an output y, operator T is applied to the input x, then the result is the y. If eigenvalue exists, we could write it as the following.
This means, the input is transfered to the output and how much transfered is λ. Therefore, signal processing people call this λ as a transfer function. Why it is called function? λ looks like a constant. Usually, λ is not a function of input x, but it usually has some parameter, means this is a function. For example, in the former equation, λ is not a function of x, but a function of ω.  In signal processing, x is usually time t and ω is frequency.

This means, a transfer function in signal processing is eigenvalue of linear algebra. How great it is!

This is an overview of recently I understand about eigenvalue, eigenvector, eigenfunction, and transfer function. I hope this also helps someone.


I wrote this article, but, actually first one month I didn't understand this at all. I thank Alexander B. who is so patient and answered my stupid and similar questions every time.


Eigenvalue and transfer function (7)

Eigenvalue and transfer function

In the Hamming's book, he repeats to mention about the merit of using trigonometric function as a basis in signal processing. Unfortunately, that is not the main point of this blog, therefore, I could not compare it with the other bases. I will just stick to this basis with assuming this is a good one. Let's see the eigenvalue of trigonometric function according to the example of Hamming's book. The first example in his book is
 A sin x + B cos x.
We apply a transformation operation. Then, let's see something is changed or not. If something doesn't change, it will be a eigenvector and we will also see its eigenvalue.

Transform T is a shift operation of the origin of coordinate like T: x → x' + h. Why someone wants to shift the coordinate? For example, signal processing usually doesn't matter when you start to measure the signal sequence. When you started to measure the signal, then, that point is the origin. Usually there is no absolute origin of time. If you want to re-set the time domain, that would be also convenient. The question is what kind of property doesn't change by transform operation. Let's apply this transform operation to the trigonometric function.

A', B' are constants that is independent from x. Wow, after the transformation, the function became the almost the same form. It is a linear combination of sin x and  cos x again.
At the end, this operation has the same form of
A' and  B' looks like eigenvalues and sin and cos looks like eigenvector. (eigenfunction)

I found this point of view is fantastic. How do you think?

Once I was scared my friends talk about eigenfunction, since I don't know what it is. But, you also are not scared this anymore!