Thursday, February 16, 2012

Axioms of Probability Theory and Some Definitions and Theorems

I was thinking about non-trivial probabilities in quantum mechanics, such as, the probability of getting a degenerate eigenvalue of an observable, or the joint probabilities of commuting observables, or when to use expectation values of projection operators etc.

Like all thinking processes, I started by asking wrong questions. Then, by reading about different concepts of probability theory from random sources, a picture started to emerge in my mind. Thanks to my tendency to go deeper, I started by pondering why the joint probability is calculated by this formula $p(A=a,B=b)$ $=<\psi|P(A=a)P(B=b)|\psi>$ and found myself came down to learning the axioms of probability theory.

Axioms of Probability Theory

The axioms of probability theory is a set of mathematical definitions and identities from which one can derive all of the relations that are related to probabilities. These mathematical objects can be used where the context is appropriate to use probabilistic ideas.

The axioms are not unique, but one can derive the same theorems by starting from different compatible sets of axioms. I read two versions. First one is Kolomogorov's axioms that I came up when I reading Conditional probability article on Wikipedia, where I see the idea of drawing Venn diagrams to visualize the probability relations. Second one is the chapter 1.5 of Ballentine's book, where he based the form of the axioms on the R. T. Cox's work The Algebra of Probable Inference.

We start we the empty concepts "event" and "probability", and we'll give them specific meaning according to the interpretation of the probability we use. (Like while writing the axioms of geometry we use the concept of "point" without defining it, but we have a intuition about its role.)

As far as I understand, the basic difference between the perspectives of Kolmogorov and Cox is that, Cox is always talking about conditional probabilities, while in Kolmogorov's version probabilities without any conditions can be used.
Left: Cox, Right: Kolmogorov
Think of E as a region of area 1 which contains all possible events. The probability of the event A's occurrence, $p(A)$, is the area of A, $a(A)$. We can also talk about conditional probabilities such as "given the occurrence of A what is the probability of the occurrence of B?", which is shown by $p(B|A)$. Cox' version does not explicitly based on the existence of the set E and deals with conditional probabilities only. Therefore, probabilities are ratios of the areas. I think, one can compromise these to by telling that $p(A)$ in Kolmogorov is $p(A|E)$ in Cox.

First start with the symbols. $A$ is the occurrence of the event A. $\sim A$, negation, is not occurrence of the event A. $A\& B$, conjunction, is the occurrence of both A and B. $A\vee B$, disjunction, is the occurrence of at least one of the A and B.

$$ 0 \leq p(A|B) \leq 1 \tag{Axiom (1)}$$
Gives the limits of a probability. No negative numbers, hence no less than absolute improbability. No numbers bigger than 1, hence no probability higher than absolute certainty.

$p(A|A) = 1 \tag{Axiom (2)}$
The probability of certainty is one. "Given $A$ happened, what is the probability that $A$ happened."

$$p(\sim A|B) = 1 - p(A|B) \tag{Axiom (3)}$$
How negation is expressed in terms of certainty and conditional probability. This is an expression of the intuition that the probability of occurrence decreases as the probability of non-occurrence increases.

$$p(A\& B|C) = p(A|C)p(B|A\& C) \tag{Axiom (4)}$$
The most complicated one, which defines joint probabilities in terms of a product of conditional probabilities.  "Given $C$, the probability of two events occur is the same as the probability of one of them (given $C$) times probability of other one given that first one already occurred.") Let us think about it using the diagram on the left, which I thought is the most general one.

$p(A|C)$ is related to the area of the intersection $A \cap B$ where our sample space is the set $C$. Divide the area of the intersection to the area of $C$. $p(A|C) = \frac{a(A \cap B)}{a(C)}$. $p(B|A \& C)$ means that we know that both $A$ and $C$ has happened hence our sample space is the intersection $A \cap C$. On that region, which portion of it also covers $B$. $p(B|A \& C)$ $= \frac{a(A \cap B \cap C)}{a(A \cap C)}$. $p(A \& B|C)$ is given C, what is the joint probability of both $A$ and $B$ occurs. If we express this as the ratios of two areas as usual, on the numerator we have again the area of the triple integration, but this time we have the area of $C$ in the denominator. $p(A \& B|C)$ $=\frac{a(A \cap B \cap C)}{a(C)}$.

$$\frac{a(A \cap B \cap C)}{a(C)} = \frac{a(A \cap B)}{a(C)} \frac{a(A \cap B \cap C)}{a(A \cap C)}$$

Hence the fourth axiom makes sense if you visualize the probability ideas with Venn diagrams.

Kolmogorov's version of the axiom (4) is written like:
$$p(A \& B) = p(A|B) p(B) = p(B|A) p(A)$$
Which looks much more understandable for me without thinking about diagrams.

Let me write its Cox version by assuming that they all live inside the set $E$. $p(A \& B|E) = p(A|B) p(B|E)$. Check their equality using ratios of areas: $\frac{a(A \cap B)}{a(E)}$ $ =\frac{a(A \cap B)}{a(B)} \frac{a(B)}{a(E)}$. Yes!

Some Definitions, Applications and Theorems

Now we have everything needed the rest of the probability theory. Let's check their plausibility and extend the amount of tools.

Use (2) on $p(\sim A|A) = 1 - p(A|A)$ use (1) $=1-1=0$. Given A occurred, its non-occurrence probability is zero. :-)

$\sim$ and $\&$ are defined with axioms. From them we can derive $\vee$. According to logic $A \vee B$ $=\sim (\sim A \& \sim B)$. "$A$ or $B$" means "not neither $A$ nor $B$." Hence,
$$ \begin{align}
p(A \vee B|C) & = p(\sim (\sim A \& \sim B|C)) \quad \text{Use (3)} \\
& = 1 - p(\sim A \& \sim B|C) \quad \text{Use (4)} \\
& = 1 - p(\sim A|C)p(\sim B|\sim A \& C) \quad \text{Use (3)} \\
& = 1-\left[1 - p(A|C) \right]\left[1 - p(B|\sim A \& C) \right] \\
& = 1-1+p(A|C)+p(B|\sim A \& C)-p(A|C)p(B|\sim A \& C) \\
& = p(A|C) + p(B|\sim A \& C)\left[1 - p(A|C)\right] \\
& = p(A|C) + p(\sim A|C) p(B|\sim A \& C)\\

& = p(A|C) + p(\sim A \& B|C) = p(A|C) + p(B\&{}\sim{}A |C) \\

& = p(A|C) + p(B|C)p(\sim A|B \& C) \\
& = p(A|C) + p(B|C)\left[ 1 - p(A|B \& C) \right] \\


& = p(A|C) + p(B|C) - p(B|C)p(A|B\& C) \\
& = p(A|C) + p(B|C) - p(B\& A|C)
\end{align}$$

Now we derived the disjunction operation from the axioms. Its Kolmogorov version will be $p(A \vee B) = = p(A) + p(B) - p(B\& A)$ which really looks like the addition of areas of two sets.

Ballentine derives the same thing using another way. He uses an important lemma. Let me just show that  lemma.
$$ \begin{align}
p(X\& Y|Z) + p(X\&{}\sim{}Y|Z) & = p(X|Z)p(Y|X\&{}Z)+p(X|Z)p(\sim{}Y|X\&{}Z) \\
& = p(X|Z)\left[p(Y|X\&{}Z) + 1 - p(Y|X\&{}Z) \right] \\
& = p(X|Z)
\end{align}$$

This really looks like the calculation of a marginal distribution from a joint distribution by summing/integration over all values of one of the variables. $p(x) = \sum_y p(x, y)$.

Using the expression of $p(A \vee B|C)$ one can define the mutually exclusiveness. Given $C$, $A$ and $B$ are mutually exclusive if $p(A\&{}B|C) = 0$. Inside the set $C$, $A$ and $B$ does not intersect.

If $A$ and $B$ are mutually exclusive then $p(A \vee B|C)$ $= p(A|C) + p(B|C)$. This is called "addition of probabilities for exclusive events".

By writing down in two ways one gets the Bayes' theorem:

$$\begin{align}
p(A\&{}B|C)& = p(A|C)p(B|A\&{}C) \\
p(B\&{}A|C)& = p(B|C)p(A|B\&{}C) \\
\Rightarrow p( B|A\&{}C) &= p(A|B\&{}C) \frac{p(B|C)}{p(A|C)} \\
\end{align}
$$

In the other notation: $p(B|A) = p(A|B)\frac{p(B)}{p(A)}$. This is also called the principle of inverse probability.

The last concept that Ballentine talks about in that chapter is the statistical independence. $B$ is statistically independent of $A$ if $p(B|A\&{}C)=p(B|C)$. The existence of $A$ in the conditions does not affect the probability of $B$.

Apply this to the fourth axiom:
$$\begin{align}
p(A\&{}B|C) & = p(A|C)p(B|A\& C) \\
& =  p(A|C)p(B|C) \\
p(B\&{}A|C) & = p(B|C)p(A|C) \\
\end{align}$$

Which means that the independence is mutual and that the joint probability becomes the product of marginal distributions when $A$ and $B$ are independent. (In the other notation: $p(A\&{}B)=p(A)p(B)$).

The definition of more then two independent events is this. Let our events are the set $\left\{ A_n \right\}$. $p(A_i\&{}A_j\&{}\cdots\&{}A_k|C)=p(A_i|C)p(A_j|C)\cdots p(A_k|C)$. Where this equation holds for all subsets of $\left\{ A_n \right\}$ (including all elements case, all pairs cases, and all other cases.) This looks interesting!


No comments:

Post a Comment