Information
To let you know whether I am hungry or not I could use the code “0 = not hungry” and “1 = hungry”. This is “1 bit” of information.
If a given day has 50% chance of being sunny and 50% chance of rainy. When you learn that it is a sunny day today, you have learned “1 bit” of information. Suppose the weather could be in one of 4 states with equal prob (equal for now — we will discuss later what happens in the other case), say rainy and cold, rainy and hot, sunny and cold, sunny and hot each with 25% prob. Then when you learn what the weather actually is (eg sunny and hot), you have learned “2 bits” of information.
Slightly more formally, if a random variable X can take values x1, x2, …, x2k each with equal probability 1 / 2k, then by learning what value X takes (say X takes the value x2 ), we learn k bits of information since the original outcomes could be coded with k bits (k “0s or 1s” which give you 2k possibilities).
In general if an outcome with probability p occurs, you learn log2(1/p) bits of information. This makes intuitive sense as in the example above where the distribution is broken into events with probability 1 / 2k but we generalise to all events.
Entropy
The average amount of information you learn by observing a random variable X is called the entropy, denoted H(X)
$latex H(X) := \sum_{i}^{} p_i log_2\left(\frac{1}{p_i}\right)$
For a continuous random variable with density function f(x) this would just be
$latex H(X) := \sum_{i}^{} p_i log_2\left(\frac{1}{p_i}\right)$
Note: many definitions take a minus sign out and use natural logs but I prefer writing it this way since each term retains the meaning.
Mutual information
Suppose you have an additional random variable Y. Consider the joint distribution. If X and Y were completely independent, then the joint density just factorises f(x,y) = fX(x)fY(y) and it is easy to see the information H(X,Y) = H(X) + H(Y).
When the variables are not independent, we might ask how much does knowing Y help in knowing X? If Y was very highly related to X (in the extreme, it was just equal to X), then conditional on knowing Y, there would be no new information obtained by observing X.
In these it is often helpful to think of the random variables as discrete and with values being taken with probability a power of 2. This way one can imagine an efficient coding system and then associate the “information” [log2(1/p)] as the number of bits requires to code that outcome.
So, suppose you knew Y=y, then the conditional pdf of X would be
$latex P(X=x|Y=y) = \frac{P(X=x \cap Y=y)} {P(Y=y)}$
Suppose someone has full knowledge of the relationship between X and Y. They could devise the most efficient coding scheme (for coding the joint outcomes [x,y]).
We can think of H(X|Y) (conditional entropy of X given Y) as the information that is “purely” from X. Ie after knowing Y, what additional information does X tell us? Then we can decompose the total information contained in X, H(X) as the information purely from X plus the “mutual information” from X and Y. Ie we can write
H(X) = H(X|Y) + I(X,Y)
where I(X,Y) denotes the mutual information. Re writing we have
I(X,Y) = H(X) – H(X|Y)
you can expand the above into the sums (or integrals) as above and will arrive at the following expression

Wikipedia shows the different ways to write this quantity (my favourite being the third which is very much akin to thinking of this in terms of a venn diagram)

Use cases
Stock returns
One can use the last 10y of history to get the return distributions of two stocks, and calculate the mutual information. The higher the mutual information, the more the two stocks are “related”. The advantage here over other methods (eg just regressing the returns of one stock vs the other) is that you are picking up relationships directly from the probability distributions and hence will pick up non linear relationships too.
The disadvantage is that while it can tell you if two stocks are related, it doesn’t give you any insight into what that relationship actually is.
Dimensionality reduction
If you have a large number of variables (x1,…, xp) that you are using to predict something (y), you could check which of the xs have high mutual information and hence might be telling you the same thing. You could perhaps use other techniques then (pca, clustering etc) to reduce the dimensionality of the independent variables before applying further stats (eg regression) to help reduce overfitting.
Normalised information
If you were ranking stock pairs for most related, then you might want to normalise the mutual information first since it is not like a Z score where the value is directly comparable across different sets of variables. Eg you could use I(X,Y) / [H(X) + H(Y)].