Artificial Intelligence

Date Taken: Fall 2025
Status: Work in Progress
Reference: LSU Professor Dong Lao, ChatGPT

Lecture 2 & 3 Probability

Probabilities and Statistics

Probabilities: Assume we know how the system works, we want to predict how unknown samples (data points) look like. It is about predicting the chance of something happening before it happens.
Statistics: Assume we have data, we want to estimate the underlying patterns in the system. It is about analyzing data after you collect it to find patterns or make conclusions.

Simplest way to remember:
Probability = Predicting the future
Statistics = Analyzing the past

Probability Space

A probability space is a mathematical construct that provides a formal model for randomness and uncertainty. It consists of three main components:

Ω - Sample Space (S): The set of all possible outcomes of a random experiment. (ALL POSSIBLE OUTCOMES).
Ex. What is all possible outcomes when flipping two coins?
(Ω = {HH, HT, TH, TT}). One coin flip (Ω = {H, T})
F - Events (E): Subsets of the sample space, representing specific outcomes or groups of outcomes. We may interpret as sets in the sample space so that we can define Union (or), Intersection (and), Complement (not). Think of it as group results.
Ex. event where at least one coin shows heads when flipping two coins ( F = {HH, HT, TH} ).
P - Probability Measure (P): A function that assigns a probability to each event in the event space, satisfying certain axioms (e.g., non-negativity, normalization). A rule that tells you how likely each event is.
Ex. The probabilities of all events in Ω must sum to 1.

A probability space can describe either discrete or continuous random variables. P(Ω) = 1 -> The total probability of all possible outcomes is 1 (something must happen).

Discrete Case: In the discrete case, the sample space consists of a finite or countably infinite set of outcomes. Each outcome has a specific probability, and the sum of all probabilities is 1.
Ex. Flipping a coin: (Ω = {H, T}) → P(H) = \( \frac{1}{2} \) = 0.5 P(T) = \( \frac{1}{2} \) = 0.5.
Flipping two coins: (Ω = {HH, HT, TH, TT}) → P(HH) = 0.25 P(HT) = 0.25 P(TH) = 0.25 P(TT) = 0.25.
Continuous Case: Outcome are not countable, they can take any value within a range.
Ex. Waiting time for a bus \( (\Omega = [0, \infty)) \). Meaning we can measure the time in continuous units (seconds, minutes, etc.) and the outcome can be any non-negative real number (infinitely many options). You can't count individual outcomes. Instead you talk about ranges, example: P(5 ≤ X ≤ 10) = 0.3 means there's a 30% chance the bus will arrive between 5 and 10 minutes.

Distribution

A probability distribution is a function that gives the probabilities of occurrence of possible events. A probability distribution tells us how different outcomes are. It is like a map that shows where the probability mass or weight is placed among the possible outcomes.

Discrete Distribution: Deals with discrete random variables, which can take on a countable number of distinct values. Examples include the binomial distribution (e.g., number of heads in coin flips) and the Poisson distribution (e.g., number of events in a fixed interval).
Continuous Distribution: Deals with continuous random variables, which can take on an infinite number of values within a given range. Examples include the normal distribution (bell curve), uniform distribution, and exponential distribution.

A distribution is a function that gives probability of each event occurring.
Ex. Roll a 6 sided die (Ω = {1, 2, 3, 4, 5, 6}). Each outcome has a probability of 1/6. This distribution is uniform because all outcomes are equally likely.

Probability Density Function (PDF)

A Probability Density Function (PDF) is a function that describes the likelihood of a continuous random variable taking on a specific value. The PDF provides a way to understand how the probability is distributed over the range of possible values for the random variable.

Suppose the waiting time for a bus follows a uniform distribution over the next 10 minutes. The PDF for this uniform distribution is constant over the interval [0, 10], meaning the bus is likely going to arrive any moment within that time frame. The waiting time is a Random Variable.

What is the probability that the waiting time is between 2 and 3 minutes? That will be 0.1 because the total length of the interval [2, 3] is 1 minute, and the PDF is constant at 0.1 over the interval [0, 10]. What is the probability that the waiting time is exactly t minutes? (e.g. t = 1.5). This will be 0 because the probability of a continuous random variable taking on any specific value is always 0.

PDF equation:

Normal Distribution (a.k.a Gaussian Distribution)

The normal distribution is a continuous probability distribution (infinitely many possible values) characterized by its bell-shaped curve. It is defined by two parameters: the mean (μ) and the standard deviation (σ). The mean determines the center of the distribution, while the standard deviation controls the spread.

In a normal distribution:

About 68% of the data falls within one standard deviation of the mean (μ ± σ).
About 95% falls within two standard deviations (μ ± 2σ).
About 99.7% falls within three standard deviations (μ ± 3σ).

The normal distribution is important in statistics because of the Central Limit Theorem, which states that the sum of a large number of independent random variables tends toward a normal distribution, regardless of the original distribution of the variables.

Central Limit Theorem: If you add up lots of small, independent effects, the result tends to be normally distributed. Example, your height is affected by genes, nutrition, environment, random factors. Each factor is small, random, and independent. Add them together → heights roughly follow a normal distribution.

Why AI/ML cares? When data is roughly normal, many statistical methods and machine learning algorithms perform better. They rely on the assumption of normality for inference, making it easier to model and predict outcomes.

Marginal Probability

Marginal probability refers to the probability of an event occurring, irrespective of the outcomes of other variables. In the context of joint distributions, it is obtained by summing or integrating the joint probability distribution over the other variables.

For example, consider a joint distribution of two random variables X and Y. The marginal probability of X is found by summing (or integrating) the joint probabilities over all possible values of Y:

P(X) = Σ P(X, Y) for all Y

Marginal probabilities are useful for understanding the behavior of individual variables within a larger system.

Example: Temperature and weather → T and W are the random variables representing temperature and weather conditions, respectively.

Joint Probability

Joint probability refers to the probability of two (or more) events occurring simultaneously. For two random variables X and Y, the joint probability is denoted as P(X, Y) and can be visualized using a joint probability distribution.

Joint probabilities are useful for understanding the relationships between variables and for making predictions based on the values of multiple variables.

Example: What is the chance that it is both hot and sunny? 0.4 → P(A,B) = P(B,A)

Conditional Probability

Conditional probability refers to the probability of an event occurring given that another event has already occurred. It is denoted as P(A | B), which reads "the probability of A given B."

For example, consider the random variables T (temperature) and W (weather condition). The conditional probability P(W | T) represents the likelihood of a specific weather condition occurring given a specific temperature.

Conditional probabilities are useful for updating our beliefs about the world based on new evidence.

Joint v.s. Conditional Probability

Joint probability considers the likelihood of two events happening together, while conditional probability focuses on the likelihood of one event given the occurrence of another. P(A,B) = P(A | B) * P(B) → The chance that A happens when B happens, times the chance that B happens.

Chain rule of joint probability: P(A,B,C,...) = P(A|B,C,...) * P(B|C,...) = P(A|B,C,...) * P(B|C,...) * P(C,...)

Independence

P(A, B) = P(A) * P(B) → The chance that A happens and B happens together is the chance that A happens times the chance that B happens.

Two events A and B are independent if the occurrence of one does not affect the probability of the other. Mathematically, this is expressed as P(A | B) = P(A) and P(B | A) = P(B).

Independence is a key assumption in many statistical models and simplifies the analysis of complex systems.

Correlation

Correlation measures the strength and direction of a linear relationship between two variables. It is quantified by the correlation coefficient, which ranges from -1 to 1. A correlation coefficient of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.

Correlation is useful for identifying relationships between variables and for making predictions based on those relationships. However, it is important to note that correlation does not imply causation (causation means that one event is the result of the occurrence of another event).

Example: Ice cream sales and drowning incidents are correlated because both increase during the summer. However, buying ice cream does not cause drowning; the underlying factor is the hot weather.
Example: Car insurance and zip code are correlated because certain areas may have higher rates of accidents or theft. However, living in a particular zip code does not cause someone to need car insurance; the underlying factor is the risk associated with that area.
Example: Heart risk and age are correlated because the risk of heart disease tends to increase with age. However, being older does not directly cause heart disease; other factors such as lifestyle and genetics play a role.

Correlation is a statistical measure that describes the extent to which two variables change together. It is important to note that correlation does not imply causation; just because two variables are correlated does not mean that one causes the other.

\[ \mathrm{Cov}(X,Y) = \frac{\mathbb{E} \big[ (X - \mathbb{E}[X]) (Y - \mathbb{E}[Y]) \big]}{\sigma_X \, \sigma_Y} \]

Law of Total Probability

The law of total probability states that if you have a set of mutually exclusive events that cover all possible outcomes, the total probability of any event can be found by considering all the different ways that event can occur.

\[ \mathbb{E}_n \big[ P(A, B_n) \big] = \mathbb{E}_n \big[ P(A \mid B_n) \, P(B_n) \big] \]

The probabilities that A happens under all possible circumstances of B

Bayes' Theorem

Bayes' theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. It is expressed mathematically as:

\[ P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)} \]

P(A|B) is the probability of A given B, P(B|A) is the probability of B given A, P(A) is the prior probability of A, and P(B) is the prior probability of B.

Bayes' theorem is widely used in various fields, including statistics, machine learning, and artificial intelligence, for updating probabilities based on new evidence.

Example: Medical diagnosis. Let A be the event that a patient has a disease, and B be the event that the patient tests positive for the disease. We want to find P(A|B), the probability that the patient has the disease given a positive test result. Using Bayes' theorem, we can calculate this probability based on the sensitivity and specificity of the test, as well as the prevalence of the disease in the population.

Example: Spam filtering. Let A be the event that an email is spam, and B be the event that the email contains certain keywords. We want to find P(A|B), the probability that an email is spam given that it contains those keywords. Using Bayes' theorem, we can update our belief about whether an email is spam based on the presence of specific keywords and the overall frequency of spam emails.

Example: Assume in general there is 50% chance of rain, when it rains there is always cloud cover. In general there is 80% chance we see cloud. We want to find P(rain|cloud) using Bayes' theorem which in this cause P(rain|cloud) = \( \frac{P(cloud|rain) \, P(rain)}{P(cloud)} = \frac{(1) \cdot (0.5)}{0.8} = 0.625 \), this is because P(cloud|rain) = 1, P(cloud) = 0.8, and P(rain) = 0.5. We know there is an initial 50% chance of rain, and an initial 80% chance of cloud. P(cloud|rain) is 1 because when it rains, we always see cloud.