import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Sample data: Randomly generated for demonstration
= np.random.normal(0, 1, 1000) # 1000 data points with a normal distribution
data
# Step 2: Decide on the number of bins
= int(np.ceil(1 + 3.322 * np.log10(len(data)))) # Sturges' formula
num_bins
# Step 3: Determine range and bin width -- handled internally by matplotlib
# Steps 4 & 5: Sort data into bins and draw the histogram
= plt.subplots()
fig, ax = ax.hist(data, bins=num_bins, density=True, alpha=0.75, edgecolor='black')
n, bins, patches
# Calculate probabilities (relative frequencies) manually, if needed
= np.diff(bins) # np.diff finds the difference between adjacent bin boundaries
bin_width = n * bin_width # n is already normalized to form a probability density if `density=True`
probabilities
# Adding labels and title for clarity
'Data Value')
ax.set_xlabel('Probability Density')
ax.set_ylabel('Histogram with Probability Density') ax.set_title(
16 Data-Driven Modeling and Optimization
16.1 StatQuest Videos
16.1.1 June, 11th 2024
16.1.1.1 Histograms
Exercise 16.1 (Histograms) Problems with histograms?
16.1.1.2 Probability Distributions
Exercise 16.2 (Smaller Bins) What happens when we use smaller bins in a histogram?
Exercise 16.3 (Density Curve) Why plot a curve to approximate a histogram?
16.1.1.3 Normal Distribution
Exercise 16.4 (TwoSDQuestion) How many samples are plus/minus two SD around the mean?
Exercise 16.5 (OneSDQuestion) How many samples are plus/minus one SD around the mean?
Exercise 16.6 (ThreeSDQuestion) How many samples are plus/minus three SD around the mean?
Exercise 16.7 (DataRangeQuestion) You have a mean at 100 and a SD of 10. Where are 95% of the data?
Exercise 16.8 (PeakHeightQuestion) If the peak is very high, is the SD low or high?
16.1.1.4 The mean, the media, and the mode
16.1.1.5 The exponential distribution
16.1.1.6 Population and Estimated Parameters
Exercise 16.9 (ProbabilityQuestion) If we have a certain curve and want to calculate the probability of values equal to 20 if the mean is 20.
16.1.1.7 Mean, Variance, and Standard Deviation
Exercise 16.10 (MeanDifferenceQuestion) The difference between \(\mu\) and x-bar?
Exercise 16.11 (EstimateMeanQuestion) How do you calculate the sample mean?
Exercise 16.12 (SigmaSquaredQuestion) What is sigma squared?
Exercise 16.13 (EstimatedSDQuestion) What is the formula for the estimated standard deviation?
Exercise 16.14 (VarianceDifferenceQuestion) Difference between the variance and the estimated variance?
16.1.2 Mathematical Models
Exercise 16.15 (ModelBenefitsQuestion) What are the benefits of using models?
16.1.2.1 Sampling from a Distribution
Exercise 16.16 (SampleDefinitionQuestion) What is a sample in statistics?
16.1.3 Hypothesis Testing and the Null-Hypothesis
Exercise 16.17 (RejectHypothesisQuestion) What does it mean to reject a hypothesis?
Exercise 16.18 (NullHypothesisQuestion) What is a null hypothesis?
Exercise 16.19 (BetterDrugQuestion) How can you show that you have found a better drug?
16.1.3.1 Alternative Hypotheses, Main Ideas
16.1.3.2 p-values: What they are and how to interpret them
Exercise 16.20 (PValueIntroductionQuestion) What is the reason for introducing the p-value?
Exercise 16.21 (PValueRangeQuestion) Is there any range for p-values? Can it be negative?
Exercise 16.22 (PValueRangeQuestion) Is there any range for p-values? Can it be negative?
Exercise 16.23 (TypicalPValueQuestion) What are typical values of the p-value and what does it mean? 5%?
Exercise 16.24 (FalsePositiveQuestion) What is a false-positive?
16.1.3.3 How to calculate p-values
Exercise 16.25 (CalculatePValueQuestion) How to calculate p-value?
Exercise 16.26 (SDCalculationQuestion) What is the SD if the mean is 155 and in the range from 142 - 169 there are 95% of the data?
Exercise 16.27 (SidedPValueQuestion) When do we need the two-sided p-value and when the one-sided?
Exercise 16.28 (CoinTestQuestion) Test a coin with Tail-Head-Head. What is the p-value?
Exercise 16.29 (BorderPValueQuestion) If you get exactly the 0.05 border value, can you reject?
Exercise 16.30 (OneSidedPValueCautionQuestion) Why should you be careful with a one-sided p-test?
Exercise 16.31 (BinomialDistributionQuestion) What is the binomial distribution?
16.1.3.4 p-hacking: What it is and how to avoid it
Exercise 16.32 (PHackingWaysQuestion) Name two typical ways of p-hacking.
Exercise 16.33 (AvoidPHackingQuestion) How can p-hacking be avoided?
Exercise 16.34 (MultipleTestingProblemQuestion) What is the multiple testing problem?
16.1.3.5 Covariance
Exercise 16.35 (CovarianceDefinitionQuestion) What is covariance?
Exercise 16.36 (CovarianceMeaningQuestion) What is the meaning of covariance?
Exercise 16.37 (CovarianceVarianceRelationshipQuestion) What is the relationship between covariance and variance?
Exercise 16.38 (HighCovarianceQuestion) If covariance is high, is there a strong relationship?
Exercise 16.39 (ZeroCovarianceQuestion) What if the covariance is zero?
Exercise 16.40 (NegativeCovarianceQuestion) Can covariance be negative?
Exercise 16.41 (NegativeVarianceQuestion) Can variance be negative?
16.1.3.6 Pearson’s Correlation
Video: [Pearson’s Correlation, Clearly Explained]
Exercise 16.42 (CorrelationValueQuestion) What do you do if the correlation value is 10?
Exercise 16.43 (CorrelationRangeQuestion) What is the possible range of correlation values?
Exercise 16.44 (CorrelationFormulaQuestion) What is the formula for correlation?
16.1.3.7 Boxplots
16.1.4 June, 18th 2024
16.1.4.1 Statistical Power
Exercise 16.45 (UnderstandingStatisticalPower) What is the definition of power in a statistical test?
Exercise 16.46 (DistributionEffectOnPower) What is the implication for power analysis if the samples come from the same distribution?
Exercise 16.47 (IncreasingPower) How can you increase the power if the distributions are very similar?
Exercise 16.48 (PreventingPHacking) What should be done to avoid p-hacking when the distributions are close to each other?
Exercise 16.49 (SampleSizeAndPower) If there is overlap and the sample size is small, will the power be high or low?
16.1.4.2 Power Analysis
Exercise 16.50 (FactorsAffectingPower) Which are the two main factors that affect power?
Exercise 16.51 (PurposeOfPowerAnalysis) What does power analysis tell us?
Exercise 16.52 (ExperimentRisks) What are the two risks faced when performing an experiment?
Exercise 16.53 (PerformingPowerAnalysis) How do you perform a power analysis?
16.1.4.3 The Central Limit Theorem
Exercise 16.54 (CentralLimitTheoremExplanation) What does the Central Limit Theorem state?
16.1.4.4 Boxplots
- Video: Boxplots are Awesome
Exercise 16.55 (MedianInBoxplot) What is represented by the middle line in a boxplot?
Exercise 16.56 (BoxContentInBoxplot) What does the box in a boxplot represent?
16.1.4.5 R-squared
- Video: R-squared, Clearly Explained
Exercise 16.57 (RSquaredDefinition) What is R-squared? Show the formula.
Exercise 16.58 (NegativeRSquared) Can the R-squared value be negative?
Exercise 16.59 (RSquaredCalculation) Perform a calculation involving R-squared.
16.1.4.6 The main ideas of fitting a line to data (The main ideas of least squares and linear regression.)
- Video: The main ideas of fitting a line to data (The main ideas of least squares and linear regression.)
Exercise 16.60 (LeastSquaresMeaning) What is the meaning of the least squares method?
16.1.4.7 Linear Regression
16.1.4.8 Multiple Regression
- Video: Multiple Regression, Clearly Explained
16.1.4.9 A Gentle Introduction to Machine Learning
Exercise 16.61 (RegressionVsClassification) What is the difference between regression and classification?
16.1.4.10 Maximum Likelihood
Exercise 16.62 (LikelihoodConcept) What is the idea of likelihood?
Exercise 16.63 (ProbabilityVsLikelihood) What is the difference between probability and likelihood?
16.1.4.11 Cross-Validation
Exercise 16.64 (TrainVsTestData) What is the difference between training and testing data?
Exercise 16.65 (SingleValidationIssue) What is the problem if you validate the model only once?
Exercise 16.66 (FoldDefinition) What is a fold in cross-validation?
Exercise 16.67 (LeaveOneOutValidation) What is leave-one-out cross-validation?
16.1.4.12 The Confusion Matrix
Exercise 16.68 (DrawingConfusionMatrix) Draw the confusion matrix.
16.1.4.13 Sensitivity and Specificity
Exercise 16.69 (SensitivitySpecificityCalculation1) Calculate the sensitivity and specificity for a given confusion matrix.
Exercise 16.70 (SensitivitySpecificityCalculation2) Calculate the sensitivity and specificity for a given confusion matrix.
16.1.4.14 Bias and Variance
Exercise 16.71 (BiasAndVariance) What are bias and variance?
16.1.4.15 Mutual Information
Exercise 16.72 (MutualInformationExample) Provide an example and calculate if mutual information is high or low.
16.1.5 June, 25th 2024
16.1.5.1 Principal Component Analysis (PCA)
Exercise 16.73 (WhatIsPCA) What is PCA?
Exercise 16.74 (ScreePlotExplanation) What is a scree plot?
- Vidoe: PCA - Practical Tips
Exercise 16.75 (LeastSquaresInPCA) Does PCA use least squares?
Exercise 16.76 (PCASteps) Which steps are performed by PCA?
Exercise 16.77 (EigenvaluePC1) What is the eigenvalue of the first principal component?
Exercise 16.78 (DifferencesBetweenPoints) Are the differences between red and yellow the same as the differences between red and blue points?
- Video: PCA in Python
Exercise 16.79 (ScalingInPCA) How to scale data in PCA?
Exercise 16.80 (DetermineNumberOfComponents) How to determine the number of principal components?
Exercise 16.81 (LimitingNumberOfComponents) How is the number of principal components limited?
16.1.6 t-SNE
- Video: t-SNE, Clearly Explained
Exercise 16.82 (WhyUseTSNE) Why use t-SNE?
Exercise 16.83 (MainIdeaOfTSNE) What is the main idea of t-SNE?
Exercise 16.84 (BasicConceptOfTSNE) What is the basic concept of t-SNE?
Exercise 16.85 (TSNESteps) What are the steps in t-SNE?
16.1.7 K-means clustering
- Video: K-means clustering
Exercise 16.86 (HowKMeansWorks) How does K-means clustering work?
Exercise 16.87 (QualityOfClusters) How can the quality of the resulting clusters be calculated?
Exercise 16.88 (IncreasingK) Why is it not a good idea to increase k too much?
16.1.8 DBSCAN
Exercise 16.89 (CorePointInDBSCAN) What is a core point in DBSCAN?
Exercise 16.90 (AddingVsExtending) What is the difference between adding and extending in DBSCAN?
Exercise 16.91 (OutliersInDBSCAN) What are outliers in DBSCAN?
16.1.9 K-nearest neighbors
Exercise 16.92 (AdvantagesAndDisadvantagesOfK) What are the advantages and disadvantages of k = 1 and k = 100 in K-nearest neighbors?
16.1.10 Naive Bayes
Exercise 16.93 (NaiveBayesFormula) What is the formula for Naive Bayes?
Exercise 16.94 (CalculateProbabilities) Calculate the probabilities for a given example using Naive Bayes.
16.1.11 Gaussian Naive Bayes
Exercise 16.95 (UnderflowProblem) Why is underflow a problem in Gaussian Naive Bayes?
16.1.12 July, 2nd 2024
16.1.12.1 Decision and Classification Trees, Clearly Explained
16.1.12.2 StatQuest: Decision Trees, Part 2 - Feature Selection and Missing Data
16.1.12.3 Regression Trees, Clearly Explained!!!
16.1.12.4 How to Prune Regression Trees, Clearly Explained!!!
16.1.12.5 Trees
Exercise 16.96 (Tree Usage) For what can we use trees?
16.1.12.6 Decision Trees
Exercise 16.97 (Tree Usage) Based on a shown tree graph:
- How can you use this tree?
- What is the root node?
- What are branches and internal nodes?
- What are the leafs?
- Are the leafs pure or impure?
- Which of the leafs is more impure?
Exercise 16.98 (Tree Feature Importance) Is the most or least important feature on top?
Exercise 16.99 (Tree Feature Imputation) How can you fill a gap/missing data?
Solution 16.1 (Tree Feature Imputation).
- Mean
- Median
- Comparing to column with high correlation
16.1.12.7 Regression Trees
Exercise 16.100 (Regression Tree Limitations) What are limitations?
Exercise 16.101 (Regression Tree Score) How is the tree score calculated?
Exercise 16.102 (Regression Tree Alpha Value Small) What can we say about the tree if the alpha value is small?
Exercise 16.103 (Regression Tree Increase Alpha Value) What happens if you increase alpha?
Exercise 16.104 (Regression Tree Pruning) What is the meaning of pruning?
16.1.13 Additional Videos
- Odds and Log(Odds), Clearly Explained!!!
- One-Hot, Label, Target and K-Fold Target Encoding, Clearly Explained!!!
- Maximum Likelihood for the Exponential Distribution, Clearly Explained!!!
- ROC and AUC, Clearly Explained!
- Entropy (for data science) Clearly Explained!!!
- Classification Trees in Python from Start to Finish: Long live video!
16.2 Introduction to Statistical Learning
Parts of this course are based on the book An Introduction to Statistical Learning, James et al. (2014). Some of the figures in this presentation are taken from An Introduction to Statistical Learning (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.
16.2.1 Opening Remarks and Examples
- Artificial Intelligence (AI)
- Machine learning (ML)
- Deep Learning (DL)
- 1980’s neural networks.
- Statistical learning.
- IBM Watson supercomputer.
Statistical learning problems include:
- Identification of prostate cancer through PSA and other measurements such as age, Gleason score, etc. Scatter plots help reveal the nature of the data and its correlations. Using transformed data (log scale) can highlight typos in the data; for example, a patient with a 449-gram prostate. Recommendation: Always examine the data before conducting any sophisticated analysis.
- Classification of phonemes, specifically between “aa” and “ao.”
- Prediction of heart attacks, which can be visualized through colored scatter plots.
- Detection of email spam, based on the frequency of words within the messages, using 57 features.
- Identification of numbers in handwritten zip codes, which involves pattern recognition.
- Classification of tissue samples into cancer classes based on gene expression profiles, utilizing heat maps for visualization.
- Establishing the relationship between salary and demographic variables like income (wage) versus age, year, and education level, employing regression models.
- Classification of pixels in LANDSAT images by their usage, using nearest neighbor methods.
16.2.1.1 Supervised and Unsupervised Learning
Two important types: supervised and unsupervised learning. There is even more, e.g., semi-supervised learning.
16.2.1.1.1 Starting point
- Outcome measurement \(Y\) (dependent variable, response, target).
- Vector of \(p\) predictor measurements \(X\) (inputs, regressors, covariates, features, independent variables).
- Training data \((x_1, y1), \ldots ,(x_N, y_N)\). These are observations (examples, instances) of these measurements.
In the regression problem, \(Y\) is quantitative (e.g., price, blood pressure). In the classification problem, \(Y\) takes values in a finite, unordered set (e.g., survived/died, digit 0-9, cancer class of tissue sample).
16.2.1.1.2 Philosophy
It is important to understand the ideas behind the various techniques, in order to know how and when to use them. One has to understand the simpler methods first, in order to grasp the more sophisticated ones. It is important to accurately assess the performance of a method, to know how well or how badly it is working (simpler methods often perform as well as fancier ones!) This is an exciting research area, having important applications in science, industry and finance. Statistical learning is a fundamental ingredient in the training of a modern data scientist.
16.3 Basics
16.3.1 Histograms
Creating a histogram and calculating the probabilities from a dataset can be approached with scientific precision
Data Collection: Obtain the dataset you wish to analyze. This dataset could represent any quantitative measure, such to examine its distribution.
Decide on the Number of Bins: The number of bins influences the histogram’s granularity. There are several statistical rules to determine an optimal number of bins:
- Square-root rule: suggests using the square root of the number of data points as the number of bins.
- Sturges’ formula: \(k = 1 + 3.322 \log_{10}(n)\), where \(n\) is the number of data points and \(k\) is the suggested number of bins.
- Freedman-Diaconis rule: uses the interquartile range (IQR) and the cube root of the number of data points \(n\) to calculate bin width as \(2 \dfrac{IQR}{n^{1/3}}\).
Determine Range and Bin Width: Calculate the range of data by subtracting the minimum data point value from the maximum. Divide this range by the number of bins to determine the width of each bin.
Allocate Data Points to Bins: Iterate through the data, sorting each data point into the appropriate bin based on its value.
Draw the Histogram: Use a histogram to visualize the frequency or relative frequency (probability) of data points within each bin.
Calculate Probabilities: The relative frequency of data within each bin represents the probability of a randomly selected data point falling within that bin’s range.
Below is a Python script that demonstrates how to generate a histogram and compute probabilities using the matplotlib
library for visualization and numpy
for data manipulation.
for i, prob in enumerate(probabilities):
print(f"Bin {i+1} Probability: {prob:.4f}")
# Ensure probabilities sum to 1 (or very close, due to floating-point arithmetic)
print(f"Sum of probabilities: {np.sum(probabilities)}")
Bin 1 Probability: 0.0020
Bin 2 Probability: 0.0150
Bin 3 Probability: 0.0350
Bin 4 Probability: 0.1100
Bin 5 Probability: 0.1960
Bin 6 Probability: 0.2280
Bin 7 Probability: 0.2240
Bin 8 Probability: 0.1100
Bin 9 Probability: 0.0570
Bin 10 Probability: 0.0170
Bin 11 Probability: 0.0060
Sum of probabilities: 1.0
This code segment goes through the necessary steps to generate a histogram and calculate probabilities for a synthetic dataset. It demonstrates important scientific and computational practices including binning, visualization, and probability calculation in Python.
Key Points: - The histogram represents the distribution of data, with the histogram’s bins outlining the data’s spread and density. - The option density=True
in ax.hist()
normalizes the histogram so that the total area under the histogram sums to 1, thereby converting frequencies to probability densities. - The choice of bin number and width has a significant influence on the histogram’s shape and the insights that can be drawn from it, highlighting the importance of selecting appropriate binning strategies based on the dataset’s characteristics and the analysis objectives.
16.3.2 Probability Distributions
What happens when we use smaller bins in a histogram? The histogram becomes more detailed, revealing the distribution of data points with greater precision. However, as the bin size decreases, the number of data points within each bin may decrease, leading to sparse or empty bins. This sparsity can make it challenging to estimate probabilities accurately, especially for data points that fall within these empty bins.
Advantages, when using a probability distribution, include:
- Blanks can be filled
- Probabilities can be calculated
- Parameters are sufficiemnt to describe the distribution, e.g., mean and variance for the normal distribution
Probability distributions offer a powerful solution to the challenges posed by limited data in estimating probabilities. When data is scarce, constructing a histogram to determine the probability of certain outcomes can lead to inaccurate or unreliable results due to the lack of detail in the dataset. However, collecting vast amounts of data to populate a histogram for more precise estimates can often be impractical, time-consuming, and expensive.
A probability distribution is a mathematical function that provides the probabilities of occurrence of different possible outcomes for an experiment. It is a more efficient approach to understanding the likelihood of various outcomes than relying solely on extensive data collection. For continuous data, this is often represented graphically by a smooth curve.
16.3.2.1 The Normal Distribution: A Common Example
A commonly encountered probability distribution is the normal distribution, known for its characteristic bell-shaped curve. This curve represents how the values of a variable are distributed: most of the observations cluster around the mean (or center) of the distribution, with frequencies gradually decreasing as values move away from the mean.
The normal distribution is particularly useful because of its defined mathematical properties. It is determined entirely by its mean (mu, \(\mu\)) and its standard deviation (sigma, \(\sigma\)). The area under the curve represents probability, making it possible to calculate the likelihood of a random variable falling within a specific range.
16.3.2.2 Practical Example: Estimating Probabilities
Consider we are interested in the heights of adults in a population. Instead of measuring the height of every adult (which would be impractical), we can use the normal distribution to estimate the probability of adults’ heights falling within certain intervals, assuming we know the mean and standard deviation of the heights.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
= 170 # e.g., mu height of adults in cm
mu = 10 # e.g., standard deviation of heights in cm
sd = np.linspace(mu - 3*sd, mu + 3*sd, 1000)
heights # Calculate the probability density function for the normal distribution
= norm.pdf(heights, mu, sd)
pdf # Plot the normal distribution curve
='blue', linewidth=2)
plt.plot(heights, pdf, color=(heights >= mu - 2 * sd) & (heights <= mu + 2*sd), color='grey', alpha=0.5)
plt.fill_between(heights, pdf, where'Height (cm)')
plt.xlabel('Probability Density')
plt.ylabel( plt.show()
This Python code snippet generates a plot of the normal distribution for adult heights, with a mean of 170 cm and a standard deviation of 10 cm. It visually approximates a histogram with a blue bell-shaped curve, and highlights (in grey) the area under the curve between \(\mu \pm 2 \times \sigma\). This area corresponds to the probability of randomly selecting an individual whose height falls within this range.
By using the area under the curve, we can efficiently estimate probabilities without needing to collect and analyze a vast amount of data. This method not only saves time and resources but also provides a clear and intuitive way to understand and communicate statistical probabilities.
16.3.3 Discrete Distributions
Discrete probability distributions are essential tools in statistics, providing a mathematical foundation to model and analyze situations with discrete outcomes. Histograms, which can be seen as discrete distributions with data organized into bins, offer a way to visualize and estimate probabilities based on the collected data. However, they come with limitations, especially when data is scarce or when we encounter gaps in the data (blank spaces in histograms). These gaps can make it challenging to accurately estimate probabilities.
A more efficient approach, especially for discrete data, is to use mathematical equations—particularly those defining discrete probability distributions—to calculate probabilities directly, thus bypassing the intricacies of data collection and histogram interpretation.
16.3.3.1 Bernoulli Distribution
The Bernoulli distribution, named after Swiss scientist Jacob Bernoulli, is a discrete probability distribution, which takes value \(1\) with success probability \(p\) and value \(0\) with failure probability \(q = 1-p\). So if \(X\) is a random variable with this distribution, we have: \[ P(X=1) = 1-P(X=0) = p = 1-q. \]
16.3.3.2 Binomial Distribution
The Binomial Distribution is a prime example of a discrete probability distribution that is particularly useful for binary outcomes (e.g., success/failure, yes/no, pumpkin pie/blueberry pie). It leverages simple mathematical principles to calculate the probability of observing a specific number of successes (preferred outcomes) in a fixed number of trials, given the probability of success in each trial.
16.3.3.3 An Illustrative Example: Pie Preference
Consider a scenario from “StatLand” where 70% of people prefer pumpkin pie over blueberry pie. The question is: What is the probability that, out of three people asked, the first two prefer pumpkin pie and the third prefers blueberry pie?
Using the concept of the Binomial Distribution, the probability of such an outcome can be calculated without the need to layout every possible combination by hand. This process not only simplifies calculations but also provides a clear and precise method to determine probabilities in scenarios involving discrete choices. We will use Python to calculate the probability of observing exactly two out of three people prefer pumpkin pie, given the 70% preference rate:
from scipy.stats import binom
= 3 # Number of trials (people asked)
n = 0.7 # Probability of success (preferring pumpkin pie)
p = 2 # Number of successes (people preferring pumpkin pie)
x # Probability calculation using Binomial Distribution
= binom.pmf(x, n, p)
prob print(f"The probability that exactly 2 out of 3 people prefer pumpkin pie is: {prob:.3f}")
The probability that exactly 2 out of 3 people prefer pumpkin pie is: 0.441
This code uses the binom.pmf()
function from scipy.stats
to calculate the probability mass function (PMF) of observing exactly x
successes in n
trials, where each trial has a success probability of p
.
A Binomial random variable is the sum of \(n\) independent, identically distributed Bernoulli random variables, each with probability \(p\) of success. We may indicate a random variable \(X\) with Bernoulli distribution using the notation \(X \sim \mathrm{Bi}(1,\theta)\). Then, the notation for the Binomial is \(X \sim \mathrm{Bi}(n,\theta)\). Its probability and distribution functions are, respectively, \[ p_X(x) = {n\choose x}\theta^x(1-\theta)^{n-x}, \qquad F_X(x) = \Pr\{X \le x\} = \sum_{i=0}^{x} {n\choose i}\theta^i(1-\theta)^{n-i}. \]
The mean of the binomial distribution is \(\text{E}[X] = n\theta\). The variance of the distribution is \(\text{Var}[X] = n\theta(1-\theta)\) (see next section).
A process consists of a sequence of \(n\) independent trials, i.e., the outcome of each trial does not depend on the outcome of previous trials. The outcome of each trial is either a success or a failure. The probability of success is denoted as \(p\), and \(p\) is constant for each trial. Coin tossing is a classical example for this setting.
The binomial distribution is a statistical distribution giving the probability of obtaining a specified number of successes in a binomial experiment; written Binomial(n, p), where \(n\) is the number of trials, and \(p\) the probability of success in each.
Definition 16.1 (Binomial Distribution) The binomial distribution with parameters \(n\) and \(p\), where \(n\) is the number of trials, and \(p\) the probability of success in each, is \[\begin{equation} p(x) = { n \choose k } p^x(1-p)^{n-x} \qquad x = 0,1, \ldots, n. \end{equation}\] The mean \(\mu\) and the variance \(\sigma^2\) of the binomial distribution are \[\begin{equation} \mu = np \end{equation}\] and \[\begin{equation} \sigma^2 = np(1-p). \end{equation}\]
Note, the Bernoulli distribution is simply Binomial(1,p).
16.4 Continuous Distributions
Our considerations regarding probability distributions, expectations, and standard deviations will be extended from discrete distributions to continuous distributions. One simple example of a continuous distribution is the uniform distribution. Continuous distributions are defined by probability density functions.
16.4.1 Distribution functions: PDFs and CDFs
The density for a continuous distribution is a measure of the relative probability of “getting a value close to \(x\).” Probability density functions \(f\) and cumulative distribution function \(F\) are related as follows. \[\begin{equation} f(x) = \frac{d}{dx} F(x) \end{equation}\]
16.4.2 Expectation (Continuous)
Definition 16.2 (Expectation (Continuous)) \[\begin{equation} \text{E}(X) = \int_{-\infty}^\infty x f(x) \, dx \end{equation}\]
16.4.3 Variance and Standard Deviation (Continuous)
Definition 16.3 (Variance (Continuous)) Variance can be calculated with \(\text{E}(X)\) and \[\begin{equation} \text{E}(X^2) = \int_{-\infty}^\infty x^2 f(x) \, dx \end{equation}\] as \[\begin{equation*} \text{Var}(X) = \text{E}(X^2) - [ E(X)]^2. \end{equation*}\] \(\Box\)
Definition 16.4 (Standard Deviation (Continuous)) Standard deviation can be calculated as \[\begin{equation*} \text{sd}(X) = \sqrt{\text{Var}(X)}. \end{equation*}\] \(\Box\)
16.4.4 Uniform Distribution
This variable is defined in the interval \([a,b]\). We write it as \(X \sim U[a,b]\). Its density and cumulative distribution functions are, respectively, \[ f_X(x) = \frac{I_{[a,b]}(x)}{b-a}, \quad\quad F_X(x) = \frac{1}{b-a}\int\limits_{-\infty}\limits^x I_{[a,b]}(t) \mathrm{d}t = \frac{x-a}{b-a}, \] where \(I_{[a,b]}(\cdot)\) is the indicator function of the interval \([a,b]\). Note that, if we set \(a=0\) and \(b=1\), we obtain \(F_X(x) = x\), \(x\) \(\in\) \([0,1]\).
A typical example is the following: the cdf of a continuous r.v. is uniformly distributed in \([0,1]\). The proof of this statement is as follows: For \(u\) \(\in\) \([0,1]\), we have \[\begin{eqnarray*} \Pr\{F_X(X) \leq u\} &=& \Pr\{F_X^{-1}(F_X(X)) \leq F_X^{-1}(u)\} = \Pr\{X \leq F_X^{-1}(u)\} \\ &=& F_X(F_X^{-1}(u)) = u. \end{eqnarray*}\] This means that, when \(X\) is continuous, there is a one-to-one relationship (given by the cdf) between \(x\) \(\in\) \(D_X\) and \(u\) \(\in\) \([0,1]\).
The has a constant density over a specified interval, say \([a,b]\). The uniform \(U(a,b)\) distribution has density \[\begin{equation} f(x) = \left\{ \begin{array}{ll} 1/(b-a) & \textrm{ if } a < x < b,\\ 0 & \textrm{ otherwise} \end{array} \right. \end{equation}\]
16.4.5 Normal Distribution
Definition 16.5 (Normal Distribution) This variable is defined on the support \(D_X = \mathbb{R}\) and its density function is given by \[ f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left \{-\frac{1}{2\sigma^2}(x-\mu)^2 \right \}. \] The density function is identified by the pair of parameters \((\mu,\sigma^2)\), where \(\mu\) \(\in\) \(\mathbb{R}\) is the mean (or location parameter) and \(\sigma^2 > 0\) is the variance (or dispersion parameter) of \(X\). \(\Box\)
The density function is symmetric around \(\mu\). The normal distribution belongs to the location-scale family distributions. This means that, if \(Z \sim N(0,1)\) (read, \(Z\) has a standard normal distribution; i.e., with \(\mu=0\) and \(\sigma^2=1\)), and we consider the linear transformation \(X = \mu + \sigma Z\), then \(X \sim N(\mu,\sigma^2)\) (read, \(X\) has a normal distribution with mean \(\mu\) and variance \(\sigma^2\)). This means that one can obtain the probability of any interval \((-\infty,x]\), \(x\) \(\in\) \(R\) for any normal distribution (i.e., for any pair of the parameters \(\mu\) and \(\sigma\)) once the quantiles of the standard normal distribution are known. Indeed \[\begin{eqnarray*} F_X(x) &=& \Pr\left\{X \leq x \right\} = \Pr\left\{\frac{X-\mu}{\sigma} \leq \frac{x-\mu}{\sigma} \right\} \\ &=& \Pr\left\{Z \leq \frac{x-\mu}{\sigma}\right\} = F_Z\left(\frac{x-\mu}{\sigma}\right) \qquad x \in \mathbb{R}. \end{eqnarray*}\] The quantiles of the standard normal distribution are available in any statistical program. The density and cumulative distribution function of the standard normal r.v.~at point \(x\) are usually denoted by the symbols \(\phi(x)\) and \(\Phi(x)\).
The standard normal distribution is based on the \[ \varphi(z) = \frac{1}{\sqrt{2\pi}} \exp \left(- \frac{z^2}{2} \right). \tag{16.1}\]
An important application of the standardization introduced in Equation 16.1 reads as follows. In case the distribution of \(X\) is approximately normal, the distribution of X^{*} is approximately standard normal. That is \[\begin{equation*} P(X\leq b) = P( \frac{X-\mu}{\sigma} \leq \frac{b-\mu}{\sigma}) = P(X^{*} \leq \frac{b-\mu}{\sigma}) \end{equation*}\] The probability \(P(X\leq b)\) can be approximated by \(\Phi(\frac{b-\mu}{\sigma})\), where \(\Phi\) is the standard normal cumulative distribution function.
If \(X\) is a normal random variable with mean \(\mu\) and variance \(\sigma^2\), i.e., \(X \sim \cal{N} (\mu, \sigma^2)\), then \[\begin{equation} X = \mu + \sigma Z \textrm{ where } Z \sim \cal{N}(0,1). \end{equation}\]
If \(Z \sim \cal{N}(0,1)\) and \(X\sim \cal{N}(\mu, \sigma^2)\), then \[\begin{equation*} X = \mu + \sigma Z. \end{equation*}\]
The probability of getting a value in a particular interval is the area under the corresponding part of the curve. Consider the density function of the normal distribution. It can be plotted using the following commands. The result is shown in Figure 16.3.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
= np.arange(-4, 4, 0.1)
x # Calculating the normal distribution's density function values for each point in x
= norm.pdf(x, 0, 1)
y ='-', linewidth=2)
plt.plot(x, y, linestyle'Normal Distribution')
plt.title('X')
plt.xlabel('Density')
plt.ylabel(True)
plt.grid( plt.show()
The (CDF) describes the probability of “hitting” \(x\) or less in a given distribution. We consider the CDF function of the normal distribution. It can be plotted using the following commands. The result is shown in Figure 16.4.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
# Generating a sequence of numbers from -4 to 4 with 0.1 intervals
= np.arange(-4, 4, 0.1)
x
# Calculating the cumulative distribution function value of the normal distribution for each point in x
= norm.cdf(x, 0, 1) # mean=0, stddev=1
y
# Plotting the results. The equivalent of 'type="l"' in R (line plot) becomes the default plot type in matplotlib.
='-', linewidth=2)
plt.plot(x, y, linestyle'Normal Distribution CDF')
plt.title('X')
plt.xlabel('Cumulative Probability')
plt.ylabel(True)
plt.grid( plt.show()
16.4.6 The Mean, the Median, and the Mode
16.4.7 The Exponential Distribution
The exponential distribution is a continuous probability distribution that describes the time between events in a Poisson process, where events occur continuously and independently at a constant average rate. It is characterized by a single parameter, the rate parameter \(\lambda\), which represents the average number of events per unit time.
16.4.8 Population and Estimated Parameters
16.4.9 Calculating the Mean, Variance, and Standard Deviation
16.4.10 What is a Mathematical Model?
16.4.11 Sampling from a Distribution
16.4.12 Hypothesis Testing and the Null Hypothesis
16.4.13 Alternative Hypotheses
16.4.14 p-values: What They Are and How to Interpret Them
16.4.15 How to Calculate p-values
16.4.16 p-hacking: What It Is and How to Avoid It
16.4.17 Covariance
16.4.18 Pearson’s Correlation
16.4.19 Boxplots
16.4.20 R-squared
16.4.21 The Main Ideas of Fitting a Line to Data
16.4.22 Linear Regression
16.4.23 Multiple Regression
16.5 Supervised Learning
Objectives of supervised learning: On the basis of the training data we would like to:
- Accurately predict unseen test cases.
- Understand which inputs affect the outcome, and how.
- Assess the quality of our predictions and inferences.
Note: Supervised means \(Y\) is known.
Exercise 16.105
- Do children learn supervised?
- When do you learn supervised?
- Can learning be unsupervised?
16.5.0.0.1 Unsupervised Learning
No outcome variable, just a set of predictors (features) measured on a set of samples. The objective is more fuzzy—find groups of samples that behave similarly, find features that behave similarly, find linear combinations of features with the most variation. It is difficult to know how well your are doing. Unsupervised learning different from supervised learning, but can be useful as a pre-processing step for supervised learning. Clustering and principle component analysis are important techniques.
Unsupervised: \(Y\) is unknown, there is no \(Y\), no trainer, no teacher, but: distances between the inputs values (features). A distance (or similarity) measure is necessary.
16.5.0.0.2 Statistical Learning
We consider supervised learning first.
Sales figures from a marketing campaign, see Figure 16.5. Trend shown using regression. First seems to be stronger than the third.
Can we predict \(Y\) = Sales using these three? Perhaps we can do better using a model \[ Y = Sales \approx f(X_1 = TV, X_2 = Radio, X_3= Newspaper) \] modeling the joint relationsship.
Here Sales is a response or target that we wish to predict. We generically refer to the response as \(Y\). TV is a feature, or input, or predictor; we name it \(X_1\). Likewise name Radio as \(X_2\), and so on. We can refer to the input vector collectively as \[ X = \begin{pmatrix} X_1\\ X_2\\ X_3 \end{pmatrix} \]
Now we write our model as \[ Y = f(X) + \epsilon \] where \(\epsilon\) captures measurement errors and other discrepancies.
What is \(f\) good for? With a good \(f\) we can make predictions of \(Y\) at new points \(X = x\). We can understand which components of \(X = (X_1, X_2, \ldots X_p)\) are important in explaining \(Y\), and which are irrelevant.
For example, Seniority and Years of Education have a big impact on Income, but Marital Status typically does not. Depending on the complexity of \(f\), we may be able to understand how each component \(X_j\) of \(X\) affects \(Y\).
16.5.1 Statistical Learning and Regression
16.5.1.1 Regression Function
Consider Figure 16.6. Is there an ideal \(f(X)\)? In particular, what is a good value for \(f(X)\) at any selected value of \(X\), say \(X = 4\)? There can be many \(Y\) values at \(X=4\). A good value is \[ f(4) = E(Y |X = 4). \]
\(E(Y |X = 4)\) means expected value (average) of \(Y\) given \(X = 4\).
The ideal \(f(x) = E(Y |X = x)\) is called the regression function. Read: The regression function gives the conditional expectation of \(Y\) given \(X\).
The regression function \(f(x)\) is also defined for the vector \(X\); e.g., \(f(x) = f(x_1, x_2, x_3) = E(Y | X_1 =x_1, X_2 =x_2, X_3 =x_3).\)
16.5.2 Optimal Predictor
The regression function is the ideal or optimal predictor of \(Y\) with regard to mean-squared prediction error: It means that \(f(x) = E(Y | X = x)\) is the function that minimizes \[ E[(Y - g(X))^2|X = x] \] over all functions \(g\) at all points \(X = x\).
16.5.2.1 Residuals, Reducible and Irreducible Error
At each point \(X\) we make mistakes: \[ \epsilon = Y-f(x) \] is the residual. Even if we knew \(f(x)\), we would still make errors in prediction, since at each \(X=x\) there is typically a distribution of possible \(Y\) values as is illustrated in Figure 16.6.
For any estimate \(\hat{f}(x)\) of \(f(x)\), we have \[ E\left[ ( Y - \hat{f}(X))^2 | X = x\right] = \left[ f(x) - \hat{f}(x) \right]^2 + \text{var}(\epsilon), \] and \(\left[ f(x) - \hat{f}(x) \right]^2\) is the reducible error, because it depends on the model (changing the model \(f\) might reduce this error), and \(\text{var}(\epsilon)\) is the irreducible error.
16.5.2.2 Local Regression (Smoothing)
Typically we have few if any data points with \(X = 4\) exactly. So we cannot compute \(E(Y |X = x)\)! Idea: Relax the definition and let \[ \hat{f}(x)= Ave(Y|X \in \cal{N}(x)), \] where \(\cal{N} (x)\) is some neighborhood of \(x\), see Figure 16.7.
Nearest neighbor averaging can be pretty good for small \(p\), i.e., \(p \leq 4\) and large-ish \(N\). We will discuss smoother versions, such as kernel and spline smoothing later in the course.
16.5.3 Curse of Dimensionality and Parametric Models
Local, e.g., nearest neighbor, methods can be lousy when \(p\) is large. Reason: the curse of dimensionality, i.e., nearest neighbors tend to be far away in high dimensions. We need to get a reasonable fraction of the \(N\) values of \(y_i\) to average to bring the variance down—e.g., 10%. A 10% neighborhood in high dimensions need no longer be local, so we lose the spirit of estimating \(E(Y |X = x)\) by local averaging, see Figure 16.8. If the curse of dimensionality does not exist, nearest neighbor models would be perfect prediction models.
We will use structured (parametric) models to deal with the curse of dimensionality. The linear model is an important example of a parametric model: \[ f_L(X) = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p. \] A linear model is specified in terms of \(p + 1\) parameters $ _1, _2, , _p$. We estimate the parameters by fitting the model to . Although it is almost never correct, a linear model often serves as a good and interpretable approximation to the unknown true function \(f(X)\).
The linear model is avoiding the curse of dimensionality, because it is not relying on any local properties. Linear models belong to the class of approaches: they replace the problem of estimating \(f\) with estimating a fixed set of coefficients \(\beta_i\), with \(i=1,2, \ldots, p\).
A linear model \[ \hat{f}_L(X) = \hat{\beta}_0 + \hat{\beta}_1 X \] gives a reasonable fit, see Figure 16.9. A quadratic model \[ \hat{f}_Q(X) = \hat{\beta}_0 + \hat{\beta}_1 X + \hat{\beta}_2 X^2 \] gives a slightly improved fit, see Figure 16.10.
Figure 16.11 shows a simulated example. Red points are simulated values for income from the model \[ income = f(education, seniority) + \epsilon \] \(f\) is the blue surface.
The linear regression model \[ \hat{f}(education, seniority) = \hat{\beta}_0 + \hat{\beta}_1 \times education + \hat{\beta}_2 \times seniority \] captures the important information. But it does not capture everything. More flexible regression model \[ \hat{f}_S (education, seniority) \] fit to the simulated data. Here we use a technique called a thin-plate spline to fit a flexible surface. Even more flexible spline regression model \[ \hat{f}_S (education, seniority) \] fit to the simulated data. Here the fitted model makes no errors on the training data! Also known as overfitting.
16.5.3.1 Trade-offs
- Prediction accuracy versus interpretability: Linear models are easy to interpret; thin-plate splines are not.
- Good fit versus over-fit or under-fit: How do we know when the fit is just right?
- Parsimony (Occam’s razor) versus black-box: We often prefer a simpler model involving fewer variables over a black-box predictor involving them all.
The trad-offs are visualized in Figure 16.15.
16.5.4 Assessing Model Accuracy and Bias-Variance Trade-off
Suppose we fit a model \(f(x)\) to some training data \(Tr = \{x_i, y_i \}^N_1\), and we wish to see how well it performs. We could compute the average squared prediction error over \(Tr\): \[ MSE_{Tr} = Ave_{i \in Tr}[y_i - \hat{f}(x_i)]^2. \] This may be biased toward more overfit models. Instead we should, if possible, compute it using fresh test data \(Te== \{x_i, y_i \}^N_1\): \[ MSE_{Te} = Ave_{i \in Te}[y_i - \hat{f}(x_i)]^2. \] The red curve, which illustrated the test error, can be estimated by holding out some data to get the test-data set.
16.5.4.1 Bias-Variance Trade-off
Suppose we have fit a model \(f(x)\) to some training data \(Tr\), and let \((x_0, y_0)\) be a test observation drawn from the population. If the true model is \[ Y = f(X) + \epsilon \qquad \text{ with } f(x) = E(Y|X=x), \] then \[ E \left( y_0 - \hat{f}(x_0) \right)^2 = \text{var} (\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + \text{var}(\epsilon). \tag{16.2}\]
Here, \(\text{var}(\epsilon)\) is the irreducible error. The reducible error consists of two components:
- \(\text{var} (\hat{f}(x_0))\) is the variance that comes from different training sets. Different training sets result in different functions \(\hat{f}\).
- \(Bias(\hat{f}(x_0)) = E[\hat{f}(x_0)] - f(x_0)\).
The expectation averages over the variability of \(y_0\) as well as the variability in \(Tr\). Note that \[ Bias(\hat{f}(x_0)) = E[\hat{f}(x_0)] - f(x_0). \] Typically as the flexibility of \(\hat{f}\) increases, its variance increases (because the fits differ from training set to trainig set), and its bias decreases. So choosing the flexibility based on average test error amounts to a bias-variance trade-off, see Figure 16.19.
If we add the two components (reducible and irreducible error), we get the MSE in Figure 16.19 as can be seen in Equation 16.2.
16.5.5 Classification Problems and K-Nearest Neighbors
In classification we have a qualitative response variable.
Here the response variable \(Y\) is qualitative, e.g., email is one of \(\cal{C} = (spam, ham)\), where ham is good email, digit class is one of \(\cal{C} = \{ 0, 1, \ldots, 9 \}\). Our goals are to:
- Build a classifier \(C(X)\) that assigns a class label from \(\cal{C}\) to a future unlabeled observation \(X\).
- Assess the uncertainty in each classification
- Understand the roles of the different predictors among \(X = (X_1,X_2, \ldots, X_p)\).
Simulation example depicted in@fig-0218a. \(Y\) takes two values, zero and one, and \(X\) has only one value. Big sample: each single vertical bar indicates an occurrance of a zero (orange) or one (blue) as a function of the \(X\)s. Black curve generated the data: it is the probability of generating a one. For high values of \(X\), the probability of ones is increasing. What is an ideal classifier \(C(X)\)?
Suppose the \(K\) elements in \(\cal{C}\) are numbered \(1,2,\ldots, K\). Let \[ p_k(x) = Pr(Y = k|X = x), k = 1,2,\ldots,K. \]
These are the conditional class probabilities at \(x\); e.g. see little barplot at \(x = 5\). Then the Bayes optimal classifier at \(x\) is \[ C(x) = j \qquad \text{ if } p_j(x) = \max \{p_1(x),p_2(x),\ldots, p_K(x)\}. \] At \(x=5\) there is an 80% probability of one, and an 20% probability of a zero. So, we classify this point to the class with the highest probability, the majority class.
Nearest-neighbor averaging can be used as before. This is illustrated in Fig.~\(\ref{fig:0219a}\). Here, we consider 100 points only. Nearest-neighbor averaging also breaks down as dimension grows. However, the impact on \(\hat{C}(x)\) is less than on \(\hat{p}_k (x)\), \(k = 1, \ldots, K\).
16.5.5.1 Classification: Some Details
Average number of errors made to measure the performance. Typically we measure the performance of \(\hat{C}(x)\) using the misclassification error rate: \[ Err_{Te} = Ave_{i\in Te} I[y_i \neq \hat{C} (x_i) ]. \] The Bayes classifier (using the true \(p_k(x)\)) has smallest error (in the population).
16.5.6 k-Nearest Neighbor Classification
Consider k-nearest neighbors in two dimensions. Orange and blue dots label the true class memberships of the underlying points in the 2-dim plane. Dotted line is the decision boundary, that is the contour with equal probability for both classes.
Nearest-neighbor averaging in 2-dim. At any given point we want to classify, we spread out a little neighborhood, say \(K=10\) points from the neighborhood and calulated the percentage of blue and orange. We assign the color with the highest probability to this point. If this is done for every point in the plane, we obtain the solid black curve as the esitmated decsion boundary.
We can use \(K=1\). This is the nearest-neighbor classifier. The decision boundary is piecewise linear. Islands occur. Approximation is rather noisy.
\(K=100\) leads to a smooth decision boundary. But gets uninteresting.
\(K\) large means higher bias, so \(1/K\) is chosen, because we go from low to high complexity on the \(x\)-error, see Figure 16.25. Horizontal dotted line is the base error.
16.5.7 Minkowski Distance
The Minkowski distance of order \(p\) (where \(p\) is an integer) between two points \(X=(x_1,x_2,\ldots,x_n)\text{ and }Y=(y_1,y_2,\ldots,y_n) \in \mathbb{R}^n\) is defined as: \[ D \left( X,Y \right) = \left( \sum_{i=1}^n |x_i-y_i|^p \right)^\frac{1}{p}. \]
16.5.8 Unsuperivsed Learning: Classification
16.5.8.1 k-Means Algorithm
The \(k\)-means algorithm is an unsupervised learning algorithm that has a loose relationship to the \(k\)-nearest neighbor classifier. The \(k\)-means algorithm works as follows:
- Step 1: Randomly choose \(k\) centers. Assign points to cluster.
- Step 2: Determine the distances of each data point to the centroids and re-assign each point to the closest cluster centroid based upon minimum distance
- Step 3: Calculate cluster centroids again
- Step 4: Repeat steps 2 and 3 until we reach global optima where no improvements are possible and no switching of data points from one cluster to other.
The basic principle of the \(k\)-means algorithm is illustrated in Figure 16.26, Figure 16.27, Figure 16.28, and Figure 16.29.