16  Data-Driven Modeling and Optimization

16.1 StatQuest Videos

16.1.1 June, 11th 2024

16.1.1.1 Histograms

Exercise 16.1 (Histograms) Problems with histograms?

16.1.1.2 Probability Distributions

Exercise 16.2 (Smaller Bins) What happens when we use smaller bins in a histogram?

Exercise 16.3 (Density Curve) Why plot a curve to approximate a histogram?

16.1.1.3 Normal Distribution

Exercise 16.4 (TwoSDQuestion) How many samples are plus/minus two SD around the mean?

Exercise 16.5 (OneSDQuestion) How many samples are plus/minus one SD around the mean?

Exercise 16.6 (ThreeSDQuestion) How many samples are plus/minus three SD around the mean?

Exercise 16.7 (DataRangeQuestion) You have a mean at 100 and a SD of 10. Where are 95% of the data?

Exercise 16.8 (PeakHeightQuestion) If the peak is very high, is the SD low or high?

16.1.1.4 The mean, the media, and the mode

16.1.1.5 The exponential distribution

16.1.1.6 Population and Estimated Parameters

Exercise 16.9 (ProbabilityQuestion) If we have a certain curve and want to calculate the probability of values equal to 20 if the mean is 20.

16.1.1.7 Mean, Variance, and Standard Deviation

Exercise 16.10 (MeanDifferenceQuestion) The difference between \(\mu\) and x-bar?

Exercise 16.11 (EstimateMeanQuestion) How do you calculate the sample mean?

Exercise 16.12 (SigmaSquaredQuestion) What is sigma squared?

Exercise 16.13 (EstimatedSDQuestion) What is the formula for the estimated standard deviation?

Exercise 16.14 (VarianceDifferenceQuestion) Difference between the variance and the estimated variance?

16.1.2 Mathematical Models

Exercise 16.15 (ModelBenefitsQuestion) What are the benefits of using models?

16.1.2.1 Sampling from a Distribution

Exercise 16.16 (SampleDefinitionQuestion) What is a sample in statistics?

16.1.3 Hypothesis Testing and the Null-Hypothesis

Exercise 16.17 (RejectHypothesisQuestion) What does it mean to reject a hypothesis?

Exercise 16.18 (NullHypothesisQuestion) What is a null hypothesis?

Exercise 16.19 (BetterDrugQuestion) How can you show that you have found a better drug?

16.1.3.1 Alternative Hypotheses, Main Ideas

16.1.3.2 p-values: What they are and how to interpret them

Exercise 16.20 (PValueIntroductionQuestion) What is the reason for introducing the p-value?

Exercise 16.21 (PValueRangeQuestion) Is there any range for p-values? Can it be negative?

Exercise 16.22 (PValueRangeQuestion) Is there any range for p-values? Can it be negative?

Exercise 16.23 (TypicalPValueQuestion) What are typical values of the p-value and what does it mean? 5%?

Exercise 16.24 (FalsePositiveQuestion) What is a false-positive?

16.1.3.3 How to calculate p-values

Exercise 16.25 (CalculatePValueQuestion) How to calculate p-value?

Exercise 16.26 (SDCalculationQuestion) What is the SD if the mean is 155 and in the range from 142 - 169 there are 95% of the data?

Exercise 16.27 (SidedPValueQuestion) When do we need the two-sided p-value and when the one-sided?

Exercise 16.28 (CoinTestQuestion) Test a coin with Tail-Head-Head. What is the p-value?

Exercise 16.29 (BorderPValueQuestion) If you get exactly the 0.05 border value, can you reject?

Exercise 16.30 (OneSidedPValueCautionQuestion) Why should you be careful with a one-sided p-test?

Exercise 16.31 (BinomialDistributionQuestion) What is the binomial distribution?

16.1.3.4 p-hacking: What it is and how to avoid it

Exercise 16.32 (PHackingWaysQuestion) Name two typical ways of p-hacking.

Exercise 16.33 (AvoidPHackingQuestion) How can p-hacking be avoided?

Exercise 16.34 (MultipleTestingProblemQuestion) What is the multiple testing problem?

16.1.3.5 Covariance

Exercise 16.35 (CovarianceDefinitionQuestion) What is covariance?

Exercise 16.36 (CovarianceMeaningQuestion) What is the meaning of covariance?

Exercise 16.37 (CovarianceVarianceRelationshipQuestion) What is the relationship between covariance and variance?

Exercise 16.38 (HighCovarianceQuestion) If covariance is high, is there a strong relationship?

Exercise 16.39 (ZeroCovarianceQuestion) What if the covariance is zero?

Exercise 16.40 (NegativeCovarianceQuestion) Can covariance be negative?

Exercise 16.41 (NegativeVarianceQuestion) Can variance be negative?

16.1.3.6 Pearson’s Correlation

Video: [Pearson’s Correlation, Clearly Explained]

Exercise 16.42 (CorrelationValueQuestion) What do you do if the correlation value is 10?

Exercise 16.43 (CorrelationRangeQuestion) What is the possible range of correlation values?

Exercise 16.44 (CorrelationFormulaQuestion) What is the formula for correlation?

16.1.3.7 Boxplots

16.1.4 June, 18th 2024

16.1.4.1 Statistical Power

Exercise 16.45 (UnderstandingStatisticalPower) What is the definition of power in a statistical test?

Exercise 16.46 (DistributionEffectOnPower) What is the implication for power analysis if the samples come from the same distribution?

Exercise 16.47 (IncreasingPower) How can you increase the power if the distributions are very similar?

Exercise 16.48 (PreventingPHacking) What should be done to avoid p-hacking when the distributions are close to each other?

Exercise 16.49 (SampleSizeAndPower) If there is overlap and the sample size is small, will the power be high or low?

16.1.4.2 Power Analysis

Exercise 16.50 (FactorsAffectingPower) Which are the two main factors that affect power?

Exercise 16.51 (PurposeOfPowerAnalysis) What does power analysis tell us?

Exercise 16.52 (ExperimentRisks) What are the two risks faced when performing an experiment?

Exercise 16.53 (PerformingPowerAnalysis) How do you perform a power analysis?

16.1.4.3 The Central Limit Theorem

Exercise 16.54 (CentralLimitTheoremExplanation) What does the Central Limit Theorem state?

16.1.4.4 Boxplots

Exercise 16.55 (MedianInBoxplot) What is represented by the middle line in a boxplot?

Exercise 16.56 (BoxContentInBoxplot) What does the box in a boxplot represent?

16.1.4.5 R-squared

Exercise 16.57 (RSquaredDefinition) What is R-squared? Show the formula.

Exercise 16.58 (NegativeRSquared) Can the R-squared value be negative?

Exercise 16.59 (RSquaredCalculation) Perform a calculation involving R-squared.

16.1.4.6 The main ideas of fitting a line to data (The main ideas of least squares and linear regression.)

Exercise 16.60 (LeastSquaresMeaning) What is the meaning of the least squares method?

16.1.4.7 Linear Regression

16.1.4.8 Multiple Regression

  • Video: Multiple Regression, Clearly Explained

16.1.4.9 A Gentle Introduction to Machine Learning

Exercise 16.61 (RegressionVsClassification) What is the difference between regression and classification?

16.1.4.10 Maximum Likelihood

Exercise 16.62 (LikelihoodConcept) What is the idea of likelihood?

Exercise 16.63 (ProbabilityVsLikelihood) What is the difference between probability and likelihood?

16.1.4.11 Cross-Validation

Exercise 16.64 (TrainVsTestData) What is the difference between training and testing data?

Exercise 16.65 (SingleValidationIssue) What is the problem if you validate the model only once?

Exercise 16.66 (FoldDefinition) What is a fold in cross-validation?

Exercise 16.67 (LeaveOneOutValidation) What is leave-one-out cross-validation?

16.1.4.12 The Confusion Matrix

Exercise 16.68 (DrawingConfusionMatrix) Draw the confusion matrix.

16.1.4.13 Sensitivity and Specificity

Exercise 16.69 (SensitivitySpecificityCalculation1) Calculate the sensitivity and specificity for a given confusion matrix.

Exercise 16.70 (SensitivitySpecificityCalculation2) Calculate the sensitivity and specificity for a given confusion matrix.

16.1.4.14 Bias and Variance

Exercise 16.71 (BiasAndVariance) What are bias and variance?

16.1.4.15 Mutual Information

Exercise 16.72 (MutualInformationExample) Provide an example and calculate if mutual information is high or low.

16.1.5 June, 25th 2024

16.1.5.1 Principal Component Analysis (PCA)

Exercise 16.73 (WhatIsPCA) What is PCA?

Exercise 16.74 (ScreePlotExplanation) What is a scree plot?

Exercise 16.75 (LeastSquaresInPCA) Does PCA use least squares?

Exercise 16.76 (PCASteps) Which steps are performed by PCA?

Exercise 16.77 (EigenvaluePC1) What is the eigenvalue of the first principal component?

Exercise 16.78 (DifferencesBetweenPoints) Are the differences between red and yellow the same as the differences between red and blue points?

Exercise 16.79 (ScalingInPCA) How to scale data in PCA?

Exercise 16.80 (DetermineNumberOfComponents) How to determine the number of principal components?

Exercise 16.81 (LimitingNumberOfComponents) How is the number of principal components limited?

16.1.6 t-SNE

Exercise 16.82 (WhyUseTSNE) Why use t-SNE?

Exercise 16.83 (MainIdeaOfTSNE) What is the main idea of t-SNE?

Exercise 16.84 (BasicConceptOfTSNE) What is the basic concept of t-SNE?

Exercise 16.85 (TSNESteps) What are the steps in t-SNE?

16.1.7 K-means clustering

Exercise 16.86 (HowKMeansWorks) How does K-means clustering work?

Exercise 16.87 (QualityOfClusters) How can the quality of the resulting clusters be calculated?

Exercise 16.88 (IncreasingK) Why is it not a good idea to increase k too much?

16.1.8 DBSCAN

Exercise 16.89 (CorePointInDBSCAN) What is a core point in DBSCAN?

Exercise 16.90 (AddingVsExtending) What is the difference between adding and extending in DBSCAN?

Exercise 16.91 (OutliersInDBSCAN) What are outliers in DBSCAN?

16.1.9 K-nearest neighbors

Exercise 16.92 (AdvantagesAndDisadvantagesOfK) What are the advantages and disadvantages of k = 1 and k = 100 in K-nearest neighbors?

16.1.10 Naive Bayes

Exercise 16.93 (NaiveBayesFormula) What is the formula for Naive Bayes?

Exercise 16.94 (CalculateProbabilities) Calculate the probabilities for a given example using Naive Bayes.

16.1.11 Gaussian Naive Bayes

Exercise 16.95 (UnderflowProblem) Why is underflow a problem in Gaussian Naive Bayes?

16.1.12 July, 2nd 2024

16.1.12.1 Decision and Classification Trees, Clearly Explained

16.1.12.2 StatQuest: Decision Trees, Part 2 - Feature Selection and Missing Data

16.1.12.3 Regression Trees, Clearly Explained!!!

16.1.12.4 How to Prune Regression Trees, Clearly Explained!!!

16.1.12.5 Trees

Exercise 16.96 (Tree Usage) For what can we use trees?

16.1.12.6 Decision Trees

Exercise 16.97 (Tree Usage) Based on a shown tree graph:

  • How can you use this tree?
  • What is the root node?
  • What are branches and internal nodes?
  • What are the leafs?
  • Are the leafs pure or impure?
  • Which of the leafs is more impure?

Exercise 16.98 (Tree Feature Importance) Is the most or least important feature on top?

Exercise 16.99 (Tree Feature Imputation) How can you fill a gap/missing data?

Solution 16.1 (Tree Feature Imputation).

  • Mean
  • Median
  • Comparing to column with high correlation

16.1.12.7 Regression Trees

Exercise 16.100 (Regression Tree Limitations) What are limitations?

Exercise 16.101 (Regression Tree Score) How is the tree score calculated?

Exercise 16.102 (Regression Tree Alpha Value Small) What can we say about the tree if the alpha value is small?

Exercise 16.103 (Regression Tree Increase Alpha Value) What happens if you increase alpha?

Exercise 16.104 (Regression Tree Pruning) What is the meaning of pruning?

16.1.13 Additional Videos

16.2 Introduction to Statistical Learning

Note

Parts of this course are based on the book An Introduction to Statistical Learning, James et al. (2014). Some of the figures in this presentation are taken from An Introduction to Statistical Learning (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.

16.2.1 Opening Remarks and Examples

  • Artificial Intelligence (AI)
  • Machine learning (ML)
  • Deep Learning (DL)

AI, ML, and DL. Taken fron Chollet and Allaire (2018)
  • 1980’s neural networks.
  • Statistical learning.
  • IBM Watson supercomputer.

Statistical learning problems include:

  1. Identification of prostate cancer through PSA and other measurements such as age, Gleason score, etc. Scatter plots help reveal the nature of the data and its correlations. Using transformed data (log scale) can highlight typos in the data; for example, a patient with a 449-gram prostate. Recommendation: Always examine the data before conducting any sophisticated analysis.
  2. Classification of phonemes, specifically between “aa” and “ao.”
  3. Prediction of heart attacks, which can be visualized through colored scatter plots.
  4. Detection of email spam, based on the frequency of words within the messages, using 57 features.
  5. Identification of numbers in handwritten zip codes, which involves pattern recognition.
  6. Classification of tissue samples into cancer classes based on gene expression profiles, utilizing heat maps for visualization.
  7. Establishing the relationship between salary and demographic variables like income (wage) versus age, year, and education level, employing regression models.
  8. Classification of pixels in LANDSAT images by their usage, using nearest neighbor methods.

16.2.1.1 Supervised and Unsupervised Learning

Two important types: supervised and unsupervised learning. There is even more, e.g., semi-supervised learning.

16.2.1.1.1 Starting point
  • Outcome measurement \(Y\) (dependent variable, response, target).
  • Vector of \(p\) predictor measurements \(X\) (inputs, regressors, covariates, features, independent variables).
  • Training data \((x_1, y1), \ldots ,(x_N, y_N)\). These are observations (examples, instances) of these measurements.

In the regression problem, \(Y\) is quantitative (e.g., price, blood pressure). In the classification problem, \(Y\) takes values in a finite, unordered set (e.g., survived/died, digit 0-9, cancer class of tissue sample).

16.2.1.1.2 Philosophy

It is important to understand the ideas behind the various techniques, in order to know how and when to use them. One has to understand the simpler methods first, in order to grasp the more sophisticated ones. It is important to accurately assess the performance of a method, to know how well or how badly it is working (simpler methods often perform as well as fancier ones!) This is an exciting research area, having important applications in science, industry and finance. Statistical learning is a fundamental ingredient in the training of a modern data scientist.

16.3 Basics

16.3.1 Histograms

Creating a histogram and calculating the probabilities from a dataset can be approached with scientific precision

  1. Data Collection: Obtain the dataset you wish to analyze. This dataset could represent any quantitative measure, such to examine its distribution.

  2. Decide on the Number of Bins: The number of bins influences the histogram’s granularity. There are several statistical rules to determine an optimal number of bins:

    • Square-root rule: suggests using the square root of the number of data points as the number of bins.
    • Sturges’ formula: \(k = 1 + 3.322 \log_{10}(n)\), where \(n\) is the number of data points and \(k\) is the suggested number of bins.
    • Freedman-Diaconis rule: uses the interquartile range (IQR) and the cube root of the number of data points \(n\) to calculate bin width as \(2 \dfrac{IQR}{n^{1/3}}\).
  3. Determine Range and Bin Width: Calculate the range of data by subtracting the minimum data point value from the maximum. Divide this range by the number of bins to determine the width of each bin.

  4. Allocate Data Points to Bins: Iterate through the data, sorting each data point into the appropriate bin based on its value.

  5. Draw the Histogram: Use a histogram to visualize the frequency or relative frequency (probability) of data points within each bin.

  6. Calculate Probabilities: The relative frequency of data within each bin represents the probability of a randomly selected data point falling within that bin’s range.

Below is a Python script that demonstrates how to generate a histogram and compute probabilities using the matplotlib library for visualization and numpy for data manipulation.

import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

# Sample data: Randomly generated for demonstration
data = np.random.normal(0, 1, 1000)  # 1000 data points with a normal distribution

# Step 2: Decide on the number of bins
num_bins = int(np.ceil(1 + 3.322 * np.log10(len(data))))  # Sturges' formula

# Step 3: Determine range and bin width -- handled internally by matplotlib

# Steps 4 & 5: Sort data into bins and draw the histogram
fig, ax = plt.subplots()
n, bins, patches = ax.hist(data, bins=num_bins, density=True, alpha=0.75, edgecolor='black')

# Calculate probabilities (relative frequencies) manually, if needed
bin_width = np.diff(bins)  # np.diff finds the difference between adjacent bin boundaries
probabilities = n * bin_width  # n is already normalized to form a probability density if `density=True`

# Adding labels and title for clarity
ax.set_xlabel('Data Value')
ax.set_ylabel('Probability Density')
ax.set_title('Histogram with Probability Density')
Text(0.5, 1.0, 'Histogram with Probability Density')
(a) Histogram with Probability Density
(b)
Figure 16.1
for i, prob in enumerate(probabilities):
    print(f"Bin {i+1} Probability: {prob:.4f}")

# Ensure probabilities sum to 1 (or very close, due to floating-point arithmetic)
print(f"Sum of probabilities: {np.sum(probabilities)}")
Bin 1 Probability: 0.0050
Bin 2 Probability: 0.0250
Bin 3 Probability: 0.0690
Bin 4 Probability: 0.1400
Bin 5 Probability: 0.2330
Bin 6 Probability: 0.2270
Bin 7 Probability: 0.1780
Bin 8 Probability: 0.0850
Bin 9 Probability: 0.0320
Bin 10 Probability: 0.0040
Bin 11 Probability: 0.0020
Sum of probabilities: 1.0

This code segment goes through the necessary steps to generate a histogram and calculate probabilities for a synthetic dataset. It demonstrates important scientific and computational practices including binning, visualization, and probability calculation in Python.

Key Points: - The histogram represents the distribution of data, with the histogram’s bins outlining the data’s spread and density. - The option density=True in ax.hist() normalizes the histogram so that the total area under the histogram sums to 1, thereby converting frequencies to probability densities. - The choice of bin number and width has a significant influence on the histogram’s shape and the insights that can be drawn from it, highlighting the importance of selecting appropriate binning strategies based on the dataset’s characteristics and the analysis objectives.

16.3.2 Probability Distributions

What happens when we use smaller bins in a histogram? The histogram becomes more detailed, revealing the distribution of data points with greater precision. However, as the bin size decreases, the number of data points within each bin may decrease, leading to sparse or empty bins. This sparsity can make it challenging to estimate probabilities accurately, especially for data points that fall within these empty bins.

Advantages, when using a probability distribution, include:

  • Blanks can be filled
  • Probabilities can be calculated
  • Parameters are sufficiemnt to describe the distribution, e.g., mean and variance for the normal distribution

Probability distributions offer a powerful solution to the challenges posed by limited data in estimating probabilities. When data is scarce, constructing a histogram to determine the probability of certain outcomes can lead to inaccurate or unreliable results due to the lack of detail in the dataset. However, collecting vast amounts of data to populate a histogram for more precise estimates can often be impractical, time-consuming, and expensive.

A probability distribution is a mathematical function that provides the probabilities of occurrence of different possible outcomes for an experiment. It is a more efficient approach to understanding the likelihood of various outcomes than relying solely on extensive data collection. For continuous data, this is often represented graphically by a smooth curve.

16.3.2.1 The Normal Distribution: A Common Example

A commonly encountered probability distribution is the normal distribution, known for its characteristic bell-shaped curve. This curve represents how the values of a variable are distributed: most of the observations cluster around the mean (or center) of the distribution, with frequencies gradually decreasing as values move away from the mean.

The normal distribution is particularly useful because of its defined mathematical properties. It is determined entirely by its mean (mu, \(\mu\)) and its standard deviation (sigma, \(\sigma\)). The area under the curve represents probability, making it possible to calculate the likelihood of a random variable falling within a specific range.

16.3.2.2 Practical Example: Estimating Probabilities

Consider we are interested in the heights of adults in a population. Instead of measuring the height of every adult (which would be impractical), we can use the normal distribution to estimate the probability of adults’ heights falling within certain intervals, assuming we know the mean and standard deviation of the heights.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
mu = 170  # e.g., mu height of adults in cm
sd = 10  # e.g., standard deviation of heights in cm
heights = np.linspace(mu - 3*sd, mu + 3*sd, 1000)
# Calculate the probability density function for the normal distribution
pdf = norm.pdf(heights, mu, sd)
# Plot the normal distribution curve
plt.plot(heights, pdf, color='blue', linewidth=2)
plt.fill_between(heights, pdf, where=(heights >= mu - 2 * sd) & (heights <= mu + 2*sd), color='grey', alpha=0.5)
plt.xlabel('Height (cm)')
plt.ylabel('Probability Density')
plt.show()
Figure 16.2: Normal Distribution Curve with Highlighted Probability Area. 95 percent of the data falls within two standard deviations of the mean.

This Python code snippet generates a plot of the normal distribution for adult heights, with a mean of 170 cm and a standard deviation of 10 cm. It visually approximates a histogram with a blue bell-shaped curve, and highlights (in grey) the area under the curve between \(\mu \pm 2 \times \sigma\). This area corresponds to the probability of randomly selecting an individual whose height falls within this range.

By using the area under the curve, we can efficiently estimate probabilities without needing to collect and analyze a vast amount of data. This method not only saves time and resources but also provides a clear and intuitive way to understand and communicate statistical probabilities.

16.3.3 Discrete Distributions

Discrete probability distributions are essential tools in statistics, providing a mathematical foundation to model and analyze situations with discrete outcomes. Histograms, which can be seen as discrete distributions with data organized into bins, offer a way to visualize and estimate probabilities based on the collected data. However, they come with limitations, especially when data is scarce or when we encounter gaps in the data (blank spaces in histograms). These gaps can make it challenging to accurately estimate probabilities.

A more efficient approach, especially for discrete data, is to use mathematical equations—particularly those defining discrete probability distributions—to calculate probabilities directly, thus bypassing the intricacies of data collection and histogram interpretation.

16.3.3.1 Bernoulli Distribution

The Bernoulli distribution, named after Swiss scientist Jacob Bernoulli, is a discrete probability distribution, which takes value \(1\) with success probability \(p\) and value \(0\) with failure probability \(q = 1-p\). So if \(X\) is a random variable with this distribution, we have: \[ P(X=1) = 1-P(X=0) = p = 1-q. \]

16.3.3.2 Binomial Distribution

The Binomial Distribution is a prime example of a discrete probability distribution that is particularly useful for binary outcomes (e.g., success/failure, yes/no, pumpkin pie/blueberry pie). It leverages simple mathematical principles to calculate the probability of observing a specific number of successes (preferred outcomes) in a fixed number of trials, given the probability of success in each trial.

16.3.3.3 An Illustrative Example: Pie Preference

Consider a scenario from “StatLand” where 70% of people prefer pumpkin pie over blueberry pie. The question is: What is the probability that, out of three people asked, the first two prefer pumpkin pie and the third prefers blueberry pie?

Using the concept of the Binomial Distribution, the probability of such an outcome can be calculated without the need to layout every possible combination by hand. This process not only simplifies calculations but also provides a clear and precise method to determine probabilities in scenarios involving discrete choices. We will use Python to calculate the probability of observing exactly two out of three people prefer pumpkin pie, given the 70% preference rate:

from scipy.stats import binom
n = 3  # Number of trials (people asked)
p = 0.7  # Probability of success (preferring pumpkin pie)
x = 2  # Number of successes (people preferring pumpkin pie)
# Probability calculation using Binomial Distribution
prob = binom.pmf(x, n, p)
print(f"The probability that exactly 2 out of 3 people prefer pumpkin pie is: {prob:.3f}")
The probability that exactly 2 out of 3 people prefer pumpkin pie is: 0.441

This code uses the binom.pmf() function from scipy.stats to calculate the probability mass function (PMF) of observing exactly x successes in n trials, where each trial has a success probability of p.

A Binomial random variable is the sum of \(n\) independent, identically distributed Bernoulli random variables, each with probability \(p\) of success. We may indicate a random variable \(X\) with Bernoulli distribution using the notation \(X \sim \mathrm{Bi}(1,\theta)\). Then, the notation for the Binomial is \(X \sim \mathrm{Bi}(n,\theta)\). Its probability and distribution functions are, respectively, \[ p_X(x) = {n\choose x}\theta^x(1-\theta)^{n-x}, \qquad F_X(x) = \Pr\{X \le x\} = \sum_{i=0}^{x} {n\choose i}\theta^i(1-\theta)^{n-i}. \]

The mean of the binomial distribution is \(\text{E}[X] = n\theta\). The variance of the distribution is \(\text{Var}[X] = n\theta(1-\theta)\) (see next section).

A process consists of a sequence of \(n\) independent trials, i.e., the outcome of each trial does not depend on the outcome of previous trials. The outcome of each trial is either a success or a failure. The probability of success is denoted as \(p\), and \(p\) is constant for each trial. Coin tossing is a classical example for this setting.

The binomial distribution is a statistical distribution giving the probability of obtaining a specified number of successes in a binomial experiment; written Binomial(n, p), where \(n\) is the number of trials, and \(p\) the probability of success in each.

Definition 16.1 (Binomial Distribution) The binomial distribution with parameters \(n\) and \(p\), where \(n\) is the number of trials, and \(p\) the probability of success in each, is \[\begin{equation} p(x) = { n \choose k } p^x(1-p)^{n-x} \qquad x = 0,1, \ldots, n. \end{equation}\] The mean \(\mu\) and the variance \(\sigma^2\) of the binomial distribution are \[\begin{equation} \mu = np \end{equation}\] and \[\begin{equation} \sigma^2 = np(1-p). \end{equation}\]

Note, the Bernoulli distribution is simply Binomial(1,p).

16.4 Continuous Distributions

Our considerations regarding probability distributions, expectations, and standard deviations will be extended from discrete distributions to continuous distributions. One simple example of a continuous distribution is the uniform distribution. Continuous distributions are defined by probability density functions.

16.4.1 Distribution functions: PDFs and CDFs

The density for a continuous distribution is a measure of the relative probability of “getting a value close to \(x\).” Probability density functions \(f\) and cumulative distribution function \(F\) are related as follows. \[\begin{equation} f(x) = \frac{d}{dx} F(x) \end{equation}\]

16.4.2 Expectation (Continuous)

Definition 16.2 (Expectation (Continuous)) \[\begin{equation} \text{E}(X) = \int_{-\infty}^\infty x f(x) \, dx \end{equation}\]

16.4.3 Variance and Standard Deviation (Continuous)

Definition 16.3 (Variance (Continuous)) Variance can be calculated with \(\text{E}(X)\) and \[\begin{equation} \text{E}(X^2) = \int_{-\infty}^\infty x^2 f(x) \, dx \end{equation}\] as \[\begin{equation*} \text{Var}(X) = \text{E}(X^2) - [ E(X)]^2. \end{equation*}\] \(\Box\)

Definition 16.4 (Standard Deviation (Continuous)) Standard deviation can be calculated as \[\begin{equation*} \text{sd}(X) = \sqrt{\text{Var}(X)}. \end{equation*}\] \(\Box\)

16.4.4 Uniform Distribution

This variable is defined in the interval \([a,b]\). We write it as \(X \sim U[a,b]\). Its density and cumulative distribution functions are, respectively, \[ f_X(x) = \frac{I_{[a,b]}(x)}{b-a}, \quad\quad F_X(x) = \frac{1}{b-a}\int\limits_{-\infty}\limits^x I_{[a,b]}(t) \mathrm{d}t = \frac{x-a}{b-a}, \] where \(I_{[a,b]}(\cdot)\) is the indicator function of the interval \([a,b]\). Note that, if we set \(a=0\) and \(b=1\), we obtain \(F_X(x) = x\), \(x\) \(\in\) \([0,1]\).

A typical example is the following: the cdf of a continuous r.v. is uniformly distributed in \([0,1]\). The proof of this statement is as follows: For \(u\) \(\in\) \([0,1]\), we have \[\begin{eqnarray*} \Pr\{F_X(X) \leq u\} &=& \Pr\{F_X^{-1}(F_X(X)) \leq F_X^{-1}(u)\} = \Pr\{X \leq F_X^{-1}(u)\} \\ &=& F_X(F_X^{-1}(u)) = u. \end{eqnarray*}\] This means that, when \(X\) is continuous, there is a one-to-one relationship (given by the cdf) between \(x\) \(\in\) \(D_X\) and \(u\) \(\in\) \([0,1]\).

The has a constant density over a specified interval, say \([a,b]\). The uniform \(U(a,b)\) distribution has density \[\begin{equation} f(x) = \left\{ \begin{array}{ll} 1/(b-a) & \textrm{ if } a < x < b,\\ 0 & \textrm{ otherwise} \end{array} \right. \end{equation}\]

16.4.5 Normal Distribution

Definition 16.5 (Normal Distribution) This variable is defined on the support \(D_X = \mathbb{R}\) and its density function is given by \[ f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left \{-\frac{1}{2\sigma^2}(x-\mu)^2 \right \}. \] The density function is identified by the pair of parameters \((\mu,\sigma^2)\), where \(\mu\) \(\in\) \(\mathbb{R}\) is the mean (or location parameter) and \(\sigma^2 > 0\) is the variance (or dispersion parameter) of \(X\). \(\Box\)

The density function is symmetric around \(\mu\). The normal distribution belongs to the location-scale family distributions. This means that, if \(Z \sim N(0,1)\) (read, \(Z\) has a standard normal distribution; i.e., with \(\mu=0\) and \(\sigma^2=1\)), and we consider the linear transformation \(X = \mu + \sigma Z\), then \(X \sim N(\mu,\sigma^2)\) (read, \(X\) has a normal distribution with mean \(\mu\) and variance \(\sigma^2\)). This means that one can obtain the probability of any interval \((-\infty,x]\), \(x\) \(\in\) \(R\) for any normal distribution (i.e., for any pair of the parameters \(\mu\) and \(\sigma\)) once the quantiles of the standard normal distribution are known. Indeed \[\begin{eqnarray*} F_X(x) &=& \Pr\left\{X \leq x \right\} = \Pr\left\{\frac{X-\mu}{\sigma} \leq \frac{x-\mu}{\sigma} \right\} \\ &=& \Pr\left\{Z \leq \frac{x-\mu}{\sigma}\right\} = F_Z\left(\frac{x-\mu}{\sigma}\right) \qquad x \in \mathbb{R}. \end{eqnarray*}\] The quantiles of the standard normal distribution are available in any statistical program. The density and cumulative distribution function of the standard normal r.v.~at point \(x\) are usually denoted by the symbols \(\phi(x)\) and \(\Phi(x)\).

The standard normal distribution is based on the \[ \varphi(z) = \frac{1}{\sqrt{2\pi}} \exp \left(- \frac{z^2}{2} \right). \tag{16.1}\]

An important application of the standardization introduced in Equation 16.1 reads as follows. In case the distribution of \(X\) is approximately normal, the distribution of X^{*} is approximately standard normal. That is \[\begin{equation*} P(X\leq b) = P( \frac{X-\mu}{\sigma} \leq \frac{b-\mu}{\sigma}) = P(X^{*} \leq \frac{b-\mu}{\sigma}) \end{equation*}\] The probability \(P(X\leq b)\) can be approximated by \(\Phi(\frac{b-\mu}{\sigma})\), where \(\Phi\) is the standard normal cumulative distribution function.

If \(X\) is a normal random variable with mean \(\mu\) and variance \(\sigma^2\), i.e., \(X \sim \cal{N} (\mu, \sigma^2)\), then \[\begin{equation} X = \mu + \sigma Z \textrm{ where } Z \sim \cal{N}(0,1). \end{equation}\]

If \(Z \sim \cal{N}(0,1)\) and \(X\sim \cal{N}(\mu, \sigma^2)\), then \[\begin{equation*} X = \mu + \sigma Z. \end{equation*}\]

The probability of getting a value in a particular interval is the area under the corresponding part of the curve. Consider the density function of the normal distribution. It can be plotted using the following commands. The result is shown in Figure 16.3.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
x = np.arange(-4, 4, 0.1)
# Calculating the normal distribution's density function values for each point in x
y = norm.pdf(x, 0, 1)
plt.plot(x, y, linestyle='-', linewidth=2)
plt.title('Normal Distribution')
plt.xlabel('X')
plt.ylabel('Density')
plt.grid(True)
plt.show()
Figure 16.3: Normal Distribution Density Function

The (CDF) describes the probability of “hitting” \(x\) or less in a given distribution. We consider the CDF function of the normal distribution. It can be plotted using the following commands. The result is shown in Figure 16.4.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Generating a sequence of numbers from -4 to 4 with 0.1 intervals
x = np.arange(-4, 4, 0.1)

# Calculating the cumulative distribution function value of the normal distribution for each point in x
y = norm.cdf(x, 0, 1)  # mean=0, stddev=1

# Plotting the results. The equivalent of 'type="l"' in R (line plot) becomes the default plot type in matplotlib.
plt.plot(x, y, linestyle='-', linewidth=2)
plt.title('Normal Distribution CDF')
plt.xlabel('X')
plt.ylabel('Cumulative Probability')
plt.grid(True)
plt.show()
Figure 16.4: Normal Distribution Cumulative Distribution Function

16.4.6 The Mean, the Median, and the Mode

16.4.7 The Exponential Distribution

The exponential distribution is a continuous probability distribution that describes the time between events in a Poisson process, where events occur continuously and independently at a constant average rate. It is characterized by a single parameter, the rate parameter \(\lambda\), which represents the average number of events per unit time.

16.4.8 Population and Estimated Parameters

16.4.9 Calculating the Mean, Variance, and Standard Deviation

16.4.10 What is a Mathematical Model?

16.4.11 Sampling from a Distribution

16.4.12 Hypothesis Testing and the Null Hypothesis

16.4.13 Alternative Hypotheses

16.4.14 p-values: What They Are and How to Interpret Them

16.4.15 How to Calculate p-values

16.4.16 p-hacking: What It Is and How to Avoid It

16.4.17 Covariance

16.4.18 Pearson’s Correlation

16.4.19 Boxplots

16.4.20 R-squared

16.4.21 The Main Ideas of Fitting a Line to Data

16.4.22 Linear Regression

16.4.23 Multiple Regression

16.5 Supervised Learning

Objectives of supervised learning: On the basis of the training data we would like to:

  • Accurately predict unseen test cases.
  • Understand which inputs affect the outcome, and how.
  • Assess the quality of our predictions and inferences.

Note: Supervised means \(Y\) is known.

Exercise 16.105  

  • Do children learn supervised?
  • When do you learn supervised?
  • Can learning be unsupervised?
16.5.0.0.1 Unsupervised Learning

No outcome variable, just a set of predictors (features) measured on a set of samples. The objective is more fuzzy—find groups of samples that behave similarly, find features that behave similarly, find linear combinations of features with the most variation. It is difficult to know how well your are doing. Unsupervised learning different from supervised learning, but can be useful as a pre-processing step for supervised learning. Clustering and principle component analysis are important techniques.

Unsupervised: \(Y\) is unknown, there is no \(Y\), no trainer, no teacher, but: distances between the inputs values (features). A distance (or similarity) measure is necessary.

16.5.0.0.2 Statistical Learning

We consider supervised learning first.

Figure 16.5: Sales as a function of TV, radio and newspaper. Taken from James et al. (2014)

Sales figures from a marketing campaign, see Figure 16.5. Trend shown using regression. First seems to be stronger than the third.

Can we predict \(Y\) = Sales using these three? Perhaps we can do better using a model \[ Y = Sales \approx f(X_1 = TV, X_2 = Radio, X_3= Newspaper) \] modeling the joint relationsship.

Here Sales is a response or target that we wish to predict. We generically refer to the response as \(Y\). TV is a feature, or input, or predictor; we name it \(X_1\). Likewise name Radio as \(X_2\), and so on. We can refer to the input vector collectively as \[ X = \begin{pmatrix} X_1\\ X_2\\ X_3 \end{pmatrix} \]

Now we write our model as \[ Y = f(X) + \epsilon \] where \(\epsilon\) captures measurement errors and other discrepancies.

What is \(f\) good for? With a good \(f\) we can make predictions of \(Y\) at new points \(X = x\). We can understand which components of \(X = (X_1, X_2, \ldots X_p)\) are important in explaining \(Y\), and which are irrelevant.

For example, Seniority and Years of Education have a big impact on Income, but Marital Status typically does not. Depending on the complexity of \(f\), we may be able to understand how each component \(X_j\) of \(X\) affects \(Y\).

16.5.1 Statistical Learning and Regression

16.5.1.1 Regression Function

Figure 16.6: Scatter plot of 2000 points (population). What is a good function \(f\)? There are many function values at \(X=4\). A function can return only one value. We can take the mean from these values as a return value. Taken from James et al. (2014)

Consider Figure 16.6. Is there an ideal \(f(X)\)? In particular, what is a good value for \(f(X)\) at any selected value of \(X\), say \(X = 4\)? There can be many \(Y\) values at \(X=4\). A good value is \[ f(4) = E(Y |X = 4). \]

\(E(Y |X = 4)\) means expected value (average) of \(Y\) given \(X = 4\).

The ideal \(f(x) = E(Y |X = x)\) is called the regression function. Read: The regression function gives the conditional expectation of \(Y\) given \(X\).

The regression function \(f(x)\) is also defined for the vector \(X\); e.g., \(f(x) = f(x_1, x_2, x_3) = E(Y | X_1 =x_1, X_2 =x_2, X_3 =x_3).\)

16.5.2 Optimal Predictor

The regression function is the ideal or optimal predictor of \(Y\) with regard to mean-squared prediction error: It means that \(f(x) = E(Y | X = x)\) is the function that minimizes \[ E[(Y - g(X))^2|X = x] \] over all functions \(g\) at all points \(X = x\).

16.5.2.1 Residuals, Reducible and Irreducible Error

At each point \(X\) we make mistakes: \[ \epsilon = Y-f(x) \] is the residual. Even if we knew \(f(x)\), we would still make errors in prediction, since at each \(X=x\) there is typically a distribution of possible \(Y\) values as is illustrated in Figure 16.6.

For any estimate \(\hat{f}(x)\) of \(f(x)\), we have \[ E\left[ ( Y - \hat{f}(X))^2 | X = x\right] = \left[ f(x) - \hat{f}(x) \right]^2 + \text{var}(\epsilon), \] and \(\left[ f(x) - \hat{f}(x) \right]^2\) is the reducible error, because it depends on the model (changing the model \(f\) might reduce this error), and \(\text{var}(\epsilon)\) is the irreducible error.

16.5.2.2 Local Regression (Smoothing)

Typically we have few if any data points with \(X = 4\) exactly. So we cannot compute \(E(Y |X = x)\)! Idea: Relax the definition and let \[ \hat{f}(x)= Ave(Y|X \in \cal{N}(x)), \] where \(\cal{N} (x)\) is some neighborhood of \(x\), see Figure 16.7.

Figure 16.7: Relaxing the definition. There is no \(Y\) value at \(X=4\). Taken from James et al. (2014)

Nearest neighbor averaging can be pretty good for small \(p\), i.e., \(p \leq 4\) and large-ish \(N\). We will discuss smoother versions, such as kernel and spline smoothing later in the course.

16.5.3 Curse of Dimensionality and Parametric Models

Figure 16.8: A 10% neighborhood in high dimensions need no longer be local. Left: Values of two variables \(x_1\) and \(x_2\), uniformly distributed. Form two 10% neighborhoods: (a) the first is just involving \(x_1\) ignoring \(x_2\). (b) is the neighborhood in two dimension. Notice that the radius of the circle is much larger than the lenght of the interval in one dimension. Right: radius plotted against fraction of the volume. In 10 dim, you have to break out the interval \([-1;+1]\) to get 10% of the data. Taken from James et al. (2014)

Local, e.g., nearest neighbor, methods can be lousy when \(p\) is large. Reason: the curse of dimensionality, i.e., nearest neighbors tend to be far away in high dimensions. We need to get a reasonable fraction of the \(N\) values of \(y_i\) to average to bring the variance down—e.g., 10%. A 10% neighborhood in high dimensions need no longer be local, so we lose the spirit of estimating \(E(Y |X = x)\) by local averaging, see Figure 16.8. If the curse of dimensionality does not exist, nearest neighbor models would be perfect prediction models.

We will use structured (parametric) models to deal with the curse of dimensionality. The linear model is an important example of a parametric model: \[ f_L(X) = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p. \] A linear model is specified in terms of \(p + 1\) parameters $ _1, _2, , _p$. We estimate the parameters by fitting the model to . Although it is almost never correct, a linear model often serves as a good and interpretable approximation to the unknown true function \(f(X)\).

The linear model is avoiding the curse of dimensionality, because it is not relying on any local properties. Linear models belong to the class of approaches: they replace the problem of estimating \(f\) with estimating a fixed set of coefficients \(\beta_i\), with \(i=1,2, \ldots, p\).

Figure 16.9: A linear model \(\hat{f}_L\) gives a reasonable fit. Taken from James et al. (2014)
Figure 16.10: A quadratic model \(\hat{f}_Q\) fits slightly better. Taken from James et al. (2014)

A linear model \[ \hat{f}_L(X) = \hat{\beta}_0 + \hat{\beta}_1 X \] gives a reasonable fit, see Figure 16.9. A quadratic model \[ \hat{f}_Q(X) = \hat{\beta}_0 + \hat{\beta}_1 X + \hat{\beta}_2 X^2 \] gives a slightly improved fit, see Figure 16.10.

Figure 16.11 shows a simulated example. Red points are simulated values for income from the model \[ income = f(education, seniority) + \epsilon \] \(f\) is the blue surface.

Figure 16.11: The true model. Red points are simulated values for income from the model, \(f\) is the blue surface. Taken from James et al. (2014)
Figure 16.12: Linear regression fit to the simulated data (red points). Taken from James et al. (2014)

The linear regression model \[ \hat{f}(education, seniority) = \hat{\beta}_0 + \hat{\beta}_1 \times education + \hat{\beta}_2 \times seniority \] captures the important information. But it does not capture everything. More flexible regression model \[ \hat{f}_S (education, seniority) \] fit to the simulated data. Here we use a technique called a thin-plate spline to fit a flexible surface. Even more flexible spline regression model \[ \hat{f}_S (education, seniority) \] fit to the simulated data. Here the fitted model makes no errors on the training data! Also known as overfitting.

Figure 16.13: Thin-plate spline models \(\hat{f}_S (education, seniority)\) fitted to the model from Figure 16.11. Taken from James et al. (2014)
Figure 16.14: Thin-plate spline models \(\hat{f}_S (education, seniority)\) fitted to the model from Figure 16.11. The model makes no errors on the training data (overfitting). Taken from James et al. (2014)

16.5.3.1 Trade-offs

  • Prediction accuracy versus interpretability: Linear models are easy to interpret; thin-plate splines are not.
  • Good fit versus over-fit or under-fit: How do we know when the fit is just right?
  • Parsimony (Occam’s razor) versus black-box: We often prefer a simpler model involving fewer variables over a black-box predictor involving them all.

The trad-offs are visualized in Figure 16.15.

Figure 16.15: Interpretability versus flexibility. Flexibility corresponds with the number of model parameters. Taken from James et al. (2014)

16.5.4 Assessing Model Accuracy and Bias-Variance Trade-off

Figure 16.16: Black curve is truth. Red curve on right is \(MSETe\), grey curve is \(MSETr\). Orange, blue and green curves/squares correspond to fits of different flexibility. The dotted line represents the irreducible error, i.e., \(var(\epsilon)\). Taken from James et al. (2014)
Figure 16.17: Here, the truth is smoother. Black curve is truth. Red curve on right is \(MSETe\), grey curve is \(MSETr\). Orange, blue and green curves/squares correspond to fits of different flexibility. The dotted line represents the irreducible error, i.e., \(var(\epsilon)\). Taken from James et al. (2014)
Figure 16.18: Here the truth is wiggly and the noise is low, so the more flexible fits do the best. Black curve is truth. Red curve on right is \(MSETe\), grey curve is \(MSETr\). Orange, blue and green curves/squares correspond to fits of different flexibility. The dotted line represents the irreducible error, i.e., \(var(\epsilon)\). Taken from James et al. (2014)

Suppose we fit a model \(f(x)\) to some training data \(Tr = \{x_i, y_i \}^N_1\), and we wish to see how well it performs. We could compute the average squared prediction error over \(Tr\): \[ MSE_{Tr} = Ave_{i \in Tr}[y_i - \hat{f}(x_i)]^2. \] This may be biased toward more overfit models. Instead we should, if possible, compute it using fresh test data \(Te== \{x_i, y_i \}^N_1\): \[ MSE_{Te} = Ave_{i \in Te}[y_i - \hat{f}(x_i)]^2. \] The red curve, which illustrated the test error, can be estimated by holding out some data to get the test-data set.

16.5.4.1 Bias-Variance Trade-off

Suppose we have fit a model \(f(x)\) to some training data \(Tr\), and let \((x_0, y_0)\) be a test observation drawn from the population. If the true model is \[ Y = f(X) + \epsilon \qquad \text{ with } f(x) = E(Y|X=x), \] then \[ E \left( y_0 - \hat{f}(x_0) \right)^2 = \text{var} (\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + \text{var}(\epsilon). \tag{16.2}\]

Here, \(\text{var}(\epsilon)\) is the irreducible error. The reducible error consists of two components:

  • \(\text{var} (\hat{f}(x_0))\) is the variance that comes from different training sets. Different training sets result in different functions \(\hat{f}\).
  • \(Bias(\hat{f}(x_0)) = E[\hat{f}(x_0)] - f(x_0)\).

The expectation averages over the variability of \(y_0\) as well as the variability in \(Tr\). Note that \[ Bias(\hat{f}(x_0)) = E[\hat{f}(x_0)] - f(x_0). \] Typically as the flexibility of \(\hat{f}\) increases, its variance increases (because the fits differ from training set to trainig set), and its bias decreases. So choosing the flexibility based on average test error amounts to a bias-variance trade-off, see Figure 16.19.

Figure 16.19: Bias-variance trade-off for the three examples. Taken from James et al. (2014)

If we add the two components (reducible and irreducible error), we get the MSE in Figure 16.19 as can be seen in Equation 16.2.

16.5.5 Classification Problems and K-Nearest Neighbors

In classification we have a qualitative response variable.

Figure 16.20: Classification. Taken from James et al. (2014)

Here the response variable \(Y\) is qualitative, e.g., email is one of \(\cal{C} = (spam, ham)\), where ham is good email, digit class is one of \(\cal{C} = \{ 0, 1, \ldots, 9 \}\). Our goals are to:

  • Build a classifier \(C(X)\) that assigns a class label from \(\cal{C}\) to a future unlabeled observation \(X\).
  • Assess the uncertainty in each classification
  • Understand the roles of the different predictors among \(X = (X_1,X_2, \ldots, X_p)\).

Simulation example depicted in@fig-0218a. \(Y\) takes two values, zero and one, and \(X\) has only one value. Big sample: each single vertical bar indicates an occurrance of a zero (orange) or one (blue) as a function of the \(X\)s. Black curve generated the data: it is the probability of generating a one. For high values of \(X\), the probability of ones is increasing. What is an ideal classifier \(C(X)\)?

Suppose the \(K\) elements in \(\cal{C}\) are numbered \(1,2,\ldots, K\). Let \[ p_k(x) = Pr(Y = k|X = x), k = 1,2,\ldots,K. \]

These are the conditional class probabilities at \(x\); e.g. see little barplot at \(x = 5\). Then the Bayes optimal classifier at \(x\) is \[ C(x) = j \qquad \text{ if } p_j(x) = \max \{p_1(x),p_2(x),\ldots, p_K(x)\}. \] At \(x=5\) there is an 80% probability of one, and an 20% probability of a zero. So, we classify this point to the class with the highest probability, the majority class.

Nearest-neighbor averaging can be used as before. This is illustrated in Fig.~\(\ref{fig:0219a}\). Here, we consider 100 points only. Nearest-neighbor averaging also breaks down as dimension grows. However, the impact on \(\hat{C}(x)\) is less than on \(\hat{p}_k (x)\), \(k = 1, \ldots, K\).

Figure 16.21: Classification. Taken from James et al. (2014)

16.5.5.1 Classification: Some Details

Average number of errors made to measure the performance. Typically we measure the performance of \(\hat{C}(x)\) using the misclassification error rate: \[ Err_{Te} = Ave_{i\in Te} I[y_i \neq \hat{C} (x_i) ]. \] The Bayes classifier (using the true \(p_k(x)\)) has smallest error (in the population).

16.5.6 k-Nearest Neighbor Classification

Consider k-nearest neighbors in two dimensions. Orange and blue dots label the true class memberships of the underlying points in the 2-dim plane. Dotted line is the decision boundary, that is the contour with equal probability for both classes.

Nearest-neighbor averaging in 2-dim. At any given point we want to classify, we spread out a little neighborhood, say \(K=10\) points from the neighborhood and calulated the percentage of blue and orange. We assign the color with the highest probability to this point. If this is done for every point in the plane, we obtain the solid black curve as the esitmated decsion boundary.

We can use \(K=1\). This is the nearest-neighbor classifier. The decision boundary is piecewise linear. Islands occur. Approximation is rather noisy.

\(K=100\) leads to a smooth decision boundary. But gets uninteresting.

Figure 16.22: K-nearest neighbors in two dimensions. Taken from James et al. (2014)
Figure 16.23: K-nearest neighbors in two dimensions. Taken from James et al. (2014)
Figure 16.24: K-nearest neighbors in two dimensions. Taken from James et al. (2014)

\(K\) large means higher bias, so \(1/K\) is chosen, because we go from low to high complexity on the \(x\)-error, see Figure 16.25. Horizontal dotted line is the base error.

Figure 16.25: K-nearest neighbors classification error. Taken from James et al. (2014)

16.5.7 Minkowski Distance

The Minkowski distance of order \(p\) (where \(p\) is an integer) between two points \(X=(x_1,x_2,\ldots,x_n)\text{ and }Y=(y_1,y_2,\ldots,y_n) \in \mathbb{R}^n\) is defined as: \[ D \left( X,Y \right) = \left( \sum_{i=1}^n |x_i-y_i|^p \right)^\frac{1}{p}. \]

16.5.8 Unsuperivsed Learning: Classification

16.5.8.1 k-Means Algorithm

The \(k\)-means algorithm is an unsupervised learning algorithm that has a loose relationship to the \(k\)-nearest neighbor classifier. The \(k\)-means algorithm works as follows:

  • Step 1: Randomly choose \(k\) centers. Assign points to cluster.
  • Step 2: Determine the distances of each data point to the centroids and re-assign each point to the closest cluster centroid based upon minimum distance
  • Step 3: Calculate cluster centroids again
  • Step 4: Repeat steps 2 and 3 until we reach global optima where no improvements are possible and no switching of data points from one cluster to other.

The basic principle of the \(k\)-means algorithm is illustrated in Figure 16.26, Figure 16.27, Figure 16.28, and Figure 16.29.

Figure 16.26: k-means algorithm. Step 1. Randomly choose \(k\) centers. Assign points to cluster. \(k\) initial means(in this case \(k=3\)) are randomly generated within the data domain (shown in color). Attribution: I, Weston.pace, CC BY-SA 3.0 http://creativecommons.org/licenses/by-sa/3.0/, via Wikimedia Commons
Figure 16.27: k-means algorithm. Step 2. \(k\) clusters are created by associating every observation with the nearest mean. The partitions here represent the Voronoi diagram generated by the means. Attribution: I, Weston.pace, CC BY-SA 3.0 http://creativecommons.org/licenses/by-sa/3.0/, via Wikimedia Commons
Figure 16.28: k-means algorithm. Step 3. The centroid of each of the \(k\) clusters becomes the new mean. Attribution: I, Weston.pace, CC BY-SA 3.0 http://creativecommons.org/licenses/by-sa/3.0/, via Wikimedia Commons
Figure 16.29: k-means algorithm. Step 4. Steps 2 and 3 are repeated until convergence has been reached. Attribution: I, Weston.pace, CC BY-SA 3.0 http://creativecommons.org/licenses/by-sa/3.0/, via Wikimedia Commons