[28]{.chapter-number}  [Clustering]{.chapter-title}

doi:10.48550/arXiv.2307.10262

28 Clustering

28.1 DBSCAN

Video: Clustering with DBSCAN, Clearly Explained!!!

28.2 k-Means Clustering

The \(k\)-means algorithm is an unsupervised learning algorithm that has a loose relationship to the \(k\)-nearest neighbor classifier. The \(k\)-means algorithm works as follows:

Step 1: Randomly choose \(k\) centers. Assign points to cluster.
Step 2: Determine the distances of each data point to the centroids and re-assign each point to the closest cluster centroid based upon minimum distance
Step 3: Calculate cluster centroids again
Step 4: Repeat steps 2 and 3 until we reach global optima where no improvements are possible and no switching of data points from one cluster to other.

The basic principle of the \(k\)-means algorithm is illustrated in Figure 28.1, Figure 28.2, Figure 28.3, and Figure 28.4.

Figure 28.1: k-means algorithm. Step 1. Randomly choose \(k\) centers. Assign points to cluster. \(k\) initial means(in this case \(k=3\)) are randomly generated within the data domain (shown in color). Attribution: I, Weston.pace, CC BY-SA 3.0 http://creativecommons.org/licenses/by-sa/3.0/, via Wikimedia Commons

Figure 28.2: k-means algorithm. Step 2. \(k\) clusters are created by associating every observation with the nearest mean. The partitions here represent the Voronoi diagram generated by the means. Attribution: I, Weston.pace, CC BY-SA 3.0 http://creativecommons.org/licenses/by-sa/3.0/, via Wikimedia Commons

Figure 28.3: k-means algorithm. Step 3. The centroid of each of the \(k\) clusters becomes the new mean. Attribution: I, Weston.pace, CC BY-SA 3.0 http://creativecommons.org/licenses/by-sa/3.0/, via Wikimedia Commons

Figure 28.4: k-means algorithm. Step 4. Steps 2 and 3 are repeated until convergence has been reached. Attribution: I, Weston.pace, CC BY-SA 3.0 http://creativecommons.org/licenses/by-sa/3.0/, via Wikimedia Commons

Video: K-means clustering

28.3 DDMO-Additional Videos

28.4 DDMO-Exercises

Exercise 28.1 (Smaller Bins) What happens when we use smaller bins in a histogram?

Exercise 28.2 (Density Curve) Why plot a curve to approximate a histogram?

Exercise 28.3 (TwoSDQuestion) How many samples are plus/minus two SD around the mean?

Exercise 28.4 (OneSDQuestion) How many samples are plus/minus one SD around the mean?

Exercise 28.5 (ThreeSDQuestion) How many samples are plus/minus three SD around the mean?

Exercise 28.6 (DataRangeQuestion) You have a mean at 100 and a SD of 10. Where are 95% of the data?

Exercise 28.7 (PeakHeightQuestion) If the peak is very high, is the SD low or high?

Exercise 28.8 (ProbabilityQuestion) If we have a certain curve and want to calculate the probability of values equal to 20 if the mean is 20.

Exercise 28.9 (MeanDifferenceQuestion) The difference between \(\mu\) and x-bar?

Exercise 28.10 (EstimateMeanQuestion) How do you calculate the sample mean?

Exercise 28.11 (SigmaSquaredQuestion) What is sigma squared?

Exercise 28.12 (EstimatedSDQuestion) What is the formula for the estimated standard deviation?

Exercise 28.13 (VarianceDifferenceQuestion) Difference between the variance and the estimated variance?

Exercise 28.14 (ModelBenefitsQuestion) What are the benefits of using models?

Exercise 28.15 (SampleDefinitionQuestion) What is a sample in statistics?

Exercise 28.16 (RejectHypothesisQuestion) What does it mean to reject a hypothesis?

Exercise 28.17 (NullHypothesisQuestion) What is a null hypothesis?

Exercise 28.18 (BetterDrugQuestion) How can you show that you have found a better drug?

Exercise 28.19 (PValueIntroductionQuestion) What is the reason for introducing the p-value?

Exercise 28.20 (PValueRangeQuestion) Is there any range for p-values? Can it be negative?

Exercise 28.21 (PValueRangeQuestion) Is there any range for p-values? Can it be negative?

Exercise 28.22 (TypicalPValueQuestion) What are typical values of the p-value and what does it mean? 5%?

Exercise 28.23 (FalsePositiveQuestion) What is a false-positive?

Exercise 28.24 (CalculatePValueQuestion) How to calculate p-value?

Exercise 28.25 (SDCalculationQuestion) What is the SD if the mean is 155 and in the range from 142 - 169 there are 95% of the data?

Exercise 28.26 (SidedPValueQuestion) When do we need the two-sided p-value and when the one-sided?

Exercise 28.27 (CoinTestQuestion) Test a coin with Tail-Head-Head. What is the p-value?

Exercise 28.28 (BorderPValueQuestion) If you get exactly the 0.05 border value, can you reject?

Exercise 28.29 (OneSidedPValueCautionQuestion) Why should you be careful with a one-sided p-test?

Exercise 28.30 (BinomialDistributionQuestion) What is the binomial distribution?

Exercise 28.31 (PHackingWaysQuestion) Name two typical ways of p-hacking.

Exercise 28.32 (AvoidPHackingQuestion) How can p-hacking be avoided?

Exercise 28.33 (MultipleTestingProblemQuestion) What is the multiple testing problem?

28.4.0.1 Covariance

Exercise 28.34 (CovarianceDefinitionQuestion) What is covariance?

Exercise 28.35 (CovarianceMeaningQuestion) What is the meaning of covariance?

Exercise 28.36 (CovarianceVarianceRelationshipQuestion) What is the relationship between covariance and variance?

Exercise 28.37 (HighCovarianceQuestion) If covariance is high, is there a strong relationship?

Exercise 28.38 (ZeroCovarianceQuestion) What if the covariance is zero?

Exercise 28.39 (NegativeCovarianceQuestion) Can covariance be negative?

Exercise 28.40 (NegativeVarianceQuestion) Can variance be negative?

Exercise 28.41 (CorrelationValueQuestion) What do you do if the correlation value is 10?

Exercise 28.42 (CorrelationRangeQuestion) What is the possible range of correlation values?

Exercise 28.43 (CorrelationFormulaQuestion) What is the formula for correlation?

Exercise 28.44 (UnderstandingStatisticalPower) What is the definition of power in a statistical test?

Exercise 28.45 (DistributionEffectOnPower) What is the implication for power analysis if the samples come from the same distribution?

Exercise 28.46 (IncreasingPower) How can you increase the power if the distributions are very similar?

Exercise 28.47 (PreventingPHacking) What should be done to avoid p-hacking when the distributions are close to each other?

Exercise 28.48 (SampleSizeAndPower) If there is overlap and the sample size is small, will the power be high or low?

Exercise 28.49 (FactorsAffectingPower) Which are the two main factors that affect power?

Exercise 28.50 (PurposeOfPowerAnalysis) What does power analysis tell us?

Exercise 28.51 (ExperimentRisks) What are the two risks faced when performing an experiment?

Exercise 28.52 (PerformingPowerAnalysis) How do you perform a power analysis?

Exercise 28.53 (CentralLimitTheoremExplanation) What does the Central Limit Theorem state?

Exercise 28.54 (MedianInBoxplot) What is represented by the middle line in a boxplot?

Exercise 28.55 (BoxContentInBoxplot) What does the box in a boxplot represent?

Exercise 28.56 (RSquaredDefinition) What is R-squared? Show the formula.

Exercise 28.57 (NegativeRSquared) Can the R-squared value be negative?

Exercise 28.58 (RSquaredCalculation) Perform a calculation involving R-squared.

Exercise 28.59 (LeastSquaresMeaning) What is the meaning of the least squares method?

Exercise 28.60 (RegressionVsClassification) What is the difference between regression and classification?

Exercise 28.61 (LikelihoodConcept) What is the idea of likelihood?

Exercise 28.62 (ProbabilityVsLikelihood) What is the difference between probability and likelihood?

Exercise 28.63 (TrainVsTestData) What is the difference between training and testing data?

Exercise 28.64 (SingleValidationIssue) What is the problem if you validate the model only once?

Exercise 28.65 (FoldDefinition) What is a fold in cross-validation?

Exercise 28.66 (LeaveOneOutValidation) What is leave-one-out cross-validation?

Exercise 28.67 (DrawingConfusionMatrix) Draw the confusion matrix.

Exercise 28.68 (SensitivitySpecificityCalculation1) Calculate the sensitivity and specificity for a given confusion matrix.

Exercise 28.69 (SensitivitySpecificityCalculation2) Calculate the sensitivity and specificity for a given confusion matrix.

Exercise 28.70 (BiasAndVariance) What are bias and variance?

Exercise 28.71 (MutualInformationExample) Provide an example and calculate if mutual information is high or low.

Exercise 28.72 (WhatIsPCA) What is PCA?

Exercise 28.73 (ScreePlotExplanation) What is a scree plot?

Exercise 28.74 (LeastSquaresInPCA) Does PCA use least squares?

Exercise 28.75 (PCASteps) Which steps are performed by PCA?

Exercise 28.76 (EigenvaluePC1) What is the eigenvalue of the first principal component?

Exercise 28.77 (DifferencesBetweenPoints) Are the differences between red and yellow the same as the differences between red and blue points?

Exercise 28.78 (ScalingInPCA) How to scale data in PCA?

Exercise 28.79 (DetermineNumberOfComponents) How to determine the number of principal components?

Exercise 28.80 (LimitingNumberOfComponents) How is the number of principal components limited?

Exercise 28.81 (WhyUseTSNE) Why use t-SNE?

Exercise 28.82 (MainIdeaOfTSNE) What is the main idea of t-SNE?

Exercise 28.83 (BasicConceptOfTSNE) What is the basic concept of t-SNE?

Exercise 28.84 (TSNESteps) What are the steps in t-SNE?

Exercise 28.85 (HowKMeansWorks) How does K-means clustering work?

Exercise 28.86 (QualityOfClusters) How can the quality of the resulting clusters be calculated?

Exercise 28.87 (IncreasingK) Why is it not a good idea to increase k too much?

Exercise 28.88 (CorePointInDBSCAN) What is a core point in DBSCAN?

Exercise 28.89 (AddingVsExtending) What is the difference between adding and extending in DBSCAN?

Exercise 28.90 (OutliersInDBSCAN) What are outliers in DBSCAN?

Exercise 28.91 (AdvantagesAndDisadvantagesOfK) What are the advantages and disadvantages of k = 1 and k = 100 in K-nearest neighbors?

Exercise 28.92 (NaiveBayesFormula) What is the formula for Naive Bayes?

Exercise 28.93 (CalculateProbabilities) Calculate the probabilities for a given example using Naive Bayes.

Exercise 28.94 (UnderflowProblem) Why is underflow a problem in Gaussian Naive Bayes?

Exercise 28.95 (Tree Usage) For what can we use trees?

Exercise 28.96 (Tree Usage) Based on a shown tree graph:

How can you use this tree?
What is the root node?
What are branches and internal nodes?
What are the leafs?
Are the leafs pure or impure?
Which of the leafs is more impure?

Exercise 28.97 (Tree Feature Importance) Is the most or least important feature on top?

Exercise 28.98 (Tree Feature Imputation) How can you fill a gap/missing data?

Solution 28.1 (Tree Feature Imputation).

Mean
Median
Comparing to column with high correlation

Exercise 28.99 (Regression Tree Limitations) What are limitations?

Exercise 28.100 (Regression Tree Score) How is the tree score calculated?

Exercise 28.101 (Regression Tree Alpha Value Small) What can we say about the tree if the alpha value is small?

Exercise 28.102 (Regression Tree Increase Alpha Value) What happens if you increase alpha?

Exercise 28.103 (Regression Tree Pruning) What is the meaning of pruning?