Appendix I — Solutions to Selected Exercises

doi:10.48550/arXiv.2307.10262

Appendix I — Solutions to Selected Exercises

Warning

Solutions are incomplete and need to be corrected!
They serve as a starting point for the final solution.

I.1 Data-Driven Modeling and Optimization

I.1.1 Histograms

Solution I.1 (Density Curve).

We can calculate propabilities.
We only need two parameters (the mean and the sd) to form the curve -> Store data more efficently
Blanks can be filled

I.1.2 The Normal Distribution

Solution I.2 (TwoSDAnswer). 95%

Solution I.3 (OneSDAnswer). 68%

Solution I.4 (ThreeSDAnswer). 99,7%

Solution I.5 (DataRangeAnswer). 80 - 120

Solution I.6 (PeakHeightAnswer). low

I.1.3 The mean, the media, and the mode

I.1.4 The exponential distribution

I.1.5 Population and Estimated Parameters

Solution I.7 (ProbabilityAnswer). 50%

I.1.6 Calculating the Mean, Variance and Standard Deviation

Solution I.8 (MeanDifferenceAnswer). If we have all the data, \(\mu\) is the population mean and x-bar is the sample mean. We don’t have the full information.

Solution I.9 (EstimateMeanAnswer). Sum of the values divided by n.

Solution I.10 (SigmaSquaredAnswer). Variance

Solution I.11 (EstimatedSDAnswer). The same as the normal standard deviation, but using n-1.

Solution I.12 (VarianceDifferenceAnswer). \(n\) and \(n-1\)

Solution I.13 (ModelBenefitsAnswer).

Approximation
Prediction
Understanding

Solution I.14 (SampleDefinitionAnswer). It’s a subset of the data.

I.1.7 Hypothesis Testing and the Null-Hypothesis

Solution I.15 (RejectHypothesisAnswer). It means the evidence supports the alternative hypothesis, indicating that the null hypothesis is unlikely to be true.

Solution I.16 (NullHypothesisAnswer). It’s a statement that there is no effect or no difference, and it serves as the default or starting assumption in hypothesis testing.

Solution I.17 (BetterDrugAnswer). By conducting experiments and statistical tests to compare the new drug’s effectiveness against the current standard and demonstrating a significant improvement.

I.1.8 Alternative Hypotheses, Main Ideas

I.1.9 p-values: What they are and how to interpret them

Solution I.18 (PValueIntroductionAnswer). We can reject the null hypothesis. We can make a decision.

Solution I.19 (PValueRangeAnswer). It can only be between 0 and 1.

Solution I.20 (PValueRangeAnswer). It can only be between 0 and 1.

Solution I.21 (TypicalPValueAnswer). The chance that we wrongly reject the null hypothesis.

Solution I.22 (FalsePositiveAnswer). If we have a false-positive, we succeed in rejecting the null hypothesis. But in fact/reality, this is false -> False positive.

I.1.10 How to calculate p-values

Solution I.23 (CalculatePValueAnswer). Probability of specific result, probability of outcome with the same probability, and probability of events with smaller probability.

Solution I.24 (SDCalculationAnswer). 7 is the SD.

Solution I.25 (SidedPValueAnswer). If we are not interested in the direction of the change, we use the two-sided. If we want to know about the direction, the one-sided.

Solution I.26 (CoinTestAnswer). TBD

Solution I.27 (BorderPValueAnswer). TBD

Solution I.28 (OneSidedPValueCautionAnswer). If you look in the wrong direction, there is no change.

Solution I.29 (BinomialDistributionAnswer). TBD

I.1.11 p-hacking: What it is and how to avoid it

Solution I.30 (PHackingWaysAnswer).

Performing repeats until you find one result with a small p-value -> false positive result.
Increasing the sample size within one experiment when it is close to the threshold.

Solution I.31 (AvoidPHackingAnswer). Specify the number of repeats and the sample sizes at the beginning.

Solution I.32 (MultipleTestingProblemAnswer). TBD

I.1.12 Covariance

Solution I.33 (CovarianceDefinitionAnswer). Formula

Solution I.34 (CovarianceMeaningAnswer). Large values in the first variable result in large values in the second variable.

Solution I.35 (CovarianceVarianceRelationshipAnswer). Formula

Solution I.36 (HighCovarianceAnswer). No, size doesn’t matter.

Solution I.37 (ZeroCovarianceAnswer). No relationship

Solution I.38 (NegativeCovarianceAnswer). Yes

Solution I.39 (NegativeVarianceAnswer). No

I.1.13 Pearson’s Correlation

Solution I.40 (CorrelationValueAnswer). Recalculate

Solution I.41 (CorrelationRangeAnswer). From -1 to 1

Solution I.42 (CorrelationFormulaAnswer). Formula

I.1.14 Boxplots

Solution I.43 (UnderstandingStatisticalPower). It is the probability of correctly rejecting the null hypothesis.

Solution I.44 (DistributionEffectOnPower). Power analysis is not applicable.

Solution I.45 (IncreasingPower). By taking more samples.

Solution I.46 (PreventingPHacking). TBD

Solution I.47 (SampleSizeAndPower). The power will be low.

I.1.15 Power Analysis

Solution I.48 (MainFactorsAffectingPower). The overlap (distance of the two means) and sample sizes.

Solution I.49 (PowerAnalysisOutcome). The sample size needed.

Solution I.50 (RisksInExperiments). Few experiments lead to very low power, and many experiments might result in p-hacking.

Solution I.51 (StepsToPerformPowerAnalysis).

Select power
Select threshold for significance (alpha)
Estimate the overlap (done by the effect size)

I.1.16 The Central Limit Theorem

Solution I.52 (CentralLimitTheoremAnswer). TBD

I.1.17 Boxplots

Solution I.53 (MedianAnswer). The median.

Solution I.54 (BoxContentAnswer). 50% of the data.

I.1.18 R-squared

Solution I.55 (RSquaredFormulaAnswer). TBD

Solution I.56 (NegativeRSquaredAnswer). If you fit a line, no, but there are cases where it could be negative. However, these are usually considered useless.

Solution I.57 (RSquaredCalculationAnswer). TBD

I.1.18.1 The main ideas of fitting a line to data (The main ideas of least squares and linear regression.)

Solution I.58 (LeastSquaresAnswer). It is the calculation of the smallest sum of residuals when you fit a model to data.

I.1.19 Linear Regression

I.1.20 Multiple Regression

I.1.21 A Gentle Introduction to Machine Learning

Solution I.59 (RegressionVsClassificationAnswer). Regression involves predicting continuous values (e.g., temperature, size), while classification involves predicting discrete values (e.g., categories like cat, dog).

I.1.22 Maximum Likelihood

Solution I.60 (LikelihoodConceptAnswer). The distribution that fits the data best.

I.1.23 Probability is not Likelihood

Solution I.61 (ProbabilityVsLikelihoodAnswer). Likelihood: Finding the curve that best fits the data. Probability: Calculating the probability of an event given a specific curve.

I.1.24 Cross Validation

Solution I.62 (TrainVsTestDataAnswer). Training data is used to fit the model, while testing data is used to evaluate how well the model fits.

Solution I.63 (SingleValidationIssueAnswer). The performance might not be representative because the data may not be equally distributed between training and testing sets.

Solution I.64 (FoldDefinitionAnswer). TBD

Solution I.65 (LeaveOneOutValidationAnswer). Only one data point is used as the test set, and the rest are used as the training set.

I.1.25 The Confusion Matrix

Solution I.66 (ConfusionMatrixAnswer). TBD

I.1.26 Sensitivity and Specificity

Solution I.67 (SensitivitySpecificityAnswer1). TBD

Solution I.68 (SensitivitySpecificityAnswer2). TBD

I.1.27 Bias and Variance

Solution I.69 (BiasAndVarianceAnswer). TBD

I.1.28 Mutual Information

Solution I.70 (MutualInformationExampleAnswer). TBD

I.1.29 Principal Component Analysis (PCA)

Solution I.71 (WhatIsPCAAnswer). A dimension reduction technique that helps discover important variables.

Solution I.72 (screePlotAnswer). It shows how much variation is defined by the data.

Solution I.73 (LeastSquaresInPCAAnswer). No, in the first step it tries to maximize distances.

Solution I.74 (PCAStepsAnswer).

Calculate mean
Shift the data to the center of the coordinate system
Fit a line by maximizing the distances
Calculate the sum of squared distances
Calculate the slope
Rotate

Solution I.75 (EigenvaluePC1Answer). Formula (to be specified).

Solution I.76 (DifferencesBetweenPointsAnswer). No, because the first difference is measured on the PC1 scale and it is more important.

Solution I.77 (ScalingInPCAAnswer). Scaling by dividing by the standard deviation (SD).

Solution I.78 (DetermineNumberOfComponentsAnswer). TBD

Solution I.79 (LimitingNumberOfComponentsAnswer).

The dimension of the problem
Number of samples

I.1.30 t-SNE

Solution I.80 (WhyUseTSNEAnswer). For dimension reduction and picking out the relevant clusters.

Solution I.81 (MainIdeaOfTSNEAnswer). To reduce the dimensions of the data by reconstructing the relationships in a lower-dimensional space.

Solution I.82 (BasicConceptOfTSNEAnswer).

First, randomly arrange the points in a lower dimension
Decide whether to move points left or right, depending on distances in the original dimension
Finally, arrange points in the lower dimension similarly to the original dimension

Solution I.83 (TSNEStepsAnswer).

Project data to get random points
Set up a matrix of distances
Calculate the inner variances of the clusters and the Gaussian distribution
Do the same with the projected points
Move projected points so the second matrix gets more similar to the first matrix

I.1.31 K-means clustering

Solution I.84 (HowKMeansWorksAnswer).

Select the number of clusters
Randomly select distinct data points as initial cluster centers
Measure the distance between each point and the cluster centers
Assign each point to the nearest cluster
Repeat the process

Solution I.85 (QualityOfClustersAnswer). Calculate the within-cluster variation.

Solution I.86 (IncreasingKAnswer). If k is too high, each point would be its own cluster. If k is too low, you cannot see the structures.

I.1.32 DBSCAN

Solution I.87 (CorePointInDBSCANAnswer). A point that is close to at least k other points.

Solution I.88 (AddingVsExtendingAnswer). Adding means we add a point and then stop. Extending means we add a point and then look for other neighbors from that point.

Solution I.89 (OutliersInDBSCANAnswer). Points that are not core points and do not belong to existing clusters.

I.1.33 K-nearest neighbors

Solution I.90 (AdvantagesAndDisadvantagesOfKAnswer).

k = 1: Noise can disturb the process because of possibly incorrect measurements of points.
k = 100: The majority can be wrong for some groups. It is smoother, but there is less chance to discover the structure of the data.

I.1.34 Naive Bayes

Solution I.91 (NaiveBayesFormulaAnswer). TBD

Solution I.92 (CalculateProbabilitiesAnswer). TBD

I.1.35 Gaussian Naive Bayes

Solution I.93 (UnderflowProblemAnswer). Small values multiplied together can become smaller than the limits of computer memory, resulting in zero. Using logarithms (e.g., log(1/2) -> -1, log(1/4) -> -2) helps prevent underflow.

I.1.36 Trees

Solution I.94 (Tree Usage). Classication, Regression, Clustering

Solution I.95 (Tree Usage). TBD

Solution I.96 (Tree Feature Importance). The most important feature.

Solution I.97 (Regression Tree Limitations). High dimensions

Solution I.98 (Regression Tree Score). SSR + alpha * T

Solution I.99 (Regression Tree Alpha Value Small). The tree is more complex.

Solution I.100 (Regression Tree Increase Alpha Value). We get smaller trees

Solution I.101 (Regression Tree Pruning). Decreases the complexity of the tree to enhance performance and reduce overfitting

I.2 Machine Learning and Artificial Intelligence

I.2.1 Backpropagation

Solution I.102 (ChainRuleAndGradientDescentAnswer). Combination of the chain rule and gradient descent.

Solution I.103 (BackpropagationNamingAnswer). Because you start at the end and go backwards.

I.2.2 Gradient Descent

Solution I.104 (GradDescStepSize). learning rate x slope

Solution I.105 (GradDescIntercept). Old intercept - step size

Solution I.106 (GradDescIntercept). When the step size is small or after a certain number of steps

I.2.3 ReLU

Solution I.107 (Graph ReLU). Graph of ReLU function: f(x) = max(0, x)

I.2.4 CNNs

Solution I.108 (CNNImageRecognitionAnswer).

too many features for input layer -> high memory consumption
always shift in data
it learns local informations and local correlations

Solution I.109 (CNNFiltersInitializationAnswer). The filter values in CNNs are randomly initialized and then trained and optimized through the process of backpropagation.

Solution I.110 (CNNFilterInitializationAnswer). The filter values in CNNs are initially set by random initialization. These filters undergo training via backpropagation, where gradients are computed and used to adjust the filter values to optimize performance.

Solution I.111 (GenNNStockPredictionAnswer). A limitation of using classical neural networks for stock market prediction is their reliance on fixed inputs. Stock market data is dynamic and requires models that can adapt to changing conditions over time.

I.2.5 RNN

Solution I.112 (RNNUnrollingAnswer). In the unrolling process of RNNs, the network is copied and the output from the inner loop is fed into the second layer of the copied network.

Solution I.113 (RNNReliabilityAnswer). RNNs sometimes fail to work reliably due to the vanishing gradient problem (where gradients are less than 1) and the exploding gradient problem (where gradients are greater than 1). Additionally, reliability issues arise because the network and the weights are copied during the unrolling process.

I.2.6 LSTM

Solution I.114 (LSTMSigmoidTanhAnswer). The sigmoid activation function outputs values between 0 and 1, making it suitable for probability determination, whereas the tanh activation function outputs values between -1 and 1.

Solution I.115 (LSTMSigmoidTanhAnswer). State how much of the long term memory should be used.

Solution I.116 (LSTMGatesAnswer). An LSTM network has three types of gates: the forget gate, the input gate, and the output gate. The forget gate decides what information to discard from the cell state, the input gate updates the cell state with new information, and the output gate determines what part of the cell state should be output.

Solution I.117 (LSTMLongTermInfoAnswer). Long-term information is used in the output gate of an LSTM network.

Solution I.118 (LSTMUpdateGatesAnswer). In the input and forget gates.

I.2.7 Pytorch/Lightning

Solution I.119 (PyTorchRequiresGradAnswer). In PyTorch, requires_grad indicates whether a tensor should be trained. If set to False, the tensor will not be trained.

I.2.8 Embeddings

Solution I.120 (NN STrings). No, they process numerical values.

Solution I.121 (Embedding Definition). Representation of a word as a vector.

Solution I.122 (Embedding Dimensions). We can model similarities.

I.2.9 Sequence to Sequence Models

Solution I.123 (LSTM). Because they are able to consider “far away” information.

Solution I.124 (Teacher Forcing). We need to force the correct words for the training.

Solution I.125 (Attention). Attention scores compute similarities for one input to the others.

I.2.10 Transformers

Solution I.126 (ChatGPT). Decoder only.

Solution I.127 (Translation). Encoder-Decoder structure.

Solution I.128 (Difference Encoder-Decoder and Decoder Only.).

Encoder-Decoder: self-attention.
Decoder only: masked self-attention.

Solution I.129 (Weights).

a: Randomly
b: Backpropagation

Solution I.130 (Order of Words). Positional Encoding

Solution I.131 (Relationship Between Words). Masked self-attention which looks at the previous tokens.

Solution I.132 (Masked Self Attention). It works by investigating how similar each word is to itself and all of the proceeding words in the sentence.

Solution I.133 (Softmax). Transformation to values between 0 and 1.

Solution I.134 (Softmax Output). We create two new numbers: Values – like K and Q with different weights. We scale these values by the percentage. -> we get the scaled V´s

Solution I.135 (V´s). Lastly, we sum these values together, which combine separate encodings for both words relative to their similarities to “is”, are the masked-self-attention values for “is”.

Solution I.136 (Residual Connections). They are bypasses, which combine the position encoded values with masked-self-attention values.

Solution I.137 (Generate Known Word in Sequence).

Training
Because it is a Decoder-Only transformer used for prediction and the calculations that you need.

Solution I.138 (Masked-Self-Attention Values and Bypass). We use a simple neural network with two inputs and five outputs for the vocabulary.