Lecture 4: Regularization and Bayesian Statisticspoorly to the trend of the data Overfitting, or high variance, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data Feng Li (SDU) Regularization Parameter Estimation in Probabilistic Models Assume data are generated via probabilistic model d ∼ p(d; θ) p(d; θ): Probability distribution underlying the data θ: Fixed but unknown distribution parameter parameter Given: m independent and identically distributed (i.i.d.) samples of the data D = {d(i)}i=1,··· ,m Independent and Identically Distributed Given θ, each sample is independent of all other samples All0 码力 | 25 页 | 185.30 KB | 1 年前3
《Efficient Deep Learning Book》[EDL] Chapter 2 - Compression TechniquesOverview of Compression One of the simplest approaches towards efficiency is compression to reduce data size. For the longest time in the history of computing, scientists have worked tirelessly towards popular example of lossless data compression algorithm is Huffman Coding, where we assign unique strings of bits (codes) to the symbols based on their frequency in the data. More frequent symbols are assigned and the path to that symbol is the bit-string assigned to it. This allows us to encode the given data in as few bits as possible, since the most frequent symbols will take the least number of bits to0 码力 | 33 页 | 1.96 MB | 1 年前3
Lecture 6: Support Vector Machinerule: y = sign(ωTx + b) Given: Training data {(x(i), y(i))}i=1,··· ,m Goal: Learn ω and b that achieve the maximum margin For now, assume that entire training data are correctly classified by (ω, b) Zero labels from negative labels We make more confident decision if larger margin is given, i.e., the data sample is further away from the hyperplane There exist a infinite number of hyperplanes, but which 82 SVM: The Solution Once we have the α∗, ω∗ = m � i=1 α∗ i y(i)x(i) Given ω∗, how to calculate the optimal value of b? Feng Li (SDU) SVM December 28, 2021 35 / 82 SVM: The Solution Since α∗ i0 码力 | 82 页 | 773.97 KB | 1 年前3
Lecture 5: Gaussian Discriminant Analysis, Naive BayespX(x) , ∀y We calculate pX|Y (x | y) for ∀x, y and pY (y) for ∀y according to the given training data Fortunately, we do not have to calculate pX(x), because arg max y pY |X(y | x) = arg max y pX|Y learning from training data, but how? Feng Li (SDU) GDA, NB and EM September 27, 2023 33 / 122 Warm Up (Contd.) Given a set of training data D = {x(i), y(i)}i=1,··· ,m The training data are sampled in an an i.i.d. manner The probability of the i-th training data (x(i), y (i)) P(X = x(i), Y = y (i)) = P(X = x(i) | Y = y (i))P(Y = y (i)) = pX(x(i) | y (i))pY (y (i)) = pX|Y (x(i) | y (i))pY (y (i)) The0 码力 | 122 页 | 1.35 MB | 1 年前3
Lecture Notes on Support Vector Machineso-called margin of x0 (with respect to the hyperplane ωT x + b = 0). Now, given a set of m training data {(x(i), y(i))}i=1,··· ,m, we first assume that they are linearly separable. Specifically, there exists hyperplane actually serves as a decision boundary to differentiating positive data samples from negative data samples. Given a test data sample, we will make a more confident decision if its margin (with respect across all b∗’s b∗ = � i:α∗ i >0(y(i) − ω∗T x(i)) �m i=1 1(α∗ i > 0) In fact, most αi’s in the solution are zeros. According to complementary slackness (see Theorem 2), α∗ i [1 − y(i)(ω∗T x(i) + b∗)]0 码力 | 18 页 | 509.37 KB | 1 年前3
Lecture Notes on Gaussian Discriminant Analysis, Naiveis given by pX|Y (x | 1) = 1 (2π)n/2|Σ|1/2 exp � −1 2(x − µ1)T Σ−1(x − µ1) � (7) Given m sample data {(x(i), y(i))}i=1,··· ,m, the log-likelihood is defined as ℓ(ψ, µ0, µ1, Σ) = log m � i=1 pX,Y (x(i) optimal values for ψ, µ0, and σ, such that the resulting GDA model can best fit the given training data. In particular, we let ∇µ0ℓ(ψ, µ0, µ1, Σ) = 0 ∇µ1ℓ(ψ, µ0, µ1, Σ) = 0 ∇Σℓ(ψ, µ0, µ1, Σ) = 0 A careful (5)∼(7), and make predictions according to Bayes’ theorem (see Eq. (2)). Specifically, given a test data featured by ˜x, we compare P(Y = ˜y | X = ˜x) = pY |X(˜y | ˜x) = p(˜x | ˜y)p(˜y) p(˜x) where ˜y0 码力 | 19 页 | 238.80 KB | 1 年前3
《Efficient Deep Learning Book》[EDL] Chapter 5 - Advanced Compression Techniquessimplest questions cleverly, thereby rendering them unusually complex. One should seek the simple solution.” — Anton Pavlovich Chekhov In this chapter, we will discuss two advanced compression techniques for weight sharing. However, quantization falls behind in case the data that we are quantizing is not uniformly distributed, i.e. the data is more likely to take values in a certain range than another equally ranges (bins), regardless of the frequency of data. Clustering helps solve that problem by adapting the allocation of precision to match the distribution of the data, which ensures the decoded value deviates0 码力 | 34 页 | 3.18 MB | 1 年前3
PyTorch Release Notesfunctionality. PyTorch also includes standard defined neural network layers, deep learning optimizers, data loading utilities, and multi-gpu, and multi-node support. Functions are executed immediately instead nvcr.io/nvidia/ pytorch:-py3 Note: If you use multiprocessing for multi-threaded data loaders, the default shared memory segment size with which the container runs might not be enough To pull data and model descriptions from locations outside the container for use by PyTorch or save results to locations outside the container, mount one or more host directories as Docker® data volumes 0 码力 | 365 页 | 2.94 MB | 1 年前3
Lecture Notes on Linear Regressionby ✓. Since our goal is to make predictions according to the hypothesis function given a new test data, we need to find the optimal value of ✓ such that the resulting prediction is as accurate as possible based on a given set of m training data {x(i), y(i)}i=1,··· ,m. In particular, we are supposed to find a hypothesis function (parameterized by ✓) which fits the training data as closely as possible. To measure measure the error between h✓ and the training data, we define a cost function (also called error function) J(✓) : Rn+1 ! R as follows J(✓) = 1 2 m X i=1 ⇣ h✓(x(i)) � y(i)⌘2 Our linear regression problem0 码力 | 6 页 | 455.98 KB | 1 年前3
QCon北京2018-《从键盘输入到神经网络--深度学习在彭博的应用》-李碧野rights reserved. Qcon Beijing April 21, 2018 Biye Li Team Manager, Data Technologies Automation Xiangqian Yu Team Lead, Derivatives Data From Keyboards to Neural Networks 从键盘到神经网络 © 2018 Bloomberg Finance facilitate financial decision- making. 4 © 2018 Bloomberg Finance L.P. All rights reserved. What is Data Technologies Automation? Challenges – Scale of Financial Information Companies Market Types Speed vs. Federal Reserve will raise rate to 2% © 2018 Bloomberg Finance L.P. All rights reserved. Solution – Evolution Over Time 1990s patt[ern] matc[hin]g 2000s 2010 2016 2017 Modified from https://commons0 码力 | 64 页 | 13.45 MB | 1 年前3
共 75 条
- 1
- 2
- 3
- 4
- 5
- 6
- 8













