VMware Data Solution - IT文库_程序员IT互联网编程电子书和文档免费下载，助您码力十足！

首页文库资料文章资讯上传文档发布文章登录账户

Lecture 4: Regularization and Bayesian Statistics

poorly to the trend of the data Overfitting, or high variance, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data Feng Li (SDU) Regularization Parameter Estimation in Probabilistic Models Assume data are generated via probabilistic model d ∼ p(d; θ) p(d; θ): Probability distribution underlying the data θ: Fixed but unknown distribution parameter parameter Given: m independent and identically distributed (i.i.d.) samples of the data D = {d(i)}i=1,··· ,m Independent and Identically Distributed Given θ, each sample is independent of all other samples All

0 码力 | 25 页 | 185.30 KB | 1 年前
3
《Efficient Deep Learning Book》[EDL] Chapter 2 - Compression Techniques

Overview of Compression One of the simplest approaches towards efficiency is compression to reduce data size. For the longest time in the history of computing, scientists have worked tirelessly towards popular example of lossless data compression algorithm is Huffman Coding, where we assign unique strings of bits (codes) to the symbols based on their frequency in the data. More frequent symbols are assigned and the path to that symbol is the bit-string assigned to it. This allows us to encode the given data in as few bits as possible, since the most frequent symbols will take the least number of bits to

0 码力 | 33 页 | 1.96 MB | 1 年前
3
Lecture 6: Support Vector Machine

rule: y = sign(ωTx + b) Given: Training data {(x(i), y(i))}i=1,··· ,m Goal: Learn ω and b that achieve the maximum margin For now, assume that entire training data are correctly classified by (ω, b) Zero labels from negative labels We make more confident decision if larger margin is given, i.e., the data sample is further away from the hyperplane There exist a infinite number of hyperplanes, but which 82 SVM: The Solution Once we have the α∗, ω∗ = m � i=1 α∗ i y(i)x(i) Given ω∗, how to calculate the optimal value of b? Feng Li (SDU) SVM December 28, 2021 35 / 82 SVM: The Solution Since α∗ i

0 码力 | 82 页 | 773.97 KB | 1 年前
3
Lecture 5: Gaussian Discriminant Analysis, Naive Bayes

pX(x) , ∀y We calculate pX|Y (x | y) for ∀x, y and pY (y) for ∀y according to the given training data Fortunately, we do not have to calculate pX(x), because arg max y pY |X(y | x) = arg max y pX|Y learning from training data, but how? Feng Li (SDU) GDA, NB and EM September 27, 2023 33 / 122 Warm Up (Contd.) Given a set of training data D = {x(i), y(i)}i=1,··· ,m The training data are sampled in an an i.i.d. manner The probability of the i-th training data (x(i), y (i)) P(X = x(i), Y = y (i)) = P(X = x(i) | Y = y (i))P(Y = y (i)) = pX(x(i) | y (i))pY (y (i)) = pX|Y (x(i) | y (i))pY (y (i)) The

0 码力 | 122 页 | 1.35 MB | 1 年前
3
Lecture Notes on Support Vector Machine

so-called margin of x0 (with respect to the hyperplane ωT x + b = 0). Now, given a set of m training data {(x(i), y(i))}i=1,··· ,m, we first assume that they are linearly separable. Specifically, there exists hyperplane actually serves as a decision boundary to differentiating positive data samples from negative data samples. Given a test data sample, we will make a more confident decision if its margin (with respect across all b∗’s b∗ = � i:α∗ i >0(y(i) − ω∗T x(i)) �m i=1 1(α∗ i > 0) In fact, most αi’s in the solution are zeros. According to complementary slackness (see Theorem 2), α∗ i [1 − y(i)(ω∗T x(i) + b∗)]

0 码力 | 18 页 | 509.37 KB | 1 年前
3
Lecture Notes on Gaussian Discriminant Analysis, Naive

is given by pX|Y (x | 1) = 1 (2π)n/2|Σ|1/2 exp � −1 2(x − µ1)T Σ−1(x − µ1) � (7) Given m sample data {(x(i), y(i))}i=1,··· ,m, the log-likelihood is defined as ℓ(ψ, µ0, µ1, Σ) = log m � i=1 pX,Y (x(i) optimal values for ψ, µ0, and σ, such that the resulting GDA model can best fit the given training data. In particular, we let ∇µ0ℓ(ψ, µ0, µ1, Σ) = 0 ∇µ1ℓ(ψ, µ0, µ1, Σ) = 0 ∇Σℓ(ψ, µ0, µ1, Σ) = 0 A careful (5)∼(7), and make predictions according to Bayes’ theorem (see Eq. (2)). Specifically, given a test data featured by ˜x, we compare P(Y = ˜y | X = ˜x) = pY |X(˜y | ˜x) = p(˜x | ˜y)p(˜y) p(˜x) where ˜y

0 码力 | 19 页 | 238.80 KB | 1 年前
3
《Efficient Deep Learning Book》[EDL] Chapter 5 - Advanced Compression Techniques

simplest questions cleverly, thereby rendering them unusually complex. One should seek the simple solution.” — Anton Pavlovich Chekhov In this chapter, we will discuss two advanced compression techniques for weight sharing. However, quantization falls behind in case the data that we are quantizing is not uniformly distributed, i.e. the data is more likely to take values in a certain range than another equally ranges (bins), regardless of the frequency of data. Clustering helps solve that problem by adapting the allocation of precision to match the distribution of the data, which ensures the decoded value deviates

0 码力 | 34 页 | 3.18 MB | 1 年前
3
PyTorch Release Notes

functionality. PyTorch also includes standard defined neural network layers, deep learning optimizers, data loading utilities, and multi-gpu, and multi-node support. Functions are executed immediately instead nvcr.io/nvidia/ pytorch:-py3 Note: If you use multiprocessing for multi-threaded data loaders, the default shared memory segment size with which the container runs might not be enough To pull data and model descriptions from locations outside the container for use by PyTorch or save results to locations outside the container, mount one or more host directories as Docker® data volumes

0 码力 | 365 页 | 2.94 MB | 1 年前
3
Lecture Notes on Linear Regression

by ✓. Since our goal is to make predictions according to the hypothesis function given a new test data, we need to find the optimal value of ✓ such that the resulting prediction is as accurate as possible based on a given set of m training data {x(i), y(i)}i=1,··· ,m. In particular, we are supposed to find a hypothesis function (parameterized by ✓) which fits the training data as closely as possible. To measure measure the error between h✓ and the training data, we define a cost function (also called error function) J(✓) : Rn+1 ! R as follows J(✓) = 1 2 m X i=1 ⇣ h✓(x(i)) � y(i)⌘2 Our linear regression problem

0 码力 | 6 页 | 455.98 KB | 1 年前
3
QCon北京2018-《从键盘输入到神经网络--深度学习在彭博的应用》-李碧野

rights reserved. Qcon Beijing April 21, 2018 Biye Li Team Manager, Data Technologies Automation Xiangqian Yu Team Lead, Derivatives Data From Keyboards to Neural Networks 从键盘到神经网络 © 2018 Bloomberg Finance facilitate financial decision- making. 4 © 2018 Bloomberg Finance L.P. All rights reserved. What is Data Technologies Automation? Challenges – Scale of Financial Information Companies Market Types Speed vs. Federal Reserve will raise rate to 2% © 2018 Bloomberg Finance L.P. All rights reserved. Solution – Evolution Over Time 1990s patt[ern] matc[hin]g 2000s 2010 2016 2017 Modified from https://commons

0 码力 | 64 页 | 13.45 MB | 1 年前
3

共 75 条前往

页

分类

语言

格式

Lecture 4: Regularization and Bayesian Statistics

《Efficient Deep Learning Book》[EDL] Chapter 2 - Compression Techniques

Lecture 6: Support Vector Machine

Lecture 5: Gaussian Discriminant Analysis, Naive Bayes

Lecture Notes on Support Vector Machine

Lecture Notes on Gaussian Discriminant Analysis, Naive

《Efficient Deep Learning Book》[EDL] Chapter 5 - Advanced Compression Techniques

PyTorch Release Notes

Lecture Notes on Linear Regression

QCon北京2018-《从键盘输入到神经网络--深度学习在彭博的应用》-李碧野