 Lecture 5: Gaussian Discriminant Analysis, Naive Bayestrue given event B is true P(A | B) = P(A, B) P(B) , P(A, B) = P(A | B)P(B) Corollary: The chain rule P (A1, A2, · · · , Ak) = n � k=1 P (Ak | A1, A2, · · · , Ak−1) Example: P(A4, A3, A2, A1) = P(A4 Feng Li (SDU) GDA, NB and EM September 27, 2023 16 / 122 Bayes’ Theorem Bayes’ theorem (or Bayes’ rule) describes the probability of an event, based on prior knowledge of conditions that might be related z(0)) = q Suppose h(t) = f (x(t), y(t), z(t)) such that h(t) has a maximum at t = 0 By the chain rule h′(t) = ∇f |r(t) ·r′(t) Since t = 0 is a local maximum, we have h′(0) = ∇f |q ·r′(0) = 0 ∇f |q0 码力 | 122 页 | 1.35 MB | 1 年前3 Lecture 5: Gaussian Discriminant Analysis, Naive Bayestrue given event B is true P(A | B) = P(A, B) P(B) , P(A, B) = P(A | B)P(B) Corollary: The chain rule P (A1, A2, · · · , Ak) = n � k=1 P (Ak | A1, A2, · · · , Ak−1) Example: P(A4, A3, A2, A1) = P(A4 Feng Li (SDU) GDA, NB and EM September 27, 2023 16 / 122 Bayes’ Theorem Bayes’ theorem (or Bayes’ rule) describes the probability of an event, based on prior knowledge of conditions that might be related z(0)) = q Suppose h(t) = f (x(t), y(t), z(t)) such that h(t) has a maximum at t = 0 By the chain rule h′(t) = ∇f |r(t) ·r′(t) Since t = 0 is a local maximum, we have h′(0) = ∇f |q ·r′(0) = 0 ∇f |q0 码力 | 122 页 | 1.35 MB | 1 年前3
 Experiment 1: Linear Regressionperformed iteratively, and in each iteration, we update parameter θ according to the the following rule θj := θj − α 1 m m � i=1 (hθ(x(i)) − y(i))x(i) j (3) where α is so-called “learning rate” based But since in this example we have only one feature, being able to plot this gives a nice sanity-check on our result. (3) Finally, we’d like to make some predictions using the learned hypothesis. Use columns(z) and y = 1 : rows(z). Therefore, z(i, j) is actually calculated based on x(j) and y(i). This rule is also applicable to the contour function. We can specify the number and the distribution of contours0 码力 | 7 页 | 428.11 KB | 1 年前3 Experiment 1: Linear Regressionperformed iteratively, and in each iteration, we update parameter θ according to the the following rule θj := θj − α 1 m m � i=1 (hθ(x(i)) − y(i))x(i) j (3) where α is so-called “learning rate” based But since in this example we have only one feature, being able to plot this gives a nice sanity-check on our result. (3) Finally, we’d like to make some predictions using the learned hypothesis. Use columns(z) and y = 1 : rows(z). Therefore, z(i, j) is actually calculated based on x(j) and y(i). This rule is also applicable to the contour function. We can specify the number and the distribution of contours0 码力 | 7 页 | 428.11 KB | 1 年前3
 《Efficient Deep Learning Book》[EDL] Chapter 6 - Advanced Learning Techniques - Technical Reviewpre-trained model. We will use this pre-processing layer to tokenize our training and test datasets. # Check out the TF hub website for more preprocessors preprocessor = hub.KerasLayer( 'https://tfhub.dev of the i-th layer, , which is the gradient for that layer’s weight. Let’s start by using the chain rule, to compute the partial derivative of the loss function with respect to as follows: And from the can calculate which is simply . More generally, we can calculate , and from that using the chain rule again. As you can see, if the network has a large number of layers and the weights25 have small0 码力 | 31 页 | 4.03 MB | 1 年前3 《Efficient Deep Learning Book》[EDL] Chapter 6 - Advanced Learning Techniques - Technical Reviewpre-trained model. We will use this pre-processing layer to tokenize our training and test datasets. # Check out the TF hub website for more preprocessors preprocessor = hub.KerasLayer( 'https://tfhub.dev of the i-th layer, , which is the gradient for that layer’s weight. Let’s start by using the chain rule, to compute the partial derivative of the loss function with respect to as follows: And from the can calculate which is simply . More generally, we can calculate , and from that using the chain rule again. As you can see, if the network has a large number of layers and the weights25 have small0 码力 | 31 页 | 4.03 MB | 1 年前3
 PyTorch Tutorialwhatever device (cuda or cpu) • Fallback to cpu if gpu is unavailable: • torch.cuda.is_available() • Check cpu/gpu tensor OR numpy array ? • type(t) or t.type() • returns • numpy.ndarray • torch.Tensor • Autograd • Automatic Differentiation Package • Don’t need to worry about partial differentiation, chain rule etc.. • backward() does that • loss.backward() • Gradients are accumulated for each step by default:0 码力 | 38 页 | 4.09 MB | 1 年前3 PyTorch Tutorialwhatever device (cuda or cpu) • Fallback to cpu if gpu is unavailable: • torch.cuda.is_available() • Check cpu/gpu tensor OR numpy array ? • type(t) or t.type() • returns • numpy.ndarray • torch.Tensor • Autograd • Automatic Differentiation Package • Don’t need to worry about partial differentiation, chain rule etc.. • backward() does that • loss.backward() • Gradients are accumulated for each step by default:0 码力 | 38 页 | 4.09 MB | 1 年前3
 深度学习与PyTorch入门实战 - 20. 链式法则Derivative Rules Basic Rule ▪ ? + ? ▪ ? − ? Product rule ▪ ?? ′ = ?′? + ??′ ▪ ?4′ = ?2 ∗ ?2 ′ = 2? ∗ ?2 + ?2 ∗ 2? = 4?3 Quotient Rule ▪ ? ? = ?′?+??′ ?2 ▪ e.g. Softmax Chain rule ▪ ?? ?? = ?? 1 ▪ ??2 ??1 = ??(?1) ??1 = ??(?1) ?y1 ??1 ??1 = ?2 ∗ ? ▪ ?2 = (??1 + ?1) ∗ w2 + b2 Chain rule ▪ ?? ???? ? = ?? ??? 1 ??? 1 ?? = ?? ??? 2 ??? 2 ??? 1 ??? 1 ?? ∑ E ?? ∑ ???0 码力 | 10 页 | 610.60 KB | 1 年前3 深度学习与PyTorch入门实战 - 20. 链式法则Derivative Rules Basic Rule ▪ ? + ? ▪ ? − ? Product rule ▪ ?? ′ = ?′? + ??′ ▪ ?4′ = ?2 ∗ ?2 ′ = 2? ∗ ?2 + ?2 ∗ 2? = 4?3 Quotient Rule ▪ ? ? = ?′?+??′ ?2 ▪ e.g. Softmax Chain rule ▪ ?? ?? = ?? 1 ▪ ??2 ??1 = ??(?1) ??1 = ??(?1) ?y1 ??1 ??1 = ?2 ∗ ? ▪ ?2 = (??1 + ?1) ∗ w2 + b2 Chain rule ▪ ?? ???? ? = ?? ??? 1 ??? 1 ?? = ?? ??? 2 ??? 2 ??? 1 ??? 1 ?? ∑ E ?? ∑ ???0 码力 | 10 页 | 610.60 KB | 1 年前3
 Experiment 2: Logistic Regression and Newton's Methodobjective function is gradient descent algorithm, where we update θ iteratively according to the following rule θ ← θ − α∇θL(θ) (6) until the difference between the objective function values in successive iterations Newton’s Method Our goal is to use Newton’s method to minimize this function. Recall that the update rule for Newton’s method is θ(t+1) = θ(t) − H−1∇θL In logistic regression, the Hessian is H = 1 m0 码力 | 4 页 | 196.41 KB | 1 年前3 Experiment 2: Logistic Regression and Newton's Methodobjective function is gradient descent algorithm, where we update θ iteratively according to the following rule θ ← θ − α∇θL(θ) (6) until the difference between the objective function values in successive iterations Newton’s Method Our goal is to use Newton’s method to minimize this function. Recall that the update rule for Newton’s method is θ(t+1) = θ(t) − H−1∇θL In logistic regression, the Hessian is H = 1 m0 码力 | 4 页 | 196.41 KB | 1 年前3
 Lecture Notes on Linear Regression@✓n ]T (2) denote the gradient of J(✓). In each iteration, we update ✓ according to the following rule: ✓ ✓ � ↵rJ(✓) (3) where ↵ is a step size. In more details, ✓j ✓j � ↵@J(✓) @✓j (4) The update model, rJ(✓; x(i), y(i)) is defined as rJ(✓; x(i), y(i)) = (✓T x(i) � y(i))x(i) (6) and the update rule is ✓j ✓j � ↵(✓T x(i) � y(i))x(i) j (7) Algorithm 2: Stochastic Gradient Descent for Linear Regression0 码力 | 6 页 | 455.98 KB | 1 年前3 Lecture Notes on Linear Regression@✓n ]T (2) denote the gradient of J(✓). In each iteration, we update ✓ according to the following rule: ✓ ✓ � ↵rJ(✓) (3) where ↵ is a step size. In more details, ✓j ✓j � ↵@J(✓) @✓j (4) The update model, rJ(✓; x(i), y(i)) is defined as rJ(✓; x(i), y(i)) = (✓T x(i) � y(i))x(i) (6) and the update rule is ✓j ✓j � ↵(✓T x(i) � y(i))x(i) j (7) Algorithm 2: Stochastic Gradient Descent for Linear Regression0 码力 | 6 页 | 455.98 KB | 1 年前3
 Lecture 4: Regularization and Bayesian StatisticsBayes Rule p(θ | D) = p(θ)p(D | θ) p(D) p(θ): Prior probability of θ (without having seen any data) p(D): Probability of the data (independent of θ) p(D) = � θ p(θ)p(D | θ)dθ The Bayes Rule lets0 码力 | 25 页 | 185.30 KB | 1 年前3 Lecture 4: Regularization and Bayesian StatisticsBayes Rule p(θ | D) = p(θ)p(D | θ) p(D) p(θ): Prior probability of θ (without having seen any data) p(D): Probability of the data (independent of θ) p(D) = � θ p(θ)p(D | θ)dθ The Bayes Rule lets0 码力 | 25 页 | 185.30 KB | 1 年前3
 Lecture 2: Linear Regressionlim h→0 g(h) − g(0) h = lim h→0 f (x + hu) − g(0) h = ∇uf (x) (1) On the other hand, by the chain rule, g′(h) = n � i=1 f ′ i (x) d dh(xi + hui) = n � i=1 f ′ i (x)ui (2) Let h = 0, then g′(0) = GD Algorithm (Contd.) In more details, we update each component of θ according to the fol- lowing rule θj ← θj − α∂J(θ) ∂θj , ∀j = 0, 1, · · · , n Calculating the gradient for linear regression ∂J(θ)0 码力 | 31 页 | 608.38 KB | 1 年前3 Lecture 2: Linear Regressionlim h→0 g(h) − g(0) h = lim h→0 f (x + hu) − g(0) h = ∇uf (x) (1) On the other hand, by the chain rule, g′(h) = n � i=1 f ′ i (x) d dh(xi + hui) = n � i=1 f ′ i (x)ui (2) Let h = 0, then g′(0) = GD Algorithm (Contd.) In more details, we update each component of θ according to the fol- lowing rule θj ← θj − α∂J(θ) ∂θj , ∀j = 0, 1, · · · , n Calculating the gradient for linear regression ∂J(θ)0 码力 | 31 页 | 608.38 KB | 1 年前3
 机器学习课程-温州大学-时间序列总结重采样方法(resample) Pandas中的resample()是一个对常规时间序 列数据重新采样和频率转换的便捷的方法。 resample(rule, how=None, axis=0, fill_method=None, clo sed=None, label=None, ...) ➢ rule -- 表示重采样频率的字符串或DateOffset。 ➢ fill_method -- 表示升采样时如何插值。0 码力 | 67 页 | 1.30 MB | 1 年前3 机器学习课程-温州大学-时间序列总结重采样方法(resample) Pandas中的resample()是一个对常规时间序 列数据重新采样和频率转换的便捷的方法。 resample(rule, how=None, axis=0, fill_method=None, clo sed=None, label=None, ...) ➢ rule -- 表示重采样频率的字符串或DateOffset。 ➢ fill_method -- 表示升采样时如何插值。0 码力 | 67 页 | 1.30 MB | 1 年前3
共 25 条
- 1
- 2
- 3













