 《Efficient Deep Learning Book》[EDL] Chapter 7 - Automationthese choices are boolean, others have discrete parameters and still there are the ones with continuous parameters. Some choices even have multiple parameters. For example, horizontal flip is a boolean choice augment requires multiple parameters. Figure 7-1: The plethora of choices that we face when training a deep learning model in the computer vision domain. A Search Space for n parameters is a n-dimensional region such that a point in such a region is a set of well-defined values for each of those parameters. The parameters can take discrete or continuous values. It is called a "search" space because we are searching0 码力 | 33 页 | 2.48 MB | 1 年前3 《Efficient Deep Learning Book》[EDL] Chapter 7 - Automationthese choices are boolean, others have discrete parameters and still there are the ones with continuous parameters. Some choices even have multiple parameters. For example, horizontal flip is a boolean choice augment requires multiple parameters. Figure 7-1: The plethora of choices that we face when training a deep learning model in the computer vision domain. A Search Space for n parameters is a n-dimensional region such that a point in such a region is a set of well-defined values for each of those parameters. The parameters can take discrete or continuous values. It is called a "search" space because we are searching0 码力 | 33 页 | 2.48 MB | 1 年前3
 《Efficient Deep Learning Book》[EDL] Chapter 1 - Introductionmodel scaled well with the number of labeled examples, since the network had a large number of parameters. Thus to extract the most out of the setup, the model needed a large number of labeled examples trailblazing work, there has been a race to create deeper networks with an ever larger number of parameters and increased complexity. In Computer Vision, several model architectures such as VGGNet, Inception intelligence and statistics. JMLR Workshop and Conference Proceedings, 2011. Figure 1-2: Growth of parameters in Computer Vision and NLP models over time. (Data Source) We have seen a similar effect in the0 码力 | 21 页 | 3.17 MB | 1 年前3 《Efficient Deep Learning Book》[EDL] Chapter 1 - Introductionmodel scaled well with the number of labeled examples, since the network had a large number of parameters. Thus to extract the most out of the setup, the model needed a large number of labeled examples trailblazing work, there has been a race to create deeper networks with an ever larger number of parameters and increased complexity. In Computer Vision, several model architectures such as VGGNet, Inception intelligence and statistics. JMLR Workshop and Conference Proceedings, 2011. Figure 1-2: Growth of parameters in Computer Vision and NLP models over time. (Data Source) We have seen a similar effect in the0 码力 | 21 页 | 3.17 MB | 1 年前3
 Lecture 1: Overviewto estimate parameters of it Use these parameters to make predictions for the test data. Such approaches save computation when we make predictions for test data. That is, estimate parameters once, use them remember all the training data. Linear regression, after getting parameters, can forget the training data, and just use the parameters. They are also opposite w.r.t. to statistical properties. NN makes ting into trouble. Optimization and Integration Usually involve finding the best values for some parameters (an opti- mization problem), or average over many plausible values (an integration problem). How0 码力 | 57 页 | 2.41 MB | 1 年前3 Lecture 1: Overviewto estimate parameters of it Use these parameters to make predictions for the test data. Such approaches save computation when we make predictions for test data. That is, estimate parameters once, use them remember all the training data. Linear regression, after getting parameters, can forget the training data, and just use the parameters. They are also opposite w.r.t. to statistical properties. NN makes ting into trouble. Optimization and Integration Usually involve finding the best values for some parameters (an opti- mization problem), or average over many plausible values (an integration problem). How0 码力 | 57 页 | 2.41 MB | 1 年前3
 《Efficient Deep Learning Book》[EDL] Chapter 4 - Efficient Architecturesoften straightforward to scale up or down the model quality by increasing or decreasing these two parameters respectively. The exact sweet-spot of embedding table size and model quality needs to be determined vocabulary size, embedding dimension size, the initializing tensor for the embeddings and several other parameters. It crucially also supports fine-tuning the table to the task by setting the layer as trainable on-disk: We can use a smaller vocabulary, and see if the resulting quality is within the acceptable parameters. For on-device models, TFLite offers post-training quantization as described in chapter 2. We could0 码力 | 53 页 | 3.92 MB | 1 年前3 《Efficient Deep Learning Book》[EDL] Chapter 4 - Efficient Architecturesoften straightforward to scale up or down the model quality by increasing or decreasing these two parameters respectively. The exact sweet-spot of embedding table size and model quality needs to be determined vocabulary size, embedding dimension size, the initializing tensor for the embeddings and several other parameters. It crucially also supports fine-tuning the table to the task by setting the layer as trainable on-disk: We can use a smaller vocabulary, and see if the resulting quality is within the acceptable parameters. For on-device models, TFLite offers post-training quantization as described in chapter 2. We could0 码力 | 53 页 | 3.92 MB | 1 年前3
 Lecture Notes on Gaussian Discriminant Analysis, Naive1 them share the same denominator P(X = x). Therefore, to perform Bayesian interference, the parameters we have to compute are only P(X = x | Y = y) and P(Y = y). Recalling that, in linear regression vector x and label y, while we now rely on Byes’ theorem to characterize the relationship through parameters θ = {P(X = x | Y = y), P(Y = y)}x,y. 2 Gaussian Discriminant Analysis In Gaussian Discriminate i=1 log pX|Y (x(i) | y(i); µ0, µ1, Σ) + m � i=1 log pY (y(i); ψ)(8) where ψ, µ0, and σ are parameters. Substituting Eq. (5)∼(7) into Eq. (8) gives 2 us a full expression of ℓ(ψ, µ0, µ1, Σ) ℓ(ψ,0 码力 | 19 页 | 238.80 KB | 1 年前3 Lecture Notes on Gaussian Discriminant Analysis, Naive1 them share the same denominator P(X = x). Therefore, to perform Bayesian interference, the parameters we have to compute are only P(X = x | Y = y) and P(Y = y). Recalling that, in linear regression vector x and label y, while we now rely on Byes’ theorem to characterize the relationship through parameters θ = {P(X = x | Y = y), P(Y = y)}x,y. 2 Gaussian Discriminant Analysis In Gaussian Discriminate i=1 log pX|Y (x(i) | y(i); µ0, µ1, Σ) + m � i=1 log pY (y(i); ψ)(8) where ψ, µ0, and σ are parameters. Substituting Eq. (5)∼(7) into Eq. (8) gives 2 us a full expression of ℓ(ψ, µ0, µ1, Σ) ℓ(ψ,0 码力 | 19 页 | 238.80 KB | 1 年前3
 《Efficient Deep Learning Book》[EDL] Chapter 2 - Compression Techniquesmodel footprint by reducing the number of trainable parameters. However, this approach has two drawbacks. First, it is hard to determine the parameters or layers that can be removed without significantly layers, and the number of parameters (assuming that the models are well-tuned). If we naively reduce the footprint, we can reduce the number of layers and number of parameters, but this could hurt the quality function with an input and parameters such that . In the case of a fully-connected layer, is a 2-D matrix. Further, assume that we can train another network with far fewer parameters ( ) such that the outputs0 码力 | 33 页 | 1.96 MB | 1 年前3 《Efficient Deep Learning Book》[EDL] Chapter 2 - Compression Techniquesmodel footprint by reducing the number of trainable parameters. However, this approach has two drawbacks. First, it is hard to determine the parameters or layers that can be removed without significantly layers, and the number of parameters (assuming that the models are well-tuned). If we naively reduce the footprint, we can reduce the number of layers and number of parameters, but this could hurt the quality function with an input and parameters such that . In the case of a fully-connected layer, is a 2-D matrix. Further, assume that we can train another network with far fewer parameters ( ) such that the outputs0 码力 | 33 页 | 1.96 MB | 1 年前3
 PyTorch Tutorialweights • Imagine updating 100k parameters! • An optimizer takes the parameters we want to update, the learning rate we want to use (and possibly many other hyper-parameters as well!) and performs the updates Two components • __init__(self): it defines the parts that make up the model —in our case, two parameters, a and b • forward(self, x): it performs the actual computation, that is, it outputs a prediction state_dic() - returns a dictionary of trainable parameters with their current values • model.parameters() - returns a list of all trainable parameters in the model • model.train() or model.eval() Putting0 码力 | 38 页 | 4.09 MB | 1 年前3 PyTorch Tutorialweights • Imagine updating 100k parameters! • An optimizer takes the parameters we want to update, the learning rate we want to use (and possibly many other hyper-parameters as well!) and performs the updates Two components • __init__(self): it defines the parts that make up the model —in our case, two parameters, a and b • forward(self, x): it performs the actual computation, that is, it outputs a prediction state_dic() - returns a dictionary of trainable parameters with their current values • model.parameters() - returns a list of all trainable parameters in the model • model.train() or model.eval() Putting0 码力 | 38 页 | 4.09 MB | 1 年前3
 Machine Learning Pytorch Tutorialx2 x1 x3 x32 y2 y1 y3 y64 32 64 ... ... W (64x32) y x x = b + torch.nn – Network Parameters ● Linear Layer (Fully-connected Layer) >>> layer = torch.nn.Linear(32, 64) >>> layer.weight algorithms that adjust network parameters to reduce error. (See Adaptive Learning Rate lecture video) ● E.g. Stochastic Gradient Descent (SGD) torch.optim.SGD(model.parameters(), lr, momentum = 0) torch optimizer = torch.optim.SGD(model.parameters(), lr, momentum = 0) ● For every batch of data: 1. Call optimizer.zero_grad() to reset gradients of model parameters. 2. Call loss.backward() to backpropagate0 码力 | 48 页 | 584.86 KB | 1 年前3 Machine Learning Pytorch Tutorialx2 x1 x3 x32 y2 y1 y3 y64 32 64 ... ... W (64x32) y x x = b + torch.nn – Network Parameters ● Linear Layer (Fully-connected Layer) >>> layer = torch.nn.Linear(32, 64) >>> layer.weight algorithms that adjust network parameters to reduce error. (See Adaptive Learning Rate lecture video) ● E.g. Stochastic Gradient Descent (SGD) torch.optim.SGD(model.parameters(), lr, momentum = 0) torch optimizer = torch.optim.SGD(model.parameters(), lr, momentum = 0) ● For every batch of data: 1. Call optimizer.zero_grad() to reset gradients of model parameters. 2. Call loss.backward() to backpropagate0 码力 | 48 页 | 584.86 KB | 1 年前3
 《Efficient Deep Learning Book》[EDL] Chapter 3 - Learning Techniquesto the model performance. They are also likely to boost the performance of smaller models (fewer parameters / layers, etc.). Concretely, we want to find the smallest model, which when trained with the learning training process. The train() is simple. It takes the model, training set and validation set as parameters. It also has two hyperparameters: batch_size and epochs. We use a small batch size because our hard labels, and denotes the distillation loss function which uses the soft labels. and are hyper-parameters that weigh the two loss functions appropriately. When and , the student model is trained with0 码力 | 56 页 | 18.93 MB | 1 年前3 《Efficient Deep Learning Book》[EDL] Chapter 3 - Learning Techniquesto the model performance. They are also likely to boost the performance of smaller models (fewer parameters / layers, etc.). Concretely, we want to find the smallest model, which when trained with the learning training process. The train() is simple. It takes the model, training set and validation set as parameters. It also has two hyperparameters: batch_size and epochs. We use a small batch size because our hard labels, and denotes the distillation loss function which uses the soft labels. and are hyper-parameters that weigh the two loss functions appropriately. When and , the student model is trained with0 码力 | 56 页 | 18.93 MB | 1 年前3
 【PyTorch深度学习-龙龙老师】-测试版202112# 创建优化器,并传递需要优化的参数列表:[w1, b1, w2, b2, w3, b3] # 设置学习率 lr=0.001 optimizer = optim.SGD(model.parameters(), lr=0.01) train_loss = [] for epoch in range(5): # 训练 5 个 epoch for batch_idx, (x, 类的 parameters 函数来返回待优化参数列表,代码如下: In [5]: for p in fc.parameters(): print(p.shape) Out[5]: # 返回待优化参数列表 torch.Size([512, 784]) torch.Size([512]) 实际上,网络层除了保存了待优化张量列表 parameters,还有部分层包含了不参与梯度优 named_buffers 函数返回所有 不需要优化的参数列表。 除了通过 parameters 函数获得匿名的待优化张量列表外,还可以通过成员函数 named_parameters 获得待优化张量名和对象列表。例如: In [6]: # 返回所有参数列表 for name,p in fc.named_parameters(): print(name, p.shape) Out[6]:0 码力 | 439 页 | 29.91 MB | 1 年前3 【PyTorch深度学习-龙龙老师】-测试版202112# 创建优化器,并传递需要优化的参数列表:[w1, b1, w2, b2, w3, b3] # 设置学习率 lr=0.001 optimizer = optim.SGD(model.parameters(), lr=0.01) train_loss = [] for epoch in range(5): # 训练 5 个 epoch for batch_idx, (x, 类的 parameters 函数来返回待优化参数列表,代码如下: In [5]: for p in fc.parameters(): print(p.shape) Out[5]: # 返回待优化参数列表 torch.Size([512, 784]) torch.Size([512]) 实际上,网络层除了保存了待优化张量列表 parameters,还有部分层包含了不参与梯度优 named_buffers 函数返回所有 不需要优化的参数列表。 除了通过 parameters 函数获得匿名的待优化张量列表外,还可以通过成员函数 named_parameters 获得待优化张量名和对象列表。例如: In [6]: # 返回所有参数列表 for name,p in fc.named_parameters(): print(name, p.shape) Out[6]:0 码力 | 439 页 | 29.91 MB | 1 年前3
共 36 条
- 1
- 2
- 3
- 4













