最强总结！神经网络中常用的九种优化技术（一）特征缩放、批量标准化、梯度下降、基于动量的梯度下降（非常详细）大模型入门到精通！

发布时间：2024-11-23 11:45

深度学习的反向传播算法是梯度下降的核心，用于更新网络权重 #生活技巧# #学习技巧# #深度学习技巧#

神经网络中的优化技术对于提升模型性能、减少误差和提高训练效率至关重要。

优化过程的目标是通过调整模型参数，使得损失函数达到最小值，以实现模型的最佳性能。

以下是一些在神经网络中常用的优化技术。

一、特征缩放

特征缩放是指对输入数据进行标准化处理，使其落在相同的范围内（如0到1），以避免由于不同特征值尺度差异过大而导致模型训练时的梯度不平衡。这在梯度下降等优化方法中尤为重要，因为它可以加速收敛。

常用的方法有：

标准化（Standardization）：将特征变换为均值为0，标准差为1的标准正态分布。归一化（Normalization）：将特征压缩到指定范围（如[0, 1]）内。

优点

加快基于梯度下降的优化器的收敛速度。防止某些特征对模型产生不成比例的影响。

缺点

可能导致信息丢失
如果特征本身的绝对大小有实际意义，缩放可能会丢失部分信息。

不适合类别特征
特征缩放主要针对连续特征，不适用于类别型数据。

代码示例

import numpy as np import tensorflow as tf from sklearn.preprocessing import StandardScaler # Dataset: height (cm) and weight (kg) data = np.array(\[\[170, 70\], # Person 1 \[160, 65\], # Person 2 \[180, 80\], # Person 3 \[175, 75\], # Person 4 \[165, 68\]\]) # Person 5 # Separate features: height and weight height = data\[:, 0\].reshape(-1, 1) # Height in cm weight = data\[:, 1\].reshape(-1, 1) # Weight in kg # Combine both features X = np.hstack((height, weight)) # Feature scaling using standardization scaler = StandardScaler() X\_scaled = scaler.fit\_transform(X) print("Original Features (Height in cm, Weight in kg):\\n", X) print("\\nScaled Features (Standardized):\\n", X\_scaled) # TensorFlow model using the scaled data model = tf.keras.Sequential(\[ tf.keras.layers.Dense(2, input\_shape=(2,), activation='relu'), tf.keras.layers.Dense(1, activation='linear') \]) # Model summary model.summary()

123456789101112131415161718192021222324252627282930313233

在这里插入图片描述

二、批量标准化

批量标准化（Batch Normalization）是一种在神经网络训练中常用的技术，用于加速训练过程并提高模型的稳定性。

其主要目的是通过在每一层神经网络的输出上进行标准化处理，使每一层的输入分布更加稳定。

优点

加速训练过程
批量标准化能够稳定梯度的更新过程，减少训练过程中的振荡，从而使得模型能够更快收敛。

有助于防止深度网络中的梯度消失/爆炸。

缺点

在每一层进行标准化和重新缩放会增加训练过程中的计算复杂度。对批量大小有一定的依赖，小批量训练时可能效果较差。

代码示例

import tensorflow as tf # Input data X = tf.random.normal(shape=(4, 10)) # 4 samples, 10 features each # Batch normalization layer bn = tf.keras.layers.BatchNormalization() # Apply batch normalization X_bn = bn(X, training=True) print("Batch Normalized Data:\n", X_bn) 123456789101112

三、梯度下降

梯度下降是最基本的优化算法，主要用于通过不断调整参数来最小化损失函数。

其核心思想是计算损失函数相对于每个参数的梯度，沿着梯度的负方向进行参数更新，直到损失函数收敛。
在这里插入图片描述

公式：

在这里插入图片描述

梯度下降的变种

批量梯度下降（Batch Gradient Descent）
每次使用整个训练集来计算梯度，更新一次参数。适用于小规模数据集。

随机梯度下降（ SGD）
每次使用一个样本来计算梯度，更新参数。适用于大规模数据，但收敛过程较为不稳定。

小批量梯度下降（Mini-Batch Gradient Descent）
在每个迭代中，使用一部分样本（小批量）来计算梯度，折中批量和随机梯度下降的优缺点，训练更高效。

代码示例

下面是小批量梯度下降的代码示例。

import numpy as np import tensorflow as tf # Generate a random dataset of 1000 samples and 2 features np.random.seed(42) X = np.random.randn(1000, 2) # 1000 samples, 2 features y = np.random.randn(1000, 1) # 1000 targets # Define a simple model model = tf.keras.Sequential([ tf.keras.layers.Dense(10, input_shape=(2,), activation='relu'), tf.keras.layers.Dense(1, activation='linear') ]) # Loss function and optimizer loss_fn = tf.keras.losses.MeanSquaredError() optimizer = tf.keras.optimizers.SGD(learning_rate=0.01) # Training step using mini-batch gradient descent batch_size = 50 def train_step(X_batch, y_batch): with tf.GradientTape() as tape: predictions = model(X_batch) loss = loss_fn(y_batch, predictions) gradients = tape.gradient(loss, model.trainable_variables) optimizer.apply_gradients(zip(gradients, model.trainable_variables)) return loss # Training loop epochs = 1000 num_batches = X.shape[0] // batch_size # 1000 samples, batch size = 50 -> 20 batches for epoch in range(epochs): for i in range(num_batches): X_batch = X[i*batch_size:(i+1)*batch_size] y_batch = y[i*batch_size:(i+1)*batch_size] loss = train_step(X_batch, y_batch) if (epoch + 1) % 100 == 0: print(f'Epoch {epoch+1}, Loss: {loss.numpy()}') # Evaluate the model print("\nModel weights after training:", model.get_weights())

123456789101112131415161718192021222324252627282930313233343536373839404142434445

四、基于动量的梯度下降

基于动量的梯度下降通过同时考虑当前梯度和先前的更新方向，改进了标准梯度下降。

它积累过去的梯度，为一致的方向赋予更多权重，从而实现更快的收敛并减少振荡。

公式

在这里插入图片描述

优点

基于动量的梯度下降加快收敛速度，特别是在深度神经网络中效果明显。在高维空间中，动量可以减少参数在各方向上的振荡，尤其是在梯度变化剧烈的区域。

缺点

需要调参，动量系数（通常是 0.9）需要通过实验调节，过大会导致优化不稳定，过小则影响效果。

代码示例

import tensorflow as tf # Example data X = np.array([[1], [2], [3], [4]], dtype=np.float32) y = np.array([[2], [4], [6], [8]], dtype=np.float32) # Model and optimizer with momentum model = tf.keras.Sequential([tf.keras.layers.Dense(units=1)]) optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9) loss_fn = tf.keras.losses.MeanSquaredError() # Training with momentum def train_step(X_batch, y_batch): with tf.GradientTape() as tape: predictions = model(X_batch) loss = loss_fn(y_batch, predictions) gradients = tape.gradient(loss, model.trainable_variables) optimizer.apply_gradients(zip(gradients, model.trainable_variables)) return loss # Train the model for epoch in range(1000): train_step(X, y) print("Weights after training:", model.weights)

12345678910111213141516171819202122232425