博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
DeepLearning.ai作业:(2-1)-- 深度学习的实践层面(Practical aspects of Deep Learning)
阅读量:4100 次
发布时间:2019-05-25

本文共 25871 字,大约阅读时间需要 86 分钟。


title: ‘DeepLearning.ai作业:(2-1)-- 深度学习的实践层面(Practical aspects of Deep Learning)’

tags:

  • homework
    categories:
  • AI
  • Deep Learning
    date: 2018-09-15 15:58:33
    id: 2018091515

  1. 不要抄作业!
  2. 我只是把思路整理了,供个人学习。
  3. 不要抄作业!
    首发于个人博客:,欢迎来访

本周的作业分了3部分:

  • 初始化参数
  • 正则化(L2、dropout)
  • 梯度检验

part1:Initialization

主要说明的不同的初始化对迭代的影响。

首先,模型函数是这样的:

def model(X, Y, learning_rate = 0.01, num_iterations = 15000, print_cost = True, initialization = "he"):    """    Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID.        Arguments:    X -- input data, of shape (2, number of examples)    Y -- true "label" vector (containing 0 for red dots; 1 for blue dots), of shape (1, number of examples)    learning_rate -- learning rate for gradient descent     num_iterations -- number of iterations to run gradient descent    print_cost -- if True, print the cost every 1000 iterations    initialization -- flag to choose which initialization to use ("zeros","random" or "he")        Returns:    parameters -- parameters learnt by the model    """            grads = {
} costs = [] # to keep track of the loss m = X.shape[1] # number of examples layers_dims = [X.shape[0], 10, 5, 1] # Initialize parameters dictionary. if initialization == "zeros": parameters = initialize_parameters_zeros(layers_dims) elif initialization == "random": parameters = initialize_parameters_random(layers_dims) elif initialization == "he": parameters = initialize_parameters_he(layers_dims) # Loop (gradient descent) for i in range(0, num_iterations): # Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID. a3, cache = forward_propagation(X, parameters) # Loss cost = compute_loss(a3, Y) # Backward propagation. grads = backward_propagation(X, Y, cache) # Update parameters. parameters = update_parameters(parameters, grads, learning_rate) # Print the loss every 1000 iterations if print_cost and i % 1000 == 0: print("Cost after iteration {}: {}".format(i, cost)) costs.append(cost) # plot the loss plt.plot(costs) plt.ylabel('cost') plt.xlabel('iterations (per hundreds)') plt.title("Learning rate =" + str(learning_rate)) plt.show() return parameters

1. Zero Initialization

把参数全都置位0,结果是显而易见的,就是没有任何变化。

2. Random initialization

把W参数随机化了,但是乘以10倍系数,所以导致初始化的参数太大,收敛速度很慢

def initialize_parameters_random(layers_dims):    """    Arguments:    layer_dims -- python array (list) containing the size of each layer.        Returns:    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":                    W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])                    b1 -- bias vector of shape (layers_dims[1], 1)                    ...                    WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])                    bL -- bias vector of shape (layers_dims[L], 1)    """        np.random.seed(3)               # This seed makes sure your "random" numbers will be the as ours    parameters = {
} L = len(layers_dims) # integer representing the number of layers for l in range(1, L): ### START CODE HERE ### (≈ 2 lines of code) parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * 10 parameters['b' + str(l)] = np.zeros((layers_dims[l], 1)) ### END CODE HERE ### return parameters

结果一般般

3. He initialization

把W参数随机化,但是乘上系数 sqrt(2./layers_dims[l-1])

def initialize_parameters_he(layers_dims):    """    Arguments:    layer_dims -- python array (list) containing the size of each layer.        Returns:    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":                    W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])                    b1 -- bias vector of shape (layers_dims[1], 1)                    ...                    WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])                    bL -- bias vector of shape (layers_dims[L], 1)    """        np.random.seed(3)    parameters = {
} L = len(layers_dims) - 1 # integer representing the number of layers for l in range(1, L + 1): ### START CODE HERE ### (≈ 2 lines of code) parameters['W' + str(l)] = np.random.randn(layers_dims[l],layers_dims[l-1]) * np.sqrt(2./layers_dims[l-1]) parameters['b' + str(l)] = np.zeros((layers_dims[l], 1)) ### END CODE HERE ### return parameters

结果非常理想。

Part 2:Regularization

数据集:

模型函数:

def model(X, Y, learning_rate = 0.3, num_iterations = 30000, print_cost = True, lambd = 0, keep_prob = 1):    """    Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID.        Arguments:    X -- input data, of shape (input size, number of examples)    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (output size, number of examples)    learning_rate -- learning rate of the optimization    num_iterations -- number of iterations of the optimization loop    print_cost -- If True, print the cost every 10000 iterations    lambd -- regularization hyperparameter, scalar    keep_prob - probability of keeping a neuron active during drop-out, scalar.        Returns:    parameters -- parameters learned by the model. They can then be used to predict.    """            grads = {
} costs = [] # to keep track of the cost m = X.shape[1] # number of examples layers_dims = [X.shape[0], 20, 3, 1] # Initialize parameters dictionary. parameters = initialize_parameters(layers_dims) # Loop (gradient descent) for i in range(0, num_iterations): # Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID. if keep_prob == 1: a3, cache = forward_propagation(X, parameters) elif keep_prob < 1: a3, cache = forward_propagation_with_dropout(X, parameters, keep_prob) # Cost function if lambd == 0: cost = compute_cost(a3, Y) else: cost = compute_cost_with_regularization(a3, Y, parameters, lambd) # Backward propagation. assert(lambd==0 or keep_prob==1) # it is possible to use both L2 regularization and dropout, # but this assignment will only explore one at a time if lambd == 0 and keep_prob == 1: grads = backward_propagation(X, Y, cache) elif lambd != 0: grads = backward_propagation_with_regularization(X, Y, cache, lambd) elif keep_prob < 1: grads = backward_propagation_with_dropout(X, Y, cache, keep_prob) # Update parameters. parameters = update_parameters(parameters, grads, learning_rate) # Print the loss every 10000 iterations if print_cost and i % 10000 == 0: print("Cost after iteration {}: {}".format(i, cost)) if print_cost and i % 1000 == 0: costs.append(cost) # plot the cost plt.plot(costs) plt.ylabel('cost') plt.xlabel('iterations (x1,000)') plt.title("Learning rate =" + str(learning_rate)) plt.show() return parameters

没有使用正则化时,效果:

L2 正则

计算代价函数

J r e g u l a r i z e d = − 1 m ∑ i = 1 m ( y ( i ) log ⁡ ( a [ L ] ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − a [ L ] ( i ) ) ) ⎵ cross-entropy cost + 1 m λ 2 ∑ l ∑ k ∑ j W k , j [ l ] 2 ⎵ L2 regularization cost J_{regularized} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L] (i)}\right) \large{)} }_\text{cross-entropy cost} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W_{k,j}^{[l]2} }_\text{L2 regularization cost} Jregularized=cross-entropy cost m1i=1m(y(i)log(a[L](i))+(1y(i))log(1a[L](i)))+L2 regularization cost m12λlkjWk,j[l]2

公式已经给了,只要加上后面那一项就可以了

使用np.sum(np.square(Wl))来计算 ∑ k ∑ j W k , j [ l ] 2 \sum\limits_k\sum\limits_j W_{k,j}^{[l]2} kjWk,j[l]2

# GRADED FUNCTION: compute_cost_with_regularizationdef compute_cost_with_regularization(A3, Y, parameters, lambd):    """    Implement the cost function with L2 regularization. See formula (2) above.        Arguments:    A3 -- post-activation, output of forward propagation, of shape (output size, number of examples)    Y -- "true" labels vector, of shape (output size, number of examples)    parameters -- python dictionary containing parameters of the model        Returns:    cost - value of the regularized loss function (formula (2))    """    m = Y.shape[1]    W1 = parameters["W1"]    W2 = parameters["W2"]    W3 = parameters["W3"]        cross_entropy_cost = compute_cost(A3, Y) # This gives you the cross-entropy part of the cost        ### START CODE HERE ### (approx. 1 line)    L2_regularization_cost = lambd / (m * 2) * (np.sum(np.square(W1)) + np.sum(np.square(W2)) + np.sum(np.square(W3)))    ### END CODER HERE ###        cost = cross_entropy_cost + L2_regularization_cost        return cost

计算反向传播函数

d W dW dW上加上了正则项 λ m W \frac{\lambda}{m} W mλW

# GRADED FUNCTION: backward_propagation_with_regularizationdef backward_propagation_with_regularization(X, Y, cache, lambd):    """    Implements the backward propagation of our baseline model to which we added an L2 regularization.        Arguments:    X -- input dataset, of shape (input size, number of examples)    Y -- "true" labels vector, of shape (output size, number of examples)    cache -- cache output from forward_propagation()    lambd -- regularization hyperparameter, scalar        Returns:    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables    """        m = X.shape[1]    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache        dZ3 = A3 - Y        ### START CODE HERE ### (approx. 1 line)    dW3 = 1./m * np.dot(dZ3, A2.T) + lambd / m * W3    ### END CODE HERE ###    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)        dA2 = np.dot(W3.T, dZ3)    dZ2 = np.multiply(dA2, np.int64(A2 > 0))    ### START CODE HERE ### (approx. 1 line)    dW2 = 1./m * np.dot(dZ2, A1.T) + lambd / m * W2    ### END CODE HERE ###    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)        dA1 = np.dot(W2.T, dZ2)    dZ1 = np.multiply(dA1, np.int64(A1 > 0))    ### START CODE HERE ### (approx. 1 line)    dW1 = 1./m * np.dot(dZ1, X.T) + lambd / m * W1    ### END CODE HERE ###    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)        gradients = {
"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2, "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1} return gradients

加上L2正则项后,效果很明显:

dropout

在每一次迭代中,都随机删除一定概率的neurons。

1. Forward propagation with dropout

分4步:

  1. 每一层的 d [ l ] d^{[l]} d[l]对应每一层的 a [ l ] a^{[l]} a[l],因为有m个样本,所以就有$D^{[1]} = [d^{
    } d^{
    } … d^{
    }] $of the same dimension as A [ 1 ] A^{[1]} A[1].使用np.random.rand(n,m)
  2. D [ l ] D^{[l]} D[l]布尔化, $ < keepprob$ 分为 1和0
  3. Set A [ 1 ] A^{[1]} A[1] to A [ 1 ] ∗ D [ 1 ] A^{[1]} * D^{[1]} A[1]D[1].
  4. Divide A [ 1 ] A^{[1]} A[1] by keep_prob.

记得用cache把每一层的D都记录下来

# GRADED FUNCTION: forward_propagation_with_dropoutdef forward_propagation_with_dropout(X, parameters, keep_prob = 0.5):    """    Implements the forward propagation: LINEAR -> RELU + DROPOUT -> LINEAR -> RELU + DROPOUT -> LINEAR -> SIGMOID.        Arguments:    X -- input dataset, of shape (2, number of examples)    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":                    W1 -- weight matrix of shape (20, 2)                    b1 -- bias vector of shape (20, 1)                    W2 -- weight matrix of shape (3, 20)                    b2 -- bias vector of shape (3, 1)                    W3 -- weight matrix of shape (1, 3)                    b3 -- bias vector of shape (1, 1)    keep_prob - probability of keeping a neuron active during drop-out, scalar        Returns:    A3 -- last activation value, output of the forward propagation, of shape (1,1)    cache -- tuple, information stored for computing the backward propagation    """        np.random.seed(1)        # retrieve parameters    W1 = parameters["W1"]    b1 = parameters["b1"]    W2 = parameters["W2"]    b2 = parameters["b2"]    W3 = parameters["W3"]    b3 = parameters["b3"]        # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID    Z1 = np.dot(W1, X) + b1    A1 = relu(Z1)    ### START CODE HERE ### (approx. 4 lines)         # Steps 1-4 below correspond to the Steps 1-4 described above.     D1 = np.random.rand(A1.shape[0], A1.shape[1])                                         # Step 1: initialize matrix D1 = np.random.rand(..., ...)    D1 = D1 < keep_prob                                         # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)    A1 = A1 * D1                                         # Step 3: shut down some neurons of A1    A1 = A1 / keep_prob                                         # Step 4: scale the value of neurons that haven't been shut down    ### END CODE HERE ###    Z2 = np.dot(W2, A1) + b2    A2 = relu(Z2)    ### START CODE HERE ### (approx. 4 lines)    D2 = np.random.rand(A2.shape[0], A2.shape[1])                                              # Step 1: initialize matrix D2 = np.random.rand(..., ...)    D2 = D2 < keep_prob                                         # Step 2: convert entries of D2 to 0 or 1 (using keep_prob as the threshold)    A2 = A2 * D2                                               # Step 3: shut down some neurons of A2    A2 = A2 / keep_prob                                         # Step 4: scale the value of neurons that haven't been shut down    ### END CODE HERE ###    Z3 = np.dot(W3, A2) + b3    A3 = sigmoid(Z3)        cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)        return A3, cache

2. Backward propagation with dropout

  1. reapplying the same mask D [ 1 ] D^{[1]} D[1] to dA1.
  2. divide dA1 by keep_prob

反向传播的时候,让之前的删除的neurons依旧归0,然后也要除以keepprob,因为dA = np.dot(W.T, dZ),并没有重复除以过系数。

# GRADED FUNCTION: backward_propagation_with_dropoutdef backward_propagation_with_dropout(X, Y, cache, keep_prob):    """    Implements the backward propagation of our baseline model to which we added dropout.        Arguments:    X -- input dataset, of shape (2, number of examples)    Y -- "true" labels vector, of shape (output size, number of examples)    cache -- cache output from forward_propagation_with_dropout()    keep_prob - probability of keeping a neuron active during drop-out, scalar        Returns:    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables    """        m = X.shape[1]    (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache        dZ3 = A3 - Y    dW3 = 1./m * np.dot(dZ3, A2.T)    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)    dA2 = np.dot(W3.T, dZ3)    ### START CODE HERE ### (≈ 2 lines of code)    dA2 = dA2 * D2              # Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation    dA2 = dA2 / keep_prob             # Step 2: Scale the value of neurons that haven't been shut down    ### END CODE HERE ###    dZ2 = np.multiply(dA2, np.int64(A2 > 0))    dW2 = 1./m * np.dot(dZ2, A1.T)    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)        dA1 = np.dot(W2.T, dZ2)    ### START CODE HERE ### (≈ 2 lines of code)    dA1 = dA1 * D1              # Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation    dA1 = dA1 / keep_prob              # Step 2: Scale the value of neurons that haven't been shut down    ### END CODE HERE ###    dZ1 = np.multiply(dA1, np.int64(A1 > 0))    dW1 = 1./m * np.dot(dZ1, X.T)    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)        gradients = {
"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2, "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1} return gradients

最终结果,也还不错:

注意:

  • dropout也是正则化的一种
  • 训练的时候用,测试的时候不要用
  • 在正向传播和反向传播中都要用

Part3:Gradient Checking

首先写了一维的checking

# GRADED FUNCTION: forward_propagationdef forward_propagation(x, theta):    """    Implement the linear forward propagation (compute J) presented in Figure 1 (J(theta) = theta * x)        Arguments:    x -- a real-valued input    theta -- our parameter, a real number as well        Returns:    J -- the value of function J, computed using the formula J(theta) = theta * x    """        ### START CODE HERE ### (approx. 1 line)    J = np.dot(theta,x)    ### END CODE HERE ###        return J
# GRADED FUNCTION: backward_propagationdef backward_propagation(x, theta):    """    Computes the derivative of J with respect to theta (see Figure 1).        Arguments:    x -- a real-valued input    theta -- our parameter, a real number as well        Returns:    dtheta -- the gradient of the cost with respect to theta    """        ### START CODE HERE ### (approx. 1 line)    dtheta = x    ### END CODE HERE ###        return dtheta

根据公式:

d i f f e r e n c e = ∣ ∣ g r a d − g r a d a p p r o x ∣ ∣ 2 ∣ ∣ g r a d ∣ ∣ 2 + ∣ ∣ g r a d a p p r o x ∣ ∣ 2 difference = \frac {\mid\mid grad - gradapprox \mid\mid_2}{\mid\mid grad \mid\mid_2 + \mid\mid gradapprox \mid\mid_2} difference=grad2+gradapprox2gradgradapprox2

步骤是:

  1. θ + = θ + ε \theta^{+} = \theta + \varepsilon θ+=θ+ε
  2. θ − = θ − ε \theta^{-} = \theta - \varepsilon θ=θε
  3. J + = J ( θ + ) J^{+} = J(\theta^{+}) J+=J(θ+)
  4. J − = J ( θ − ) J^{-} = J(\theta^{-}) J=J(θ)
  5. g r a d a p p r o x = J + − J − 2 ε gradapprox = \frac{J^{+} - J^{-}}{2 \varepsilon} gradapprox=2εJ+J
def gradient_check(x, theta, epsilon = 1e-7):    """    Implement the backward propagation presented in Figure 1.        Arguments:    x -- a real-valued input    theta -- our parameter, a real number as well    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)        Returns:    difference -- difference (2) between the approximated gradient and the backward propagation gradient    """        # Compute gradapprox using left side of formula (1). epsilon is small enough, you don't need to worry about the limit.    ### START CODE HERE ### (approx. 5 lines)    thetaplus = theta + epsilon                               # Step 1    thetaminus = theta - epsilon                              # Step 2    J_plus = forward_propagation(x, thetaplus)                              # Step 3    J_minus = forward_propagation(x, thetaminus)                                # Step 4    gradapprox = (J_plus - J_minus) / (2 * epsilon)                              # Step 5    ### END CODE HERE ###        # Check if gradapprox is close enough to the output of backward_propagation()    ### START CODE HERE ### (approx. 1 line)    grad = backward_propagation(x, theta)    ### END CODE HERE ###        ### START CODE HERE ### (approx. 1 line)    numerator = np.linalg.norm(grad - gradapprox)                               # Step 1'    denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)                             # Step 2'    difference = numerator / denominator                              # Step 3'    ### END CODE HERE ###        if difference < 1e-7:        print ("The gradient is correct!")    else:        print ("The gradient is wrong!")        return difference

在N维的空间中,

def forward_propagation_n(X, Y, parameters):    """    Implements the forward propagation (and computes the cost) presented in Figure 3.        Arguments:    X -- training set for m examples    Y -- labels for m examples     parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":                    W1 -- weight matrix of shape (5, 4)                    b1 -- bias vector of shape (5, 1)                    W2 -- weight matrix of shape (3, 5)                    b2 -- bias vector of shape (3, 1)                    W3 -- weight matrix of shape (1, 3)                    b3 -- bias vector of shape (1, 1)        Returns:    cost -- the cost function (logistic cost for one example)    """        # retrieve parameters    m = X.shape[1]    W1 = parameters["W1"]    b1 = parameters["b1"]    W2 = parameters["W2"]    b2 = parameters["b2"]    W3 = parameters["W3"]    b3 = parameters["b3"]    # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID    Z1 = np.dot(W1, X) + b1    A1 = relu(Z1)    Z2 = np.dot(W2, A1) + b2    A2 = relu(Z2)    Z3 = np.dot(W3, A2) + b3    A3 = sigmoid(Z3)    # Cost    logprobs = np.multiply(-np.log(A3),Y) + np.multiply(-np.log(1 - A3), 1 - Y)    cost = 1./m * np.sum(logprobs)        cache = (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3)        return cost, cache
def backward_propagation_n(X, Y, cache):    """    Implement the backward propagation presented in figure 2.        Arguments:    X -- input datapoint, of shape (input size, 1)    Y -- true "label"    cache -- cache output from forward_propagation_n()        Returns:    gradients -- A dictionary with the gradients of the cost with respect to each parameter, activation and pre-activation variables.    """        m = X.shape[1]    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache        dZ3 = A3 - Y    dW3 = 1./m * np.dot(dZ3, A2.T)    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)        dA2 = np.dot(W3.T, dZ3)    dZ2 = np.multiply(dA2, np.int64(A2 > 0))    dW2 = 1./m * np.dot(dZ2, A1.T) * 2    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)        dA1 = np.dot(W2.T, dZ2)    dZ1 = np.multiply(dA1, np.int64(A1 > 0))    dW1 = 1./m * np.dot(dZ1, X.T)    db1 = 4./m * np.sum(dZ1, axis=1, keepdims = True)        gradients = {
"dZ3": dZ3, "dW3": dW3, "db3": db3, "dA2": dA2, "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1} return gradients

这个时候,给了两个函数,可以在字典和向量结构相互转换,也就是要计算 θ + \theta^{+} θ+时,把字典转为向量会比较好计算。

dictionary_to_vector()vector_to_dictionary()

J_plus[i]就是向量中的每一个元素,也就是W,b展开之后的每一项元素

  • To compute J_plus[i]:
    1. Set θ + \theta^{+} θ+ to np.copy(parameters_values)
    2. Set θ i + \theta^{+}_i θi+ to θ i + + ε \theta^{+}_i + \varepsilon θi++ε
    3. Calculate J i + J^{+}_i Ji+ using to forward_propagation_n(x, y, vector_to_dictionary( θ + \theta^{+} θ+ )).
  • To compute J_minus[i]: do the same thing with θ − \theta^{-} θ
  • Compute g r a d a p p r o x [ i ] = J i + − J i − 2 ε gradapprox[i] = \frac{J^{+}_i - J^{-}_i}{2 \varepsilon} gradapprox[i]=2εJi+Ji

代码如下,记住 thetaplus是一个(n,1)的向量,循环计算每一个参数的gradapprox,再和原本的grad比较:

# GRADED FUNCTION: gradient_check_ndef gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7):    """    Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n        Arguments:    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":    grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters.     x -- input datapoint, of shape (input size, 1)    y -- true "label"    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)        Returns:    difference -- difference (2) between the approximated gradient and the backward propagation gradient    """        # Set-up variables    parameters_values, _ = dictionary_to_vector(parameters)    grad = gradients_to_vector(gradients)    num_parameters = parameters_values.shape[0]    J_plus = np.zeros((num_parameters, 1))    J_minus = np.zeros((num_parameters, 1))    gradapprox = np.zeros((num_parameters, 1))        # Compute gradapprox    for i in range(num_parameters):                # Compute J_plus[i]. Inputs: "parameters_values, epsilon". Output = "J_plus[i]".        # "_" is used because the function you have to outputs two parameters but we only care about the first one        ### START CODE HERE ### (approx. 3 lines)        thetaplus =  np.copy(parameters_values)                                      # Step 1        thetaplus[i][0] = thetaplus[i][0] + epsilon                                # Step 2        J_plus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaplus))                                   # Step 3        ### END CODE HERE ###                # Compute J_minus[i]. Inputs: "parameters_values, epsilon". Output = "J_minus[i]".        ### START CODE HERE ### (approx. 3 lines)        thetaminus = np.copy(parameters_values)                                        # Step 1        thetaminus[i][0] = thetaminus[i][0] - epsilon                               # Step 2                J_minus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaminus))                                  # Step 3        ### END CODE HERE ###                # Compute gradapprox[i]        ### START CODE HERE ### (approx. 1 line)        gradapprox[i] = (J_plus[i] - J_minus[i]) / (2 * epsilon)        ### END CODE HERE ###        # Compare gradapprox to backward propagation gradients by computing difference.    ### START CODE HERE ### (approx. 1 line)    numerator = np.linalg.norm(grad - gradapprox)                                           # Step 1'    denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)                                         # Step 2'    difference = numerator / denominator                                          # Step 3'    ### END CODE HERE ###    if difference > 2e-7:        print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")    else:        print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")        return difference

注意:

梯度检验太慢,不要在训练的时候运行,你运行只是为了保证你的算法是正确的。

转载地址:http://verii.baihongyu.com/

你可能感兴趣的文章
【设计模式】—-(20)状态模式(行为型)
查看>>
【Python】学习笔记——-5、第一个Python程序
查看>>
【设计模式】—-(21)策略模式(行为型)
查看>>
【Python】学习笔记——-6、使用文本编辑器
查看>>
【第一章】—-(1)认识C++对象
查看>>
【设计模式】—-(22)责任链模式(行为型)
查看>>
【Python】学习笔记——-7、Python代码运行助手
查看>>
【Python】学习笔记——-8、输入和输出
查看>>
【大话数据结构】——-数据结构
查看>>
【鹅厂面试题】——-关于微信红包取中间值(笔试题)
查看>>
【猫厂面试题】——-2016(Web前端)
查看>>
【猫厂面试题】——-阿里常问面试题目汇总及答案(JAVA)
查看>>
【设计模式】—-(23)访问者模式(行为型)
查看>>
【Python】学习笔记——-2.0、Python基础
查看>>
【Python】学习笔记——-2.1、Python的数据类型与变量
查看>>
【分布式缓存】——-基于redis分布式缓存的实现
查看>>
【鸟叔的Linux私房菜】学习笔记(1)——Linux系统入门
查看>>
【2016——-编程语言排行以及优缺点分析】
查看>>
【深入理解JVM】学习笔记——-1、JVM基本结构
查看>>
Windows点击任务栏中的文件夹直接进入资源管理器界面设置
查看>>