A well chosen initialization can:
- Speed up the convergence of gradient descent
- Increase the odds of gradient descent converging to a lower training (and generalization) error
for l in range(1, L + 1): parameters['W' + str(l)] = np.multiply(np.random.randn(layers_dims[l], layers_dims[l-1]), np.sqrt(2./layers_dims[l-1])) parameters['b' + str(l)] = np.zeros((layers_dims[l], 1)) return parameters
- Different initializations lead to different results
- Random initialization is used to break symmetry and make sure different hidden units can learn different things
- Don’t intialize to values that are too large
- He initialization works well for networks with ReLU activations.
L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes.
The implications of L2-regularization on:
- The cost computation: A regularization term is added to the cost
- The backpropagation function: There are extra terms in the gradients with respect to weight matrices
- Weights end up smaller (“weight decay”): Weights are pushed to smaller values.
cost = cross_entropy_cost + L2_regularization_cost
Add the regularization term’s gradient
Dropout is a widely used regularization technique that is specific to deep learning. It randomly shuts down some neurons in each iteration.
Dropout: A Simple Way to Prevent Neural Networks from Overfitting
Forward propagation with dropout
- Initialize matrix D1 = np.random.rand(…, …)
- Convert entries of D1 to 0 or 1 (using keep_prob as the threshold)
- Shut down some neurons of A1
- Scale the value of neurons that haven’t been shut down
D1 = np.random.rand(A1.shape, A1.shape) #Step1 D1 = D1 < keep_prob #Step2 A1 = A1 * D1 #Step3 A1 = A1 / keep_prob #Step4
Backward propagation with dropout
- Apply mask D2 to shut down the same neurons as during the forward propagation.
- Scale the value of neurons that haven’t been shut down.
dA2 = dA2 * D2 # Step 1 dA2 = dA2 / keep_prob # Step 2
- Dropout is a regularization technique.
- You only use dropout during training. Don’t use dropout (randomly eliminate nodes) during test time.
- Apply dropout both during forward and backward propagation.
- During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.
- Regularization will help you reduce overfitting.
- Regularization will drive your weights to lower values.
- L2 regularization and Dropout are two very effective regularization techniques.
N-dimensional gradient checking
LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID:
dictionary_to_vector() and vector_to_dictionary():
About Gradient Checking
- Gradient checking verifies closeness between the gradients from backpropagation and the numerical approximation of the gradient (computed using forward propagation).
- Gradient checking is slow, so we don’t run it in every iteration of training. You would usually run it only to make sure your code is correct, then turn it off and use backprop for the actual learning process.