## Initialization

A well chosen initialization can:

1. Speed up the convergence of gradient descent
2. Increase the odds of gradient descent converging to a lower training (and generalization) error

### He initialization

for l in range(1, L + 1):

parameters['W' + str(l)] = np.multiply(np.random.randn(layers_dims[l], layers_dims[l-1]), np.sqrt(2./layers_dims[l-1]))
parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))

return parameters


Results：

### Conclusions

• Different initializations lead to different results
• Random initialization is used to break symmetry and make sure different hidden units can learn different things
• Don’t intialize to values that are too large
• He initialization works well for networks with ReLU activations.

## Regularization

### L2 Regularization

L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes.

#### The implications of L2-regularization on:

1. The cost computation: A regularization term is added to the cost
2. The backpropagation function: There are extra terms in the gradients with respect to weight matrices
3. Weights end up smaller (“weight decay”): Weights are pushed to smaller values.

compute_cost_with_regularization:

cost = cross_entropy_cost + L2_regularization_cost


backward_propagation_with_regularization:
Add the regularization term’s gradient $\frac{d}{dW} ( \frac{1}{2}\frac{\lambda}{m} W^2) = \frac{\lambda}{m} W$

### Dropout

Dropout is a widely used regularization technique that is specific to deep learning. It randomly shuts down some neurons in each iteration.
Dropout: A Simple Way to Prevent Neural Networks from Overfitting

#### Forward propagation with dropout

Steps:

1. Initialize matrix D1 = np.random.rand(…, …)
2. Convert entries of D1 to 0 or 1 (using keep_prob as the threshold)
3. Shut down some neurons of A1
4. Scale the value of neurons that haven’t been shut down
 D1 = np.random.rand(A1.shape[0], A1.shape[1])      #Step1
D1 = D1 < keep_prob                                #Step2
A1 = A1 * D1                                       #Step3
A1 = A1 / keep_prob                                #Step4


#### Backward propagation with dropout

Steps:

1. Apply mask D2 to shut down the same neurons as during the forward propagation.
2. Scale the value of neurons that haven’t been shut down.
dA2 = dA2 * D2              # Step 1
dA2 = dA2 / keep_prob       # Step 2


1. Dropout is a regularization technique.
2. You only use dropout during training. Don’t use dropout (randomly eliminate nodes) during test time.
3. Apply dropout both during forward and backward propagation.
4. During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.

### Conclusions

2. Regularization will drive your weights to lower values.
3. L2 regularization and Dropout are two very effective regularization techniques.