## Outline

**Initialize**the parameters for a two-layer network and for an $L$-layer neural network.**初始化参数**- Implement the
**Forward Propagation**module (shown in purple in the figure below).- Complete the LINEAR part of a layer’s forward propagation step (resulting in ).
- ACTIVATION function (relu/sigmoid) has been given.
- Combine the previous two steps into a new [LINEAR->ACTIVATION] forward function.
- Stack the [LINEAR->RELU] forward function L-1 time (for layers 1 through L-1) and add a [LINEAR->SIGMOID] at the end (for the final layer ). This will bulid a new L_model_forward function. 前层为LINEAR->RELU, 最后一层为LINEAR->SIGMOID

**Compute the loss**.- Implement the
**backward propagation**module (denoted in red in the figure below).- Complete the LINEAR part of a layer’s backward propagation step.
- The ACTIVATE function (relu_backward/sigmoid_backward) has been given.
- Combine the previous two steps into a new [LINEAR->ACTIVATION] backward function.
- Stack [LINEAR->RELU] backward L-1 times and add [LINEAR->SIGMOID] backward in a new L_model_backward function

- Finally
**update**the parameters.

## Initialization

The initialization for a deeper L-layer neural network is more complicated because there are many more weight matrices and bias vectors. When completing the `initialize_parameters_deep`

, you should make sure that your dimensions match between each layer. Recall that $n^{[l]}$ is the number of units in layer $l$. Thus for example if the size of our input $X$ is $(12288, 209)$ (with $m=209$ examples).

Remember that when we compute $W X + b$ in python, it carries out broadcasting. For example, if:

Then $WX + b$ will be:

```
for l in range(1, L):
parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1]) * 0.01
parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))
```

## Forward propagation module

### Linear Forward

The linear forward module (vectorized over all the examples) computes the following equations:

where .

### Linear-Activation Forward

**Sigmoid**: . We have provided you with the`sigmoid`

function. This function returns**two**items: the activation value “`a`

” and a “`cache`

” that contains “`Z`

” (it’s what we will feed in to the corresponding backward function). To use it you could just call:`A, activation_cache = sigmoid(Z)`

**ReLU**: The mathematical formula for ReLu is . We have provided you with the`relu`

function. This function returns**two**items: the activation value “`A`

” and a “`cache`

” that contains “`Z`

” (it’s what we will feed in to the corresponding backward function). To use it you could just call:`A, activation_cache = relu(Z)`

### L-Layer Model

For even more convenience when implementing the -layer Neural Net, you will need a function that replicates the previous one (`linear_activation_forward`

with RELU) times, then follows that with one `linear_activation_forward`

with SIGMOID.

```
def L_model_forward(X, parameters):
"""
Implement forward propagation for the [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID computation
Arguments:
X -- data, numpy array of shape (input size, number of examples)
parameters -- output of initialize_parameters_deep()
Returns:
AL -- last post-activation value
caches -- list of caches containing:
every cache of linear_relu_forward() (there are L-1 of them, indexed from 0 to L-2)
the cache of linear_sigmoid_forward() (there is one, indexed L-1)
"""
caches = []
A = X
L = len(parameters) // 2 # number of layers in the neural network
# Implement [LINEAR -> RELU]*(L-1). Add "cache" to the "caches" list.
for l in range(1, L):
A_prev = A
A, cache = linear_activation_forward(A_prev, parameters['W' + str(l)],
parameters['b' + str(l)],
activation="relu")
caches.append(cache)
# Implement LINEAR -> SIGMOID. Add "cache" to the "caches" list.
AL, cache = linear_activation_forward(A, parameters['W' + str(L)],
parameters['b' + str(L)],
activation="sigmoid")
caches.append(cache)
assert(AL.shape == (1,X.shape[1]))
return AL, caches
```

**caches->linear_activation_cache**

## Cost function

Compute the cross-entropy cost $J$, using the following formula:

```
m = Y.shape[1]
cost = (-1/m) * np.sum(np.multiply(Y, np.log(AL)) + np.multiply((1-Y), np.log(1-AL)))
cost = np.squeeze(cost) # To make sure your cost's shape is what we expect (e.g. this turns [[17]] into 17).
assert(cost.shape == ())
```

## Backward propagation module

### Linear backward

The three outputs are computed using the input .Here are the formulas you need:

```
def linear_backward(dZ, cache):
"""
Implement the linear portion of backward propagation for a single layer (layer l)
Arguments:
dZ -- Gradient of the cost with respect to the linear output (of current layer l)
cache -- tuple of values (A_prev, W, b) coming from the forward propagation in the current layer
Returns:
dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
dW -- Gradient of the cost with respect to W (current layer l), same shape as W
db -- Gradient of the cost with respect to b (current layer l), same shape as b
"""
A_prev, W, b = cache
m = A_prev.shape[1]
### START CODE HERE ### (≈ 3 lines of code)
dW = np.dot(dZ, cache[0].T) / m
db = np.sum(dZ, axis=1, keepdims=True) / m
dA_prev = np.dot(cache[1].T, dZ)
### END CODE HERE ###
assert (dA_prev.shape == A_prev.shape)
assert (dW.shape == W.shape)
assert (db.shape == b.shape)
return dA_prev, dW, db
```

### Linear-Activation backward

: Implements the backward propagation for SIGMOID unit. You can call it as follows:`sigmoid_backward`

```
dZ = sigmoid_backward(dA, activation_cache)
```

: Implements the backward propagation for RELU unit. You can call it as follows:`relu_backward`

```
dZ = relu_backward(dA, activation_cache)
```

If is the activation function,
`sigmoid_backward`

and `relu_backward`

compute .

```
if activation == "relu":
dZ = relu_backward(dA, activation_cache)
dA_prev, dW, db = linear_backward(dZ, linear_cache)
elif activation == "sigmoid":
dZ = sigmoid_backward(dA, activation_cache)
dA_prev, dW, db = linear_backward(dZ, linear_cache)
```

### L-Model Backward

**Initializing backpropagation**:
To backpropagate through this network, we know that the output is,
. Your code thus needs to compute `dAL`

.
To do so, use this formula (derived using calculus which you don’t need in-depth knowledge of):

```
dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL)) # derivative of cost with respect to AL
```

```
def L_model_backward(AL, Y, caches):
"""
Implement the backward propagation for the [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID group
Arguments:
AL -- probability vector, output of the forward propagation (L_model_forward())
Y -- true "label" vector (containing 0 if non-cat, 1 if cat)
caches -- list of caches containing:
every cache of linear_activation_forward() with "relu" (it's caches[l], for l in range(L-1) i.e l = 0...L-2)
the cache of linear_activation_forward() with "sigmoid" (it's caches[L-1])
Returns:
grads -- A dictionary with the gradients
grads["dA" + str(l)] = ...
grads["dW" + str(l)] = ...
grads["db" + str(l)] = ...
"""
grads = {}
L = len(caches) # the number of layers
m = AL.shape[1]
Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL
# Initializing the backpropagation
dAL = -(np.divide(Y, AL) - np.divide(1-Y, 1-AL))
# Lth layer (SIGMOID -> LINEAR) gradients. Inputs: "AL, Y, caches". Outputs: "grads["dAL"], grads["dWL"], grads["dbL"]
current_cache = caches[L-1]
grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] = linear_activation_backward(dAL, current_cache, "sigmoid")
for l in reversed(range(L-1)):
# lth layer: (RELU -> LINEAR) gradients.
# Inputs: "grads["dA" + str(l + 2)], caches". Outputs: "grads["dA" + str(l + 1)] , grads["dW" + str(l + 1)] , grads["db" + str(l + 1)]
current_cache = caches[l]
dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA" + str(l+2)], current_cache, "relu")
grads["dA" + str(l + 1)] = dA_prev_temp
grads["dW" + str(l + 1)] = dW_temp
grads["db" + str(l + 1)] = db_temp
return grads
```

## Update Parameters

Update the parameters of the model, using gradient descent:

where $\alpha$ is the learning rate. After computing the updated parameters, store them in the parameters dictionary.

```
for l in range(L):
parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * grads["dW" + str(l+1)]
parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * grads["db" + str(l+1)]
```