{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Ungraded Lab - Gradient Descent for Linear Regression\n",
    "\n",
    "In the previous labs, we determined the optimal values of $w_0$ and $w_1$ manually. In this lab we will automate this process with gradient descent with one variable as described in lecture."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Outline\n",
    "\n",
    "- [Exercise 01- Compute Gradient](#ex01)\n",
    "- [Exercise 02- Checking the Gradient](#ex02)\n",
    "- [Exercise 03- Learning Parameters with Batch Gradient Descent](#ex-03)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import math \n",
    "import copy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Problem Statement\n",
    "\n",
    "Let's use the same two data points as before - a house with 1000 square feet sold for \\\\$200,000 and a house with 2000 square feet sold for \\\\$400,000.\n",
    "\n",
    "That is our dataset contains has the following two points - \n",
    "\n",
    "| Size (feet$^2$)     | Price (1000s of dollars) |\n",
    "| -------------------| ------------------------ |\n",
    "| 1000               | 200                      |\n",
    "| 2000               | 400                      |\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load our data set\n",
    "X_train = [1000, 2000] #feature \n",
    "y_train = [200, 400]   #actual value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## routine to plot the data points\n",
    "def plt_house(X, y,f_w=None):\n",
    "    plt.scatter(X, y, marker='x', c='r', label=\"Actual Value\")\n",
    "\n",
    "    # Set the title\n",
    "    plt.title(\"Housing Prices\")\n",
    "    # Set the y-axis label\n",
    "    plt.ylabel('Price (in 1000s of dollars)')\n",
    "    # Set the x-axis label\n",
    "    plt.xlabel('Size (feet^2)')\n",
    "    # print predictions\n",
    "    if f_w != None:\n",
    "        plt.plot(X, f_w,  c='b', label=\"Our Prediction\")\n",
    "    plt.legend()\n",
    "    plt.show()\n",
    "    \n",
    "plt_house(X_train,y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compute_Cost\n",
    "You produced this in the last lab, so this is supplied here for later use"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Function to calculate the cost\n",
    "def compute_cost(X, y, w):\n",
    "   \n",
    "    m = len(X)\n",
    "    cost = 0\n",
    "    \n",
    "    for i in range(m):\n",
    "    \n",
    "        # Calculate the model prediction\n",
    "        f_w = w[0] + w[1]*X[i]\n",
    "        \n",
    "        # Calculate the cost\n",
    "        cost = cost + (f_w - y[i])**2\n",
    "\n",
    "    total_cost = 1/(2*m) * cost\n",
    "\n",
    "    return total_cost"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Gradient descent summary\n",
    "So far in this course we have developed a linear model that predicts $f_{\\mathbf{w}}(x)$ based a single input $x$ using trained parameters $w_0$,$w_1$.\n",
    "$$f_\\mathbf{w}(x)= w_0 + w_1x \\tag{1}$$\n",
    "In machine learning, we utilize input data to train the parameters $w_0$,$w_1$ by minimizing a measure of the error between our predictions $f_{\\mathbf{w}}(x)$ and the actual data $y$. The measure is called the $cost$, $J(\\mathbf{w})$. In training we measure the cost over all of our training samples $x^{(i)},y^{(i)}$\n",
    "$$J(\\mathbf{w}) = \\frac{1}{2m} \\sum\\limits_{i = 0}^{m-1} (f_{\\mathbf{w}}(x^{(i)}) - y^{(i)})^2\\tag{2}$$ \n",
    "From calculus we know the partial derivitive of the cost relative to one of the parameters tells us how a small change in that parameter $w_j$, or $\\Delta{w_j}$, causes a small change in $J(\\mathbf{w})$, or $\\Delta(J(w)$.\n",
    "\n",
    "$$ \\frac{\\partial J(w)}{\\partial w_j} \\approx \\frac{\\Delta{J(w)}}{\\Delta{w_j}}$$\n",
    "Using that information, we can iteratively make small adjustments to $w_j$ that reduce the value of $J(\\mathbf{w})$. This iterative process is called gradient descent. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "In lecture, *gradient descent* was described as:\n",
    "\n",
    "$$\\begin{align*}& \\text{repeat until convergence:} \\; \\lbrace \\newline \\; & w_j := w_j -  \\alpha \\frac{\\partial J(\\mathbf{w})}{\\partial w_j} \\tag{3}  \\; & \\text{for j := 0,1}\\newline & \\rbrace\\end{align*}$$\n",
    "where, parameters $w_0$, $w_1$ are updated simultaneously.  \n",
    "As in lecture:\n",
    "$$\n",
    "\\begin{align}\n",
    " \\frac{\\partial J(\\mathbf{w})}{\\partial w_0}  &:= \\frac{1}{m} \\sum\\limits_{i = 0}^{m-1} (f_{w}(x^{(i)}) - y^{(i)} \\tag{4}\\\\\n",
    " \\frac{\\partial J(\\mathbf{w})}{\\partial w_1}  &:= \\frac{1}{m} \\sum\\limits_{i = 0}^{m-1} (f_{w}(x^{(i)}) - y^{(i)})x^{(i)} \\tag{5}\\\\\n",
    "\\end{align}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name='ex01'></a>\n",
    "## Exercise 01- Compute Gradient\n",
    "We will implement a batch gradient descent algorithm for one variable. We'll need three functions. \n",
    "- compute_gradient implementing equation (4) and (5) above\n",
    "- compute_cost implementing equation (2) above (code from previous lab)\n",
    "- gradient_descent, utilizing compute_gradient and compute_cost, runs the iterative algorithm to find the parameters with the lowest cost."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## compute_gradient\n",
    "<a name='ex-01'></a>\n",
    "Implement `compute_gradient` which will return $\\frac{\\partial J(\\mathbf{w})}{\\partial w}$. A naming convention we will use in code when referring to gradients is to infer the dJ(w) and name variables for the parameter. For example, $\\frac{\\partial J(\\mathbf{w})}{\\partial w_0}$ will be `dw0`.\n",
    "\n",
    "Please complete the `compute_gradient` function to:\n",
    "\n",
    "- Create a list to store the gradient `dw`. \n",
    "- Loop over all examples in the training set `m`. \n",
    "    - Inside the loop, calculate the gradient update from each training example:\n",
    "        - Calculate the model prediction `f`\n",
    "        $$\n",
    "        f_\\mathbf{w}(x^{(i)}) =  w_0+ w_1x^{(i)} \n",
    "        $$\n",
    "        - Calculate the gradient for $w_0$ and $w_1$\n",
    "        $$\n",
    "\\begin{align}\n",
    "\\frac{\\partial{J(w)}}{\\partial{w_0}} &=  f_\\mathbf{w}(x^{(i)}) - y^{(i)}  \\\\  \n",
    "\\frac{\\partial{J(w)}}{\\partial{w_1}} &= (f_\\mathbf{w}(x^{(i)}) - y^{(i)})x^{(i)}  \\\\\n",
    "\\end{align}   \n",
    "$$\n",
    "        - Add these gradients to the total gradients `dw`\n",
    "    \n",
    "    - Compute total gradient by dividing by the number of examples `m`.\n",
    "**Note** that this assignment continues to use python lists rather than the NumPy data structures that will be described in upcoming lectures. This will require writing some expressions 'per element' where later, these could be a single operation. Also note that these routines are specifically for one variable. Later labs and the weekly assignment will use more general cases."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details>\n",
    "<summary>\n",
    "    <font size='3', color='darkgreen'><b>Hints</b></font>\n",
    "</summary>\n",
    "\n",
    "```\n",
    "def compute_gradient(X, y, w): \n",
    "    \"\"\"\n",
    "    Computes the gradient for linear regression \n",
    " \n",
    "    Args:\n",
    "      X : (array_like Shape (m,)) variable such as house size \n",
    "      y : (array_like Shape (m,)) actual value \n",
    "      w : (array_like Shape (2,)) Initial values of parameters of the model      \n",
    "    Returns\n",
    "      dw: (array_like Shape (2,)) The gradient of the cost w.r.t. the parameters w. \n",
    "                                   Note that dw has the same dimensions as w.\n",
    "    \"\"\"\n",
    "    m = len(X)\n",
    "    \n",
    "    dw = [0,0]\n",
    "    for i in range(m):        \n",
    "        f   = w[0] + w[1]*X[i]\n",
    "        dw0 = f-y[i]\n",
    "        dw1 = (f-y[i])*X[i] \n",
    "        dw[0] = dw[0] + dw0\n",
    "        dw[1] = dw[1] + dw1\n",
    "    dw[0] = (1/m) * dw[0]\n",
    "    dw[1] = (1/m) * dw[1]       \n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_gradient(X, y, w): \n",
    "    \"\"\"\n",
    "    Computes the gradient for linear regression \n",
    " \n",
    "    Args:\n",
    "      X : (array_like Shape (m,)) variable such as house size \n",
    "      y : (array_like Shape (m,)) actual value \n",
    "      w : (array_like Shape (2,)) Initial values of parameters of the model      \n",
    "    Returns\n",
    "      dw: (array_like Shape (2,)) The gradient of the cost w.r.t. the parameters w. \n",
    "                                   Note that dw has the same dimensions as w.\n",
    "    \"\"\"\n",
    "    m = len(X)\n",
    "    dw = [0,0]   \n",
    "    ### START CODE HERE ### \n",
    "\n",
    "    ### END CODE HERE ### \n",
    "       \n",
    "    return dw"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Compute and display gradient with w initialized to zeroes\n",
    "initial_w = [0,0]\n",
    "grad = compute_gradient(X_train, y_train, initial_w)\n",
    "print('Gradient at initial w (zeros):', grad)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Expected Output**:\n",
    "```Gradient at initial w (zeros): [-300.0, -500000.0]```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Now, lets try setting w to a value we know, from previous labs, is the optimal value\n",
    "initial_w = [0,0.2]\n",
    "grad = compute_gradient(X_train, y_train, initial_w)\n",
    "print('Gradient when w is set to optimal values:', grad)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Expected Output**:\n",
    "```Gradient when w is set to optimal values: [0.0, 0.0]```  \n",
    "As we expected, the gradient is zero at the \"bottom of the bowl\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# one more test case to ensure we are using all the w values.\n",
    "initial_w = [0.1,0.1]\n",
    "grad = compute_gradient(X_train, y_train, initial_w)\n",
    "print('Gradient:', grad)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Expected Output**:\n",
    "```Gradient: [-149.9, -249850.0]```  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Checking the gradient\n",
    "What do these gradient values mean?    \n",
    "If you have taken calculus, you may recall an early lecture describing a derivative as:\n",
    "$$\\frac{df(x)}{dx} = \\lim_{\\Delta{x} \\to 0} \\frac{f(x+\\Delta{x}) - f(x)}{\\Delta{x}}$$\n",
    "The derivative then is just a measure of how a small change in x, the $\\Delta{x}$ above, changes $f(x)$.\n",
    "\n",
    "Above, we calculated `dw1` or $\\frac{\\partial J(\\mathbf{w})}{\\partial w_1}$ to be -249850.0. That says that when $\\mathbf{w} = [0.1,0.1]$, a small change in $w_1$ will result in a change in the **cost**, $J(\\mathbf{w})$, that is -249850.0 times that change.  Note the change in notation  from $d$ to $\\partial{}$ just indicates the J has multiple dependencies and that this is a derivative with respect to one of them - a partial derivative.\n",
    "\n",
    "We can use this knowledge to perform a simple check of our implementation of the gradient."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name='ex02'></a>\n",
    "## Exercise 2 \n",
    "Let's check our gradient descent algorithm by \n",
    "calculating an approximation to the partial derivative with respect to $w_1$. We can't make $\\Delta{x}$ go to zero as in the equation above, but we can just use a small value: \n",
    "$$ \\frac{\\partial J(\\mathbf{w})}{\\partial w_1} \\approx\\frac{Cost(w_0,w_1+\\Delta)-Cost(w_0,w_1)}{\\Delta{w_1}}$$\n",
    "Of course, the same method can be applied to any of the parameters."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details>\n",
    "<summary>\n",
    "    <font size='3', color='darkgreen'><b>Hints</b></font>\n",
    "</summary>\n",
    "\n",
    "```\n",
    "# calculate a derivative and compare with our implementaion.\n",
    "delta = 0.00001\n",
    "w_check = [0.1,0.1]\n",
    "\n",
    "# compute the gradient using our derivation and implementation\n",
    "grad = compute_gradient(X_train, y_train, initial_w)\n",
    "\n",
    "# compute point 1\n",
    "c1 = compute_cost(X_train,y_train,w_check)\n",
    "\n",
    "#increment parameter w_check[1] by delta, leave w_check[0] the same\n",
    "w_check[0] = w_check[0]  # leave the same\n",
    "w_check[1] = w_check[1] + delta\n",
    "\n",
    "#compute point 2\n",
    "c2 = compute_cost(X_train,y_train,w_check)\n",
    "calculated_dw1 = (c2 - c1)/delta\n",
    "print(f\"calculated_dw1 {calculated_dw1:0.1f}, expected dw1 {grad[1]}\" )#increment parameter w_check[1] by delta, leave w_check[0] the same      \n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# calculate a derivative and compare with our implementaion.\n",
    "delta = 0.00001\n",
    "w_check = [0.1,0.1]\n",
    "\n",
    "# compute the gradient using our derivation and implementation\n",
    "### START CODE HERE ### \n",
    "\n",
    "\n",
    "# compute point 1\n",
    "\n",
    "\n",
    "#increment parameter w_check[1] by delta, leave w_check[0] the same\n",
    "\n",
    "\n",
    "#compute point 2\n",
    "\n",
    "### END CODE HERE ### \n",
    "\n",
    "print(f\"calculated_dw1 {calculated_dw1:0.1f}, expected dw1 {grad[1]}\" )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Expected Output**:  \n",
    "```calculated_dw1 -249837.5, expected dw1 -249850.0```   \n",
    "Not *exactly* the same, but close. The real derivative would take delta to zero. Try changing the value of delta."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name='ex-03'></a>\n",
    "## Exercise 3 Learning parameters using batch gradient descent \n",
    "\n",
    "You will now find the optimal parameters of a linear regression model by using batch gradient descent. Recall batch refers to running all the examples in one batch. \n",
    "- You don't need to implement anything for this part. Simply run the cells below. \n",
    "- A good way to verify that gradient descent is working correctly is to look\n",
    "at the value of $J(\\mathbf{w})$ and check that it is decreasing with each step. \n",
    "- Assuming you have implemented the gradient and computed the cost correctly, your value of $J(\\mathbf{w})$ should never increase and should converge to a steady value by the end of the algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def gradient_descent(X, y, w_in, cost_function, gradient_function, alpha, num_iters): \n",
    "    \"\"\"\n",
    "    Performs batch gradient descent to learn theta. Updates theta by taking \n",
    "    num_iters gradient steps with learning rate alpha\n",
    "    \n",
    "    Args:\n",
    "      X : (array_like Shape (m,)\n",
    "      y : (array_like Shape (m,) )\n",
    "      w_in : (array_like Shape (2,)) Initial values of parameters of the model\n",
    "      alpha : (float) Learning rate\n",
    "      num_iters : (int) number of iterations to run gradient descent\n",
    "    Returns\n",
    "      w : (array_like Shape (2,)) Updated values of parameters of the model after\n",
    "          running gradient descent\n",
    "    \"\"\"\n",
    "    \n",
    "    # number of training examples\n",
    "    m = len(X)\n",
    "    w = copy.deepcopy(w_in) # avoid modifying global w\n",
    "    # An array to store cost J and w's at each iteration primarily for graphing later\n",
    "    J_history = []\n",
    "    w_history = []\n",
    "    \n",
    "    for i in range(num_iters):\n",
    "      \n",
    "       # Calculate the gradient and update the parameters\n",
    "        gradient = gradient_function(X, y, w)\n",
    "\n",
    "        # Update Parameters \n",
    "        w[0] = w[0] - alpha * gradient[0]\n",
    "        w[1] = w[1] - alpha * gradient[1]\n",
    "\n",
    "        # Save cost J at each iteration\n",
    "        if i<100000:      # prevent resource exhaustion \n",
    "            J_history.append( compute_cost(X, y, w))\n",
    "        \n",
    "        # Print cost every at intervals 10 times or as many iterations if < 10\n",
    "        if i% math.ceil(num_iters/10) == 0:\n",
    "            w_history.append([w[0],w[1]])\n",
    "            print(f\"Iteration {i:4}: Cost {J_history[-1]:8.2f}   \",\n",
    "                  f\"gradient: {gradient[0]:9.4f},{gradient[1]:14.4f}\")\n",
    "        \n",
    "    return w, J_history, w_history #return w and J,w history for graphing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# initialize parameters\n",
    "w_init = [0,0]\n",
    "# some gradient descent settings\n",
    "iterations = 1000\n",
    "alpha = 1.0e-8\n",
    "# run gradient descent\n",
    "w_final, J_hist, w_hist = gradient_descent(X_train ,y_train, w_init, compute_cost, compute_gradient, alpha, iterations)\n",
    "print(f\"w found by gradient descent: ({w_final[0]:8.4f},{w_final[1]:8.4f})\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Expected Output**:  \n",
    "```w found by gradient descent: (0.0001,0.2000)```  \n",
    "As we expected, the calculated parameter values are very close to (0,0.2) from previous labs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"1000 sqft house estimate {w_final[0] + w_final[1]*1000:0.2f} Thousand dollars\")\n",
    "print(f\"1000 sqft house estimate {w_final[0] + w_final[1]*1200:0.2f} Thousand dollars\")\n",
    "print(f\"2000 sqft house estimate {w_final[0] + w_final[1]*2000:0.2f} Thousand dollars\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot cost vs iteration  \n",
    "plt.plot(J_hist)\n",
    "plt.title(\"Cost vs iteration\")\n",
    "plt.ylabel('Cost')\n",
    "plt.xlabel('iteration step')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The plot shows that we rapidly reduced cost early. Recall from lecture that the gradient tends to be larger when further from the optimum creating larger step sizes. As you approach the final value, the gradient is smaller resulting in smaller step sizes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plotting\n",
    "Let's produce some of the fancy graphs that are popular for showing gradient descent. First we'll create some extra test cases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# generate some more paths\n",
    "w_init = [400,0.6]\n",
    "# some gradient descent settings\n",
    "iterations = 1000\n",
    "alpha = 1.0e-7\n",
    "# run gradient descent\n",
    "w2_final, J2_hist, w2_hist = gradient_descent(X_train ,y_train, w_init, compute_cost, compute_gradient, alpha, iterations)\n",
    "print(f\"w found by gradient descent: ({w2_final[0]:0.4f},{w2_final[1]:0.4f})\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note, cost seems to have **plateaued**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#generate some more paths\n",
    "w_init = [100,0.1]\n",
    "# some gradient descent settings\n",
    "iterations = 5\n",
    "alpha = 1.0e-6  # larger alpha\n",
    "# run gradient descent\n",
    "w3_final, J3_hist, w3_hist = gradient_descent(X_train ,y_train, w_init, compute_cost, compute_gradient, alpha, iterations)\n",
    "print(f\"w found by gradient descent: ({w3_final[0]:0.4f},{w3_final[1]:0.4f})\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note, cost is **increasing**!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from mpl_toolkits.mplot3d import axes3d\n",
    "import matplotlib.pyplot as plt\n",
    "from matplotlib import cm\n",
    "import numpy as np\n",
    "\n",
    "w0 = np.arange(-500, 500, 5)\n",
    "w1 = np.arange(-0.2, 0.8, 0.005)\n",
    "w0,w1 = np.meshgrid(w0,w1)\n",
    "z=np.zeros_like(w0)\n",
    "n,_ = w0.shape\n",
    "for i in range(n):\n",
    "    for j in range(n):\n",
    "        z[i][j] = compute_cost(X_train, y_train, [w0[i][j],w1[i][j]] )\n",
    "\n",
    "   \n",
    "fig = plt.figure(figsize=(24,6))\n",
    "\n",
    "ax = fig.add_subplot(1, 2, 2)\n",
    "CS = ax.contour(w1, w0, z,[0,50,1000,5000,10000,25000,50000])\n",
    "plt.clabel(CS, inline=1, fmt='%1.0f', fontsize=10)\n",
    "plt.title('Contour plot of cost J(w), vs w0,w1 with path of gradient descent')\n",
    "\n",
    "w_sub = [ (i[1],i[0]) for i in w_hist]\n",
    "for i in range(len(w_sub)-1):\n",
    "    plt.annotate('', xy=w_sub[i + 1], xytext=w_sub[i],\n",
    "                 arrowprops={'arrowstyle': '->', 'color': 'r', 'lw': 1},\n",
    "                 va='center', ha='center')\n",
    "\n",
    "w_sub = [ (i[1],i[0]) for i in w2_hist]\n",
    "for i in range(len(w_sub)-1):\n",
    "    plt.annotate('', xy=w_sub[i + 1], xytext=w_sub[i],\n",
    "                 arrowprops={'arrowstyle': '->', 'color': 'b', 'lw': 1},\n",
    "                 va='center', ha='center')\n",
    "w_sub = [ (i[1],i[0]) for i in w3_hist]\n",
    "for i in range(len(w_sub)-1):\n",
    "    plt.annotate('', xy=w_sub[i + 1], xytext=w_sub[i],\n",
    "                 arrowprops={'arrowstyle': '->', 'color': 'g', 'lw': 1},\n",
    "                 va='center', ha='center')\n",
    "  \n",
    "ax.set_xlabel('w_1')\n",
    "ax.set_ylabel('w_0')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details>\n",
    "<summary>\n",
    "    <font size='3'><b>Expected graph</b></font>\n",
    "</summary>\n",
    "    <img src=\"./figures/ContourPlotLab3.PNG\" alt=\"Contour Plot\">\n",
    "<\\details>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What is this graph showing? The ellipses are describing the surface of the cost $J(\\mathbf{w})$. The lines are the paths take from initial values of $(w_0,w_1)$ to their final values.  \n",
    "The **red line** is our first run with w_init = (0,0). Gradient Descent successfully moves the parameters to (0,0.2) where cost is a minimum. But what about the Blue and Green lines?   \n",
    "The **Blue** lines has w_init = (400,0.6) and alpha = 1.0e-7. Notice that while `w1` moves, `w0` doesn't seem to move. Why?  \n",
    "The **Green** line has w_init = (100,0.1) and alpha = 1.0e-6. It never fully converges. Why?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details>\n",
    "<summary>\n",
    "    <font size='3', color='darkgreen'><b>Hints</b></font>\n",
    "</summary>\n",
    "    \n",
    "In next week's lectures we will cover some fine tuning of gradient descent that is required to get it to run well. The **blue line** is one of these cases. It it does not seem that `w0` is being updated, but it is, just slowly. `w1` is multiplied by $x_1$ which is the square footage of houses in the dataset, a value in the thousands. This makes `w1` update much more quickly than `w0`. Review the update equations (4) and (5) above. With alpha = 1.0e-7, it will take many iterations to update `w0` to the right value. \n",
    "    \n",
    "Why not just increase the value of alpha? The **green** line demonstrates the problem with doing this. We use a larger value for alpha in that run and the solution _diverges_. The update for `w1` is so large that the cost is larger on each iteration rather than smaller.  If you run it long enough, you will generate a numerical overflow (try it). The lecture described this scenario. \n",
    "    \n",
    "So, we have a situation where alpha is too big for `w1` but too small for `w0`. A means of dealing with this will be described next week. It involves _scaling_ or _normalizing_ the features in the data set so they fall within the same range. Once the data is normalized, alpha will impact all features evenly.\n",
    "    \n",
    "Another way to handle this is to select the largest value of alpha you can that doesn't cause the solution to diverge, and then run it a long time. Try this in the next section _if you have the time!_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#TAKES A LONG TIME, 10 minutes or so.\n",
    "w_init = [400,0.1]\n",
    "# some gradient descent settings\n",
    "iterations = 40000000\n",
    "alpha = 7.0e-7\n",
    "# run gradient descent\n",
    "w4_final, J4_hist, w4_hist = gradient_descent(X_train ,y_train, w_init, compute_cost, compute_gradient, alpha, iterations)\n",
    "print(f\"w found by gradient descent: ({w4_final[0]:0.4f},{w4_final[1]:0.4f})\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "w0 = np.arange(-500, 500, 5)\n",
    "w1 = np.arange(-0.2, 0.8, 0.005)\n",
    "w0,w1 = np.meshgrid(w0,w1)\n",
    "z=np.zeros_like(w0)\n",
    "n,_ = w0.shape\n",
    "for i in range(n):\n",
    "    for j in range(n):\n",
    "        z[i][j] = compute_cost(X_train, y_train, [w0[i][j],w1[i][j]] )\n",
    "\n",
    "   \n",
    "fig = plt.figure(figsize=(24,6))\n",
    "\n",
    "ax = fig.add_subplot(1, 2, 2)\n",
    "CS = ax.contour(w1, w0, z,[0,50,1000,5000,10000,25000,50000])\n",
    "plt.clabel(CS, inline=1, fmt='%1.0f', fontsize=10)\n",
    "plt.title('Contour plot of cost, w0 vs w1')\n",
    "\n",
    "w_sub = [ (i[1],i[0]) for i in w_hist]\n",
    "for i in range(len(w_sub)-1):\n",
    "    plt.annotate('', xy=w_sub[i + 1], xytext=w_sub[i],\n",
    "                 arrowprops={'arrowstyle': '->', 'color': 'r', 'lw': 1},\n",
    "                 va='center', ha='center')\n",
    "\n",
    "w_sub = [ (i[1],i[0]) for i in w4_hist]\n",
    "for i in range(len(w_sub)-1):\n",
    "    plt.annotate('', xy=w_sub[i + 1], xytext=w_sub[i],\n",
    "                 arrowprops={'arrowstyle': '->', 'color': 'c', 'lw': 1},\n",
    "                 va='center', ha='center')\n",
    "  \n",
    "ax.set_xlabel('w_1')\n",
    "ax.set_ylabel('w_0')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The cyan line is our long-running solution. Scaling or Normalizing features will get us to the right solution faster. We will cover this next week."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}