Jekyll2022-08-10T11:06:53-07:00https://justinmath.com/feed.xmlJustin SkycakJustin Skycakjpskycak@gmail.com[INCOMPLETE DRAFT] Reimplementing Blondie24: Part 22022-03-07T00:00:00-08:002022-03-07T00:00:00-08:00https://justinmath.com/reimplementing-blondie24-part-2<p>(in progress)</p> <!-- "Evolving an expert checkers playing program without using human expertise" by K. Chellapilla and D.B. Fogel I think this is similar to above but uses a CNN? -->Justin Skycakjpskycak@gmail.com(in progress)Reimplementing Blondie24: Part 12022-03-06T00:00:00-08:002022-03-06T00:00:00-08:00https://justinmath.com/reimplementing-blondie24-part-1<p>Fogel and Chellapilla’s Blondie24 was published over the course of two papers. Here we shall address the first paper, <i>Evolving Neural Networks to Play Checkers without Relying on Expert Knowledge</i>, published in 1999.</p> <p>This first version of Blondie24 operated under similar principles as Fogel’s tic-tac-toe player, described <a class="body" target="_blank" href="https://justinmath.com/reimplementing-fogels-tic-tac-toe-player">previously</a>. However, there are a number of important differences that are detailed below.</p> <h2>Neural Network Architecture</h2> <p>The neural net consists of the following layers:</p> <ul> <li><i>Input Layer:</i> $32$ linearly-activated nodes and $1$ bias node (checkers board has $64$ squares but only half of them are used)</li> <li><i>First Hidden Layer:</i> $40$ tanh-activated nodes and $1$ bias node ($H$ is variable, as will be described later)</li> <li><i>Second Hidden Layer:</i> $10$ tanh-activated nodes and $1$ bias node ($H$ is variable, as will be described later)</li> <li><i>Output Layer:</i> $1$ tanh-activated node</li> </ul> <p>Additionally, there is a special node called a <i>piece difference node</i> whose activity is the sum of the $32$ input nodes. The piece difference node connects directly to the output node, bypassing all the layers. The connection, of course, has a variable weight that is learned by the network.</p> <p><br /> <b>Converting Board to Input</b></p> <p>This is similar to tic-tac-toe in that the player’s own regular pieces are labeled with $1,$ empty squares with $0,$ and opponent regular pieces with $-1.$ However, the player’s own king pieces are labeled with $K,$ and the opponent’s with $-K,$ where $K$ is a variable that is learned by the network.</p> <p><br /> <b>Converting Output to Action</b></p> <p>An action is chosen via the minimax algorithm using the following <a class="body" target="_blank" href="https://justinmath.com/reduced-search-depth-and-heuristic-evaluation-for-connect-four">heuristic evaluation function</a>. As the network learns, this heuristic evaluation function will become more accurate.</p> <ol> <li>If a board state is a win or a loss, return $1$ or $-1$ respetively.</li> <li>Otherwise, pass the board state as input to the neural network and return the activity of the output node.</li> </ol> <p>The search depth is set to $d=4$ to allow for reasonable execution times.</p> <h2>Evolution Procedure</h2> <p>The initial population consists of $15$ networks with initial weights randomly chosen from the range $[-0.2, 0.2],$ mutation rates set to $\alpha = 0.05,$ and $K=2.$</p> <p><br /> <b>Replication</b></p> <p>The evolution procedure follows the same rules as those described <a class="body" target="_blank" href="https://justinmath.com/introduction-to-blondie24-and-neuroevolution">previously</a> when evolving neural network regressors.</p> <p>However, in addition to updating the mutation rate and weights, $K$ is also updated through the following rule:</p> <center> \begin{align*} K^\text{child} = K^\text{parent} e^{ N(0,1) / \sqrt{2} } \end{align*} </center> <p><br /> Note that $K$ is constrained to the range $[1,3],$ meaning that</p> <ul> <li>if $K$ falls below $1$ then it is immediately set to $1,$ and</li> <li>if $K$ rises above $3$ then it is immediately set to $3.$</li> </ul> <p><br /> <b>Evaluation</b></p> <p>Each of the $30$ networks in a generation plays a game of checkers against $5$ other networks randomly selected (with replacement) from the generation. The network is allowed to move first during each game, and it receives a payoff of $1$ for a win, $0$ for a tie, and $-2$ for a loss. (A tie is declared after $100$ moves by each player with no winner.)</p> <p>The $15$ networks with the highest total payoffs are selected as the parents of the next generation.</p> <h2>Performance Curve</h2> <p>In their paper, Fogel and Chellapilla did not create a curve to demonstrate performance as a function of number of generations. Instead, they played their final network against human players online and demonstrated that it achieved an impressive performance rating.</p> <p>Here, we will create a performance curve by playing the evolving networks against an external algorithmic strategy and measuring their performance. This can be accomplished as follows:</p> <ol> <li>Develop a heuristic checkers player by hand that plays slightly intelligently. It should capitalize on obvious opportunities to move its pieces forward and jump opponent pieces, but it should not attempt to plan into the future.</li> <li>During each generation of the evolutionary procedure, before replication, play each of the $15$ parent networks against your heuristic player and compute the average payoff.</li> <li>Keep evolving new generations until the average payoff levels off.</li> </ol> <p>The resulting plot should show that the average payoff increases with the number of generations (up to some point), demonstrating the that the evolving networks are learning to play checkers intelligently.</p> <p>Keep in mind that your hand-crafted heuristic checkers player is not actually used during the evolution procedure – it is only used to measure how the evolved networks perform against an external opponent. So, anything that the evolving networks “learn” is organic and self-taught, not tailored to the specifics of your hand-crafted player.</p>Justin Skycakjpskycak@gmail.comFogel and Chellapilla’s Blondie24 was published over the course of two papers. Here we shall address the first paper, Evolving Neural Networks to Play Checkers without Relying on Expert Knowledge, published in 1999.Reimplementing Fogel’s Tic-Tac-Toe Paper2022-03-05T00:00:00-08:002022-03-05T00:00:00-08:00https://justinmath.com/reimplementing-fogels-tic-tac-toe-paper<p>The goal of this section is to reimplement the paper <i>Using Evolutionary Programming to Create Neural Networks that are Capable of Playing Tic-Tac-Toe</i> by David Fogel in 1993. This paper preceded Blondie24, and many of the principles introduced in this paper were extended in Blondie24. As such, reimplementing ths paper provides good scaffolding as we work our way up to reimplement Blondie24.</p> <p>The information needed to reimplement this paper is outlined below.</p> <h2>Neural Network Architecture</h2> <p>The neural net consists of the following layers:</p> <ul> <li><i>Input Layer:</i> $9$ linearly-activated nodes and $1$ bias node</li> <li><i>Hidden Layer:</i> $H$ sigmoidally-activated nodes and $1$ bias node ($H$ is variable, as will be described later)</li> <li><i>Output Layer:</i> $9$ sigmoidally-activated nodes</li> </ul> <p><br /> <b>Converting Board to Input</b></p> <p>A tic-tac-toe board is converted to input by flattening it into a vector and replacing $\textrm X$ with $1,$ empty squares with $0,$ and $\textrm O$ with $-1.$</p> <p>For example, given a board</p> <center> \begin{align*} \begin{bmatrix} \textrm{X} &amp; \textrm{O} &amp; \square \\ \square &amp; \textrm{X} &amp; \textrm{O} \\ \square &amp; \square &amp; \square \end{bmatrix}, \end{align*} </center> <p><br /> we first concatenate consecutive rows to flatten the board into the following $9$-element vector:</p> <center> \begin{align*} \left&lt; \textrm{X}, \textrm{O}, \square, \square, \textrm{X}, \textrm{O}, \square, \square, \square \right&gt;, \end{align*} </center> <p><br /> Then, we replace $\textrm X$ with $1,$ empty squares with $0,$ and $\textrm O$ with $-1$ to get the final input vector:</p> <center> \begin{align*} \left&lt; 1, -1, 0, 0, 1, -1, 0, 0, 0 \right&gt; \end{align*} </center> <p><br /></p> <p><b>Converting Output to Action</b></p> <p>The output layer consists of $9$ nodes, one for each board square. To convert the output values into an action, we do the following:</p> <ol> <li>Discard any values that correspond to a board square that has already been filled. (This will prevent illegal moves.)</li> <li>Identify the empty board square with the maximum value. We move into this square.</li> </ol> <h2>Evolution Procedure</h2> <p>The initial population consists of $50$ networks. In each network, the number of hidden nodes $H$ is randomly chosen from the range ${ 1, 2, \ldots, 10 }$ and the initial weights are randomly chosen from the range $[-0.5, 0.5].$</p> <p><br /> <b>Replication</b></p> <p>A network replicates by making a copy of itself and then modifying the copy as follows:</p> <ul> <li>Each weight is incremented by $N(0,0.5),$ a value drawn from the normal distribution with mean $0$ and variance $0.05.$</li> <li>With $0.5$ probability, we modify the network architecture by randomly choosing between adding or deleting a hidden node. If we add a node, then we initialize its associated weights with values of $0.$</li> </ul> <p>Note that when modifying the architecture, we abort any decision that would lead the number of hidden nodes to exit the range ${ 1, 2, \ldots, 10 }.$ More specifically:</p> <ul> <li>We abort the decision to delete a hidden node if the number of hidden nodes is $H=1.$</li> <li>We abort the decision to add a hidden node if the number of hidden nodes is $H=10.$</li> </ul> <p><br /> <b>Evaluation</b></p> <p>In each generation, each network plays $32$ games against a near-perfect but beatable opponent. The evolving network is always allowed to move first, and receives a payoff of $1$ for a win, $0$ for a tie, and $-10$ for a loss.</p> <p>The near-perfect opponent follows the strategy below:</p> <pre><code> With 10% chance: - Randomly choose an open square to move into. Otherwise: - If the next move can be chosen to win the game, do so. - Otherwise, if the next move can be chosen to block the opponent's win, do so. - Otherwise, if there are 2 open squares in a line with the opponent's marker, randomly move into one of those squares. - Otherwise, randomly choose an open square to move into. </code></pre> <p>Once the total payoff (over $32$ games against the near-perfect opponent) has been computed for each of the $50$ networks, a second round of evaluation occurs to select the networks that will proceed to the next generation and replicate.</p> <p>In the second round of evaluation, each network is given a score that represents how its total payoff (from matchups with the near-perfect opponent) compares to the total payoffs of some other networks in the same generation. Specifically, each network is compared to $10$ other networks randomly chosen from the generation, and its score is incremented for each other network that has a lower total payoff.</p> <p>The network(s) with the maximum score proceed to the next generation and replicate.</p> <h2>Performance Curve</h2> <p>Generate a performance curve as follows:</p> <ol> <li>Run the above procedure for $800$ generations, keeping track of the maximum total payoff (i.e. the best player's total payoff) at each generation.</li> <li>Then repeat $19$ more times, for a total of $20$ trials of $800$ generations each.</li> <li>Finally, plot the mean maximum total payoff (averaged over the $20$ trials) as a function of the number of generations.</li> </ol> <p>The resulting curve should resemble the following shape:</p> <center><img src="https://justinmath.com/files/blog/fogel-tic-tac-toe-performance-curve.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p>Justin Skycakjpskycak@gmail.comThe goal of this section is to reimplement the paper Using Evolutionary Programming to Create Neural Networks that are Capable of Playing Tic-Tac-Toe by David Fogel in 1993. This paper preceded Blondie24, and many of the principles introduced in this paper were extended in Blondie24. As such, reimplementing ths paper provides good scaffolding as we work our way up to reimplement Blondie24.Introduction to Blondie24 and Neuroevolution2022-03-04T00:00:00-08:002022-03-04T00:00:00-08:00https://justinmath.com/introduction-to-blondie24-and-neuroevolution<p><a class="body" target="_blank" href="https://justinmath.com/pruned-game-trees-and-heuristics-for-connect-four">Previously</a>, we built strategic connect four players by constructing a pruned game tree, using heuristics to rate terminal states, and then applying the minimax algorithm. This was a combination of game-specific human intelligence (heuristics) and generalizable artificial intelligence (minimax on a game tree).</p> <h2>Blondie24</h2> <p>In the 1990s, a researcher named David Fogel managed to automate the process of rating states in pruned game trees without relying on heuristics or any other human input. In particular, he and his colleague Kumar Chellapilla created a computer program that achieved expert level checkers-playing ability by learning from scratch. They played it against other humans online under the username Blondie24, pretending to be a 24-year old blonde female college student.</p> <p>Blondie24 was particularly noteworthy because other successful game-playing agents had been hand-tuned and/or trained on human-expert strategies. Unlike these agents, Blondie24 learned <i>without</i> having any access to information regarding human-expert strategies.</p> <p>To automate the process of rating states in pruned game trees, Fogel turned it into a regression problem: given a game state, predict its value. Of course, the regression function is pretty complicated (e.g. changing one piece on a chess board can totally change the outcome of the game), so the natural choice was to use a neural network.</p> <p>However, the usual method of training a neural network, backpropagation, does not work in this setting. Backpropagation relies on a dataset of pairs of inputs and outputs – which means that the model would need a data set of game states along with their correct rating, totally defeating the purpose of getting the model to learn this information from scratch. In this setting, the only feedback the computer gets is at the very end of the game, whether it won or lost (or tied).</p> <h2>Neuroevolution</h2> <p>To get around this issue, Fogel trained neural networks via <b>evolution</b>, which is often referred to as <b>neuroevolution</b> in the context of neural networks. Starting with population of many neural networks with random weights, he repeatedly</p> <ol> <li>played the networks against each other,</li> <li>discarded the networks that performed worse than average,</li> <li>duplicated the remaining networks, and then</li> <li>randomly perturbed the weights of the duplicate networks.</li> </ol> <p>This is analogous to the concept of evolution in biology in which weak organisms die and fit organisms survive to produce similar but slightly mutated offspring. By repeatedly running the evolutionary procedure, Fogel was able to evolve a neural network whose internal mapping from input state to output rating caused it to play the game of checkers in an intelligent way, without any sort of human input.</p> <h2>Exercise: Evolving a Neural Network Regressor</h2> <p>Before we reimplement Fogel’s papers leading up to Blondie24, let’s first gain some experience with neuroevolution in a simpler case. As a toy problem, consider the following data set:</p> <pre><code> [ (0.0, 1.0), (0.04, 0.81), (0.08, 0.52), (0.12, 0.2), (0.17, -0.12), (0.21, -0.38), (0.25, -0.54), (0.29, -0.58), (0.33, -0.51), (0.38, -0.34), (0.42, -0.1), (0.46, 0.16), (0.5, 0.39), (0.54, 0.55), (0.58, 0.61), (0.62, 0.55), (0.67, 0.38), (0.71, 0.12), (0.75, -0.19), (0.79, -0.51), (0.83, -0.77), (0.88, -0.95), (0.92, -1.0), (0.96, -0.91), (1.0, -0.7) ] </code></pre> <p>We will fit the above data using a neural network regressor with the following architecture:</p> <ul> <li><i>Input Layer:</i> 1 linearly-activated node + 1 bias node</li> <li><i>First Hidden Layer:</i> 10 tanh-activated nodes + 1 bias node</li> <li><i>Second Hidden Layer:</i> 6 tanh-activated nodes + 1 bias node</li> <li><i>Third Hidden Layer:</i> 3 tanh-activated nodes + 1 bias node</li> <li><i>Output Layer:</i> 1 tanh-activated node</li> </ul> <p>Remember that hyperbolic tangent function is defined as</p> <center> \begin{align*} \tanh x = \dfrac{e^x - e^{-x}}{e^x + e^{-x}}. \end{align*} </center> <p><br /> To train the neural network, use the following evolutionary algorithm (which is based on the Blondie24 approach):</p> <ol> <li>Create a population of $15$ neural networks with weights randomly drawn from $[-0.2, 0.2].$ Additionally, assign a mutation rate to each net, initially equal to $\alpha = 0.05.$</li> <li>Each of the $15$ parents replicates to produce a single child. The child is given mutation rate <center> \begin{align*} \alpha^\text{child} = \alpha^\text{parent} e^{ N(0,1) / \sqrt{ 2 \sqrt{|W|}} } \end{align*} </center> <br /> and weights <br /><br /> <center> \begin{align*} w_{ij}^\text{child} = w_{ij}^\text{parent} + \alpha^\text{child} N(0,1), \end{align*} </center> <br /> where $N(0,1)$ is a random number drawn from the standard normal distribution and $|W|$ is the number of weights in the network. Be sure to draw a different random number for each instance of $N(0,1).$</li> <li>Compute the RSS for each net and select the $15$ nets with the lowest RSS. These will be the parents in the next generation.</li> <li>Go back to step 2.</li> </ol> <p>Make a plot of the average RSS at each generation, run the algorithm until the graph levels off to nearly $0,$ and then plot the regression curves corresponding to the first and last generations of neural networks on a graph along with the data.</p> <p>(The regression curve plot will contain $60$ different curves drawn on the same plot: one curve from each of the $30$ nets in the first generation, and one curve from each of the $30$ nets in the last generation.)</p> <p>The first generation curves will not fit the data at all (they will appear flat), but the final generation of regression curves should fit the data remarkably well. Note that the training process may require on the order of a thousand generations.</p> <!-- <center><img src="https://justinmath.com/files/blog/neuroevolution-results.png" style="border: none; height: 25em;" alt="icon"></center><br> --> <h2>Exercise: Hyperparameter Tuning</h2> <p>Once you’ve got this working, try tuning hyperparameters to get the RSS to converge to nearly $0$ in as few generations as possible. You can tweak the learning rate, mutation rate, initial weight distribution, number of neural networks, and neural network architecture (i.e. number of hidden layers and their sizes).</p>Justin Skycakjpskycak@gmail.comPreviously, we built strategic connect four players by constructing a pruned game tree, using heuristics to rate terminal states, and then applying the minimax algorithm. This was a combination of game-specific human intelligence (heuristics) and generalizable artificial intelligence (minimax on a game tree).Reduced Search Depth and Heuristic Evaluation for Connect Four2022-03-03T00:00:00-08:002022-03-03T00:00:00-08:00https://justinmath.com/reduced-search-depth-and-heuristic-evaluation-for-connect-four<p>In theory, we could solve any game by building a big game tree, labeling the terminal states as wins, losses, or ties, and then working backwards from that information to identify the minimax strategy. But in practice, game trees get so big so quickly that for all but the most simple games, game trees are too expensive to store in memory and take too long to traverse.</p> <p>For example, consider the game of connect four, which is normally played on a $7 \times 6$ board. Whereas tic-tac-toe had $5\,478,$ valid board states, connect four has $4\,531\,985\,219\,092$ (<a class="body" target="_blank" href="https://oeis.org/A212693">source</a>). Implementing a game tree of this size is infeasible – if you’re not convinced, try running a simple “for” loop that loops over the numbers from $1$ to $4\,531\,985\,219\,092.$ If looping over a million numbers takes a couple seconds, then looping over a trillion numbers will take days.</p> <h2>Reduced Search Depth</h2> <p>We can’t build a full game tree for connect four. But what we can do instead is</p> <ol> <li><b>reduce the search depth</b> (i.e. build a shortened game tree up to some maximum depth), and</li> <li>come up with some kind of <b>heuristic evaluation function</b> to rate how good or bad each leaf node state is.</li> </ol> <p>Then we can apply the minimax strategy in attempt to move in the direction of the best leaf node state.</p> <p>This procedure for selecting a move can be outlined more explicitly as follows:</p> <ol> <li>Build a game tree that is $N$ layers deep from the current game state. (This is referred to as a <b>$N$-ply</b> game tree.)</li> <li>Use the heuristic evaluation function to assign minimax to the terminal nodes in the game tree.</li> <li>Repeatedly propagate those values up to parent nodes using the minimax algorithm.</li> <li>Choose your action in accordance with the standard minimax strategy (i.e. choose the move that takes you to the child state with the highest minimax value).</li> </ol> <p>Note that this time, you’ll have to relabel the game tree on every move because the terminal nodes of the tree will change (thereby changing the minimax values of the rest of the tree). But you don’t need to rebuild the full game tree on every move – you can take the existing game tree, prune off nodes that are no longer irrelevant, and grow the additional nodes needed to bring you back to a search depth of $N.$</p> <h2>Heuristic Evaluation Function</h2> <p>Now, let’s talk about the “secret sauce” in this recipe: the heuristic evaluation function. It takes a game state as input, and returns a number between $-1$ and $1$ that represents how strongly we want or do not want to be in any given game state. To write this function we use our human intuition about the game. Here are some rough guidelines:</p> <ul> <li>If we're 100% confident that a game state is a win or will result in a win, then the function should return $1.$</li> <li>If we think that a win is more likely than a loss, then the function should return a decimal between $0$ and $1,$ with higher win probabilities corresponding to higher decimals.</li> <li>If we have no idea whether a game state will result in a win or loss, or we think it will result in a tie, then the function should return $0.$</li> <li>If we think that a win is less likely than a loss, then the function should return a decimal between $0$ and $-1,$ with lower win probabilities corresponding to higher decimals.</li> <li>If we're 100% confident that a game state is a win or will result in a loss, then the function should return $-1.$</li> </ul> <p>For example, here is a simple heuristic function for tic-tac-toe:</p> <ol> <li>If the game state is a definite win, tie, or loss, then return $1,$ $0,$ or $-1$ respectively.</li> <li>Otherwise, count up the number of rows, columns, and diagonals where you occupy two spaces and the third space is empty. Then, subtract the number of rows, columns, and diagonals where your opponent occupies two spaces and the third space is empty. Finally, divide the result by $8$ (which is the total number of rows, columns, and diagonals).</li> </ol> <h2>Exercises</h2> <p>Remember to alternate who goes first in the matchups described below.</p> <ol> <li>Implement a 9-ply heuristic minimax tic-tac-toe player, i.e. it uses the heuristic evaluation function described above and a search depth of $N=9$ (which happens to be the full tree). Then, run it against the perfect minimax player that you created previously. Every game should result in a tie.</li> <li>Implement a 5-ply heuristic minimax tic-tac-toe player. Then, run it against your 9-ply heuristic player, as well as a purely random player. The 5-ply player should do better than the random player but worse than the 9-ply player.</li> <li>Develop a heuristic minimax connect four player that uses as many ply as can be computed quickly, and verify that it performs better than a random player.</li> <li>Develop another heuristic minimax connect four player that uses better heuristic evaluation function. Then, verify that this second heuristic player is better than your first player. (If you can't figure out how to improve your function, then make the second player's function intentionally worse than that of the first player and verify that it indeed performs worse.)</li> <li>Play against your heuristic minimax connect four player yourself.</li> </ol>Justin Skycakjpskycak@gmail.comIn theory, we could solve any game by building a big game tree, labeling the terminal states as wins, losses, or ties, and then working backwards from that information to identify the minimax strategy. But in practice, game trees get so big so quickly that for all but the most simple games, game trees are too expensive to store in memory and take too long to traverse.Minimax Strategy2022-03-02T00:00:00-08:002022-03-02T00:00:00-08:00https://justinmath.com/minimax-strategy<p>The <b>minimax strategy</b> is a powerful game-playing strategy that operates on game trees. It works by maximizing the worst-case scenario that could potentially arise from every move.</p> <h2>Minimax Algorithm</h2> <p>The minimax strategy chooses actions according to the following algorithm:</p> <ol> <li>Create a game tree with all the states of the game.</li> <li>Identify each node that represents a terminal state and assign it a minimax value of $1,$ $-1,$ or $0$ depending on whether it corresponds to a win, loss, or tie for you.</li> <li>Repeatedly propagate those values up the tree to parent nodes, assuming that you will try to win (i.e. move into states that maximize your value) and your opponent will try to make you lose (i.e. move into states that minimize your value).</li> <ul> <li>If the edge from the parent node corresponds to your turn, then the parent node's minimax value is the maximum of the child values (since you want to maximize your value).</li> <li>Otherwise, if the edge from the parent node corresponds to your <i>opponent's</i> turn, then the parent node's minimax value is the <i>minimum</i> of the child values (since your oponent wants to <i>minimize</i> your value).</li> </ul> <li>Always choose the move that takes you to the next possible state with the highest minimax value. (You can break ties via random choice.)</li> </ol> <h2>Worked Example</h2> <p>To illustrate the minimax algorithm in action, let’s label part of a tic-tac-toe game tree with minimax values, from the perspective of player X (i.e. supposing we are player X). We always start by labeling the terminal states.</p> <center><img src="https://justinmath.com/files/blog/tic-tac-toe-minimax-values-1.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <p>Then, we propagate these values up to parent nodes. But we can only compute the minimax value of a node once we’ve assigned minimax values to all its children.</p> <p>Here, there are $3$ parents who do not have minimax values but whose children all do. The edges between these parents and their children all correspond to moves by X, which is us, and we want to maximize the minimax value. So, to each of these parents, we assign the maximum child value. (In this case, each of these parents has only one child, so it’s trivial.)</p> <center><img src="https://justinmath.com/files/blog/tic-tac-toe-minimax-values-2.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <p>Now, we repeat the process. Now, there are $2$ parents who do not have minimax values but whose children all do. The edges between these parents and their children all correspond to moves by O, which is our opponent, and our opponent wants to minimize the minimax value. So, to each of these parents, we assign the minimum child value.</p> <center><img src="https://justinmath.com/files/blog/tic-tac-toe-minimax-values-3.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <p>Again, we repeat the process. There is a single parent (the top node) who does not have a minimax value but whose children all do. The edges between these parents and their children all correspond to moves by X, which is us, and we want to maximize the minimax value. So we assign the maximum child value.</p> <center><img src="https://justinmath.com/files/blog/tic-tac-toe-minimax-values-4.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <p>We’ve assigned minimax values to all the nodes in this part of the tree. The minimax value of the highest node is $1,$ which tells us that there is a guaranteed way to win from that game state (all we need to do is place an X in the bottom-left corner). Indeed, this action is accomplished by choosing the child node with the maximum value.</p> <h2>Exercises</h2> <ol> <li>Implement a <b>minimax player</b> for your tic-tac-toe game that automatically chooses actions based on the minimax strategy. (It goes without saying: don't rebuild and relabel the game tree on every move. That would be very inefficient and slow. Build and label it once at the beginning, and then use the same tree throughout the rest of the game.)</li> <li>Run your minimax player against a deterministic "top-left strategy" that always moves into the leftmost open space in the topmost row. At each of the minimax player's turns, print out the possible moves that the minimax player could possibly make as well as the associated minimax values of the states. Check the following:</li> <ul> <li>Every minimax value should be either $1,$ $-1,$ or $0.$</li> <li>Each of the minimax player's chosen moves should be associated with a maximum-value state.</li> <li>Towards the end of the game, you should be able to inspect game states, manually sketch out the section of the game tree containing their progeny, and then manually compute and verify the minimax value of each state.</li> <li>Make sure these checks still hold when the minimax player goes second.</li> </ul> <li>Then, run your minimax player against a random player for many games (alternating who goes first). The minimax player should <i>never</i> lose. If you encounter any game where the minimax player loses, then you'll need to store the sequence of moves and step through the game to debug what went wrong.</li> <li>Play two minimax players against each other for many games, alternating who goes first. Every game should result in a tie.</li> <li>Play against the minimax player yourself. You shouldn't be able to win.</li> </ol>Justin Skycakjpskycak@gmail.comThe minimax strategy is a powerful game-playing strategy that operates on game trees. It works by maximizing the worst-case scenario that could potentially arise from every move.Canonical and Reduced Game Trees for Tic-Tac-Toe2022-03-01T00:00:00-08:002022-03-01T00:00:00-08:00https://justinmath.com/canonical-and-reduced-game-trees-for-tic-tac-toe<p>A <b>game tree</b> is a data structure that represents all the possible outcomes of a game. It is a graph where the nodes correspond to the states of the game, and the edges correspond to actions that cause the game to transition from one state to another. Game trees are commonly used when coding up strategies for autonomous game-playing agents.</p> <h2>Exercise: Tic-Tac-Toe Tree</h2> <p>Create a class <code>TicTacToeTree</code> that constructs a game tree for tic-tac-toe. Each node in the game tree has corresponds to a state of the game. The root node represents an empty board. It has 9 children, one for each move that the first player can make. Each of those 9 children have 8 children (after the first player has moved, there are 8 moves remaining for the second player). And so on.</p> <center><img src="https://justinmath.com/files/blog/tic-tac-toe-game-tree.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <p>There are $255 \, 168$ unique ways that a game of tic-tac-toe can play out, so you can check your tree by verifying that there are $255 \, 168$ <i>leaf nodes</i>.</p> <p>Here are some tips regarding the implementation:</p> <ul> <li>Each node should have a state attribute that holds the state of the tic-tac-toe game, a player attribute that says whose turn it is, and a winner attribute that says if someone has won.</li> <li>Instead of passing edges into the tree at initialization, you'll need to build up your tree algorithmically: start with a tree with a single node, and then repeatedly create child nodes until they reach a terminal state (i.e. a state where the game is finished).</li> <li>Ultimately this just comes down to a graph traversal (breadth-first or depth-first, doesn't matter which). Whenever a node's game state is not terminal, create a child node for each possible next state.</li> </ul> <h2>Exercise: Reduced Tic-Tac-Toe Tree</h2> <p>Once you’ve built your <code>TicTacToeTree</code> and verified that it has the correct number of leaf nodes, the next step is to make it more efficient. Notice that there are many redundancies where separate nodes represent the same state:</p> <center><img src="https://justinmath.com/files/blog/tic-tac-toe-game-tree-before-reduction.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <p>Although redundancies are included in the canonical conception of a game tree, we can greatly speed up the construction and reduce the size of our game tree if we use only one node per game state. To do this, you’ll need to make a slight tweak to your traversal so that whenever a node with the desired state already exists, you connect up that existing node as a child (instead of creating a new node).</p> <center><img src="https://justinmath.com/files/blog/tic-tac-toe-game-tree-after-reduction.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <p><i>Do not loop over the tree every time to check if a node with the desired child state already exists.</i> That would be really inefficient! Instead, store nodes in a dictionary where the key represents the game state. That way, to check if a node with a particular game state already exists, you just need to look up that state in the dictionary.</p> <p>There are $5\,478$ distinct possible game states in the game of tic-tac-toe, so you can check your reduced tree by verifying that there are $5\,478$ nodes <i>in total</i>.</p>Justin Skycakjpskycak@gmail.comA game tree is a data structure that represents all the possible outcomes of a game. It is a graph where the nodes correspond to the states of the game, and the edges correspond to actions that cause the game to transition from one state to another. Game trees are commonly used when coding up strategies for autonomous game-playing agents.Backpropagation2022-02-06T00:00:00-08:002022-02-06T00:00:00-08:00https://justinmath.com/backpropagation<p>The most common method used to fit neural networks to data is gradient descent, just like we did have done previously for simpler models. The computations are significantly more involved for neural networks, but an algorithm called <b>backpropagation</b> provides a convenient framework for computing gradients.</p> <h2>Core Idea</h2> <p>The backpropagation algorithm leverages two key facts:</p> <ol> <li>If you know $\dfrac{\partial \textrm{RSS}}{\partial \sigma(\Sigma)}$ for the output $\sigma(\Sigma)$ of a neuron, then you can easily compute $\dfrac{\partial \textrm{RSS}}{\partial w}$ for any weight $w$ that the neuron receives from a neuron in the previous layer.</li> <li>If you know $\dfrac{\partial \textrm{RSS}}{\partial \sigma(\Sigma)}$ for all neurons in a layer, then you can piggy-back off the result to compute $\dfrac{\partial \textrm{RSS}}{\partial \sigma(\Sigma)}$ for all neurons in the previous layer.</li> </ol> <p>With these two facts in mind, the backpropagation algorithm consists of the following three steps:</p> <ol> <li><i>Forward propagate neuron activities.</i> Compute $\Sigma$ and $\sigma(\Sigma)$ for all neurons in the network, starting at the input layer and repeatedly piggy-backing off the results to compute $\Sigma$ and $\sigma(\Sigma)$ for all neurons in the next layer.</li> <li><i>Backpropagate neuron output gradients.</i> Compute $\dfrac{\partial \textrm{RSS}}{\partial \sigma(\Sigma)}$ for all neurons, starting with the output layer and then repeatedly piggy-backing off the results to compute $\dfrac{\partial \textrm{RSS}}{\partial \sigma(\Sigma)}$ in the previous layer.</li> <li><i>Expand neuron output gradients to weight gradients.</i> Compute $\dfrac{\partial \textrm{RSS}}{\partial w}$ for all weights in the neural network by piggy-backing off of $\dfrac{\partial \textrm{RSS}}{\partial \sigma(\Sigma)}$ for the neuron that receives the weight.</li> </ol> <h2>Forward Propagation of Neuron Activities</h2> <p>Let’s formalize these steps mathematically. First, we denote the following quantities:</p> <ul> <li>$\vec x = \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \end{bmatrix}$ are the inputs to the neural network, and $f(\vec x)$ is the output of the neural network.</li> <li>$\vec \Sigma_\ell = \begin{bmatrix} \Sigma_{\ell 1} \\ \Sigma_{\ell 2} \\ \vdots \end{bmatrix}$ are the inputs to the neurons in the $\ell$th layer, and $\vec h_\ell = \begin{bmatrix} h_{\ell 1} \\ h_{\ell 2} \\ \vdots \end{bmatrix}$ are the outputs of the neurons in the $\ell$th layer. If the activation function of these neurons is $\sigma,$ then $\vec h_\ell = \sigma \left( \vec \Sigma_\ell \right).$</li> <li>The input layer is the $0$th layer, there are $L$ hidden layers between the input and output layers, and the output layer is the $(L+1)$th layer. Note that this means $\vec h_0 = \vec x$ and $h_{L+1} = f(\vec x).$</li> <li>$A_\ell = \begin{bmatrix} a_{\ell 11} &amp; a_{\ell 12} &amp; \cdots \\ a_{\ell 21} &amp; a_{\ell 22} &amp; \cdots \\ \vdots &amp;&amp; \vdots &amp;&amp; \ddots \end{bmatrix}$ is the matrix of connection weights between the non-bias neurons in the $\ell$th layer and the next layer, and $\vec b_\ell = \begin{bmatrix} b_{\ell 1} \\ b_{\ell 2} \\ \vdots \end{bmatrix}$ are the connection weights between the bias neuron in the $\ell$th layer and the neurons in the next layer.</li> </ul> <p><br /></p> <p>The following diagram may aid in remembering what each symbol represents.</p> <center><img src="https://justinmath.com/files/blog/neural-network-regressor-labeled-symbols.png" style="border: none; height: 30em;" alt="icon" /></center> <p><br /></p> <p>Using the terminology introduced above, we can state the forward propagation step as follows:</p> <center> \begin{align*} \vec \Sigma_1 &amp;= A_1 \vec x + \vec b_1 \\[3pt] \vec h_1 &amp;= \sigma \left( \vec \Sigma_1 \right) \\[10pt] \vec \Sigma_2 &amp;= A_2 \vec h_1 + \vec b_2 \\[3pt] \vec h_2 &amp;= \sigma \left( \vec \Sigma_2 \right) \\[10pt] \vec \Sigma_3 &amp;= A_3 \vec h_2 + \vec b_3 \\[3pt] \vec h_3 &amp;= \sigma \left( \vec \Sigma_3 \right) \\[3pt] &amp;\vdots \\[3pt] \vec \Sigma_L &amp;= A_{L} \vec h_{L-1} + \vec b_{L} \\[3pt] \vec h_L &amp;= \sigma \left( \vec \Sigma_L \right) \\[10pt] \Sigma_{L+1} &amp;= a_{(L+1)11} h_{L1} + a_{(L+1)12} h_{L2} + \cdots + b_{(L+1)1} \\[3pt] f \left( \vec x \right) &amp;= \sigma \left( \Sigma_{L+1} \right) \end{align*} </center> <p><br /> Note that the last two lines are written as scalars since the output layer contains only a single neuron, i.e. <i>the</i> output neuron.</p> <h2>Backpropagation of Neuron Output Gradients</h2> <p>Now, let’s formalize the backpropagation step for a point $(\vec x, y).$ First, we compute the gradient with respect to the output neuron. Remember that the output of the output neuron is $f(\vec x),$ which can also be denoted as $h_{(L+1)1}$ since the output layer is the $(L+1)$th layer.</p> <center> \begin{align*} \dfrac{\partial \textrm{RSS}}{\partial h_{(L+1)i}} &amp;= \dfrac{\partial}{\partial h_{(L+1)i}} \left[ \left( f \left( \vec x \right) - y \right)^2 \right] \\[3pt] &amp;= \dfrac{\partial}{\partial h_{(L+1)i}} \left[ \left( h_{(L+1)i} - y \right)^2 \right] \\[3pt] &amp;= 2 \left( h_{(L+1)i} - y \right) \end{align*} </center> <p><br /> Then, we backpropagate to the previous layer.</p> <center><img src="https://justinmath.com/files/blog/neural-network-regressor-labeled-symbols-backprop.png" style="border: none; height: 30em;" alt="icon" /></center> <p><br /></p> <center> \begin{align*} \dfrac{\partial \textrm{RSS}}{\partial h_{Li}} &amp;= \dfrac{\partial \textrm{RSS}}{\partial h_{(L+1)1}} \cdot \dfrac{\partial h_{(L+1)1}}{\partial h_{Li}} \\[5pt] &amp;= \dfrac{\partial \textrm{RSS}}{\partial h_{(L+1)1}} \cdot \dfrac{\partial}{\partial h_{Li}} \left[ \sigma \left( \Sigma_{(L+1)1} \right) \right] \\[5pt] &amp;= \dfrac{\partial \textrm{RSS}}{\partial h_{(L+1)1}} \sigma' \left( \Sigma_{(L+1)1} \right) \cdot \dfrac{\partial}{\partial h_{Li}} \left[ \Sigma_{(L+1)1} \right] \\[5pt] &amp;= \dfrac{\partial \textrm{RSS}}{\partial h_{(L+1)1}} \sigma' \left( \Sigma_{(L+1)1} \right) \cdot \dfrac{\partial}{\partial h_{Li}} \left[ a_{(L+1)11} h_{L1} + a_{(L+1)12} h_{L2} + \cdots + b_{(L+1)1} \right] \\[5pt] &amp;= \dfrac{\partial \textrm{RSS}}{\partial h_{(L+1)1}} \sigma' \left( \Sigma_{(L+1)1} \right) a_{(L+1)1i} \end{align*} </center> <p><br /> Note that the quantity $\dfrac{\partial \textrm{RSS}}{\partial h_{(L+1)1}}$ was already computed, so we do not have to expand it out.</p> <p>We continue backpropagating using the same approach. Note that hidden layers contain multiple nodes (unlike the output layer), so we need a term for each node.</p> <center> \begin{align*} \dfrac{\partial \textrm{RSS}}{\partial h_{(L-1)i}} &amp;= \dfrac{\partial \textrm{RSS}}{\partial h_{L1}} \cdot \dfrac{\partial h_{L1}}{\partial h_{(L-1)i}} + \dfrac{\partial \textrm{RSS}}{\partial h_{L2}} \cdot \dfrac{\partial h_{L2}}{\partial h_{(L-1)i}} + \cdots \\[5pt] &amp;= \cdots \\[5pt] &amp;= \dfrac{\partial \textrm{RSS}}{\partial h_{L1}} \sigma' \left( \Sigma_{L1} \right) a_{L1i} + \dfrac{\partial \textrm{RSS}}{\partial h_{L2}} \sigma' \left( \Sigma_{L2} \right) a_{L2i} + \cdots \end{align*} </center> <p><br /> Again, note that the quantities $\dfrac{\partial \textrm{RSS}}{\partial h_{L1}},$ $\dfrac{\partial \textrm{RSS}}{\partial h_{L2}},$ $\ldots$ were already computed, so we do not have to expand them out.</p> <p>Also note that we can consolidate into vector form:</p> <center> \begin{align*} \dfrac{\partial \textrm{RSS}}{\partial \vec h_{L-1}} &amp;= \begin{bmatrix} \dfrac{\partial \textrm{RSS}}{\partial h_{(L-1)1}} \\[5pt] \dfrac{\partial \textrm{RSS}}{\partial h_{(L-1)2}} \\ \vdots \end{bmatrix} \\[5pt] &amp;= \begin{bmatrix} \dfrac{\partial \textrm{RSS}}{\partial h_{L1}} \sigma' \left( \Sigma_{L1} \right) a_{L11} + \dfrac{\partial \textrm{RSS}}{\partial h_{L2}} \sigma' \left( \Sigma_{L2} \right) a_{L21} + \cdots \\[5pt] \dfrac{\partial \textrm{RSS}}{\partial h_{L1}} \sigma' \left( \Sigma_{L1} \right) a_{L12} + \dfrac{\partial \textrm{RSS}}{\partial h_{L2}} \sigma' \left( \Sigma_{L2} \right) a_{L22} + \cdots \\ \vdots \end{bmatrix} \\[5pt] &amp;= \begin{bmatrix} a_{L11} &amp; a_{L21} &amp; \cdots \\ a_{L12} &amp; a_{L22} &amp; \cdots \\ \vdots &amp; \vdots &amp; \ddots \end{bmatrix} \begin{bmatrix} \dfrac{\partial \textrm{RSS}}{\partial h_{L1}} \sigma' \left( \Sigma_{L1} \right) \\[5pt] \dfrac{\partial \textrm{RSS}}{\partial h_{L2}} \sigma' \left( \Sigma_{L2} \right) \\ \vdots \end{bmatrix} \\[5pt] &amp;= A_{L}^T \left( \dfrac{\partial \textrm{RSS}}{\partial \vec h_{L}} \circ \sigma' \left( \vec \Sigma_{L} \right) \right), \end{align*} </center> <p><br /> where $\circ$ denotes the element-wise product.</p> <p>We keep backpropagating using the same approach until we reach the input layer. At that point, we will have computed $\dfrac{\partial \textrm{RSS}}{\partial h_{\ell i}}$ for every neuron in the network.</p> <h2>Expansion of Neuron Output Gradients to Weight Gradients</h2> <p>Finally, we expand the neuron output gradients into weight gradients, i.e. coefficient gradients $\dfrac{\partial \textrm{RSS}}{\partial a_{\ell i j}}$ and bias gradients $\dfrac{\partial \textrm{RSS}}{\partial b_{\ell i}}.$</p> <center><img src="https://justinmath.com/files/blog/neural-network-regressor-labeled-symbols-expansion.png" style="border: none; height: 12em;" alt="icon" /></center> <p><br /></p> <center> \begin{align*} \dfrac{\partial \textrm{RSS}}{\partial a_{\ell i j}} &amp;= \dfrac{\partial \textrm{RSS}}{\partial h_{\ell i}} \cdot \dfrac{\partial h_{\ell i}}{\partial a_{\ell i j}} \\[5pt] &amp;= \dfrac{\partial \textrm{RSS}}{\partial h_{\ell i}} \cdot \dfrac{\partial}{\partial a_{\ell i j}} \left[ \sigma \left( \Sigma_{\ell i} \right) \right] \\[5pt] &amp;= \dfrac{\partial \textrm{RSS}}{\partial h_{\ell i}} \sigma' \left( \Sigma_{\ell i} \right) \cdot \dfrac{\partial}{\partial a_{\ell i j}} \left[ \Sigma_{\ell i} \right] \\[5pt] &amp;= \dfrac{\partial \textrm{RSS}}{\partial h_{\ell i}} \sigma' \left( \Sigma_{\ell i} \right) \cdot \dfrac{\partial}{\partial a_{\ell i j}} \left[ a_{\ell i 1} h_{(\ell-1) 1} + a_{\ell i 2} h_{(\ell-1) 2} + \cdots + b_{(\ell-1) i} \right] \\[5pt] &amp;= \dfrac{\partial \textrm{RSS}}{\partial h_{\ell i}} \sigma' \left( \Sigma_{\ell i} \right) h_{(\ell-1) j} \end{align*} </center> <p><br /> By the same computation, we get</p> <center> \begin{align*} \dfrac{\partial \textrm{RSS}}{\partial b_{\ell i}} &amp;= \dfrac{\partial \textrm{RSS}}{\partial h_{\ell i}} \sigma' \left( \Sigma_{\ell i} \right). \end{align*} </center> <p><br /> Notice that the expression for $\dfrac{\partial \textrm{RSS}}{\partial b_{\ell i}}$ appears in the expression for $\dfrac{\partial \textrm{RSS}}{\partial a_{\ell i j}},$ so we can simplify:</p> <center> \begin{align*} \dfrac{\partial \textrm{RSS}}{\partial b_{\ell i}} &amp;= \dfrac{\partial \textrm{RSS}}{\partial h_{\ell i}} \sigma' \left( \Sigma_{\ell i} \right) \\[5pt] \dfrac{\partial \textrm{RSS}}{\partial a_{\ell i j}} &amp;= \dfrac{\partial \textrm{RSS}}{\partial b_{\ell i}} h_{(\ell-1) j} \end{align*} </center> <p><br /> Again, we can consolidate into vector form:</p> <center> \begin{align*} \dfrac{\partial \textrm{RSS}}{\partial \vec b_{\ell}} &amp;= \dfrac{\partial \textrm{RSS}}{\partial \vec h_{\ell}} \circ \sigma' \left( \vec \Sigma_{\ell} \right) \\[5pt] \dfrac{\partial \textrm{RSS}}{\partial A_{\ell}} &amp;= \dfrac{\partial \textrm{RSS}}{\partial \vec b_{\ell}} \otimes \vec h_{\ell-1}, \end{align*} </center> <p><br /> where $\otimes$ is the outer product.</p> <h2>Gradient Descent Update</h2> <p>Once we know all the weight gradients, we can update the weights using the usual gradient descent update:</p> <center> \begin{align*} A_{\ell} &amp;\to A_{\ell} - \alpha \dfrac{\partial \textrm{RSS}}{\partial A_{\ell}} \quad \textrm{where} \quad \dfrac{\partial \textrm{RSS}}{\partial A_{\ell}} = \sum_{(\vec x, y)} \left. \dfrac{\partial \textrm{RSS}}{\partial A_{\ell}} \right|_{(\vec x, y)} \\[5pt] \vec b_{\ell} &amp;\to \vec b_{\ell} - \alpha \dfrac{\partial \textrm{RSS}}{\partial \vec b_{\ell}} \quad \textrm{where} \quad \dfrac{\partial \textrm{RSS}}{\partial \vec b_{\ell}} = \sum_{(\vec x, y)} \left. \dfrac{\partial \textrm{RSS}}{\partial \vec b_{\ell}} \right|_{(\vec x, y)} \end{align*} </center> <p><br /></p> <h2>Pseudocode</h2> <p>The following pseudocode summarizes the backpropagation algorithm that was derived above.</p> <center> \begin{align*} &amp;\textrm{1. Reset all gradient placeholders} \\[5pt] &amp; \forall \ell \in \lbrace 1, 2, ..., L \rbrace: \\[5pt] &amp; \qquad \dfrac{\partial \textrm{RSS}}{\partial A_\ell} = \begin{bmatrix} 0 &amp; 0 &amp; \cdots \\ 0 &amp; 0 &amp; \cdots \\ \vdots &amp; \vdots &amp; \ddots \end{bmatrix} \\[5pt] &amp; \qquad \dfrac{\partial \textrm{RSS}}{\partial \vec b_\ell} = \vec 0 \\ \\ &amp; \textrm{2. Loop over all data points} \\[5pt] &amp; \forall (\vec x, y): \\ \\ &amp; \qquad \textrm{2.1 Forward propagate neuron activities} \\[5pt] &amp; \qquad \vec \Sigma_0 = \vec x \\[5pt] &amp; \qquad \vec h_0 = \vec x \\[5pt] &amp; \qquad \forall \ell \in \lbrace 0, 1, \ldots, L \rbrace: \\[5pt] &amp; \qquad \qquad \vec \Sigma_{\ell + 1} = A_{\ell+1} \vec h_\ell + \vec b_{\ell+1} \\[5pt] &amp; \qquad \qquad \vec h_{\ell+1} = \sigma \left( \vec \Sigma_{\ell+1} \right) \\ \\ &amp; \qquad \textrm{2.2 Backpropagate neuron output gradients} \\[5pt] &amp; \qquad \dfrac{\partial \textrm{RSS}}{\partial h_{(L+1)1}} = 2 \left( h_{(\ell + 1)1} - y \right) \\[5pt] &amp; \qquad \forall \ell \in \lbrace L, L-1, \ldots, 1 \rbrace: \\[5pt] &amp; \qquad \qquad \dfrac{\partial \textrm{RSS}}{\partial \vec h_{\ell}} = A_{\ell+1}^T \left( \dfrac{\partial \textrm{RSS}}{\partial \vec h_{\ell+1}} \circ \sigma' \left( \vec \Sigma_{\ell+1} \right) \right) \\ \\ &amp; \qquad \textrm{2.3 Expand to weight gradients} \\[5pt] &amp; \qquad \forall \ell \in \lbrace L+1, L, \ldots, 1 \rbrace: \\[5pt] &amp; \qquad \qquad \left. \dfrac{\partial \textrm{RSS}}{\partial \vec b_{\ell}} \right|_{(\vec x, y)} = \dfrac{\partial \textrm{RSS}}{\partial \vec h_{\ell}} \circ \sigma' \left( \vec \Sigma_{\ell} \right) \\[5pt] &amp; \qquad \qquad \left. \dfrac{\partial \textrm{RSS}}{\partial A_{\ell}} \right|_{(\vec x, y)} = \left. \dfrac{\partial \textrm{RSS}}{\partial \vec b_{\ell}} \right|_{(\vec x, y)} \otimes \vec h_{\ell-1} \\[5pt] &amp; \qquad \qquad \dfrac{\partial \textrm{RSS}}{\partial A_{\ell}} \to \dfrac{\partial \textrm{RSS}}{\partial A_{\ell}} + \left. \dfrac{\partial \textrm{RSS}}{\partial A_{\ell}} \right|_{(\vec x, y)} \\[5pt] &amp; \qquad \qquad \dfrac{\partial \textrm{RSS}}{\partial \vec b_{\ell}} \to \dfrac{\partial \textrm{RSS}}{\partial \vec b_{\ell}} + \left. \dfrac{\partial \textrm{RSS}}{\partial \vec b_{\ell}} \right|_{(\vec x, y)} \\ \\ &amp; \textrm{3. Update weights via gradient descent} \\[5pt] &amp; \forall \ell \in \lbrace 1, 2, ..., L \rbrace: \\[5pt] &amp; \qquad A_\ell \to A_\ell - \alpha \dfrac{\partial \textrm{RSS}}{\partial A_\ell} \\[5pt] &amp; \qquad \vec b_\ell \to \vec b_\ell - \alpha \dfrac{\partial \textrm{RSS}}{\partial \vec b_\ell} \end{align*} </center> <p><br /></p> <p>You might notice that steps 2.2 and 2.3 above can be combined more efficiently into a single step since $\dfrac{\partial \textrm{RSS}}{\partial \vec h_{\ell}} = A_{\ell+1}^T \dfrac{\partial \textrm{RSS}}{\partial \vec b_{\ell+1}}.$ However, we will keep these steps separate for the sake of intuitive clarity. You are welcome to combine these steps in your own implementation.</p> <h2>Worked Example of a Single Iteration</h2> <p>Now, let’s walk through an concrete example of fitting a neural network to a data set using the backpropagation algorithm. We will use the same data set and neural network architecture as the <a class="body" target="_blank" href="https://justinmath.com/introduction-to-neural-network-regressors">previous section</a>:</p> <center> \begin{align*} \left[ (0,0), (0.25,1), (0.5,0.5), (0.75,1), (1,0) \right] \end{align*} </center> <p><br /></p> <center><img src="https://justinmath.com/files/blog/neural-network-regressor-backpropagation-example-architecture.png" style="border: none; height: 25em;" alt="icon" /></center> <p><br /></p> <p>Because neural networks are hierarchical and high-dimensional (i.e. they have many parameters that are tightly coupled), they are vastly more difficult to train as compared to simpler non-hierarchical low-dimensional models like linear, logistic, and polynomial regressions. Various tricks are often required to prevent the neural network from getting “stuck” in suboptimal local minima, which we will not cover here.</p> <p>To provide a simple example that illustrates the training of a neural network to a high degree of accuracy while avoiding the need for more advanced tricks, we will intentionally choose the initial weights of our network to be similar to the weights that we arrived at when manually constructing a neural network in the previous section. (More specifically, they will be proportional by a factor of $0.5.$) This will place us near a deep valley on the surface of RSS as a function of parameters of the neural network, and the proximity will allow elementary gradient descent to lead us down into the valley.</p> <p>So, we will use the following initial weights:</p> <center> \begin{align*} &amp;A_1 = \begin{bmatrix} 5 \\ -5 \\ 5 \\ -5 \end{bmatrix} &amp;&amp;\vec b_1 = \begin{bmatrix} -0.75 \\ 1.75 \\ -3.25 \\ 4.25 \end{bmatrix} \\ \\ &amp;A_2 = \begin{bmatrix} 10 &amp; 10 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; 10 &amp; 10 \end{bmatrix} &amp;&amp;\vec b_2 = \begin{bmatrix} -12.5 \\ -12.5 \end{bmatrix} \\ \\ &amp;A_3 = \begin{bmatrix} 10 &amp; 10 \end{bmatrix} &amp;&amp; \vec b_3 = \begin{bmatrix} -2.5 \end{bmatrix} \end{align*} </center> <p><br /></p> <p>Let’s work out the first iteration of backpropagation by hand, using learning rate $\alpha = 0.01.$ Note that the values shown are rounded to $6$ decimal places, but intermediate values are not actually rounded in the implementation.</p> <p><b>Point: $(\vec x, y) = (,0)$</b></p> <p><i>Forward propagation</i></p> <center> \begin{align*} \vec \Sigma_0 &amp;= \vec x = \begin{bmatrix} 0 \end{bmatrix} \\[3pt] \vec h_0 &amp;= \vec x = \begin{bmatrix} 0 \end{bmatrix} \\[10pt] \vec \Sigma_1 &amp;= A_2 \vec h_1 + \vec b_2 = \begin{bmatrix} 5 \\ -5 \\ 5 \\ -5 \end{bmatrix} \begin{bmatrix} 0 \end{bmatrix}+ \begin{bmatrix} -0.75 \\ 1.75 \\ -3.25 \\ 4.25 \end{bmatrix} = \begin{bmatrix} -0.75 \\ 1.75 \\ -3.25 \\ 4.25 \end{bmatrix} \\[3pt] \vec h_1 &amp;= \sigma \left( \vec \Sigma_1 \right) = \sigma \left( \begin{bmatrix} -0.75 \\ 1.75 \\ -3.25 \\ 4.25 \end{bmatrix} \right) = \begin{bmatrix} 0.320821 \\ 0.851953 \\ 0.037327 \\ 0.985936 \end{bmatrix} \\[10pt] \vec \Sigma_2 &amp;= A_2 \vec h_1 + \vec b_2 = \begin{bmatrix} 10 &amp; 10 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; 10 &amp; 10 \end{bmatrix} \begin{bmatrix} 0.320821 \\ 0.851953 \\ 0.037327 \\ 0.985936 \end{bmatrix} + \begin{bmatrix} -12.5 \\ -12.5 \end{bmatrix} = \begin{bmatrix} -0.772259 \\ -2.267367 \end{bmatrix} \\[3pt] \vec h_2 &amp;= \sigma \left( \vec \Sigma_2 \right) = \sigma \left( \begin{bmatrix} -0.772259 \\ -2.267367 \end{bmatrix} \right) = \begin{bmatrix} 0.315991 \\ 0.093862 \end{bmatrix} \\[10pt] \vec \Sigma_3 &amp;= A_3 \vec h_2 + \vec b_3 = \begin{bmatrix} 10 &amp; 10 \end{bmatrix} \begin{bmatrix} 0.315991 \\ 0.093862 \end{bmatrix} + \begin{bmatrix} -2.5 \end{bmatrix} = \begin{bmatrix} 1.59852529 \end{bmatrix}\\[3pt] \vec h_3 &amp;= \sigma \left( \vec \Sigma_3 \right) = \sigma \left( \begin{bmatrix} 1.598525 \end{bmatrix} \right) = \begin{bmatrix} 0.831812 \end{bmatrix} \end{align*} </center> <p><br /></p> <p><i>Backpropagation</i></p> <center> \begin{align*} \dfrac{\partial \textrm{RSS}}{\partial h_{31}} &amp;= 2 \left( h_{31} - y \right) = 2 \left( 0.831812 - 0 \right) = 1.663624 \\[5pt] \dfrac{\partial \textrm{RSS}}{\partial \vec h_{2}} &amp;= A_{3}^T \left( \dfrac{\partial \textrm{RSS}}{\partial \vec h_{3}} \circ \sigma' \left( \vec \Sigma_{3} \right) \right) = \begin{bmatrix} 10 \\ 10 \end{bmatrix} \left( \begin{bmatrix} 1.663624 \end{bmatrix} \circ \sigma' \left( \begin{bmatrix} 1.59852529 \end{bmatrix} \right) \right) = \begin{bmatrix} 2.327422 \\ 2.327422 \end{bmatrix} \\[5pt] \dfrac{\partial \textrm{RSS}}{\partial \vec h_{1}} &amp;= A_{2}^T \left( \dfrac{\partial \textrm{RSS}}{\partial \vec h_{2}} \circ \sigma' \left( \vec \Sigma_{2} \right) \right) = \begin{bmatrix} 10 &amp; 0 \\ 10 &amp; 0 \\ 0 &amp; 10 \\ 0 &amp; 10 \end{bmatrix} \left( \begin{bmatrix} 2.327422 \\ 2.327422 \end{bmatrix} \circ \sigma' \left( \begin{bmatrix} -0.772259 \\ -2.267367 \end{bmatrix} \right) \right) = \begin{bmatrix} 5.030502 \\ 5.030502 \\ 1.979515 \\ 1.979515 \end{bmatrix} \end{align*} </center> <p><br /></p> <p><i>Expansion</i></p> <center> \begin{align*} \left. \dfrac{\partial \textrm{RSS}}{\partial \vec b_{3}} \right|_{(, 0)} &amp;= \dfrac{\partial \textrm{RSS}}{\partial \vec h_{3}} \circ \sigma' \left( \vec \Sigma_{3} \right) = \begin{bmatrix} 0.232742 \end{bmatrix} \circ \sigma' \left( \begin{bmatrix} 1.598525 \end{bmatrix} \right) = \begin{bmatrix} 0.232742 \end{bmatrix} \\[3pt] \left. \dfrac{\partial \textrm{RSS}}{\partial A_{3}} \right|_{(, 0)} &amp;= \left. \dfrac{\partial \textrm{RSS}}{\partial \vec b_{3}} \right|_{(, 0)} \otimes \vec h_{2} = \begin{bmatrix} -0.000913 \end{bmatrix} \otimes \begin{bmatrix} 0.315991 \\ 0.093862 \end{bmatrix} = \begin{bmatrix} 0.073544 &amp; 0.021846 \end{bmatrix} \\[10pt] \left. \dfrac{\partial \textrm{RSS}}{\partial \vec b_{2}} \right|_{(, 0)} &amp;= \dfrac{\partial \textrm{RSS}}{\partial \vec h_{2}} \circ \sigma' \left( \vec \Sigma_{2} \right) = \begin{bmatrix} 2.327422 \\ 2.327422 \end{bmatrix} \circ \sigma' \left( \begin{bmatrix} -0.772259 \\ -2.267367 \end{bmatrix} \right) = \begin{bmatrix} 0.503050 \\ 0.197951 \end{bmatrix} \\[3pt] \left. \dfrac{\partial \textrm{RSS}}{\partial A_{2}} \right|_{(, 0)} &amp;= \left. \dfrac{\partial \textrm{RSS}}{\partial \vec b_{2}} \right|_{(, 0)} \otimes \vec h_{1} = \begin{bmatrix} 0.503050 \\ 0.197951 \end{bmatrix} \otimes \begin{bmatrix} 0.320821 \\ 0.851953 \\ 0.037327 \\ 0.985936 \end{bmatrix} = \begin{bmatrix} 0.161389 &amp; 0.428575 &amp; 0.018777 &amp; 0.495976 \\ 0.063507 &amp; 0.168645 &amp; 0.007389 &amp; 0.195168 \end{bmatrix} \\[10pt] \left. \dfrac{\partial \textrm{RSS}}{\partial \vec b_{1}} \right|_{(, 0)} &amp;= \dfrac{\partial \textrm{RSS}}{\partial \vec h_{1}} \circ \sigma' \left( \vec \Sigma_{1} \right) = \begin{bmatrix} -0.011254 \\ -0.011254 \\ -0.022807 \\ -0.022807 \end{bmatrix} \circ \sigma' \left( \begin{bmatrix} -0.75 \\ 1.75 \\ -3.25 \\ 4.25 \end{bmatrix} \right) = \begin{bmatrix} 1.096121 \\ 0.634493 \\ 0.071131 \\ 0.027448 \end{bmatrix} \\[3pt] \left. \dfrac{\partial \textrm{RSS}}{\partial A_{1}} \right|_{(, 0)} &amp;= \left. \dfrac{\partial \textrm{RSS}}{\partial \vec b_{1}} \right|_{(, 0)} \otimes \vec h_{0} = \begin{bmatrix} 1.096121 \\ 0.634493 \\ 0.071131 \\ 0.027448 \end{bmatrix} \otimes \begin{bmatrix} 0 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \end{bmatrix} \end{align*} </center> <p><br /></p> <p><b>Point: $(\vec x, y) = ([0.25],1)$</b></p> <center> \begin{align*} &amp;\vec \Sigma_0 = \begin{bmatrix} 0.25 \end{bmatrix} &amp;&amp;\vec h_0 = \begin{bmatrix} 0.25 \end{bmatrix} \\ \\ &amp;\vec \Sigma_1 = \begin{bmatrix} 0.5 \\ 0.5 \\ -2 \\ 3 \end{bmatrix} &amp;&amp;\vec h_1 = \begin{bmatrix} 0.622459 \\ 0.622459 \\ 0.119203 \\ 0.952574 \end{bmatrix} \\ \\ &amp;\vec \Sigma_2 = \begin{bmatrix} -0.050813 \\ -1.782230 \end{bmatrix} &amp;&amp;\vec h_2 = \begin{bmatrix} 0.487299 \\ 0.144028 \end{bmatrix} \\ \\ &amp;\vec \Sigma_3 = \begin{bmatrix} 3.813274 \end{bmatrix} &amp;&amp;\vec h_3 = \begin{bmatrix} 0.978401 \end{bmatrix} \end{align*} </center> <p><br /></p> <center> \begin{align*} \dfrac{\partial \textrm{RSS}}{\partial \vec h_{3}} = \begin{bmatrix} -0.043198 \end{bmatrix}, \quad \dfrac{\partial \textrm{RSS}}{\partial \vec h_{2}} = \begin{bmatrix} -0.009129 \\ -0.009129 \end{bmatrix}, \quad \dfrac{\partial \textrm{RSS}}{\partial \vec h_{1}} = \begin{bmatrix} -0.022807 \\ -0.022807 \\ -0.011254 \\ -0.011254 \end{bmatrix} \end{align*} </center> <p><br /></p> <center> \begin{align*} &amp;\left. \dfrac{\partial \textrm{RSS}}{\partial \vec b_{3}} \right|_{([0.25],1)} = \begin{bmatrix} -0.000913 \end{bmatrix} &amp;&amp;\left. \dfrac{\partial \textrm{RSS}}{\partial A_{3}} \right|_{([0.25],1)} = \begin{bmatrix} -0.000445 &amp; -0.000131 \end{bmatrix} \\ \\ &amp;\left. \dfrac{\partial \textrm{RSS}}{\partial \vec b_{2}} \right|_{([0.25],1)} = \begin{bmatrix} -0.002281 \\ -0.001125\end{bmatrix} &amp;&amp;\left. \dfrac{\partial \textrm{RSS}}{\partial A_{2}} \right|_{([0.25],1)} = \begin{bmatrix} -0.001420 &amp; -0.001420 &amp; -0.000272 &amp; -0.002173 \\ -0.000701 &amp; -0.000701 &amp; -0.000134 &amp; -0.001072 \end{bmatrix} \\ \\ &amp;\left. \dfrac{\partial \textrm{RSS}}{\partial \vec b_{1}} \right|_{([0.25],1)} = \begin{bmatrix} -0.005360 \\ -0.005360 \\ -0.001182 \\ -0.000508 \end{bmatrix} &amp;&amp;\left. \dfrac{\partial \textrm{RSS}}{\partial A_{1}} \right|_{([0.25],1)} = \begin{bmatrix} -0.001340 \\ -0.001340 \\ -0.000295 \\ -0.000127 \end{bmatrix} \end{align*} </center> <p><br /></p> <p><b>Point: $(\vec x, y) = ([0.5],0.5)$</b></p> <center> \begin{align*} &amp;\left. \dfrac{\partial \textrm{RSS}}{\partial \vec b_{3}} \right|_{([0.5],0.5)} = \begin{bmatrix} 0.020099 \end{bmatrix} &amp;&amp;\left. \dfrac{\partial \textrm{RSS}}{\partial A_{3}} \right|_{([0.5],0.5)} = \begin{bmatrix} 0.006351 &amp; 0.006351 \end{bmatrix} \\ \\ &amp;\left. \dfrac{\partial \textrm{RSS}}{\partial \vec b_{2}} \right|_{([0.5],0.5)} = \begin{bmatrix} 0.043443 \\ 0.043443 \end{bmatrix} &amp;&amp;\left. \dfrac{\partial \textrm{RSS}}{\partial A_{2}} \right|_{([0.5],0.5)} = \begin{bmatrix} 0.037011 &amp; 0.013937 &amp; 0.013937 &amp; 0.037011 \\ 0.037011 &amp; 0.013937 &amp; 0.013937 &amp; 0.037011 \end{bmatrix} \\ \\ &amp;\left. \dfrac{\partial \textrm{RSS}}{\partial \vec b_{1}} \right|_{([0.5],0.5)} = \begin{bmatrix} 0.054794 \\ 0.094659 \\ 0.094659 \\ 0.054794 \end{bmatrix} &amp;&amp;\left. \dfrac{\partial \textrm{RSS}}{\partial A_{1}} \right|_{([0.5],0.5)} = \begin{bmatrix} 0.027397 \\ 0.047330 \\ 0.047330 \\ 0.027397 \end{bmatrix} \end{align*} </center> <p><br /></p> <p><b>Point: $(\vec x, y) = ([0.75],1)$</b></p> <center> \begin{align*} &amp;\left. \dfrac{\partial \textrm{RSS}}{\partial \vec b_{3}} \right|_{([0.75],1)} = \begin{bmatrix} -0.000913 \end{bmatrix} &amp;&amp;\left. \dfrac{\partial \textrm{RSS}}{\partial A_{3}} \right|_{([0.75],1)} = \begin{bmatrix} -0.000131 &amp; -0.000445 \end{bmatrix} \\ \\ &amp;\left. \dfrac{\partial \textrm{RSS}}{\partial \vec b_{2}} \right|_{([0.75],1)} = \begin{bmatrix} -0.001125 \\ -0.002281 \end{bmatrix} &amp;&amp;\left. \dfrac{\partial \textrm{RSS}}{\partial A_{2}} \right|_{([0.75],1)} = \begin{bmatrix} -0.001072 &amp; -0.000134 &amp; -0.000701 &amp; -0.000701 \\ -0.002173 &amp; -0.000272 &amp; -0.001420 &amp; -0.001420 \end{bmatrix} \\ \\ &amp;\left. \dfrac{\partial \textrm{RSS}}{\partial \vec b_{1}} \right|_{([0.75],1)} = \begin{bmatrix} -0.000508 \\ -0.001182 \\ -0.005360 \\ -0.005360 \end{bmatrix} &amp;&amp;\left. \dfrac{\partial \textrm{RSS}}{\partial A_{1}} \right|_{([0.75],1)} = \begin{bmatrix} -0.000381 \\ -0.000886 \\ -0.004020 \\ -0.004020 \end{bmatrix} \end{align*} </center> <p><br /></p> <p><b>Point: $(\vec x, y) = (,0)$</b></p> <center> \begin{align*} &amp;\left. \dfrac{\partial \textrm{RSS}}{\partial \vec b_{3}} \right|_{(,0)} = \begin{bmatrix} 0.232742 \end{bmatrix} &amp;&amp;\left. \dfrac{\partial \textrm{RSS}}{\partial A_{3}} \right|_{(,0)} = \begin{bmatrix} 0.021846 &amp; 0.073544 \end{bmatrix} \\ \\ &amp;\left. \dfrac{\partial \textrm{RSS}}{\partial \vec b_{2}} \right|_{(,0)} = \begin{bmatrix} 0.197951 \\ 0.503050 \end{bmatrix} &amp;&amp;\left. \dfrac{\partial \textrm{RSS}}{\partial A_{2}} \right|_{(,0)} = \begin{bmatrix} 0.195168 &amp; 0.007389 &amp; 0.168645 &amp; 0.063507 \\ 0.495976 &amp; 0.018777 &amp; 0.428575 &amp; 0.161389 \end{bmatrix} \\ \\ &amp;\left. \dfrac{\partial \textrm{RSS}}{\partial \vec b_{1}} \right|_{(,0)} = \begin{bmatrix} 0.027448 \\ 0.071131 \\ 0.634493 \\ 1.096121 \end{bmatrix} &amp;&amp;\left. \dfrac{\partial \textrm{RSS}}{\partial A_{1}} \right|_{(,0)} = \begin{bmatrix} 0.027448 \\ 0.071131 \\ 0.634493 \\ 1.096121 \end{bmatrix} \end{align*} </center> <p><br /></p> <p><b>Weight Updates</b></p> <p>Summing up all the gradients we computed, we get the following:</p> <center> \begin{align*} &amp;\dfrac{\partial \textrm{RSS}}{\partial \vec b_{3}} = \begin{bmatrix} 0.483758 \end{bmatrix} &amp;&amp;\dfrac{\partial \textrm{RSS}}{\partial A_{3}} = \begin{bmatrix} 0.101165 &amp;&amp; 0.101165 \end{bmatrix} \\ \\ &amp;\dfrac{\partial \textrm{RSS}}{\partial \vec b_{2}} = \begin{bmatrix} 0.741038 \\ 0.741038 \end{bmatrix} &amp;&amp;\dfrac{\partial \textrm{RSS}}{\partial A_{2}} = \begin{bmatrix} 0.391076 &amp; 0.448347 &amp; 0.200388 &amp; 0.593621 \\ 0.593621 &amp; 0.200388 &amp; 0.448347 &amp; 0.391076 \end{bmatrix} \\ \\ &amp;\dfrac{\partial \textrm{RSS}}{\partial \vec b_{1}} = \begin{bmatrix} 1.172495 \\ 0.793742 \\ 0.793742 \\ 1.172495 \end{bmatrix} &amp;&amp;\dfrac{\partial \textrm{RSS}}{\partial A_{1}} = \begin{bmatrix} 0.053123 \\ 0.116235 \\ 0.677508 \\ 1.119371 \end{bmatrix} \end{align*} </center> <p><br /></p> <p>Finally, applying the gradient descent updates $A_\ell \to A_\ell - \alpha \dfrac{\textrm{RSS}}{\partial A_{\ell}}$ and $\vec b_\ell \to \vec b_\ell - \alpha \dfrac{\textrm{RSS}}{\partial \vec b_{\ell}}$ with $\alpha = 0.01,$ we get the following updated weights:</p> <center> \begin{align*} &amp;A_1 = \begin{bmatrix} 4.999469 \\ -5.001162 \\ 4.993225 \\ -5.011194 \end{bmatrix} &amp;&amp;\vec b_1 = \begin{bmatrix} -0.761725 \\ 1.742063 \\ -3.257937 \\ 4.238275 \end{bmatrix} \\ \\ &amp;A_2 = \begin{bmatrix} 9.996089 &amp; 9.995517 &amp; -0.002004 &amp; -0.005936 \\ -0.005936 &amp; -0.002004 &amp; 9.995517 &amp; 9.996089 \end{bmatrix} &amp;&amp;\vec b_2 = \begin{bmatrix} -12.507410 \\ -12.507410 \end{bmatrix} \\ \\ &amp;A_3 = \begin{bmatrix} 9.998988 &amp; 9.998988 \end{bmatrix} &amp;&amp; \vec b_3 = \begin{bmatrix} -2.504838 \end{bmatrix} \end{align*} </center> <p><br /></p> <h2>Demonstration of Many Iterations</h2> <p>Repeating this procedure over and over, we get the following results. Note that the values shown are rounded to $3$ decimal places, but intermediate values are not actually rounded in the implementation.</p> <p><i>Initial<i></i></i></p> <ul> <li>$\textrm{RSS} \approx 1.614$</li> <li> $\textrm{Predictions} \approx 0.832, 0.978, 0.979, 0.978, 0.832$<br /> $(\textrm{Compare to} \phantom{\approx} 0,\phantom{.000} 1,\phantom{.000} 0.5,\phantom{00} 1,\phantom{.000} 0)\phantom{.000}$ </li> </ul> <center> \begin{align*} &amp;A_1 = \begin{bmatrix} 5 \\ -5 \\ 5 \\ -5 \end{bmatrix} &amp;&amp;\vec b_1 = \begin{bmatrix} -0.75 \\ 1.75 \\ -3.25 \\ 4.25 \end{bmatrix} \\ \\ &amp;A_2 = \begin{bmatrix} 10 &amp; 10 &amp; 0 &amp; 0 \\ 0 &amp; 0 &amp; 10 &amp; 10 \end{bmatrix} &amp;&amp;\vec b_2 = \begin{bmatrix} -12.5 \\ -12.5 \end{bmatrix} \\ \\ &amp;A_3 = \begin{bmatrix} 10 &amp; 10 \end{bmatrix} &amp;&amp; \vec b_3 = \begin{bmatrix} -2.5 \end{bmatrix} \end{align*} </center> <p><br /></p> <p><i>After $1$ iteration<i></i></i></p> <ul> <li>$\textrm{RSS} \approx 1.525$</li> <li>$\textrm{Predictions} \approx 0.812, 0.973, 0.973, 0.972, 0.801$</li> </ul> <center> \begin{align*} &amp;A_1 = \begin{bmatrix} 4.999 \\ -5.001 \\ 4.993 \\ -5.011 \end{bmatrix} &amp;&amp;\vec b_1 = \begin{bmatrix} -0.762 \\ 1.742 \\ -3.258 \\ 4.238 \end{bmatrix} \\ \\ &amp;A_2 = \begin{bmatrix} 9.996 &amp; 9.996 &amp; -0.002 &amp; -0.006 \\ -0.006 &amp; -0.002 &amp; 9.996 &amp; 9.996 \end{bmatrix} &amp;&amp;\vec b_2 = \begin{bmatrix} -12.507 \\ -12.507 \end{bmatrix} \\ \\ &amp;A_3 = \begin{bmatrix} 9.999 &amp; 9.999 \end{bmatrix} &amp;&amp; \vec b_3 = \begin{bmatrix} -2.505 \end{bmatrix} \end{align*} </center> <p><br /></p> <p><i>After $2$ iterations<i></i></i></p> <ul> <li>$\textrm{RSS} \approx 1.426$</li> <li>$\textrm{Predictions} \approx 0.789, 0.967, 0.965, 0.962, 0.765$</li> </ul> <center> \begin{align*} &amp;A_1 \approx \begin{bmatrix} 4.999 \\ -5.002 \\ 4.986 \\ -5.023 \end{bmatrix} &amp;&amp;\vec b_1 \approx \begin{bmatrix} -0.774 \\ 1.734 \\ -3.267 \\ 4.226 \end{bmatrix} \\ \\ &amp;A_2 \approx \begin{bmatrix} 9.992 &amp; 9.991 &amp; -0.004 &amp; -0.012 \\ -0.012 &amp; -0.004 &amp; 9.991 &amp; 9.992 \end{bmatrix} &amp;&amp;\vec b_2 \approx \begin{bmatrix} -12.515 \\ -12.515 \end{bmatrix} \\ \\ &amp;A_3 \approx \begin{bmatrix} 9.998 &amp; 9.998 \end{bmatrix} &amp;&amp; \vec b_3 \approx \begin{bmatrix} -2.510 \end{bmatrix} \end{align*} </center> <p><br /></p> <p><i>After $3$ iterations<i></i></i></p> <ul> <li>$\textrm{RSS} \approx 1.320$</li> <li>$\textrm{Predictions} \approx 0.764, 0.959, 0.954, 0.949, 0.725$</li> </ul> <center> \begin{align*} &amp;A_1 \approx \begin{bmatrix} 4.998 \\ -5.004 \\ 4.978 \\ -5.035 \end{bmatrix} &amp;&amp;\vec b_1 \approx \begin{bmatrix} -0.787 \\ 1.725 \\ -3.276 \\ 4.213 \end{bmatrix} \\ \\ &amp;A_2 \approx \begin{bmatrix} 9.987 &amp; 9.986 &amp; -0.006 &amp; -0.019 \\ -0.019 &amp; -0.006 &amp; 9.986 &amp; 9.988 \end{bmatrix} &amp;&amp;\vec b_2 \approx \begin{bmatrix} -12.524 \\ -12.524 \end{bmatrix} \\ \\ &amp;A_3 \approx \begin{bmatrix} 9.997 &amp; 9.997 \end{bmatrix} &amp;&amp; \vec b_3 \approx \begin{bmatrix} -2.516 \end{bmatrix} \end{align*} </center> <p><br /></p> <p><i>After $10$ iterations<i></i></i></p> <ul> <li>$\textrm{RSS} \approx 0.730$</li> <li>$\textrm{Predictions} \approx 0.571, 0.844, 0.816, 0.774, 0.478$</li> </ul> <center> \begin{align*} &amp;A_1 \approx \begin{bmatrix} 4.992 \\ -5.015 \\ 4.933 \\ -5.101 \end{bmatrix} &amp;&amp;\vec b_1 \approx \begin{bmatrix} -0.873 \\ 1.66 \\ -3.331 \\ 4.142 \end{bmatrix} \\ \\ &amp;A_2 \approx \begin{bmatrix} 9.957 &amp; 9.953 &amp; -0.022 &amp; -0.064 \\ -0.058 &amp; -0.022 &amp; 9.958 &amp; 9.959 \end{bmatrix} &amp;&amp;\vec b_2 \approx \begin{bmatrix} -12.58 \\ -12.575 \end{bmatrix} \\ \\ &amp;A_3 \approx \begin{bmatrix} 9.99 &amp; 9.991 \end{bmatrix} &amp;&amp; \vec b_3 \approx \begin{bmatrix} -2.557 \end{bmatrix} \end{align*} </center> <p><br /></p> <p><i>After $100$ iterations<i></i></i></p> <ul> <li>$\textrm{RSS} \approx 0.496$</li> <li>$\textrm{Predictions} \approx 0.362, 0.694, 0.730, 0.698, 0.356$</li> </ul> <center> \begin{align*} &amp;A_1 \approx \begin{bmatrix} 5.068 \\ -4.962 \\ 4.965 \\ -5.158 \end{bmatrix} &amp;&amp;\vec b_1 \approx \begin{bmatrix} -0.936 \\ 1.696 \\ -3.249 \\ 4.164 \end{bmatrix} \\ \\ &amp;A_2 \approx \begin{bmatrix} 9.920 &amp; 9.862 &amp; -0.064 &amp; -0.15 \\ -0.116 &amp; -0.074 &amp; 9.894 &amp; 9.921 \end{bmatrix} &amp;&amp;\vec b_2 \approx \begin{bmatrix} -12.708 \\ -12.683 \end{bmatrix} \\ \\ &amp;A_3 \approx \begin{bmatrix} 9.978 &amp; 9.981 \end{bmatrix} &amp;&amp; \vec b_3 \approx \begin{bmatrix} -2.735 \end{bmatrix} \end{align*} </center> <p><br /></p> <p><i>After $1\,000$ iterations<i></i></i></p> <ul> <li>$\textrm{RSS} \approx 0.198$</li> <li>$\textrm{Predictions} \approx 0.199, 0.788, 0.668, 0.788, 0.202$</li> </ul> <center> \begin{align*} &amp;A_1 \approx \begin{bmatrix} 5.491 \\ -5.101 \\ 5.343 \\ -5.152 \end{bmatrix} &amp;&amp;\vec b_1 \approx \begin{bmatrix} -0.518 \\ 2.008 \\ -3.081 \\ 4.546 \end{bmatrix} \\ \\ &amp;A_2 \approx \begin{bmatrix} 9.744 &amp; 9.666 &amp; -0.301 &amp; -0.425 \\ -0.442 &amp; -0.301 &amp; 9.668 &amp; 9.721 \end{bmatrix} &amp;&amp;\vec b_2 \approx \begin{bmatrix} -13.155 \\ -13.170 \end{bmatrix} \\ \\ &amp;A_3 \approx \begin{bmatrix} 9.982 &amp; 9.98 \end{bmatrix} &amp;&amp; \vec b_3 \approx \begin{bmatrix} -3.596 \end{bmatrix} \end{align*} </center> <p><br /></p> <p><i>After $10\,000$ iterations<i></i></i></p> <ul> <li>$\textrm{RSS} \approx 0.0239$</li> <li>$\textrm{Predictions} \approx 0.068, 0.922, 0.517, 0.915, 0.075$</li> </ul> <center> \begin{align*} &amp;A_1 \approx \begin{bmatrix} 6.96 \\ -6.033 \\ 6.632 \\ -5.806 \end{bmatrix} &amp;&amp;\vec b_1 \approx \begin{bmatrix} -0.279 \\ 2.766 \\ -3.307 \\ 5.478 \end{bmatrix} \\ \\ &amp;A_2 \approx \begin{bmatrix} 9.88 &amp; 9.555 &amp; -0.835 &amp; -0.865 \\ -0.932 &amp; -0.806 &amp; 9.647 &amp; 9.781 \end{bmatrix} &amp;&amp;\vec b_2 \approx \begin{bmatrix} -13.763 \\ -13.804 \end{bmatrix} \\ \\ &amp;A_3 \approx \begin{bmatrix} 10.515 &amp; 10.542 \end{bmatrix} &amp;&amp; \vec b_3 \approx \begin{bmatrix} -4.759 \end{bmatrix} \end{align*} </center> <p><br /></p> <p><i>After $100\,000$ iterations<i></i></i></p> <ul> <li>$\textrm{RSS} \approx 0.0020$</li> <li>$\textrm{Predictions} \approx 0.020, 0.979, 0.501, 0.976, 0.023$</li> </ul> <center> \begin{align*} &amp;A_1 \approx \begin{bmatrix} 8.407 \\ -7.103 \\ 7.786 \\ -6.971 \end{bmatrix} &amp;&amp;\vec b_1 \approx \begin{bmatrix} -0.523 \\ 3.289 \\ -3.906 \\ 6.412 \end{bmatrix} \\ \\ &amp;A_2 \approx \begin{bmatrix} 10.437 &amp; 9.708 &amp; -1.367 &amp; -1.063 \\ -1.179 &amp; -1.311 &amp; 9.922 &amp; 10.416 \end{bmatrix} &amp;&amp;\vec b_2 \approx \begin{bmatrix} -14.075 \\ -14.139 \end{bmatrix} \\ \\ &amp;A_3 \approx \begin{bmatrix} 11.607 &amp; 11.800 \end{bmatrix} &amp;&amp; \vec b_3 \approx \begin{bmatrix} -5.433 \end{bmatrix} \end{align*} </center> <p><br /></p> <p><i>After $1\,000\,000$ iterations<i></i></i></p> <ul> <li>$\textrm{RSS} \approx 0.0002$</li> <li>$\textrm{Predictions} \approx 0.006, 0.993, 0.500, 0.993, 0.007$</li> </ul> <center> \begin{align*} &amp;A_1 \approx \begin{bmatrix} 9.527 \\ -7.895 \\ 8.602 \\ -7.956 \end{bmatrix} &amp;&amp;\vec b_1 \approx \begin{bmatrix} -0.738 \\ 3.651 \\ -4.384 \\ 7.219 \end{bmatrix} \\ \\ &amp;A_2 \approx \begin{bmatrix} 11.006 &amp; 9.855 &amp; -1.801 &amp; -1.214 \\ -1.405 &amp; -1.730 &amp; 10.129 &amp; 11.100 \end{bmatrix} &amp;&amp;\vec b_2 \approx \begin{bmatrix} -14.323 \\ -14.440 \end{bmatrix} \\ \\ &amp;A_3 \approx \begin{bmatrix} 12.902 &amp; 13.223 \end{bmatrix} &amp;&amp; \vec b_3 \approx \begin{bmatrix} -6.178 \end{bmatrix} \end{align*} </center> <p><br /></p> <p>Below is a graph of the regression curves before and after training (i.e. using the initial weights and using the weights after $1\,000\,000$ iterations of backpropagation). The trained network does an even better job of passing through the leftmost and rightmost points than the network we constructed manually in the previous section!</p> <center><img src="https://justinmath.com/files/blog/neural-network-regressor-before-after-backpropagation.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <h2>Exercises</h2> <ol> <li>Implement the example that was worked out above.</li> <li>Re-run the example that was worked out above, this time using initial weights drawn randomly from the normal distribution. Note that your RSS should gradually decrease but it may get "stuck" in a suboptimal local minimum, resulting in a regression curve that is a decent but not perfect fit.</li> </ol>Justin Skycakjpskycak@gmail.comThe most common method used to fit neural networks to data is gradient descent, just like we did have done previously for simpler models. The computations are significantly more involved for neural networks, but an algorithm called backpropagation provides a convenient framework for computing gradients.Introduction to Neural Network Regressors2022-02-05T00:00:00-08:002022-02-05T00:00:00-08:00https://justinmath.com/introduction-to-neural-network-regressors<p>It’s common to represent models via <i>computational graphs</i>. For example, consider the following multiple logistic regression model:</p> <center> \begin{align*} f(x) &amp;= \dfrac{1}{1 + e^{-(a_1 x_1 + a_2 x_2 + b)}} \end{align*} </center> <p><br /> This model can be represented by the following computation graph, where</p> <ul> <li>$\Sigma = a_1 x_1 + a_2 x_2 + b$ is the sum of products of lower-node values and the edge weights, and</li> <li>$\sigma(\Sigma) = \dfrac{1}{1+e^{-\Sigma}}$ is the sigmoid function.</li> </ul> <center><img src="https://justinmath.com/files/blog/computational-graph-logistic-2-layer.png" style="border: none; height: 10em;" alt="icon" /></center> <p><br /></p> <h2>Hierarchy and Complexity</h2> <p>Loosely speaking, the deeper or more “hierarchical” a computational graph is, the more complex the model that it represents. For example, consider the computational graph below, which contains an extra “layer” of nodes.</p> <center><img src="https://justinmath.com/files/blog/computational-graph-logistic-3-layer.png" style="border: none; height: 18em;" alt="icon" /></center> <p><br /></p> <p>Whereas the first computational graph represented a simple model $f(x_1, x_2) = \sigma(a_1 x_1 + a_2 x_2 + b),$ this second computational graph represents a far more complex model:</p> <center> \begin{align*} f(x_1, x_2) &amp;= \sigma \left( \begin{matrix} \phantom{+} a_{211} \sigma \left( a_{111} x_1 + a_{112} x_2 + b_{11} \right) \\ + a_{212} \sigma \left( a_{121} x_1 + a_{122} x_2 + b_{12} \right) \\ + a_{213} \sigma \left( a_{131} x_1 + a_{132} x_2 + b_{13} \right) \\ + b_{21\phantom{0}} \phantom{\sigma \left( a_{131} x_1 + a_{132} x_2 + b_{13} \right)} \end{matrix} \right) \end{align*} </center> <p><br /> The subscripts in the coefficients may look a little crazy, but there is a consistent naming pattern:</p> <ul> <li>$a_{\ell i j}$ is the weight of the connection from the $j$th node in the $\ell$th layer to the $i$th node in the next layer.</li> <li>$b_{\ell i}$ is the weight of the connection from the bias node in the $\ell$th to the $i$th node in the next layer. (A <b>bias node</b> is a node whose output is always $1.$)</li> </ul> <h2>Neural Networks</h2> <p>A <b>neural network</b> is a type of computational graph that is loosely inspired by the human brain. Each neuron in the brain receives input electrical signals from other neurons that connect to it, and the amount of signal that a neuron sends outward to the neurons it connects to depends on the total amount of electrical signal it receives as input. Each connection has a different strength, meaning that neurons influence each other by different amounts. Additionally, neurons in key information-processing parts of the brain are sometimes arranged in layers.</p> <center><img src="https://justinmath.com/files/blog/neural-network-3-layer-labeled.png" style="border: none; height: 18em;" alt="icon" /></center> <p><br /></p> <p>Using neural network terminology, the computational graph above can be described as a neural network with $3$ layers:</p> <ol> <li>an <b>input layer</b> containing $2$ linearly-activated neurons and a bias neuron,</li> <li>a <b>hidden layer</b> containing $3$ sigmoidally-activated neurons and a bias neuron, and</li> <li>an <b>output layer</b> containing a single sigmoidally-activated neuron.</li> </ol> <p>To say that a neuron is <i>sigmoidally-activated</i> means that to get the neuron’s output we apply a sigmoidal <b>activation function</b> $\sigma$ to the neuron’s input. Remember that the input $\Sigma$ is the sum of products of lower-node values and the edge weights. By convention, a linear activation function is the identity function (i.e. the output is the same as the input).</p> <p>Neural networks are extremely powerful models. In fact, the <i>universal approximation theorem</i> states that given a continuous function $f: [0,1]^n \to [0,1]$ and an acceptable error threshold $\epsilon &gt; 0,$ there exists a sigmoidally-activated neural network with one hidden layer and a finite number of neurons such that the error between the $f$ and the neural network’s output is less than $\epsilon.$</p> <h2>Example: Manually Constructing a Neural Network</h2> <p>To demonstrate, let’s set up a neural network that models the following data set:</p> <center> \begin{align*} \left[ (0,0), (0.25,1), (0.5,0.5), (0.75,1), (1,1) \right] \end{align*} </center> <p><br /> First, we’ll draw a curve that approximates the data set. Then, we’ll work backwards to combine sidgmoid functions in a way that resembles the curve that we drew.</p> <center><img src="https://justinmath.com/files/blog/intro-neural-network-regressors-reconstruction-goal.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <p>Loosely speaking, it appears that our curve can be modeled as the sum of two humps.</p> <center><img src="https://justinmath.com/files/blog/intro-neural-network-regressors-reconstruction-goal-humps.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <p>Notice that we can create a hump by adding two opposite-facing sigmoids (and shifting the result down to lie flat against the $x$-axis):</p> <center> \begin{align*} h(x) = \sigma(x+1) + \sigma(-x+1)-1 \end{align*} </center> <p><br /></p> <center><img src="https://justinmath.com/files/blog/intro-neural-network-regressors-sigmoid-hump.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <p>Remember that our neural network repeatedly applies sigmoid functions to sums of sigmoid functions, so we’ll have to apply a sigmoid to the function above. The following composition will accomplish this while shaping our hump to be the correct width:</p> <center> \begin{align*} H(x) = \sigma( 20h(10x) - 5) \end{align*} </center> <p><br /> Then, we can represent our final curve as the sum of two horizontally-shifted humps (again shifted downward to lie flat against the $x$ axis and then wrapped in another sigmoid function).</p> <center> \begin{align*} \sigma( 20 H(x-0.25) + 20H(x-0.75)-5 ) \end{align*} </center> <p><br /></p> <center><img src="https://justinmath.com/files/blog/intro-neural-network-regressors-reconstruction.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <p>Now, let’s work backwards from our final curve expression to figure out the architecture of the corresponding neural network.</p> <p>Our output node represents the expression</p> <center> \begin{align*} \sigma( 20 H(x-0.25) + 20H(x-0.75)-5 ), \end{align*} </center> <p><br /> so the previous layer should have nodes whose outputs are $H(x-0.25),$ $H(x-0.75),$ and $1$ (the corresponding weights being $20,$ $20,$ and $-5$ respectively).</p> <center><img src="https://justinmath.com/files/blog/intro-neural-network-regressors-reconstruction-network-prev.png" style="border: none; height: 12em;" alt="icon" /></center> <p><br /></p> <p>Expanding further, we have</p> <center> \begin{align*} H(x-0.25) &amp;= \sigma(20h(10x-2.5)-5) \\[3pt] &amp;= \sigma(20(\sigma(10x-1.5)+\sigma(-10x+3.5)-1)-5) \\[3pt] &amp;= \sigma( 20 \sigma(10x-1.5)+20\sigma(-10x+3.5)-25 ) \\ \\ H(x-0.75) &amp;= \sigma(20h(10x-7.5)-5) \\[3pt] &amp;= \sigma(20(\sigma(10x-6.5)+\sigma(-10x+8.5)-1)-5) \\[3pt] &amp;= \sigma( 20 \sigma(10x-6.5)+20\sigma(-10x+8.5)-25 ), \end{align*} </center> <p><br /> so the second-previous layer should have nodes whose outputs are $\sigma(10x-1.5),$ $\sigma(10x-6.5),$ $\sigma(-10x+3.5),$ $\sigma(-10x+8.5),$ and $1.$ (In the diagram below, edges with weight $0$ are represented by soft dashed segments.)</p> <center><img src="https://justinmath.com/files/blog/intro-neural-network-regressors-reconstruction-network-prev-prev.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <p>We can now sketch our full neural network as follows:</p> <center><img src="https://justinmath.com/files/blog/intro-neural-network-regressors-reconstruction-network.png" style="border: none; height: 25em;" alt="icon" /></center> <p><br /></p> <h2>Hierarchical Representation</h2> <p>There is a clear hierarchical structure to the network. The first hidden layer transforms the linear intput into sigmoidal functions. The second hidden layer combines those sigmoids to generate humps. The output layer combines humps into the desired output.</p> <center><img src="https://justinmath.com/files/blog/intro-neural-network-regressors-reconstruction-hierarchy.png" style="border: none; height: 25em;" alt="icon" /></center> <p><br /></p> <p>Hierarchical structure is ultimately the reason why neural networks can fit arbitrary functions to such high degrees of accuracy. Loosely speaking, each neuron in the network recognizes a different feature in the data, and deeper layers in the network synthesize elementary features into more complex features.</p> <h2>Exercises</h2> <ol> <li>Tweak the neural network constructed in the discussion above so that the output resembles the following curve: <br /><br /><center><img src="https://justinmath.com/files/blog/neural-network-regressor-reconstruction-challenge-1.png" style="border: none; height: 20em;" alt="icon" /></center><br /> </li> <li>Tweak the neural network constructed in the discussion above so that the output resembles the following curve: <br /><br /><center><img src="https://justinmath.com/files/blog/neural-network-regressor-reconstruction-challenge-2.png" style="border: none; height: 20em;" alt="icon" /></center><br /> </li> <li>Tweak the neural network constructed in the discussion above so that the output resembles the following curve: <br /><br /><center><img src="https://justinmath.com/files/blog/neural-network-regressor-reconstruction-challenge-3.png" style="border: none; height: 20em;" alt="icon" /></center><br /> </li> </ol>Justin Skycakjpskycak@gmail.comIt’s common to represent models via computational graphs. For example, consider the following multiple logistic regression model:Decision Trees2022-02-04T00:00:00-08:002022-02-04T00:00:00-08:00https://justinmath.com/decision-trees<p>A <b>decision tree</b> is a graphical flowchart that represents a sequence of nested “if-then” decision rules. To illustrate, first recall the following cookie data set that was introduced during the discussion of k-nearest neighbors:</p> <center> \begin{align*} \begin{matrix} \textrm{Cookie Type} &amp; \textrm{Portion Butter} &amp; \textrm{Portion Sugar} \\ \hline \textrm{Shortbread} &amp; 0.15 &amp; 0.2 \\ \textrm{Shortbread} &amp; 0.15 &amp; 0.3 \\ \textrm{Shortbread} &amp; 0.2 &amp; 0.25 \\ \textrm{Shortbread} &amp; 0.25 &amp; 0.4 \\ \textrm{Shortbread} &amp; 0.3 &amp; 0.35 \\ \textrm{Sugar} &amp; 0.05 &amp; 0.25 \\ \textrm{Sugar} &amp; 0.05 &amp; 0.35 \\ \textrm{Sugar} &amp; 0.1 &amp; 0.3 \\ \textrm{Sugar} &amp; 0.15 &amp; 0.4 \\ \textrm{Sugar} &amp; 0.25 &amp; 0.35 \end{matrix} \end{align*} </center> <p><br /></p> <p>The following decision tree was algorithmically constructed to classify an unknown cookie as a shortbread cookie or sugar cookie based on its portions of butter and sugar.</p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree.png" style="border: none; height: 30em;" alt="icon" /></center> <p><br /></p> <h2>Using a Decision Tree</h2> <p>To use the decision tree to classify an unknown cookie, we start at the top of the tree and then repeatedly go downwards and left or right depending on the values of $x$ and $y.$</p> <p>For example, suppose we have a cookie with $0.25$ portion butter and $0.35$ portion sugar. To classify this cookie, we start at the top of the tree and then go</p> <ol> <li>right $(\textrm{butter} &gt; 0.125),$</li> <li>right $(\textrm{sugar} &gt; 0.325),$</li> <li>right $(\textrm{butter} &gt; 0.2),$</li> <li>left $(\textrm{butter} \leq 0.275),$</li> <li>left $(\textrm{sugar} \leq 0.375),$</li> </ol> <p>reaching the prediction that the cookie is a sugar cookie.</p> <h2>Classification Boundary</h2> <p>Let’s take a look at how the decision tree classifies the points in our data set:</p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-with-data.png" style="border: none; height: 30em;" alt="icon" /></center> <p><br /></p> <p>We can visualize this in the plane by drawing the <b>classification boundary</b>, shading the regions whose points would be classified as shortbread cookies and keeping unshaded the regions whose points would be classified as sugar cookies. Each dotted line corresponds to a <b>split</b> in the tree.</p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-classification-boundary.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <h2>Building a Decision Tree: Reducing Impurity</h2> <p>The algorithm for building a decision tree is conceptually simple. The goal is to make the simplest tree such that the leaf nodes <i>pure</i> in the sense that they only contain data points from one class. So, we repeatedly split <i>impure</i> leaf nodes in the way that most quickly reduces the impurity.</p> <p>Intuitively, a node has $0$ impurity when all of its data points come from one class. On the other hand, a node has maximum impurity when an equal amount of its data points come from each class.</p> <p>To quantify a node’s impurity, all we have to do is count up the proportion $p$ of the node’s data points that are from one particular class and then apply a function that transforms $p$ into a measure of impurity.</p> <ul> <li>If $p=0$ or $p=1,$ then the node has no impurity since its data points are entirely from one class.</li> <li>If $p=0.5,$ then the node has maximum impurity since half of its data points come from one class and the other half comes from the other class.</li> </ul> <p>Graphically, our function should look like this:</p> <center><img src="https://justinmath.com/files/blog/decision-tree-split-impurity.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <p>Two commonly used functions that yield the above graph are <b>Gini impurity</b>, defined as</p> <center> \begin{align*} G(p) = 1 - p^2 - (1-p)^2, \end{align*} </center> <p><br /> and <b>information entropy</b>, defined as</p> <center> \begin{align*} H(p) = -p \log_2 p - (1-p) \log_2 (1-p). \end{align*} </center> <p><br /> Although these functions may initially look a little complicated, note that their forms permit them to be easily generalized to situations where we have more than two classes:</p> <center> \begin{align*} G &amp;= 1 - \sum_i p_i^2 \\[5pt] H &amp;= - \sum_i p_i \log_2 p_i, \end{align*} </center> <p><br /> where $p_i$ is the proportion of the $i$th class. (In our situation we only have two classes with proportions $p_1 = p$ and $p_2 = 1 - p.$)</p> <h2>Worked Example: Split 0</h2> <p>As we walk through the algorithm for building our decision tree, we’ll use Gini impurity since it simplifies nicely in the case of two classes, making it more amenable to manual computation:</p> <center> \begin{align*} G(p) &amp;= 1 - p^2 - (1-p)^2 \\[3pt] &amp;= 2p(1-p) \end{align*} </center> <p><br /></p> <p>Initially, our decision tree is just a single root node, i.e. a “stump” with no splits. It contains our full data set, shown below.</p> <center><img src="https://justinmath.com/files/blog/knn-dataset.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <h2>Worked Example: Split 1</h2> <p>Remember that our goal is to repeatedly split <i>impure</i> leaf nodes in the way that most quickly reduces the impurity. To find the split that most quickly reduces the impurity, we loop over all possible splits and compare the impurity before the split to the impurity after the split.</p> <p>The impurity before the split is the same for all possible splits, so we will calculate it first. In the graph above there are $5$ points that represent shortbread cookies and $5$ points that represent sugar cookies, so $p=\dfrac{5}{5+5}=\dfrac{1}{2}$ and the impurity is computed as</p> <center> \begin{align*} G_\textrm{before} &amp;= G\left( \dfrac{1}{2} \right) \\[5pt] &amp;= 2\left( \dfrac{1}{2} \right)\left( 1 - \dfrac{1}{2} \right) \\[5pt] &amp;= 0.5. \end{align*} </center> <p><br /> Now, let’s find all the possible splits. To find the values of $x$ that could be chosen for splits, we first find all the distinct values of $x$ that are hit by points and put them in order:</p> <center> \begin{align*} x &amp;= 0.05, 0.1, 0.15, 0.2, 0.25, 0.3 \end{align*} </center> <p><br /> The possible splits along the $x$-axis are the midpoints between consecutive entries in the list above:</p> <center> \begin{align*} x_\textrm{split} &amp;= 0.075, 0.125, 0.175, 0.225, 0.275 \end{align*} </center> <p><br /> Performing the same process for $y$-coordinates, we get the following:</p> <center> \begin{align*} y &amp;= 0.2, 0.25, 0.3, 0.35, 0.4 \\[3pt] y_\textrm{split} &amp;= 0.225, 0.275, 0.325, 0.375 \end{align*} </center> <p><br /> Let’s go through each possible split and measure the impurity after the split. In general, the impurity after the split is measured as a weighted average of the new leaf nodes resulting from the split:</p> <center> \begin{align*} \textrm{impurity after} &amp;= (\textrm{portion data points in} \leq \textrm{node}) \times (\textrm{impurity of} \leq \textrm{node}) \\[3pt] &amp;\phantom{=} + (\textrm{portion data points in} &gt; \textrm{node}) \times (\textrm{impurity of} &gt; \textrm{node}) \end{align*} </center> <p><br /> The formula above can be represented more concisely as</p> <center> \begin{align*} G_\textrm{after} &amp;= p_\leq G_\leq + p_&gt; G_&gt;. \end{align*} </center> <p><br /></p> <p><i>Possible Split: $x_\textrm{split} = 0.075$</i></p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-split1-possibility-x1.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <p>The $x \leq 0.05$ node would be pure with $2$ sugar cookies, giving an impurity of</p> <center> \begin{align*} G_\leq &amp;= 2 \left( \dfrac{0}{2} \right) \left( \dfrac{2}{2} \right) = 0. \end{align*} </center> <p><br /> On the other hand, the $x &gt; 0.05$ node would contain $5$ shortbread cookies and $3$ sugar cookies, giving an impurity of</p> <center> \begin{align*} G_&gt; &amp;= 2 \left( \dfrac{5}{8} \right) \left( \dfrac{3}{8} \right)= \dfrac{30}{64}. \end{align*} </center> <p><br /></p> <p>The $\leq$ node would contain $2$ points while the $&gt;$ node would contain $8$ points, giving proportions $p_\leq = \dfrac{2}{10}$ and $p_&gt; = \dfrac{8}{10}.$</p> <p>Finally, the impurity after the split would be</p> <center> \begin{align*} G_\textrm{after} &amp;= p_\leq G_\leq + p_&gt; G_&gt; \\[5pt] &amp;= \left( \dfrac{2}{10} \right) \left( 0 \right) + \left( \dfrac{8}{10} \right) \left( \dfrac{30}{64} \right) \\[5pt] &amp;= 0.375. \end{align*} </center> <p><br /></p> <p><i>Possible Split: $x_\textrm{split} = 0.125$</i></p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-split1-possibility-x2.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <p>Repeating the same process, we have $p_\leq = \dfrac{3}{10}$ and $p_&gt; = \dfrac{7}{10}$ get we get the following impurities:</p> <center> \begin{align*} G_\leq &amp;= 2 \left( \dfrac{0}{3} \right) \left( \dfrac{3}{3} \right) = 0 \\[5pt] G_&gt; &amp;= 2 \left( \dfrac{5}{7} \right) \left( \dfrac{2}{7} \right) = \dfrac{20}{49} \\[5pt] G_\textrm{after} &amp;= \left( \dfrac{3}{10} \right) \left( 0 \right) + \left( \dfrac{7}{10} \right) \left( \dfrac{20}{49} \right) \approx 0.289. \end{align*} </center> <p><br /></p> <p><i>Possible Split: $x_\textrm{split} = 0.175$</i></p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-split1-possibility-x3.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <center> \begin{align*} G_\leq &amp;= 2 \left( \dfrac{2}{6} \right) \left( \dfrac{4}{6} \right) = \dfrac{16}{36} \\[5pt] G_&gt; &amp;= 2 \left( \dfrac{3}{4} \right) \left( \dfrac{1}{4} \right) = \dfrac{6}{16} \\[5pt] G_\textrm{after} &amp;= \left( \dfrac{6}{10} \right) \left( \dfrac{16}{36} \right) + \left( \dfrac{4}{10} \right) \left( \dfrac{6}{16} \right) \approx 0.417 \end{align*} </center> <p><br /></p> <p><i>Possible Split: $x_\textrm{split} = 0.225$</i></p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-split1-possibility-x4.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <center> \begin{align*} G_\leq &amp;= 2 \left( \dfrac{3}{7} \right) \left( \dfrac{4}{7} \right) = \dfrac{24}{49} \\[5pt] G_&gt; &amp;= 2 \left( \dfrac{2}{3} \right) \left( \dfrac{1}{3} \right) = \dfrac{4}{9} \\[5pt] G_\textrm{after} &amp;= \left( \dfrac{7}{10} \right) \left( \dfrac{24}{49} \right) + \left( \dfrac{3}{10} \right) \left( \dfrac{4}{9} \right) \approx 0.476 \end{align*} </center> <p><br /></p> <p><i>Possible Split: $x_\textrm{split} = 0.275$</i></p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-split1-possibility-x5.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <center> \begin{align*} G_\leq &amp;= 2 \left( \dfrac{4}{9} \right) \left( \dfrac{5}{9} \right) = \dfrac{40}{81} \\[5pt] G_&gt; &amp;= 2 \left( \dfrac{1}{1} \right) \left( \dfrac{0}{1} \right) = 0 \\[5pt] G_\textrm{after} &amp;= \left( \dfrac{9}{10} \right) \left( \dfrac{40}{81} \right) + \left( \dfrac{1}{10} \right) \left( 0 \right) \approx 0.444 \end{align*} </center> <p><br /></p> <p><i>Possible Split: $y_\textrm{split} = 0.225$</i></p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-split1-possibility-y1.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <center> \begin{align*} G_\leq &amp;= 2 \left( \dfrac{1}{1} \right) \left( \dfrac{0}{1} \right) = 0 \\[5pt] G_&gt; &amp;= 2 \left( \dfrac{4}{9} \right) \left( \dfrac{5}{9} \right) = \dfrac{40}{81} \\[5pt] G_\textrm{after} &amp;= \left( \dfrac{1}{10} \right) \left( 0 \right) + \left( \dfrac{9}{10} \right) \left( \dfrac{40}{81} \right) \approx 0.444 \end{align*} </center> <p><br /></p> <p><i>Possible Split: $y_\textrm{split} = 0.275$</i></p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-split1-possibility-y2.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <center> \begin{align*} G_\leq &amp;= 2 \left( \dfrac{2}{3} \right) \left( \dfrac{1}{3} \right) = \dfrac{4}{9} \\[5pt] G_&gt; &amp;= 2 \left( \dfrac{3}{7} \right) \left( \dfrac{4}{7} \right) = \dfrac{24}{49} \\[5pt] G_\textrm{after} &amp;= \left( \dfrac{3}{10} \right) \left( \dfrac{4}{9} \right) + \left( \dfrac{7}{10} \right) \left( \dfrac{24}{49} \right) \approx 0.476 \end{align*} </center> <p><br /></p> <p><i>Possible Split: $y_\textrm{split} = 0.325$</i></p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-split1-possibility-y3.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <center> \begin{align*} G_\leq &amp;= 2 \left( \dfrac{3}{5} \right) \left( \dfrac{2}{5} \right) = \dfrac{12}{25} \\[5pt] G_&gt; &amp;= 2 \left( \dfrac{2}{5} \right) \left( \dfrac{3}{5} \right) = \dfrac{12}{25} \\[5pt] G_\textrm{after} &amp;= \left( \dfrac{5}{10} \right) \left( \dfrac{12}{25} \right) + \left( \dfrac{5}{10} \right) \left( \dfrac{12}{25} \right) = 0.48 \end{align*} </center> <p><br /></p> <p><i>Possible Split: $y_\textrm{split} = 0.375$</i></p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-split1-possibility-y4.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <center> \begin{align*} G_\leq &amp;= 2 \left( \dfrac{4}{8} \right) \left( \dfrac{4}{8} \right) = \dfrac{32}{64} \\[5pt] G_&gt; &amp;= 2 \left( \dfrac{1}{2} \right) \left( \dfrac{1}{2} \right) = \dfrac{2}{4} \\[5pt] G_\textrm{after} &amp;= \left( \dfrac{8}{10} \right) \left( \dfrac{32}{64} \right) + \left( \dfrac{2}{10} \right) \left( \dfrac{2}{4} \right) = 0.5 \end{align*} </center> <p><br /></p> <p><i>Best Split</i></p> <p>Remember that the initial impurity before splitting was $G_\textrm{before} = 0.5.$ Let’s compute how much each potential split would decrease the impurity:</p> <center> \begin{align*} \begin{matrix} \textrm{Split} &amp; \big\vert &amp; G_\textrm{before} - G_\textrm{after} \\ \hline x_\textrm{split} = 0.075 &amp; \big\vert &amp; 0.5 - 0.375 &amp; = 0.125 \\ x_\textrm{split} = 0.125 &amp; \big\vert &amp; 0.5 - 0.289 &amp; = \mathbf{0.211} \\ x_\textrm{split} = 0.175 &amp; \big\vert &amp; 0.5 - 0.417 &amp; = 0.083 \\ x_\textrm{split} = 0.225 &amp; \big\vert &amp; 0.5 - 0.476 &amp; = 0.024 \\ x_\textrm{split} = 0.275 &amp; \big\vert &amp; 0.5 - 0.444 &amp; = 0.056 \\ y_\textrm{split} = 0.225 &amp; \big\vert &amp; 0.5 - 0.444 &amp; = 0.056 \\ y_\textrm{split} = 0.275 &amp; \big\vert &amp; 0.5 - 0.476 &amp; = 0.024 \\ y_\textrm{split} = 0.325 &amp; \big\vert &amp; 0.5 - 0.48\phantom{0} &amp; = 0.02\phantom{0} \\ y_\textrm{split} = 0.375 &amp; \big\vert &amp; 0.5 - 0.5\phantom{00} &amp; = 0\phantom{.000} \end{matrix} \end{align*} </center> <p><br /></p> <p>According to the table above, the best split is $x_\textrm{split} = 0.125$ since it decreases the impurity the most. We integrate this split into our decision tree:</p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-with-data-split1.jpg" style="border: none; height: 15em;" alt="icon" /></center> <p><br /></p> <p>This decision tree can be visualized in the plane as follows:</p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-classification-boundary-split1.png" style="border: none; height: 15em;" alt="icon" /></center> <p><br /></p> <h2>Worked Example: Split 2</h2> <p>Again, we repeat the process and split any impure leaf nodes in the tree. There is exactly one impure leaf node $(x &gt; 0.125)$ and it contains $5$ shortbread and $2$ sugar cookies, giving an impurity of</p> <center> \begin{align*} G_\textrm{before} &amp;= G\left( \dfrac{5}{7} \right) \\[5pt] &amp;= 2\left( \dfrac{5}{7} \right)\left( \dfrac{2}{7} \right) \\[5pt] &amp;\approx 0.408 \end{align*} </center> <p><br /> To find the possible splits, we first find the distinct values of $x$ and $y$ that are hit by points in this node and put them in order:</p> <center> \begin{align*} x &amp;= 0.15, 0.2, 0.25, 0.3 \\[3pt] y &amp;= 0.2, 0.25, 0.3, 0.35, 0.4 \end{align*} </center> <p><br /> The possible splits are the midpoints between consecutive entries in the list above:</p> <center> \begin{align*} x_\textrm{split} &amp;= 0.175, 0.225, 0.275 \\[3pt] y_\textrm{split} &amp;= 0.225, 0.275, 0.325, 0.375 \end{align*} </center> <p><br /></p> <p><i>Possible Split: $x_\textrm{split} = 0.175$</i></p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-split2-possibility-x1.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <p>Remember that we are only splitting the region covered by the $x &gt; 0.125$ node, which contains $7$ data points. We can ignore the $3$ data points left of the hard dotted line, since they are not contained within the node that we are splitting.</p> <center> \begin{align*} G_\leq &amp;= 2 \left( \dfrac{2}{3} \right) \left( \dfrac{1}{3} \right) = \dfrac{4}{9} \\[5pt] G_&gt; &amp;= 2 \left( \dfrac{3}{4} \right) \left( \dfrac{1}{4} \right) = \dfrac{6}{16} \\[5pt] G_\textrm{after} &amp;= \left( \dfrac{3}{7} \right) \left( \dfrac{4}{9} \right) + \left( \dfrac{4}{7} \right) \left( \dfrac{6}{16} \right) \approx 0.405 \end{align*} </center> <p><br /></p> <p><i>Possible Split: $x_\textrm{split} = 0.225$</i></p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-split2-possibility-x2.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <center> \begin{align*} G_\leq &amp;= 2 \left( \dfrac{3}{4} \right) \left( \dfrac{1}{4} \right) = \dfrac{6}{16} \\[5pt] G_&gt; &amp;= 2 \left( \dfrac{2}{3} \right) \left( \dfrac{1}{3} \right) = \dfrac{4}{9} \\[5pt] G_\textrm{after} &amp;= \left( \dfrac{4}{7} \right) \left( \dfrac{6}{16} \right) + \left( \dfrac{3}{7} \right) \left( \dfrac{4}{9} \right) \approx 0.405 \end{align*} </center> <p><br /></p> <p><i>Possible Split: $x_\textrm{split} = 0.275$</i></p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-split2-possibility-x3.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <center> \begin{align*} G_\leq &amp;= 2 \left( \dfrac{4}{6} \right) \left( \dfrac{2}{6} \right) = \dfrac{16}{36} \\[5pt] G_&gt; &amp;= 2 \left( \dfrac{1}{1} \right) \left( \dfrac{0}{1} \right) = 0 \\[5pt] G_\textrm{after} &amp;= \left( \dfrac{6}{7} \right) \left( \dfrac{16}{36} \right) + \left( \dfrac{1}{7} \right) \left( 0 \right) \approx 0.381 \end{align*} </center> <p><br /></p> <p><i>Possible Split: $y_\textrm{split} = 0.225$</i></p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-split2-possibility-y1.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <center> \begin{align*} G_\leq &amp;= 2 \left( \dfrac{1}{1} \right) \left( \dfrac{0}{1} \right) = 0 \\[5pt] G_&gt; &amp;= 2 \left( \dfrac{4}{6} \right) \left( \dfrac{2}{6} \right) = \dfrac{16}{36} \\[5pt] G_\textrm{after} &amp;= \left( \dfrac{1}{7} \right) \left( 0 \right) + \left( \dfrac{6}{7} \right) \left( \dfrac{16}{36} \right) \approx 0.381 \end{align*} </center> <p><br /></p> <p><i>Possible Split: $y_\textrm{split} = 0.275$</i></p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-split2-possibility-y2.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <center> \begin{align*} G_\leq &amp;= 2 \left( \dfrac{2}{2} \right) \left( \dfrac{0}{2} \right) = 0 \\[5pt] G_&gt; &amp;= 2 \left( \dfrac{3}{5} \right) \left( \dfrac{2}{5} \right) = \dfrac{12}{25} \\[5pt] G_\textrm{after} &amp;= \left( \dfrac{2}{7} \right) \left( 0 \right) + \left( \dfrac{5}{7} \right) \left( \dfrac{12}{25} \right) \approx 0.343 \end{align*} </center> <p><br /></p> <p><i>Possible Split: $y_\textrm{split} = 0.325$</i></p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-split2-possibility-y3.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <center> \begin{align*} G_\leq &amp;= 2 \left( \dfrac{3}{3} \right) \left( \dfrac{0}{3} \right) = 0 \\[5pt] G_&gt; &amp;= 2 \left( \dfrac{2}{4} \right) \left( \dfrac{2}{4} \right) = \dfrac{8}{16} \\[5pt] G_\textrm{after} &amp;= \left( \dfrac{3}{7} \right) \left( 0 \right) + \left( \dfrac{4}{7} \right) \left( \dfrac{8}{16} \right) \approx 0.286 \end{align*} </center> <p><br /></p> <p><i>Possible Split: $y_\textrm{split} = 0.375$</i></p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-split2-possibility-y4.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <center> \begin{align*} G_\leq &amp;= 2 \left( \dfrac{4}{5} \right) \left( \dfrac{1}{5} \right) = \dfrac{8}{25} \\[5pt] G_&gt; &amp;= 2 \left( \dfrac{1}{2} \right) \left( \dfrac{1}{2} \right) = \dfrac{1}{2} \\[5pt] G_\textrm{after} &amp;= \left( \dfrac{5}{7} \right) \left( \dfrac{8}{25} \right) + \left( \dfrac{2}{7} \right) \left( \dfrac{1}{2} \right) \approx 0.371 \end{align*} </center> <p><br /></p> <p><i>Best Split</i></p> <p>The best split is $y_\textrm{split} = 0.325$ since it decreases the impurity the most.</p> <center> \begin{align*} \begin{matrix} \textrm{Split} &amp; \big\vert &amp; G_\textrm{before} - G_\textrm{after} \\ \hline x_\textrm{split} = 0.175 &amp; \big\vert &amp; 0.408 - 0.405 &amp; = 0.003 \\ x_\textrm{split} = 0.225 &amp; \big\vert &amp; 0.408 - 0.405 &amp; = 0.003 \\ x_\textrm{split} = 0.275 &amp; \big\vert &amp; 0.408 - 0.381 &amp; = 0.027 \\ y_\textrm{split} = 0.225 &amp; \big\vert &amp; 0.408 - 0.381 &amp; = 0.027 \\ y_\textrm{split} = 0.275 &amp; \big\vert &amp; 0.408 - 0.343 &amp; = 0.065 \\ y_\textrm{split} = 0.325 &amp; \big\vert &amp; 0.408 - 0.286 &amp; = \mathbf{0.122} \\ y_\textrm{split} = 0.375 &amp; \big\vert &amp; 0.408 - 0.371 &amp; = 0.037 \end{matrix} \end{align*} </center> <p><br /> We integrate this split into our decision tree:</p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-with-data-split2.jpg" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <p>This decision tree can be visualized in the plane as follows:</p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-classification-boundary-split2.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <h2>Worked Example: Split 3</h2> <p>Again, we repeat the process and split any impure leaf nodes in the tree. There is exactly one impure leaf node $(x &gt; 0.125$ $\to$ $y &gt; 0.325)$ and it contains $2$ shortbread and $2$ sugar cookies, giving an impurity of</p> <center> \begin{align*} G_\textrm{before} &amp;= G\left( \dfrac{2}{4} \right) \\[5pt] &amp;= 2\left( \dfrac{2}{4} \right)\left( \dfrac{2}{4} \right) \\[5pt] &amp;= 0.5. \end{align*} </center> <p><br /> To find the possible splits, we first find the distinct values of $x$ and $y$ that are hit by points in this node and put them in order:</p> <center> \begin{align*} x &amp;= 0.15, 0.25, 0.3 \\[3pt] y &amp;= 0.35, 0.4 \end{align*} </center> <p><br /> The possible splits are the midpoints between consecutive entries in the list above:</p> <center> \begin{align*} x_\textrm{split} &amp;= 0.2, 0.275 \\[3pt] y_\textrm{split} &amp;= 0.375 \end{align*} </center> <p><br /></p> <p><i>Possible Split: $x_\textrm{split} = 0.2$</i></p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-split3-possibility-x1.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <p>Remember that we are only splitting the region covered by the $x &gt; 0.125$ $\to$ $y &gt; 0.325$ node, which contains $4$ data points. We can ignore the $6$ data points outside of this region, since they are not contained within the node that we are splitting.</p> <center> \begin{align*} G_\leq &amp;= 2 \left( \dfrac{0}{1} \right) \left( \dfrac{1}{1} \right) = 0 \\[5pt] G_&gt; &amp;= 2 \left( \dfrac{2}{3} \right) \left( \dfrac{1}{3} \right) = \dfrac{4}{9} \\[5pt] G_\textrm{after} &amp;= \left( \dfrac{1}{4} \right) \left( 0 \right) + \left( \dfrac{3}{4} \right) \left( \dfrac{4}{9} \right) \approx 0.333 \end{align*} </center> <p><br /></p> <p><i>Possible Split: $x_\textrm{split} = 0.275$</i></p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-split3-possibility-x2.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <center> \begin{align*} G_\leq &amp;= 2 \left( \dfrac{1}{3} \right) \left( \dfrac{2}{3} \right) = \dfrac{4}{9} \\[5pt] G_&gt; &amp;= 2 \left( \dfrac{1}{1} \right) \left( \dfrac{0}{1} \right) = 0 \\[5pt] G_\textrm{after} &amp;= \left( \dfrac{3}{4} \right) \left( \dfrac{4}{9} \right) + \left( \dfrac{1}{4} \right) \left( 0 \right) \approx 0.333 \end{align*} </center> <p><br /></p> <p><i>Possible Split: $y_\textrm{split} = 0.375$</i></p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-split3-possibility-y1.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <center> \begin{align*} G_\leq &amp;= 2 \left( \dfrac{1}{2} \right) \left( \dfrac{1}{2} \right) = \dfrac{2}{4} \\[5pt] G_&gt; &amp;= 2 \left( \dfrac{1}{2} \right) \left( \dfrac{1}{2} \right) = \dfrac{2}{4} \\[5pt] G_\textrm{after} &amp;= \left( \dfrac{2}{4} \right) \left( \dfrac{2}{4} \right) + \left( \dfrac{2}{4} \right) \left( \dfrac{2}{4} \right) = 0.5 \end{align*} </center> <p><br /></p> <p><i>Best Split</i></p> <p>This time, there is a tie for the best split: $x_\textrm{split} = 0.2$ and $x_\textrm{split} = 0.275$ both decrease impurity the most.</p> <center> \begin{align*} \begin{matrix} \textrm{Split} &amp; \big\vert &amp; G_\textrm{before} - G_\textrm{after} \\ \hline x_\textrm{split} = 0.2 &amp; \big\vert &amp; 0.5 - 0.333 &amp; = \mathbf{0.167} \\ x_\textrm{split} = 0.275 &amp; \big\vert &amp; 0.5 - 0.333 &amp; = \mathbf{0.167} \\ y_\textrm{split} = 0.375 &amp; \big\vert &amp; 0.5 - 0.5\phantom{00} &amp; = 0\phantom{000} \end{matrix} \end{align*} </center> <p><br /> When ties like this occur, it does not matter which split we choose. We will arbitrarily choose the split that we encountered first, $x_\textrm{split} = 0.2,$ and integrate this split into our decision tree:</p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-with-data-split3.jpg" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <p>This decision tree can be visualized in the plane as follows:</p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-classification-boundary-split3.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <h2>Worked Example: Split 4</h2> <p>Again, we repeat the process and split any impure leaf nodes in the tree. There is exactly one impure leaf node $(x &gt; 0.125$ $\to$ $y &gt; 0.325$ $\to$ $x &gt; 0.2)$ and it contains $2$ shortbread and $1$ sugar cookie, giving an impurity of</p> <center> \begin{align*} G_\textrm{before} &amp;= G\left( \dfrac{2}{3} \right) \\[5pt] &amp;= 2\left( \dfrac{2}{3} \right)\left( \dfrac{1}{3} \right) \\[5pt] &amp;\approx 0.444. \end{align*} </center> <p><br /> To find the possible splits, we first find the distinct values of $x$ and $y$ that are hit by points in this node and put them in order:</p> <center> \begin{align*} x &amp;= 0.25, 0.3 \\[3pt] y &amp;= 0.35, 0.4 \end{align*} </center> <p><br /> The possible splits are the midpoints between consecutive entries in the list above:</p> <center> \begin{align*} x_\textrm{split} &amp;= 0.275 \\[3pt] y_\textrm{split} &amp;= 0.375 \end{align*} </center> <p><br /></p> <p><i>Possible Split: $x_\textrm{split} = 0.275$</i></p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-split4-possibility-x1.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <center> \begin{align*} G_\leq &amp;= 2 \left( \dfrac{1}{2} \right) \left( \dfrac{1}{2} \right) = \dfrac{2}{4} \\[5pt] G_&gt; &amp;= 2 \left( \dfrac{1}{1} \right) \left( \dfrac{0}{1} \right) = 0 \\[5pt] G_\textrm{after} &amp;= \left( \dfrac{2}{3} \right) \left( \dfrac{2}{4} \right) + \left( \dfrac{1}{3} \right) \left( 0 \right) \approx 0.333 \end{align*} </center> <p><br /></p> <p><i>Possible Split: $y_\textrm{split} = 0.375$</i></p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-split4-possibility-y1.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <center> \begin{align*} G_\leq &amp;= 2 \left( \dfrac{1}{2} \right) \left( \dfrac{1}{2} \right) = \dfrac{2}{4} \\[5pt] G_&gt; &amp;= 2 \left( \dfrac{1}{1} \right) \left( \dfrac{0}{1} \right) = 0 \\[5pt] G_\textrm{after} &amp;= \left( \dfrac{2}{3} \right) \left( \dfrac{2}{4} \right) + \left( \dfrac{1}{3} \right) \left( 0 \right) \approx 0.333 \end{align*} </center> <p><br /></p> <p><i>Best Split</i></p> <p>Again, there is a tie for the best split: $x_\textrm{split} = 0.275$ and $y_\textrm{split} = 0.375$ both decrease impurity the most.</p> <center> \begin{align*} \begin{matrix} \textrm{Split} &amp; \big\vert &amp; G_\textrm{before} - G_\textrm{after} \\ \hline x_\textrm{split} = 0.275 &amp; \big\vert &amp; 0.5 - 0.333 &amp; = \mathbf{0.167} \\ y_\textrm{split} = 0.375 &amp; \big\vert &amp; 0.5 - 0.333 &amp; = \mathbf{0.167} \end{matrix} \end{align*} </center> <p><br /> We will arbitrarily choose the split that we encountered first, $x_\textrm{split} = 0.275,$ and integrate this split into our decision tree:</p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-with-data-split4.jpg" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <p>This decision tree can be visualized in the plane as follows:</p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-classification-boundary-split4.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <h2>Worked Example: Split 5</h2> <p>There is only one possibility for the next split, $y_\textrm{split} = 0.375,$ so it may be tempting to select it outright. But remember that we only want splits that lead to a decrease in impurity. So, it’s still necessary to compute the decrease in impurity before selecting this split.</p> <center> \begin{align*} G_\textrm{before} &amp;= G\left( \dfrac{1}{2} \right) \\[5pt] &amp;= 2\left( \dfrac{1}{2} \right)\left( \dfrac{1}{2} \right) \\[5pt] &amp;= 0.5. \end{align*} </center> <p><br /></p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-split5-possibility-y1.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <center> \begin{align*} G_\leq &amp;= 2 \left( \dfrac{0}{1} \right) \left( \dfrac{1}{1} \right) = 0 \\[5pt] G_&gt; &amp;= 2 \left( \dfrac{1}{1} \right) \left( \dfrac{0}{1} \right) = 0 \\[5pt] G_\textrm{after} &amp;= \left( \dfrac{1}{2} \right) \left( 0 \right) + \left( \dfrac{1}{2} \right) \left( 0 \right) = 0 \end{align*} </center> <p><br /> Indeed, the impurity decreases by a posititve amount</p> <center> \begin{align*} G_\textrm{before} - G_\textrm{after} &amp;= 0.5 - 0 \\ &amp;= 0.5 \\ &amp;&gt; 0, \end{align*} </center> <p><br /> so we select the split and integrate it into our decision tree:</p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-with-data.png" style="border: none; height: 30em;" alt="icon" /></center> <p><br /></p> <p>This decision tree can be visualized in the plane as follows:</p> <center><img src="https://justinmath.com/files/blog/cookie-decision-tree-classification-boundary.png" style="border: none; height: 20em;" alt="icon" /></center> <p><br /></p> <p>No more splits are possible, so we’re done.</p> <h2>Early Stopping</h2> <p>Note that when fitting decision trees, it’s common to stop splitting early so that the tree doesn’t overfit the data. This is often achieved by enforcing</p> <ul> <li>a <i>maximum depth</i> constraint (i.e. skip over any potential splits that would cause the tree to become deeper than some number of levels), or</li> <li>a <i>minimum split size</i> constraint (i.e. do not split any leaf node that contains fewer than some number of data points).</li> </ul> <p>These parameters constrain how far the decision tree can read into the data, similar to how the degree parameter constrains a polynomial regression model and how $k$ constrains a k-nearest neighbors model.</p> <p>Also note that if we stop splitting early (or if the data set has duplicate points with different classes), we end up with impure leaf nodes. In such cases, impure leaf nodes are considered to predict the majority class of the data points they contain. If there is a tie, then we can go up a level and use the majority class of the parent node.</p> <h2>Random Forests</h2> <p>A common way to improve the performance of decision trees is to fit a bunch of decision trees on many different random subsets of the data, and then aggregate them together into a hive mind called a <b>random forest</b>. The random forest makes its predictions by</p> <ol> <li>allowing each individual decision tree to vote (i.e. make its own prediction), and then</li> <li>choosing whichever prediction received the most votes.</li> </ol> <p>This general approach is called <b>bootstrap aggregating</b> or <b>bagging</b> for short (because a random subset of the data is known as a <i>bootstrap sample</i>). Bootstrap aggregating can be applied to any model, though random forest is the most famous application.</p> <h2>Exercises</h2> <ol> <li>Implement the example that was worked out above.</li> <li>Construct a leave-one-out cross-validation curve where a maximum depth constraint is varied.</li> <li>Construct a leave-one-out cross-validation curve where a minimum split size constraint is varied.</li> <li>Construct a leave-one-out cross-validation curve for a random forest, where the number of trees in the forest is varied and each tree is trained on a random sample of $50\%$ of the data. You should see the performance increase and asymptote off with the number of trees.</li> <li> Construct a data set that leads to a decision tree that looks like the diagram shown below. Be sure to run your decision tree construction algorithm on the data set to verify the result. <br /><center><img src="https://justinmath.com/files/blog/binary-tree-depth-2.png" style="border: none; height: 10em;" alt="icon" /></center><br /> </li> <li> Construct a data set that leads to a decision tree that looks like the diagram shown below. Be sure to run your decision tree construction algorithm on the data set to verify the result. <br /><center><img src="https://justinmath.com/files/blog/binary-tree-depth-3.png" style="border: none; height: 15em;" alt="icon" /></center><br /> </li> </ol>Justin Skycakjpskycak@gmail.comA decision tree is a graphical flowchart that represents a sequence of nested “if-then” decision rules. To illustrate, first recall the following cookie data set that was introduced during the discussion of k-nearest neighbors: