[In Progress] Notes: The Principles of Deep Learning Theory
The Principles of Deep Learning Theory - Roberts, Yaida, Hanin Read more...
The Principles of Deep Learning Theory - Roberts, Yaida, Hanin Read more...
(in progress) Read more...
Until now, we have been focused on regression problems, in which we predict an output quantity (that is often continuous). However, in the real world, it’s even more common to encounter classification problems, in which we predict an output class (i.e. category). For example, predicting how much money a person will spend at a store is a regression problem, whereas predicting which items the person will buy is a classification problem. Read more...
In many real-life situations, there is more than one factor that controls the quantity we’re trying to predict. That is to say, there is more than one input variable that controls the output variable. Read more...
Gradient descent can be applied to fit regression models. In particular, it can help us avoid the pitfalls that we’ve experienced when we attempt to fit nonlinear models using the pseudoinverse. Read more...
Multivariable gradient descent is similar to single-variable gradient descent, except that we replace the derivative $f’(x)$ with the gradient $\nabla f(\vec x)$ as follows: Read more...
Another method for fitting models to data, called gradient descent, involves minimizing the error between the model and the data by using the gradient vector from multivariable calculus. The idea is that we construct a function (like the RSS) that represents the error between the model and the data set and then minimize that function using the gradient descent procedure. Read more...
We have previously described a model as “accurate” when it appears to match closely with points in the data set. However, there are issues with this definition that we will need to remedy. In this section, we will expose these issues and develop a more nuanced understanding of model accuracy by way of a concrete example. Read more...
Previously, we learned that we can use the pseudoinverse to fit any regression model that can be expressed as a linear combination of functions. Unfortunately, there are a handful of useful models that cannot be expressed as a linear combination of functions. Here, we will explore $3$ of these models in particular. Read more...
Previously, we learned how to use the pseudoinverse to fit linear and polynomial models to data sets consisting of one or more input variables. However, the pseudoinverse is even more general than that. For example, let’s fit the model Read more...
Supervised learning consists of fitting functions to data sets. The idea is that we have a sample of inputs and outputs from some process, and we want to come up with a mathematical model that predicts what the outputs would be for other inputs. Supervised machine learning involves programming computers to compute such models. Read more...
A game tree is a data structure that represents all the possible outcomes of a game. It is a graph where the nodes correspond to the states of the game, and the edges correspond to actions that cause the game to transition from one state to another. Game trees are commonly used when coding up strategies for autonomous game-playing agents. Read more...
Create a WeightedGraph
class that is similar to your existing Graph
class, except that now, each edge is assigned a value called a weight. Read more...
Now, let’s extend your Graph
class with the methods calc_distance
and calc_shortest_path
. Below is a demonstration of these methods on a particular graph, followed by instructions regarding how to implement them. Read more...
Graphs show up all the time in computer science, so it’s important to know how to work with them. For example, consider the following graph: Read more...
The simplex method is a technique for maximizing linear expressions subject to linear constraints. Many applied problems can be framed like this – for example, a business may have a variety of raw materials and need to determine the most profitable inventory of products that they can produce from those materials. This would ultimately amount to maximizing a linear expression for profit, subject to a linear inequality for each type of raw material (the total amount of raw material used in the products produced must be less than or equal to the amount of raw material available). Read more...
Clustering is the act of finding groups of similar records within data. Some examples of clusters in real-life data include users who buy similar items, songs in a similar genre, and patients with similar health conditions. Read more...
Arrays can be used to implement more than just matrices. We can also implement other mathematical procedures like Euler estimation. Read more...
Recall the following procedure for converting a matrix to reduced row echelon form (RREF): Read more...
Let’s use arrays to implement matrices and their associated mathematical operations. (It is assumed that you are already familiar with matrix arithmetic and reduced row echelon form.) Read more...
Merge sort is a sorting algorithm that works by breaking an array in half, recursively calling merge sort on each half separately, and then merging the two sorted halves. The base case is that empty arrays and arrays of 1 item are already sorted (so if you call merge sort on such an array, it leaves the array as-is). Read more...
There are many different methods for sorting items in arrays. Let’s go over some of the simplest methods. Read more...
(Cross-post from eurisko.us) Read more...
Two of the simplest methods for estimating roots of functions are bisection search and the Newton-Rhapson method. We will cover both of these here. Read more...
An encoding function maps a string to a sequence of numbers. For example, you have already implemented the trivial encoding function, defined as follows: Read more...
A recursive sequence is a sequence where one or more terms are given and each following term is defined as a function of the previous terms. For example, consider the following rule for generating a recursive sequence: starting with $3,$ generate each term by doubling the previous term and adding $1.$ The terms of this sequence are as follows: Read more...
We are used to representing numbers using ten characters: $0,$ $1,$ $2,$ $3,$ $4,$ $5,$ $6,$ $7,$ $8,$ $9.$ This is called the decimal number system. However, there are other number systems that use more or fewer characters. For example, the binary or base-$2$ number system uses two characters ($0$ and $1$), and the hexadecimal or base-$16$ number system uses sixteen characters ($0,$ $1,$ $2,$ $3,$ $4,$ $5,$ $6,$ $7,$ $8,$ $9,$ $A,$ $B,$ $C,$ $D,$ $E$ where $A=11,$ $B=12,$ $C=13,$ $D=14,$ $E=15$). Read more...
Let’s get started on some introductory computation exercises. We will work in Python, but the general ideas are applicable to any programming language. The goal of this guide is to provide enough scaffolding to give you direction, but not to hand-hold you through the process. Read more...
See below for the video version of this post: Read more...
See below for the video version of this post: Read more...
See below for the video version of this post: Read more...
In this post, we will learn how to solve systems of linear differential equations. These systems take the form shown below. Read more...
As we saw previously, not every matrix is diagonalizable, even when allowing complex eigenvalues/eigenvectors. The matrix below was given as an example of a non-diagonalizable matrix. Read more...
In this post, we introduce an interesting application of matrix diagonalization: constructing closed-form expressions for recursive series. Read more...
Suppose we want to compute a matrix raised to a large power, i.e. multiplied by itself many times. Read more...
In this post, we introduce the idea of the inverse of a matrix, which undoes the transformation of that matrix. For example, it’s straightforward that the inverse of the rescaling matrix Read more...
The key insight in this post is that every square matrix can be decomposed into a product of rescalings and shears. Before we elaborate on that, though, let’s discuss what rescalings and shears are, in terms of matrices. Read more...
We have seen how to multiply a vector by a matrix. Now, we will see how to multiply a matrix by another matrix. Whereas multiplying a vector by a matrix corresponds to a linear transformation of that vector, multiplying a matrix by another matrix corresponds to a composition of linear transformations. Read more...
Let’s create a compact notation for expressing systems of linear equations like the one shown below. Read more...
Until this point we’ve been working exclusively with linear systems. However, solving linear systems can sometimes be a necessary component of solving nonlinear systems. For example, recall the variation of parameters method for solving a second-order differential equation of the form Read more...
Not only can a nonzero determinant tell us that a linear system has exactly one solution – the nonzero determinant can also help us quickly find that solution through a process known as Cramer’s rule. Read more...
We have seen that the space of linear equations is actually a vector space, and that the linear equations in any particular system span a subspace of this vector space. However, there is also another way to interpret linear systems in terms of vectors: a linear system can be interpreted as a single vector equation stating that some multiples of particular vectors add up to another particular vector. Read more...
N-dimensional volume generalizes the idea of the space occupied by an object:
Recall that systems of linear equations can be solved through elimination, multiplying equations by constants and adding equations to each other to cancel variables. For example, to solve the following linear system Read more...
The span of a set of vectors consists of all vectors that can be made by adding multiples of vectors in the set. For example, the span of the set $\left\lbrace \left< 1,0 \right>, \left< 0,1 \right> \right\rbrace$ is just the entirety of the 2-dimensional plane: any vector $\left< x,y \right>$ in this plane can be made by adding $x \left< 1,0 \right> + y \left< 0,1 \right>$. For example, $\left< 5,-2 \right>$ can be written as $5\left< 1,0 \right> - 2 \left< 0,1 \right>$. Read more...
A line starts at an initial point and proceeds straight in a constant direction. Thus, we can write the equation of a line as Read more...
We know how to multiply a vector by a scalar, but what does it mean to multiply a vector by another vector? The two most common interpretations of vector multiplication are the dot product, and for vectors in 3 dimensions, the cross product. Read more...
N-dimensional space consists of points that have N components. For example, $(0,0)$ is the origin in 2-dimensional space, $(0,0,0)$ is the origin in 3-dimensional space, and $(0,0,0,0)$ is the origin in 4-dimensional space. Similarly, the points $(0,0)$, $(1,0)$, $(0,1)$ are the corners of a triangle in 2-dimensional space, the points $(0,0,0)$, $(1,0,0)$, $(0,1,0)$, $(0,0,1)$ are the corners of a tetrahedron in 3-dimensional space, and the points $(0,0,0,0)$, $(1,0,0,0)$, $(0,1,0,0)$, $(0,0,1,0)$, $(0,0,0,1)$ are the corners of a “hypertetrahedron” in 4-dimensional space. Read more...
Many differential equations don’t have solutions which can be expressed in terms of finite combinations of familiar functions. However, we can often solve for the Taylor series of the solution. Read more...
To find the Taylor series of complicated functions, it’s often easiest to manipulate the Taylor series of simpler functions, such as those given below. Read more...
The sum formula for a geometric series is an example of a representation of a non-polynomial function as an infinite polynomial, within a particular range of inputs. Read more...
Previously, we saw that sum formulas are only valid for series which converge. But how can we tell whether a series converges or diverges, in the first place? Read more...
A geometric series is a sum of the form $r+r^2+r^3+\cdots$ for some number $r$. For example, when $r=\frac{1}{2}$, the corresponding geometric series is $\frac{1}{2} + \frac{1}{4} + \frac{1}{8} + \cdots$. This series might look like it grows bigger and bigger as you add more terms, but there is actually a limit to how big it can get. Read more...
When we know the zero solutions $y_0$ of a differential equation $y’^\prime+a_1(x)y’+a_0(x)y=f(x)$, we can use a method called variation of parameters to find the particular solution. This method is especially useful in cases where we are unable to guess the particular solution through undetermined coefficients. Read more...
We know how to solve differential equations of the form Read more...
In the previous post, we learned how to solve differential equations of the form Read more...
In this post, we learn a technique for solving differential equations of the form Read more...
Sometimes, non-separable differential equations can be converted into separable differential equations by way of substitution. For example, $y’+y=x$ is a non-separable differential equation as-is. However, we can make a variable substitution $u=x-y$ to turn it into a separable differential equation. Differentiating both sides of $u=x-y$ with respect to $x$, and interpreting $y$ as a function of $x$, we have $u’=1-y’$, so $y’=1-u’$. Substituting, the equation becomes separable and thus solvable in terms of $u$. Read more...
When faced with a differential equation that we don’t know how to solve, we can sometimes still approximate the solution by simpler methods. If we just want to get an idea of what the solutions of the differential equation look like on a graph, we can construct a slope field. Read more...
In differential equations, we are given an equation in terms of the derivative(s) of some function, and we need to solve for the function which makes the equation true. For example, a simple differential equation is $y’=2x$, and its solution is just the antiderivative $y=x^2+C$. Read more...
Improper integrals are integrals which have bounds or function values which extend to positive or negative infinity. For example, $\int_1^\infty \frac{1}{x^2} \, dx$ is an improper integral because its upper bound is at infinity. Likewise, $\int_0^1 \frac{1}{\sqrt{x}} \, dx$ is an improper integral because $\frac{1}{\sqrt{x}}$ approaches infinity as approaches the lower bound of integration, $0$. Read more...
Integration by parts is another technique for simplifying integrals. We can apply integration by parts whenever an integral would be made simpler by differentiating some expression within the integral, at the cost of anti-differentiating another expression within the integral. The formula for integration by parts is given below: Read more...
Complicated integrals can sometimes be made simpler through the method of substitution. Substitution involves condensing an expression of $x$ into a single variable, say $u$, and then expressing the integral in terms of $u$ instead of $x$. Read more...
In the last post, we learned how to evaluate integrals of the form $\int f(x) \, dx$, which are also known as indefinite integrals. In this section, we shall be concerned with definite integrals, which have bounds of integration. The definite integral $\int_a^b f(x) \, dx$ is evaluated by first finding the antiderivative $F(x) = \int f(x) \, dx$, and then computing the difference between the values of the antiderivative at the indicated bounds. Read more...
An antiderivative of a function $f(x)$ is a function $F(x)$ whose derivative is $f(x)$, i.e. $F’(x)=f(x)$. For example, an antiderivative of $2x$ is $x^2$, because $(x^2)’=2x$. Another antiderivative of $2x$ is $x^2+1$, because $(x^2+1)’ = 2x$. To encapsulate all possibilities, we say that the antiderivative of $2x$ is $x^2+C$ where $C$ is a constant. Read more...
L’Hôpital’s rule provides a way to evaluate limits that take the indeterminate forms of $\frac{0}{0}$ or $\frac{\infty}{\infty}$. It says that, for such limits, we can differentiate the numerator and denominator separately, without changing the actual value of the limit. Read more...
The chain rule tells us that we can treat the derivative $\frac{df}{dx}$ like a fraction when multiplying by other derivatives. In this section, we continue the idea of interpreting the derivative as a fraction, extending it to an even more literal sense. Read more...
Derivatives can be used to find a function’s local extreme values, its peaks and valleys. At its peaks and valleys, a function’s derivative is either $0$ (a smooth, rounded peak/valley) or undefined (a sharp, pointy peak/valley). Read more...
In this section, we introduce rules for the derivatives of exponential, logarithmic, trigonometric, and inverse trigonometric functions. Although it’s possible to compute each derivative using the difference quotient, it will take a long time to compute derivatives during calculus problems if we have to start from scratch with the difference quotient process every time – so it’s advantageous to remember the derivative rules. The derivative rules are to calculus, what the multiplication table is to arithmetic. Read more...
We know that when differentiating polynomials, we can differentiate each term individually. But why are we able to do this? Does multiplication work the same way? What about division? We answer these questions in this section. Read more...
The chain rule tells us how to take derivatives of compositions of functions. Informally, it says that we can “forget” about the inside of a function when we take the derivative, as long as we multiply by the derivative of the inside afterwards. For example, to differentiate $(x^2+1)^{100}$, we can use the power rule, as long as we multiply by the derivative of the inside $(x^2+1)$ afterwards. Read more...
It can be a pain to evaluate the difference quotient every time we want to take the derivative of a function. Luckily, there are some patterns in derivatives that allow us to compute derivatives without having to go through all the steps of computing the limit of the difference quotient. One such pattern is the power rule, which tells us that the derivative of a function $f(x)=x^n$, where $n$ is some constant number, is given by $f’(x)=nx^{n-1}$. Several examples are shown below. Read more...
The derivative of a function is the function’s slope at a particular point. We can approximate the derivative at a point $(x,f(x))$ by choosing another nearby point on the function, and computing the slope. If we increase the input $x$ by a small amount $\Delta x$, then we reach an x-coordinate of $x+\Delta x$, and the corresponding point on the function is $(x+\Delta x, f(x+\Delta x))$. Read more...
A useful property of limits is that they can be brought inside continuous functions, i.e. the limit of a continuous function is the function of the limit. For example, $\sqrt{x}$ is a continuous function, so to take the limit of the square root of some expression, we can first find the limit of the expression and then take the square root. Read more...
The limit of a function $f(x)$, as $x$ approaches some value $a$, is the value we would expect for $f(a)$ if we saw only the portion of the graph around (but not including) $x=a$. If the resulting value is $L$, then we denote the limit as follows: Read more...
Compositions of functions consist of multiple functions linked together, where the output of one function becomes the input of another function. For example, the function $2x^2$ can be thought of as the composition of two functions: the first function squares the input, and then the second function doubles the input. Read more...
Inverting a function entails reversing the outputs and inputs of the function. For example, if inputting $x=1$ into a function $f$ produces an output $f(1)=3$, then inputting $x=3$ into the inverse function $f^{-1}$ results in the output $f^{-1}(3)=1$. Read more...
When a function is reflected, it flips across one of the axes to become its mirror image. Read more...
When a function is rescaled, it is stretched or compressed along one of the axes, like a slinky. The function’s general shape is preserved, but it might look a bit thinner or fatter afterwards. Read more...
When a function is shifted, all of its points move vertically and/or horizontally by the same amount. The function’s size and shape are preserved – it is just slid in some direction, like sliding a book across a table. Read more...
A piecewise function is a function which is pieced together from multiple different functions. For example, the absolute value function is a piecewise function because it consists of the line $y=-x$ for negative $x$, and $y=x$ for positive $x$. Read more...
Trigonometric functions are functions which represent the relationship between sides and angles in right triangles. There are three main “trig” functions: sine, cosine, and tangent, and a mnemonic often used to remember what they represent is SohCahToa:
An absolute value function represents the magnitude of a number, i.e. its distance from $0$. For example, the absolute value of $-3$ is $3$, and the absolute value of $4$ is $4$. We write this as $|-3|=3$, and $|4|=4$. In effect, absolute value just removes the negative sign from a number, if there is a negative sign to begin with. Read more...
Exponential functions have variables as exponents, e.g. $f(x)=2^x$. Their end behavior consists of growing without bound to infinity in one direction, and decaying to a horizontal asymptote of $y=0$ in the other direction. The size of the number which is exponentiated, called the base, governs which direction corresponds to which end behavior. Read more...
A radical function is a function that involves roots: square roots, cube roots, or any kind of fractional exponent in general. We can often infer what their graphs look like by sandwiching them between polynomial functions. For example, the radical function $f(x)=\sqrt{x^3}$ can be written as $f(x)=x^{3/2}$, and its exponent $\frac{3}{2}$ is between $1$ and $2$, so the graph of $f$ lies between the graphs of $y=x$ and $y=x^2$. Read more...
Horizontal asymptotes are horizontal lines which are indicated by a constant whole number term in the proper form of a rational function. Likewise, slant asymptotes are slanted lines which are indicated by a linear term in the proper form of a rational function. For example, the proper form of $r(x)=\frac{2x^2-3x}{x-1}$ is given by $r(x)=2x-1-\frac{1}{x-1}$, which has $2x-1$ as its whole number term. As a result, $r$ has a slant asymptote at $y=2x-1$, which appears in the graph of $r$ below. Read more...
The horizontal and vertical asymptotes of a rational function can give us insight into the shape of its graph. For example, consider the function $r(x)=\frac{3x}{x-1}$, which has a horizontal asymptote $y=3$ and a vertical asymptote $x=1$. Read more...
Unlike polynomials, rational functions can “blow up” to positive or negative infinity even for relatively small input values. Such input values are called vertical asymptotes, because they represent vertical lines that the function approaches but never quite reaches. Read more...
Like polynomials, rational functions can have end behavior that goes to positive or negative infinity. However, rational functions can also have another form of end behavior in which they become flat, approaching (but never quite reaching) a horizontal line known as a horizontal asymptote. Read more...
A rational function is a fraction whose numerator and denominator are both polynomials. Rational functions are usually written in proper form, where the numerator is of a smaller degree than the denominator. (The degree of a polynomial is its highest exponent.) Read more...
In the previous posts, we learned how to find end behavior, zeros, and factored forms of polynomials. In this post, we will put all this information together to sketch graphs of polynomials. Read more...
In the previous post, we learned how to find the remaining zeros of a polynomial if we are given some zeros to start with. But how do we get those initial zeros in the first place, if they’re not given to us and aren’t obvious from the equation? Read more...
The zeros of a polynomial are the inputs which cause it to evaluate to zero. For example, a zero of the polynomial $x^3-2x^2-5x+6$ is $x=1$ because $(1)^3-2(1)^2-5(1)+6=0$. Another zero is $x=-2$ because $(-2)^3-2(-2)^2-5(-2)=0$. Can you find the rest? Read more...
Polynomials include linear expressions and quadratic expressions, as well as expressions adding multiples of higher exponents of the variable. Rational functions are usually written in proper form, where the numerator is of a smaller degree than the denominator. (The degree of a polynomial is its highest exponent.) Read more...
To solve a system of inequalities, we need to solve each individual inequality and find where all their solutions overlap. For example, to solve the system Read more...
Quadratic inequalities are best visualized in the plane. For example, to solve a quadratic inequality $-x^2+x+2>0$, we can find the values of $x$ where the parabola $y=-x^2+x+2$ is positive. Since $y=-x^2+x+2$ is a downward parabola, the solution consists of the values of $x$ in its midsection which arches over the x-axis. That is, the solution consists of all x-values between the solutions to $-x^2+x+2=0$. This quadratic equation factors to $-(x-2)(x+1)=0$, so the parabola’s midsection is given by $-1<x<2$, or $(-1,2)$ in interval notation. Read more...
When a linear equation has one variable, the solution covers a section of the number line: if our solution is $x > \text{some number}$, then the solution covers the section of the number line that lies right of that number; if our solution is $x<\text{some number}$, then the solution covers the section of the number line that lies left of that number. If equality is allowed (i.e. $\geq$ or $\leq$), then we use a closed circle to indicate that the circled number is itself a solution; otherwise, if equality is not allowed (i.e. $>$ or $<$), then we use an open circle. Read more...
An inequality is similar to an equation, but instead of saying two quantities are equal, it says that one quantity is greater than or less than another. For example, since $1$ is greater than $0$, we write $1>0$. Likewise, since $0$ is less than $1$, we write $0<1$. If we write $x>0$, then we mean that $x$ can be $1$, $2$, $\pi$, $0.000001$, or any other positive number. If we write $x<0$, then we mean that $x$ can be $-1$, $-2$, $-\pi$, $-0.000001$, or any other negative number. Read more...
Systems of quadratic equations can be solved via substitution. After substituting, the resulting equation can itself be reduced down to a quadratic equation and solved by techniques covered in this chapter. Read more...
To easily graph a quadratic equation, we can convert it to vertex form: Read more...
Completing the square is another method for solving quadratic equations. Although we can use the quadratic formula to solve any quadratic equation, completing the square helps us gain a better intuition for quadratic equations and understand where the quadratic formula comes from. As we will see in the next section, completing the square will also help us rearrange quadratic equations into forms that are easy to graph. Read more...
Some quadratic equations cannot be factored easily. For example, in the equation $x^2+3x+1=0$, we need to find two factors of $1$ that add to $3$. But the only integer factors of $1$ are $1$ and $1$, and they definitely don’t add to $3$! To solve these hard-to-factor quadratic equations, it’s easiest to use the quadratic formula given below, which tells us explicitly how to compute the solutions of a quadratic equation $ax^2+bx+c=0$. Read more...
Factoring is a method for solving quadratic equations. It involves converting the quadratic equation to standard form, then factoring it into a product of two linear terms (which are called factors), and finally solving for the variable values that make either factor equal to $0$. Read more...
Quadratic equations are similar to linear equations, except that they contain squares of a single variable. Read more...
A linear system consists of multiple linear equations, and the solution of a linear system consists of the pairs which satisfy all of the equations. For example, the solution to the linear system Read more...
The standard form of a linear equation is $ax+by=c$, where $a$, $b$, and $c$ are all integers and $a$ is nonnegative. For example, we can convert the equation $y=\frac{3}{5}x+\frac{10}{3}$ to standard form by moving $x$ and $y$ to the same side and multiplying to cancel out any fractions. Read more...
Suppose we want to write the equation of a line with a given slope $m=2$, through a particular point $(3,5)$. In the previous post, we substituted the given information into a slope-intercept equation form $y=mx+b$, solved for $b$, and rewrote the slope-intercept form with $m$ and $b$ substituted so that $x$ and $y$ were the only variables. Read more...
In the previous post, we were solving linear equations in one variable. Now, let’s consider linear equations in two variables. A few examples are shown below: Read more...
Loosely speaking, a linear equation is an equality statement containing only addition, subtraction, multiplication, and division. It does not need to include all of these operations, but it cannot include operations beyond them, such as exponentiation. Read more...