Based on David Silver's reinforcement learning lecture 1 and lecture 2.

In reinforcement learning, the policy determines which action to take given a state, the broad goal of an RL algorithm. A policy can be stochastic or deterministic:

$$ \begin{aligned} a &= \pi(s) && \text{deterministic policy} \ \pi(a|s) &= P[A=a|S=s] && \text{stochastic policy} \end{aligned} $$

The state-value function evaluates the "worth" of a state, based on the present value of future rewards. We use the policy to get an expectation of rewards, which gives us the value of state $s$ based on policy $\pi$. Different policies would give us different state values.

$$
v*{\pi}(s) = E*{\pi} [R*t + \gamma R*{t+1} + \gamma^2 R_{t+2} \cdots | S_t = s]
$$

If we know the expected reward for policies at each state then we know which action to take, and so the trajectory.

The model is the agent's representation of the environment. Knowing the model allows us to...

On Human-level control through deep reinforcement learning by Mnih et al, 2015.

Although a few years old, this is a seminal paper on deep reinforcement learning and I encourage you to read the original paper in which the authors developed the well known deep Q-netowrk (DQN) artificial agent.

DQN is an end-to-end method for training an agent, avoiding handcrafted rules for a specific domain. In a nutshell, the agent sees what you see, has access to actions you have access to (joystick movements) and is told to optimize a specific score. Amazingly, the trained algorithm achieves human level performance on 49 Atari games using not only the same architecture, but the **same parameters**!

Reinforcement learning is concerned with an agent interacting with the world to maximize a reward. To maximize this reward it makes a series of observations and actions. Here the game is observed via pixels (like humans), so unsuprisingly it uses a convolutional neural network to extract interesting...

On Categorical Reparameterization with Gumbel-Softmax by Jang et al, 2017.

The reparameterization trick is a way to formulate a distribution so as to efficiently sample. You may be aware of the trick for the Gaussian case. When sampling is written as $z \sim \mathcal{N}(\mu, \sigma^2)$ the whole sampling process is random. By rewriting the sampling to:

$$ z = \mu + \sigma \epsilon \text{ ,where } \epsilon \sim \mathcal{N}(0,1) $$

we have shifted the randomness to only $\epsilon$, which is the same distribution for whichever value of $\mu$ and $\sigma$. Now the statistics can be learned in a neural network through backpropagation. In the Variational AutoEncoder case, latent variables $\mu$ and $\sigma$ are functions of the input $x$, modeled as hidden layers in a neural network.

For the discrete case such as the categorical distribution, we have the additional challenge of the distribution being non-continuous by definition. We can formulate the distribution as $p =...

On Best Practices for Applying Deep Learning to Novel Applications by Leslie N. Smith, 2017.

This is a handy report on where to start if you want to apply deep learning for your specific application. I encourage you to read the short report. I will only summarize some key points.

**1. Getting prepared**

- Be familiar with the literature for your application
- Do you have the computational resources for DL?
- How will you measure your results? How do humans perform? What is your objective?

**2. Preparing your data**

- The number of parameters is correlated with the amount of training data
- For limited data, consider transfer learning and domain adaptation
- Make the job easier for the network: preprocess, normalize, leverage previous known heuristics

**3. Find an analogy between your application and the closest deep learning applications**

- Don't start from scratch, look at research, find similar applications
- Look for code, reproduce the results

**4. Simple baseline model**

- Start simple, small, and...

On Bayesian Recurrent Neural Networks by Fortunato et al, 2017.

**TL;DR**: improve your current RNN with variational Bayes and posterior sharpening

Variational inference techniques for neural networks have been quite popular in the last few years, at least since Kingma and Welling introduced variational Bayes for Autoencoders. Here Fortunato et al explore variational Bayes for Recurrent Neural Networks (RNNs)

The motivation for applying Bayesian methods are twofold: explicit representations of uncertainty and regularization (from the KL divergence).

Variational Bayes is a little exotic at first (I'm not sure I fully get it myself). If you need a refresher, check out this derivation. From that derivation we know that maximizing the lower bound $\mathcal{L}(\theta)$ minimizes the $KL$ divergence. Thus we train the network by minimizing the variational free energy:

$$ \mathcal{L(\theta)} = \mathbb{E}_{q(\theta)} \left[ \log \frac{q(\theta)}{p(y \mid \theta, x)p(\theta)}...

On Hybrid computing using a neural network with dynamic external memory, by Graves et al, 2016. Published in Nature.

Baking "memory" straight into neural networks has been popular since the late 90s with the Long Short-Term Memory (LSTM) variation of Recurrent Neural Networks (RNN). LSTMs have a specialized hidden state, called the memory cell, which remembers information from the past and updates its cell based on new information. Memory Networks (MemNets), which I discussed recently focus heavily on the idea of memory. Let's say you are trying to predict step 7 in a sequence. An LSTM can only peak at step 6's information such as its predicton, hidden state and memory cell. MemNets can look at all of steps from 1 to 6 and "attend" to what's important to predict step 7.

The Differential Neural Computer (DNC) are like a conventional computer's complex memory but inside a neural network. You may wonder why we want all these versions of memory inside our algorithm as opposed to just...

On Learning to Generate Reviews and Discovering Sentiment by Radford et al, 2017, of OpenAI.

Some machine learning algorithms can work on many types of data, such as words, numbers, and so on. Neural networks, however, can only deal with numbers since at their core a neuron is a non-linear transformation of a matrix multiplication. Images are represented by continuous values (pixels) in a color space. A natural representation of an image to use as input for a neural network is to simply input the raw RGB values. The interpretation of these values is also straighforward. If say a pixel's RGB value is [124, 67, 99], increasing the first dimension by on value, [125, 67, 99], is simply interpreted as as slightly shifting a pixel's value toward red. We are not so fortunate with text.

Text has a representation problem. What number should represent the word "hello"? What is an appropriate number of dimensions for each word? And how do we learn that...

Reviewing Learning End-to-End Goal-Oriented Dialog, 2017, by Bordes et al of Facebook AI Research. Accepted for oral presentation at upcoming ICLR 2017. The dataset is available on a fb research page.

There's been much ado about personal assistants and chatbots in the last few years. Many of these systems are somewhat fine-tuned to specific tasks, called *slot-filling* which simply fill out blanks in predefined structures.
This reminds me of the good ol' days of machine translation where grammar rules are hard coded in the system.
The problem, as you can guess, is you need to handcraft all the rules for each language, and possibly each language pair. This approach has almost universally been abandoned in the last year and given way to Neural Machine Translation. Similarly, the current authors want to shift away from structured models with handcrafted features and into end-to-end dialog systems with no assumptions on domain.
A dialog system, unlike Q&A, must respond with appropriate...

Reviewing End-To-End Memory Networks by Sukhbaatar et al of NYU and Facebook. Original Torch code, and an implementation in Tensorflow by Taehoon Kim.

**TL;DR**: on specific Q&A dataset Memory Networks hugely outperform. Training tips: encode word position, toggling softmax to avoid local minima, add noise, stack memory layers.

This paper builds on the work by Weston et al on Memory Networks. Memory Networks are an interesting form of "attention" where instead of attenting on specific elements of a single input, such as in Neural Machine Translation attention, it attends over a "memory" of huge number of inputs. And unlike previous work on Memory Networks, this model is trained end-to-end requiring minimal supervision.

Conceptually, the model is simple. Inputs $x$ are stored to the memory and a query $q$ retrieves the relevant parts of memory to output an answer $a$. Information flows through the model through smooth differentiable functions allowing for backpropagation. The...

A review of the paper Exploring the Limits of Language Modeling, 2016, by Jozefowicz et al of Google Brain Google Brain. The architecture and weights have been released by the authors and made available on the official tensorflow git repo

**TL;DR** for large scale modeling: character CNN inputs, lstm with huge hidden sates, use importance sampling, ensemble your best models.

If you've done any neural net work in Natural Language Processing (NLP) you've probably hit a few walls. For example, you may have built a generative model where you had to deal with a massive output layer, where your model predicts the probability of tens of thousands of words. A smallish penultimate layer of 128 hidden units predicting a vocabulary of 40k words costs you a staggering 5.12 million parameters. Just one layer. This paper explores recent advances in Recurrent neural Networks (RNN) for large scale tasks and how to best deal with such issues.

In the good old days of Statistical...

Welcome to this complete guide on setting up your machine for deep learning. Specifically, we will install Ubuntu, TensorFlow, Theano and Torch. These require CUDA for GPU computation. I have found that installing CUDA through `apt-get`

was always problematic, so this guide goes through a manual installation.

If your OS is already good to go, skip to the CUDA and cuDNN section.

There are two options to consider: single or dual-boot. If you simply want a single boot machine, ie only Ubuntu, skip to the next section. If you have Windows, you may want to keep it just in case. In Windows, proceed to add a new partition. Format the new partition as NTFS (it doesn't really matter though, Ubuntu will reformat the drive later on).

You may possibly get a warning that you cannot shrink the volume beyond movable files. This can help

Download the Ubuntu 14.04 Desktop (64-bit) edition. I'm running Ubuntu 14.04 and it works fine. Later Ubuntu "might" work as...