### 什么是数据集的分布？

• 对于supervised learning，分布是指关于特征$X$和结果$Y$的联合分布$F(X,Y)$或者条件分布$F(Y|X)$。
我们说训练集和测试集服从同分布的意思是训练集和测试集都是由服从同一个分布的随机样本组成的，也就是：
• 对于unsupervised learning，分布是指特征$X$的分布 F(X)，也就是：

• 但是现实中比较难做到这点，特别是当训练集是过去的数据，测试集是当下的数据，由于时间的因素，它们很可能不是完全同分布的，这就增加了预测难度。这也是为什么一般交叉验证的误差往往小于实际的测试误差。因为交叉验证中每折数据都是来自训练集，它们肯定是同分布的。如果训练集和测试集的分布风马牛不相及，那么根据训练集学习得到的模型在测试集上就几乎没有什么用了。所以我们训练模型和应用模型时一个重要的前提假设就是训练集和测试集是同分布的。另外一个方面是牵涉到过拟合问题，即使训练集和测试集是同分布的，由于数据量的问题，训练集的分布可能无法完整体现真实分布，当我们过分去学习训练集分布的时候，我们反而会远离真实分布（以及测试集的分布），造成预测不准确，这就造成过拟合。

## 分布变换

GAN的思路很直接粗犷：既然没有合适的度量，那我干脆把这个度量也用神经网络训练出来吧。

## VAE

### Reparameterization trick

$(z-\mu)/\sigma=\varepsilon$是服从均值为0、方差为1的标准正态分布的，要同时把$dz$考虑进去，因为乘上$dz$才是概率，不乘是概率密度。

## Data

For data let’s use MNIST dataset. Pytorch vision module has an easy way to create training and test dataset for MNIST

## Visualization

Before proceeding, let’s visualize some data. For that I am using torchvision.utils.make_grid which creates a grid from multiple images:

## Network Architecture

Similar to deniosing auto encoder, VAE has an encoder and decoder.

### Encoder

The encoder encodes an image to a varibale z with normal distribution. For normal distribution we just need to approximate mean m and standard deviation s. Therefore, the role of neural network is to learn a funcion from image to m and s. This implicitly means we are learning a function from image to a probability distribution for z. We implement that function approximator using linear matrix and RELU nonlinearity:

### Decoder

The decoder gets the encoded value z, which in theory is reffered to as latent variable, and decodes that value to an image. Therefore, the role of decoder is to learn a function that maps a value of z to a vector of 782 real values. Note that z is in fact a random variable but here we just work with a realization (a.k.a a sampled value) of that random variable

## Loss

For doing training we need a loss function. VAE combines two type of losses

• A loss from reconstructing the image. This is simply a Cross Entropy (CE) or Mean Square Error (MSE) between decoded image and original image
• KL divergence: this loss function is for latent variable $Z$,What we like to do is to make $P(z | input)$,as close as possible to standard normal (with mean zero and variance 1). Since $z$ has normal distribution with mean m and variance s. $z ~ N(m, s)$ we can use this simple formula to calculate the loss function of z.

## Train

• blue boxes: these correspond to the tensors we use as parameters, the ones we’re asking PyTorch to compute gradients for;
• gray box: a Python operation that involves a gradient-computing tensor or its dependencies;
• green box: the same as the gray box, except it is the starting point for the computation of gradients (assuming the backward()method is called from the variable used to visualize the graph)— they are computed from the bottom-up in a graph.

