Python机器学习快速入门#

有必要入门一下机器学习

快速过一下尽可能多的概念

尽量把概念的来龙去脉记下来。尽量推导。

较深的数学不做过多纠缠，以后再深入。
基于`Machine Learning with PyTorch and Scikit-Learn <https://sebastianraschka.com/books/#machine-learning-with-pytorch-and-scikit-learn>`__这本书。

优点是较新，有现成代码。学起来比较高效。

包含了机器学习较多的内容。跟着过一遍没错。

这书不会深入讲所有原理，得配合其他经典权威的书籍资料一起学习。
需要一些概率/统计/线性代数/微积分基础

参考#

[Machine Learning With PyTorch and Scikit-Learn]

[Artificial Intelligence A Modern Approach]

[Machine Learning with Python Cookbook]

https://developers.google.com/machine-learning

https://mlu-explain.github.io/

1. 赋予机器学习数据的能力#

三种机器学习

Supervised learning

标签化的数据直接反馈预测
Unsupervised learning

无标签无反馈找出数据中隐含的结构
Reinforcement learning

决策过程奖励系统学习一系列动作

Supervised learning#

监督学习，建立输入数据和标签之间的联系。用标签化的数据进行学习。可以称为标签学习。

图1.2为一个典型的流程。

1. 定义标签 2. 训练数据结合ml算法做出一个预测标签的模型 3. 新数据进模型产出预测的标签

监督学习的子类: 1. 例如识别垃圾邮件，是一种Classification。

2. 另一种监督学习是Regression。输出是连续值。

Classification#

分类算法。标签是离散的。垃圾邮件的例子是一个二元分类，只有是和否两种标签。

一个多类别分类的例子是识别手写的英文字母。

Regression#

回归算法用来预测连续的输出。我们拿到一些predictor variable和一个连续的输出。然后找出输入和输出的关系，再用这个关系来预测新的输入。

predictor variable一般被称作feature也就是特征。

图1.4展示了回归的思想。大概就是根据已知的输入和结果，拟出函数。预测就是根据函数直接出结果。

例如我们要预测学生的考试分数，如果能找到分数和学习时间的某种关系就好了。可以用回归算法来得到他们的关系曲线。

Reinforcement learning#

另一种机器学习，所谓强化学习。目的是做出一个系统(agent)，其与环境(environment)的交互可以对自身进行改进。通常有一个所谓的奖励信号(reward signal)。

象棋是一个较好的例子。agent根据当前局面(environment)做出几步决策。奖励信号是输或赢。

强化学习下面有诸多分支。本质上都是最大化奖励信号。

Unsupervised learning#

无监督学习，处理无标签的或未知结构的数据。可以在没有引导(已知的输出或奖励信号)的情况下，从输入中找出有用的信息。

用clustering进行分组#

所谓聚类。是一种pattern discovery技术。

每一个数据有一些特征，聚类算法根据特征，把数据分组(cluster)。

又称为unsupervised classification无监督分类。

比如图1.6。按坐标分组。

降维#

无监督学习的另一个子类。数据的原始特征可能太多，需要降低维度，除掉相对次要的信息。

术语和一些约定等#

sample/feature/label

https://scikit-learn.org/stable/glossary.html

代码环境#

windows/python3.11

书的代码在 rasbt/machine-learning-book

pip3.11 install numpy==1.25.0 --index-url http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com

pip3.11 install pandas==2.0.3 --index-url http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com

pip3.11 install matplotlib==3.7.2 --index-url http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com

pip3.11 install scikit-learn==1.3.0 --index-url http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com

pip3.11 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu118 --index-url http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com

pip3.11 install xgboost==1.7.6 --index-url http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com

安装pytorch要看下版本和显卡相关信息
https://pytorch.org/get-started/locally/
https://pytorch.org/get-started/pytorch-2.0/

2 简单分类算法#

这一章看classification的两种算法
* perceptron/感知机
* adaptive linear neurons/自适应线性神经元

2.1 人工神经元-历史#

1943年提出简化的脑细胞概念。脑中的神经细胞组成神经元，传输和处理化学信号和电信号。把神经细胞看作简单的逻辑门，只输出二值。多个信号输入到树突(dendrite)，被整合到细胞体。

如果累计的信号达到某阈值，产生输出信号到轴突(axon)。

1957年提出感知学习规律。提出一个算法，可以习得一个最优的因子，该因子乘上输入的特征后，能决定触发哪个神经元。把这运用到监督学习和分类上，可以预测数据所属的类别。

2.1.1 artificial neuron人工神经元的正式定义#

有一串输入\(x = \begin{bmatrix} x_1\\ ...\\ x_m \end{bmatrix}\)

有一串权重\(w = \begin{bmatrix} w_1\\ ...\\ w_m \end{bmatrix}\)

把权重乘到每个输入上，并引入一个阈值参数(bias)b，得\(z=w^Tx+b\)

可以定义决策函数\(\sigma(z) = \begin{cases} 1 & \text{ if } z \ge 0 \\ 0 & \text{ if } z < 0 \end{cases}\)

图2.2展示了用z值进行分类

2.1.2 感知器学习规则#

原理就是模拟神经元的工作方式：是否触发。其实非常简单 1. 权重和bias初始化为0或较小的随机数 2. 对于每组训练数据\(x^i\)，有给定的标签值\(y^i\)。 1. 算出输出值(预测值)\(\hat y^i=\sigma(z)\) 2. 更新权重和bias

更新方法:

\(w_j += \Delta w_j\)

\(b+= \Delta b\)

其中

\(\Delta w_j=\eta(y^i - \hat y^i)x^i_j\)

\(\Delta b=\eta(y^i - \hat y^i)\)

\(\eta\)为learning rate，eta值/学习力度，在0到1之间。

例如对于一个二维特征值的数据，有

\(\Delta w_1=\eta(y^i - \hat y^i)x^i_1\)

\(\Delta w_2=\eta(y^i - \hat y^i)x^i_2\)

\(\Delta b=\eta(y^i - \hat y^i)\)

根据这些\(\Delta\)的算法可以看出一旦输出值(预测值)小于实际标签值，\(\Delta\)值就会变大，权重就会变大。

下次预测就会把预测值推高。反之亦然。

这个更新就是学习的过程。权重就是神经元，神经元不断学习进化，达到一个聪明的状态。

图2.3 这种算法，要点是数据集必须能被线性分割，感知器算法才能收敛。

如果无法线性分割，需要设置一个最大pass数，否则算法永远停不了。

接下来要学习的Adaline(adaptive linear neurons)算法，不需要线性分割也可收敛。

第三章学习可以输出非线性决策边界的算法。

2.2 实现一个感知器学习算法#

看代码

输入数据为鸢尾花信息，经过处理。

最终的输入X是一串二维数据，对应的y为0或1。

Perceptron类
eta初始值0.01
bias初始值0
权重值取了正态分布的随机值，mean为0，sd为0.01。值很小。

net_input函数即上述\(z\)函数

predict即上述\(\sigma(z)\)

fit即上述计算和更新权重与bias的流程
每一次predict都会记录是否错误，记录错误总数。
会对输入数据做n_iter次同样的流程。同时保持权重和bias不断更新。
运行代码。可以看到头几次会有较大的错误次数，后来不会出错，所谓收敛了。
即训练几次就达到了很好的效果。

这么看来训练出来的有价值的数据就是权重和bias了。

plot_decision_regions画出了二维数据的预测边界
meshgrid生成网格
把所有网格的xy作为数据走predict。就得到了图像上所有点的预测值。
然后contourf画出等高线。

对着图2.4总结一下算法
有一堆sample，每个包含一个正确的y值和一堆特征x。
b和w是需要训练的关键数据，有初始值。
有一个总的pass，把全体sample训练一定pass数。
每个pass中，依次喂所有sample进来，每次处理一个sample。
sample与关键数据计算得出预计值
预计值与正确值比较，如果错了，可记录下来。
他们的差，与eta相乘得到一个update值。
用update值更新b和w。这个是精髓，错了，就往正确值靠拢，也就是学习。
完成一个pass。

多次训练之后预测的正确率会提高。说明有用。

2.3 Adaptive linear neurons和学习的收敛#

我们看一种单层神经网络(single-layer neural network(NN))
所谓自适应线性神经元(ADAptive LInear NEuron(Adaline))
1960年提出

Adaline算法非常有意思，因为它展现了定义和最小化连续损失函数的核心概念。

这是一些其他分类算法的基础工作。比如逻辑回归，svm，多层神经网络，线性回归模型。

Adaline和之前感知算法的区别在于，权重的更新时基于一个linear activation function，所谓激活函数。

与之前的unit step function \(\sigma(z)\)不同。

在Adaline里，linear activation function \(\sigma(z)=z\)。

而感知算法里\(\sigma(z)\)会被归类。

图2.9图示了区别。就是把\(\sigma(z)\)做归类这个动作放到了反馈之后。
感知算法比较的是\(\sigma(z)\)归类后的标签，而Adaline比较的是\(\sigma(z)\)。
感知器的error先经过了threshold函数，输出要么0要么1。
而现在的error可以是一个正常的差值，用这个插值再去更新b和w。

2.3.1 利用梯度下降来最小化损失函数#

监督学习的一个核心组件是一个目标函数(objective function)，它在学习过程中不断被优化。这个函数一般是一个loss function或cost function，我们希望它最小化。

在Adaline里，定义loss funtion为L。

\({\Large L(w, b)=\frac{1}{n} \sum_{i=1}^{n}(y^i-\sigma(z^i))^2}\)

是一个平均方差mean squared error(MSE)。

n是数据总数。

这个连续的linear activation function相比unit step function的好处是，损失函数可微分。

另一个性质是这个函数是凸的(convex)。可以用梯度下降方法来最小化损失函数。

\({\large \Delta w = -\eta \nabla_w L(w, b)}\)

\({\large \Delta b = -\eta \nabla_b L(w, b)}\)

梯度就是算偏微分。也就是走势。

加上负号，就是做一个和走势相反的反馈。

一个重点是公式里的n是sample数。本质是每个sample都要算出error，再算均方差，再更新。

之前的感知器是每次只算一个sample就更新。

对于\(\Delta w\)中的某个\(\Delta w_j\)有

\({\Large \Delta w_j=-\eta \frac{\partial L}{\partial w_j}}\)

\({\Large = \frac{-\eta}{n} \frac{\partial}{\partial w_j} \sum_{i=1}^{n}(y^i-\sigma(z^i))^2}\)

\({\Large = \frac{-2\eta}{n} \sum_{i=1}^{n}(y^i-\sigma(z^i)) \frac{\partial}{\partial w_j} (y^i-\sigma(z^i))}\)

\({\Large = \frac{-2\eta}{n} \sum_{i=1}^{n}(y^i-\sigma(z^i)) \frac{\partial}{\partial w_j} (y^i - w^Tx^i - b)}\)

最后点乘出来只有\({\Large -w_jx^i_j}\)这一项是有用的。所以最终

\({\Large \Delta w_j= \frac{2\eta}{n} \sum_{i=1}^{n}(y^i-\sigma(z^i)) x^i_j}\)

同理

\({\Large \Delta b= \frac{2\eta}{n} \sum_{i=1}^{n}(y^i-\sigma(z^i))}\)

我们要求所有的\({\Large \Delta w_x}\)，也就是整个\({\Large \Delta w}\)。

而整个\({\Large \Delta w}\)其实是个点积的形式。一下子有点绕，推一下就清楚了。

设所有输入的5组数据为 \(X = \begin{bmatrix} a1&a2&a3&a4&a5 \\ b1&b2&b3&b4&b5 \end{bmatrix}\)

每组包含ab两个数据。

先可以一步算出总的的\(Errors = \begin{bmatrix} e1\\ e2\\ e3\\ e4\\ e5\end{bmatrix}\)

每组数据一个error，即代码中的y - np.dot(X, self.w_) + self.b_

目标是要算\(\Delta w_a\)和\(\Delta w_b\)

\(\Delta w_a\)根据上面的公式

\({\Large \Delta w_j= \frac{2\eta}{n} \sum_{i=1}^{n}(y^i-\sigma(z^i)) x^i_j}\)

其中n=5。括号中其实就是Error值，那么

\({\Large \Delta w_j= \frac{2\eta}{n} \sum_{i=1}^{n}e^i x^i_j}\)

发现这个e乘x再求和，就是一个e点乘x的j列。

\({\Large \Delta w_j= \frac{2\eta}{n} Errors \cdot x_j}\)

w的每个分量就有了。推广到整个w，就是一个矩阵相乘。

\({\Large \Delta w= \frac{2\eta}{n} Errors \cdot X}\)

b更简单，直接求errors平均

\({\Large \Delta b= 2\eta Errors.mean()}\)

非常简洁

2.3.2 Adaline的实现#

AdalineGD类

net_input处理整个数据集，得到所有数据组的z值。存output。

errors存每组数据的z值和正确答案y的差。

按上述方法算\(\Delta w\)和\(\Delta b\)并更新。

记录每次的loss。

问题#

\(\eta\)值分别用0.1和0.0001做对比。

可以看到0.1太大，误差反而不断变大。0.0001太小，收敛很慢。

图2.12展示了\(\eta\)太小和太大的效果。太小速度慢。太大了来回波动，一直无法收敛。

还有局部最优和全局最优的问题。有可能陷入局部最优解，跳不出来。

后续的个人理解：

首先根据统计学等定下一个分数标准，也就是loss函数，值越小越好。

每一种权重和bias，对应一个训练过的模型。

每个模型对所有数据，都可算出一个对应loss值。

那我们训练的目的就是要得到让loss值最小的那个模型。

然后这个loss函数我们可以把它看作一个曲面/普通的函数，跟求函数最小值那样，用基本的微积分来求最小值。
训练数据X看成是常数，w和b是变量，求loss最小时w和b的值。

然后这个loss函数的图形一般是较为混乱的，比如杂乱的波浪。并不能轻易获得一个全局最小值(很难收敛)。需要采取各种手段来优化。通常只能概率性获取最小值。

梯度下降是一个求最小值的方法。有很多变种。
loss函数对w和b分别算偏微分，就得到了梯度。也就是两个方向的斜率。

然后本质是从一个初始点(初始w和b)，不断地向下走，下下山一样，希望下到山底/山谷。
学习率eta就是步子大小，会乘上斜率，得到新的w和b。
每走一步记录loss值，看看是不是更小了。
n_iter就是走的步数，一般必须进行限制。否则可能永远停不下来。

为什么是求导/偏微分。
因为要追求快速。
对于两个变量/3d图形，沿梯度方向下降的速度最大。
比如下山，跨出同样的距离，一定是沿着梯度方向下降的高度最大。
那照这样说，如果是1个变量，就无所谓梯度下降？因为只能沿一个方向走。(或者称为梯度下降的一个特例？)

又有一个问题。直接算斜率为0时的变量是不是就行了？貌似是不好算，因为w本身又包含多个个变量哈哈。
算不出解析解导数为0的解析解，就可以用梯度下降，是一种数值算法，

2.3.3 用feature scaling优化梯度下降#

这本书中的很多算法都需要做feature scaling来优化。

很多算法包括梯度下降都能被feature scaling优化。

考察一种feature scaling方法，叫做标准化(standardization)。

它能让梯度下降收敛更快。但不会使原始数据更加正态分布。

标准化处理，把feature的平均值移动到0，并且使该feature的标准差为1。

\({\LARGE newX_j=\frac{x_j-该组平均值}{该组标准差} }\)

标准化能帮助我们快速找到一个好的\(\eta\)值。

对于波动较大的数据，某个\(\eta\)值可能对某个权重效果好，但对其他权重效果差。

运行代码可以看到\(\eta\)取0.5，10次收敛到一定程度，但是没法非常接近0。

2.3.4 大规模机器学习和随机梯度下降(stochastic gradient descent)#

之前的梯度下降，每次用的是所有的训练数据。称作full batch梯度下降。

那么如果数据量超大的话，是没法算的。得用stochastic gradient descent(SGD)随机梯度下降。也称作iterative梯度下降或在线梯度下降。

之前的\(\Delta\)

\({\Large \Delta w_j= \frac{2\eta}{n} \sum_{i=1}^{n}(y^i-\sigma(z^i)) x^i_j}\)

\({\Large \Delta b= \frac{2\eta}{n} \sum_{i=1}^{n}(y^i-\sigma(z^i))}\)

把这里面的均值部分去掉，简单替成当前的数据即可。

\({\Large \Delta w_j= \eta (y^i-\sigma(z^i)) x^i_j}\)

\({\Large \Delta b= \eta (y^i-\sigma(z^i))}\)

因为每次只用一组数据，所以误差会更大。但也有个好处是有概率跳出local最小区域。

要保证SGD有好的结果，训练数据必须随机化。每组训练还要shuffle一下防止出现规律性。

SGD可以动态更新\(\eta\)。

SGD可以用来在线学习。非常方便，老的训练数据可以扔掉。

有一种折中，mini-batch梯度下降。就是每次取小部分数据进行训练。

SGD的实现#

初始训练时每次训练shuffle一下数据。

按上述\(\Delta\)算法处理每组数据。

后续走在线训练partial_fit。

3 Scikit-Learn分类器概览#

这一章考察scikit-learn的一些监督分类算法

3.1 选择一个分类算法#

选算法需要经验。不同算法可有不同的性质，基于不同的预设。
不可能有一种算法通吃所有场景。
通常需要做对比，选出一个最好的。

监督学习的五个主要步骤
挑选特征，收集标签化的训练数据。
选一个性能指标(performance metric)
选算法和训练模型
评估模型的表现
调参数以达到较好的表现

3.2 训练一个perceptron#

之前我们自己实现了两个分类算法perceptron和Adaline。

现在看看用scikit-learn能做什么。

打开文档方便查找 https://scikit-learn.org/stable/modules/classes.html

直接看代码

sklearn.datasets.load_iris可读取内置的iris数据集合。
sklearn.model_selection.train_test_split这个可以把一个数据集分成测试组和训练组。

stratify指定y，是要使两组数据中的y的数据占比和原始y一样。
sklearn.preprocessing.StandardScaler

即上述的标准化。默认居中，标准差为1。
sklearn.linear_model.Perceptron

属于线性模型。
Perceptron.fit()

进行训练
Perceptron.predict()

进行预测

可以看到准确率为0.978。

同样可沿用之前的plot_decision_regions画出决策的分布图。

第二章讲过如果数据集不能被完美地线性分割，那么感知器算法一定无法收敛。

这也是为什么一般不推荐感知器算法。

3.3 使用逻辑回归#

为什么会出现上述的无法收敛？

因为无法线性分割意味着总是至少有一次错误的分类，导致权重永远在更新。

看另一个简单而强大的线性二元分类算法，逻辑回归(logistic regression)。

3.3.1 逻辑回归和条件概率#

逻辑回归实现简单，在线性分类上表现良好。工业上应用广泛。

可以进行三种或更多种类的分类。目前我们只看二元分类。

从概率入手

\({\Large odds=\frac{p}{1-p}}\)

p是一个事件发生的概率，那么1-p就是不发生的概率。两个一除就是\(odds\)。

对于分类可以把p写成\({\Large p(y=1|x)}\)，当给定特征值x时，y分类为1的概率。

又有logit函数(log-odds)

\({\Large logit(p) = ln \frac{p}{1-p} }\)

参考讨论logit

logit可以把0-1的概率输入映射到整个实数域。

在逻辑模型下，我们假设logit和第二章中的net_input即z值相等。

\({\Large logit(p)=ln \frac{p}{1-p}= w^Tx + b = z}\)

那么从z值反过来能求得一个概率p

这个反求称为logistic sigmoid function，简称sigmoid函数。

\({\Large p= \sigma(z) = \frac{e^z}{1+e^z} = \frac{1}{1+e^{-z}} }\)

它的图形是一个拉长的S曲线。值域在[0,1]，定义域为整个实数范围。

图3.3，总的来说把sigmoid函数插在Adaline的求z值之后，就变成了逻辑回归。

预测时如果从z值算sigmoid得到概率达到0.5，就认为y为1，否则y为0。

下面看具体如何更新权重和bias。

3.3.2 通过logistic loss function学习权重#

定义一个likelihood函数

\({\LARGE \mathcal{L}(w, b|x) = p(y|x;w,b) = \prod_{i=1}^{n}p(y^i|x^i;w,b)}\)

在数据x，权重w和bias b条件下，所有组的预测值都为y的概率。是一个连乘。

希望让它最大化。

p可以用sigmoid代替，然后现在y只考虑0/1两种值，可以变成：

\({\LARGE \prod_{i=1}^{n} (\sigma(z^i))^{y^i} (1-\sigma(z^i))^{1-y^i} }\)

这个可能有点懵，意思是y只有0/1两种值，因为我们的问题是二元分类。

\({\Large \sigma(z)}\)是y为1的概率，那么\({\Large 1-\sigma(z)}\)是y为0的概率。

那么对于每组sample，如果y是1，那么只需看前一项。否则只需看后一项。

他是故意构造了y的乘方，来选择需要的项。

另外每组当作独立事件，所以可以写成连乘。

实战中取其log值，更容易找最大值。

\({\LARGE l(w, b|x) = log\mathcal{L}(w, b|x) }\)

\({\LARGE = log\prod_{i=1}^{n} (\sigma(z^i))^{y^i} (1-\sigma(z^i))^{1-y^i}}\)

log乘变成加

\({\LARGE = \sum_{i}^{n}log[(\sigma(z^i))^{y^i} (1-\sigma(z^i))^{1-y^i}]}\)

log乘变成加

\({\LARGE = \sum_{i}^{n}[log((\sigma(z^i))^{y^i}) + log ((1-\sigma(z^i))^{1-y^i})]}\)

\({\LARGE = \sum_{i}^{n}[y^i log(\sigma(z^i)) + (1-y^i)log(1-\sigma(z^i))]}\)

为什么取log？
取log不影响最大最小值对应的变量。
可以防止\(\mathcal{L}\)很小时可能发生的数据向下溢出。
把乘法变成加法。求导更容易。

加上负号变成loss函数

\({\LARGE L(w,b)= \sum_i^n[-y^i log(\sigma(z^i)) - (1-y^i)log(1-\sigma(z^i))]}\)

上面的loss函数是求所有sample的loss。现在只拿一个sample看看具体情况。

取某个sample，[]中的值为

\({\LARGE -y log(\sigma(z)) - (1-y)log(1-\sigma(z))}\)

根据y值有

\({\Large 单个sample的loss值 = \begin{cases} -log(\sigma(z)) & \text{ if } y = 1 \\ -log(1 - \sigma(z)) & \text{ if } y = 0 \end{cases}}\)

可以画出单个sample的loss和\(\sigma(z)\)也就是概率的曲线。

图3.4可以看到。当y值为1，算出的\(\sigma(z)\)也就是p值越接近1，loss就越小，越接近0，loss呈指数级增长。
当y值为0时也是一样。
这样构造出一个指数级的loss，离正确答案越远，loss越离谱。

求L的偏微分\({\LARGE \frac{\partial L(w,b)}{\partial w_j}}\)，用链式法则chain rule。

第一层。设\(\sigma(z_i)=B\)，[]里的内容为A，对B微分为：

\({\LARGE \frac{\partial A}{\partial B} = \frac{-y}{B} - \frac{y-1}{1-B} = \frac{B-y}{B(1-B)} }\)

第二层

\({\LARGE \frac{\partial B}{\partial z} = \frac{\partial \frac{1}{1+e^{-z}}}{\partial z} = (1+e^{-z})^{-2}e^{-z}}\)

同时本身有\({\Large B=\frac{1}{1+e^{-z}}}\)，可求得\({\Large e^{-z}=\frac{1-B}{B}}\)。代入上式

\({\LARGE \frac{\partial B}{\partial z} = B(1-B)}\)

第三层 >\({\LARGE \frac{\partial z}{\partial w_j} = \frac{\partial (w^T+b)}{\partial w_j} =x_j}\)

最后三层相乘得\({\Large (B-y)x_j}\)

所以

\({\LARGE \frac{\partial L(w,b)}{\partial w_j} = \sum_i^n (\sigma(z^i) - y^i)x_j}\)

同理

\({\LARGE \frac{\partial L(w,b)}{\partial b} = \sum_i^n (\sigma(z^i) - y^i)}\)

可以看到和Adaline的结果是类似的。

3.3.3 把Adaline转换为逻辑回归#

\({\LARGE L(w,b)= \frac{1}{n}\sum_i^n[-y^i log(\sigma(z^i)) - (1-y^i)log(1-\sigma(z^i))]}\)

看代码

LogisticRegressionGD是手搓的逻辑回归类

按照上面的推导，net_input之后插入sigmoid。即activation()中算sigmoid。

打印一下每次的loss，可以看到越来越小。

总结一下手搓的逻辑回归
对一个sample，可正常算z值。
通过sigmoid函数，让z值对应到一个0-1的概率。与正确y值的差作为error。
根据loss函数，用梯度下降去更新w和b。

3.3.4 使用scikit-learn训练逻辑回归模型#

看代码

sklearn.linear_model.LogisticRegression 参数很多支持多个class。之前手搓的只支持二元。

3.3.5 用regularization解决过度拟合#

过拟合是极其学习中的常见问题。训练数据表现良好，但新数据表现不好。原因是数据波动大(high variance)。参数太多可能导致这个问题。

过拟合的反面是欠拟合(high bias)。原因是我们的模型不够复杂，不足以识别出数据的模式。图3.8展示了例子。

大致上过拟合就代表high variance，欠拟合就代表high bias。形成一个variance-bias的取舍。

regularization可以用来找一个合适的平衡点。善于处理collinearity(high correlation among features)。我理解是数据整体较为一致，只是有个别noise。它能把noise除掉。

原理是引入一个惩罚值，来处理极端的权重。最常用的叫L2 regularization(L2 shrinkage or weight decay)。

\({\LARGE \frac{\lambda}{2n}||w||^2 = \frac{\lambda}{2n} \sum_{j=1}^{m}w_j^2 }\)

\(\lambda\)是正规化参数。

把它直接加到loss函数最后得到新的loss函数。

那么偏微分的结果就要加上一项\({\LARGE \frac{\lambda}{n}w_j}\)了。

通过控制\(\lambda\)来控制regularization的力度。\(\lambda\)越大力度越大。

sklearn.linear_model.LogisticRegression的C参数就是\({\Large \frac{1}{\lambda}}\)，默认为1。

3.4 使用svm最大化margin classification#

另一个强大的学习算法是svm(support vector
machine)。可认为是感知器的扩展。
感知器算法中，我们最小化分类错误。
svm中，我们要最大化margin。
margin是决策边界对应的超平面和最近的数据的距离的两倍。所谓的support
vector。
见图3.10。

margin越大说明模型越泛化，能覆盖更多的数据，通常就是越好。

越小的话越容易过拟合，数据稍有波动就容易出错。

直觉上较为简单，但数学要求较高。

3.4.1用slack variables处理无法线性分割的情况#

sklearn.svm.SVC的C参数。可以看作一个hyperparameter，来控制出现错误分类时的惩罚。
C值越大，出错时惩罚越大。
然后可以用C来控制margin大小，也就是regularization力度。

C和regularization力度是反比

C越小力度越大、惩罚越小、margin越大，

效果见图3.11。可以看出惩罚越大，就分得越精细，但是同时margin越小，越容易过度拟合。
惩罚越小，margin越大。更容忍一些错误，把这些错误当作noise。
需要取舍。

逻辑回归vs svm#

对于分类任务，线性逻辑回归和线性svm经常有类似的性能。

逻辑回归最大化训练数据的likelihoods。比svm更适用于离群值(outlier)。
svm更注重靠近决策边界的数据。
逻辑回归在数学和实现上更简单。
逻辑回归更容易更新。非常适用于streaming data。

3.5 kernel svm处理非线性问题#

svm的一个流行原因是能被kernelized，以处理非线性分类问题。

3.5.1 Kernel methods for线性不可分数据#

例如生成一堆XOR形态的数据。见图3.13。

很明显无法用一个线性超平面把数据分成两组。

kernel methods的idea是创建原始特征的非线性组合，然后用一个映射函数\(\phi\)，把它们映射到更高维度的空间。使得数据最终能被线性分割。

图3.14是一个例子

\({\Large \phi(x_1, x_2)=(z_1, z_2,z_3)=(x_1, x_2, x_1^2+x_2^2)}\)

内圈和外圈的数据映射到三维，分出了高低。

3.5.2 用kernel技巧在高维度找超平面#

数据映射到高维度后再进行训练。之后新的数据也要先映射再预测。

一个问题是计算高维度的数据开销很大。需要用kernel trick。

常用的radial basis function（Gaussian kernel)

\({\Large K(x^i, x^j)=exp(-\gamma||x^i-x^j||^2)}\)

其中gamma是可调的参数

sklearn.svm.SVC使用rbf kernel，和不同的gamma/C参数处理数据，观察结果。

3.6 决策树学习#

考虑易读性，决策树比较有吸引力。

按顺序以问答的形式往下走就行了。

从root开始，从某个feature开始分裂，挑选依据是能得到最大的information gain。

不断往下分，直到无法再分。见图3.18。

往往深度会太大，需要剪枝。

3.6.1 IG最大化#

需要定义一个objective function

\({\Large IG(D_p, f) = I(D_p)-\sum_{j=1}^m\frac{N_j}{N_p}I(D_j)}\)

f是当前要分裂的feature
\(D_p\)是父节点
\(D_j\)是第j个子节点
\(N_p\)是父节点中sample的数量
\(N_j\)是\(D_j\)节点中sample的数量
I是impurity measure

可以看出IG是父节点的impurity measure减去所有子节点的impurity measure的和。

子节点的和越少，IG越多。

然而为了简洁，多数库实现都会采用二叉决策树，只会分裂成左右两个子节点。

\({\Large IG(D_p, f) = I(D_p) - \frac{N_{left}}{N_p}I(D_{left}) - \frac{N_{right}}{N_p}I(D_{right})}\)

三种impurity measure
- \(I_G\) Gini impurity
- \(I_H\) entropy
- \(I_E\) classification error

entropy#

\({\Large I_H(t) = - \sum_{i = 1}^{c}p(i|t)log_2 p(i|t)}\)

p(i|t)意思是某个节点t的数据中，属于i类的百分比。c是类的数量。

如果类是平均分布的话，熵能达到最大。

例如一个二叉分类，如果p(i=1|t)=1或p(i=0|t)=0，熵会得到0。

如果都是0.5，也就是平均分布，熵会得到1。

可以做出曲线，图3.19。看到0.5时熵最大。

Gini#

\({\Large I_G(t) = \sum_{i = 1}^{c}p(i|t)(1-p(i|t)) = \sum_{i = 1}^{c}p(i|t)-\sum_{i = 1}^c p(i|t)^2}\)

\({\Large = 1-\sum_{i = 1}^c p(i|t)^2}\)

同样是平均分布时达到最大

Gini和entropy效果非常相似

classification error#

\({\Large I_E(t) = 1 - max\{p(i|t)\}}\)

对于剪枝比较有效，但不推荐在树生长时使用。

看懂图3.20的数据用三种方法的计算

图3.21是三种方法的比较

3.6.2 构建决策树#

决策树可以通过把feature空间分割成长方形，来构建复杂的决策边界。

然而需要注意，如果创建的越深，就越复杂，页越容易过度拟合。

用sklearn.tree.DecisionTreeClassifier创建决策树。

criterion选gini，最大深度选4。

sklearn.tree.plot_tree可以画出决策树，非常nice。

3.6.3多个决策树构成随机森林#

ensemble方法逐渐流行，因为分类表现较好，且不容易过度拟合。

一个随机森林可以看作一个决策树的ensemble，也就是多棵树组成了一个森林。

算法思想是，单棵树可能波动较大，那么拿多棵树取平均，得到一个更泛化、不易过度拟合的模型。

随机森林生成步骤
随机挑选n个sample作为root
从root开始生长，递归往下。
1 随机挑选d个feature(无replacement)
2 分裂当前节点，根据分裂的IG值，选择IG值最大的分裂方法。
重复1-2 k次，即做k棵树。
通过majority vote把k棵树整合起来

与生成单棵决策树不同，随机森林里每棵树是随机挑选d个feature而不是所有。

无replacement意思是取完以后不再放回，比如随机取整数，一旦取到了2，后续不可能再取到2。

反之就是会把每次取到的数放回，后续可能再次取到。

随机森林在可读性(interpretability)上肯定不如单棵决策树了，但是一个优势是不用操心hyperparameter也就是C值得选择。

一般也不需要剪枝，因为ensemble性质，求平均，比较能抗noise。

需要关心的参数是k。一般树的数量越多结果越好，当然计算量更大。

n越小，即每棵树选的初始sample越少，结果随机性越大，越不容易过度拟合。
但是整体越差。
n越大，越容易过度拟合。因为选的越多，所有树越相似。

通常，n就取sample的总数，d取\(\sqrt{m}\)。

sklearn.ensemble.RandomForestClassifier
max_features参数默认sqrt。
但代码里X_train只有两个feature，所以实际每个决策树只用1个feature？

3.7 K-nearest(KNN)#

与之前的算法都不同。

lazy。并不是说看上去简单。而是说它不是去算各种函数，而是去记忆训练数据。

knn步骤
选一个k，选一个距离算法。
找k个最近的数据
通过majority vote设标签

图3.25。很简明，来了一个数据，找它最近的k个数据，其中三角最多，那么认为这个数据就是三角。

如果出现平手，可选先出现的。

虽然很简单，但是随着数据增多，速度越来越慢。

k的选择很重要，距离算法也要符合feature。

一般对于数字数据，就用欧拉距离。

见minkowski距离/欧拉距离/曼哈顿距离

4 创建好的训练数据-数据预处理#

这一章讲3中预处理的方法

4.1 处理missing数据#

sample中缺几个数据是常事。

会碰到Nan/None/Null之类。

必须进行处理。可以删除一些数据或者填入缺失的数据。

文档https://pandas.pydata.org/docs/reference/index.html#api

pandas.read_csv读csv

isnull().sum()获取缺失数据数量

dropna(axis=0)删除含有null的行
dropna(axis=1)删除含有null的列
how参数可选any，只要有null就删。all，都是null才删。
thresh参数，低于该数量就删除。
subset参数，只删除指定列中含有的null的行。

有时不能删，会丢失太多数据。那么就得造数据。

mean imputation取现有数据的平均值填进去。

sklearn.impute.SimpleImputer可以造数据。

strategy=’mean’按feature来算平均数

pandas的fillna也可以造

还有其他的方法比如sklearn.impute.KNNImputer

scikit-learn的estimator API#

fit和transform，两个常见的api。

fit一般用于学习，得到模型参数。transform用参数进行数据的转换。

见图4.2

4.2 处理分类数据#

之前处理的数据是数字，现在要处理非数字。

ordinal，序数，非数字，但是可排序，比如衣服的尺码，XL/L/M这种。

nominal，无法排序，比如衣服的颜色。

pandas.DataFrame初始化数据。

可用map把ordinal数据映射到数字。

对于nominal，一般库会在内部映射为数字。但最好是显式做一下映射。

sklearn.preprocessing.LabelEncoder

sklearn.preprocessing.OneHotEncoder sklearn的mapping api。

把ordinal转成数字没问题，本来就是有序的。

但是把nominal转成数字，fit会把这些无序的数据当成有序的数字来处理，产生问题。

one-hot encoding解决这个问题。
思想是把这个nominal的feature转成多个feature。
比如颜色这个feature有3个值红黄蓝，那么就生成红/黄/蓝3个feature。填好对应的值0或1。

可以删除一个，比如删除红，那么根据黄/蓝2个feature也可以推出红的值。

这样可减少一些冗余数据。

4.3 把数据分割为训练数据和测试数据#

之前已经看过

4.4 把feature调到相同的scale#

Feature scaling很关键，但很容易被忽略。
决策树和随机森林不需要Feature scaling。
绝大多数算法可用Feature scaling提升表现。

normalizable/standardization经常混用，需要根据上下文确定含义。

normalization一般指把feature进行rescale，到[0,1]。

\({\Large x^i_{norm} = \frac{x^i-x_{min}}{x_{max} - x_{min}}}\)

sklearn.preprocessing.MinMaxScaler干这个事

standardization可能应用更广泛。因为很多线性模型会把权重初始化到接近0。

用standardization把数据居中，标准差置1，能更好地学习。

\({\Large x^i_{std} = \frac{x^i-\mu_{x}}{\sigma_x}}\)

\(\mu_x\)是feature的平均值。

\(\sigma_x\)是feature的标准差。

sklearn.preprocessing.StandardScaler干这个事

之前我没有详细写代码处理最基本的数据，这里大概做一些测试。

必须得搞清楚数据的基本形态和常用指标。

例如

\(\begin{bmatrix} 2 & 666 & 7\\ 22 & 18 & 10001 \end{bmatrix}\)

a = np.array([[2, 666, 7], [22, 18, 10001]])
print(a)
# [[    2   666     7]
#  [   22    18 10001]]

print(f"a.shape {a.shape}")
# a.shape (2, 3)
# 2行3列。应该是有个广泛的默认约定。每行是一个sample。每列对应某个feature。

print(f"a[0].mean() {a[0].mean()}")
# 某行的mean
# a[0].mean() 225.0

print(f"a[1].mean() {a[1].mean()}")
# a[1].mean() 3347.0

print(f"a.mean() {a.mean()}")
# a.mean() 1786.0
# 默认算所有的mean，也可以选axis按行算或按列算。

print(f"a.mean(axis = 0) {a.mean(axis = 0)}")
# 某列的mean
# a.mean(axis = 0) [  12.  342. 5004.]
# 针对每个feature，算所有sample的mean。
# 很多例子数据都是全数字，所以开始容易混乱。
# 一定是针对feature算，否则结果没有意义。
# 比如强行针对sample算，算身高/体重/性别的mean？没有意义。

print(f"a[0].std() {a[0].std()}")
# 第1行的sd
# a[0].std() 311.84077133477376

print(f"a[1].std() {a[1].std()}")
# 第2行的sd
# a[1].std() 4705.088805396415

print(f"a.std() {a.std()}")
# 所有的sd
# a.std() 3681.612916462928

print(f"a.std(axis = 0) {a.std(axis = 0)}")
# 针对feature计算标准差
# a.std(axis = 0) [  10.  324. 4997.]

print(f"a StandardScaler {StandardScaler().fit_transform(a)}")
# 针对每个feature进行标准化
# a StandardScaler [[-1.  1. -1.]
#                   [ 1. -1.  1.]]

# 因为是2个sample，所以每个feature必然出现1和-1。
# 标准化以后，每个feature的mean必然为0。

#####################################################

# 把a转一下看看

print(f"a.T {a.T}")
# a.T [[    2    22]
#      [  666    18]
#      [    7 10001]]

print(f"a.T.mean(axis = 0) {a.T.mean(axis = 0)}")
# a.T.mean(axis = 0) [ 225. 3347.]

print(f"a.T.std(axis = 0) {a.T.std(axis = 0)}")
# a.T.std(axis = 0) [ 311.84077133 4705.0888054 ]

at_std = StandardScaler().fit_transform(a.T)
print(f"a.T StandardScaler {at_std}")
# a.T StandardScaler [[-0.71510854 -0.70668167]
#                     [ 1.41418326 -0.70753181]
#                     [-0.69907472  1.41421348]]

# 可以看到[2, 666, 7]标准化之后差距并没有想象的那么大。
# 标准化以后，每个feature的mean必然为0。

4.5 选择有意义的数据#

如果发现模型在训练数据上的表现比测试数据好很多，很有可能是过度拟合。
这样的模型是high variance。
本质是模型过于复杂，泛化不好。

一些改善手段

取更多的训练数据
通过regularization，给复杂度加上惩罚。
减少模型参数，变简单点。
减少数据维度。

4.5.1 L1/L2 regularization#

\({\Large L2: ||w||_2 = \sqrt {\sum_{j=1}^{m} w_j^2}}\)

\({\Large L1: ||w||_1 = \sum_{j=1}^{m} |w_j|}\)

	优势	劣势
	挑选feature/使有些feature为0	数据变得稀疏/字面意义更模糊
L1	容易处理高纬度数据	feature之间相关性高的场合效果差
	处理相关性低的或不重要的数据比较在行	计算开销比L2大

	优势	劣势
	改善过度拟合/变得更泛化	无法像L1那样选出feature
L2	feature之间相关性高的场合效果好	可能产生很多很小的非0参数
	稳定/计算开销小	可能不适合高纬度

4.5.2 从图形来理解#

见图4.5，各种资料中常用的例子。feature限定为2个，loss function用最小方差(MSE)。

最小方差能形成圆形，2个feature可做成2d图形。便于理解。

圆环乍一看不好理解，其实他们是loss function产生的等高线(contour)。
本质是3d图形(2个feature和一个loss值)。圈是3d图形的loss值投影到了2d产生。
每个圈上的2个feature值产生的loss值相等。而中心点的loss值最小。

参考
https://towardsdatascience.com/regularization-in-machine-learning-connecting-the-dots-c6e030bfaddd
看梯度下降的图，秒懂。

做regularization本质是把sample限制到一个范围。

图4.6。再看L2的图形，是一个圆。把数据限制在圆中，此时仍希望获得最小的loss。

那么就能得到交点。也就是regularization的结果。

图4.7。L1是方形。那么由图形的性质决定了L1更sharp。结果很容易靠近某个轴。

靠近某个轴，就是另个feature变成接近0，几乎将其消除，也就是上面总结的，挑选feature，变得稀疏。

接下来代码用了https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data的数据

有三种酒，有13个feature(物理化学性质)。

这里有更多数据

https://archive.ics.uci.edu/datasets

图4.8。通过代码演示了L1下，C值对于结果的影响。

C值越小，力度越大，所有feature趋近0。

力度变小，feature开始分化，容易区分。

4.5.3 顺序挑选feature#

另一个减少复杂度/避免的过度拟合的方法，是通过挑选feature来减少维度。对于未做regularization的数据尤其有用。

两种减少维度的方法

feature selection，挑选feature子集。

feature extraction，从现有feature榨取某些信息，形成新的feature空间。

顺序feature挑选算法，是一种贪婪搜索算法，把feature数从d降到k。

选出与问题更相关的feature，提高效率。

经典的算法sequential backward selection(SBS)

sbs不断删除feature，直到剩下期望的feature数。

定义一个函数，可给所有现有feature算一个分数(criterion)。那么可算出删除某feature前后的差值，那么每次挑选差值最大的那个feature删除即可。

从d个feature删到k个结束。

看代码

criterion用sklearn.neighbors.KNeighborsClassifier这个estimator得到。

用它的fit训练数据，predict得到预测值，再用sklearn.metrics.accuracy_score得到分数。

按上述每次挑出一个最差的feature删除即可。

4.6 用随机森林得到feature的重要性#

之前学习了用L1 reg去除了无关feat。用sbs/knn筛选有用的feat。

随机森林也可以选feat，且不必关心数据是否线性可分。

图4.1。sklearn.ensemble.RandomForestClassifier可以在fit之后直接获取feature_importances_。

然而如果有多个feat相关性较高，可能出现它们排名靠前，其他有用的feat低分。

可以把这个结果当成一个中间数据，继续用其他手段再处理。

sklearn.feature_selection.SelectFromModel

5 通过`维度消减`来压缩数据#

这一章看feature extraction

5.1 principal component analysis(`PCA`)实现无监督的维度消减#

feature selection主打一个挑。

feature extraction主要是做转换，把feat映射到另一个空间。

可压缩数据量，也可以提高预测表现。

PCA是一种应用广泛的无监督线性转换。其他用途包括实验数据分析、证券市场降噪、生物相关应用等等。

pca基于feat之间的联系来找出pattern。

本质是找出高维度数据中最大variance的方向，然后把数据投射到新的子空间，维度不高于原空间。

使用pca时，创建一个dxk的转换矩阵W。原始nxd的数据乘上W，变成nxk的数据。

pca之前必须standardize。

PCA步骤
对d维数据standardize
创建covariance矩阵
解出covariance矩阵的特征向量和特征值
按特征值倒排序
取头部k个特征向量和特征值
创建W矩阵
用W矩阵转换d维数据到k维

5.1.2 抽取principal components#

还是使用酒的数据。train_test_split分割数据。

按公式创建covariance矩阵

\({\Large \sigma_{jk} = \frac{1}{n-1} \sum_{i = 1}^{n}(x_j^i - \mu_j)(x_k^i - \mu_k) }\)

\(\mu_j\)和\(\mu_k\)是feat j/k的平均值。
这里注意如果standardize过，那么平均值为0。见之前4.4章做的数据测试。
本质是创建了feat之间的两两关系。
正值表示两个feat会一起增加或减少，负值表示他们会反方向变化。

比如2x3的数据。n=2，d=3。

\(\begin{bmatrix} 2 & 666 & 7\\ 22 & 18 & 10001 \end{bmatrix}\)

平均值 \(\mu_1\)=12 \(\mu_2\)=342 \(\mu_3\)=5004

先创建一个3x3的covariance矩阵

手算感受一下数据的含义

((2-12)(2-12)+(22-12)(22-12)),        ((2-12)(666-342)+(22-12)(18-342)),        ((7-5004)(2-12)+(10001-5004)(22-12))

((666-342)(2-12)+(18-342)(22-12)),    ((666-342)(666-342)+(18-342)(18-342)),    ((7-5004)(666-342)+(10001-5004)(18-342))

((7-5004)(2-12)+(10001-5004)(22-12)), ((7-5004)(666-342)+(10001-5004)(18-342)), ((7-5004)(7-5004)+(10001-5004)(10001-5004))

得到

\(\begin{bmatrix} 200 & -6480 & 99940 \\ -6480 & 209952 & -3238056 \\ 99940 & -3238056 & 49940018 \end{bmatrix}\)

用api算，得到一样的结果。

a = np.array([[2, 666, 7], [22, 18, 10001]])
print(f"a.T {a.T}")

a.T [[    2    22]
    [   666    18]
    [     7 10001]]

cov_mat = np.cov(a.T)
print(f"cov_mat {cov_mat}")

cov_mat [[ 2.0000000e+02 -6.4800000e+03  9.9940000e+04]
        [-6.4800000e+03  2.0995200e+05 -3.2380560e+06]
        [ 9.9940000e+04 -3.2380560e+06  4.9940018e+07]]

这个covariance矩阵，表达了principal components(maximum
variance的方向)。
其中的值是矢量的长度。
它一定是个方阵，且斜向对称。

求这个矩阵的特征向量和特征值，直接调api。

numpy.linalg.eig

特征值和特征向量比较重要，需要参考各种资料搞清楚其意义。

某个特征值的variance explained ratio，就是它在所有特征值和中的占比。

\({\Large Explained\ variance\ ratio = \frac{\lambda_j}{\sum_{j=1}^{d} \lambda_j} }\)

图5.2演示了ratio的累积。

接着看代码

eigen_vals, eigen_vecs = np.linalg.eig(cov_mat)

print(f"eigen_vals\n{eigen_vals}")

'''
酒的13个特征值

eigen_vals
[4.84274532 2.41602459 1.54845825 0.96120438 0.84166161 0.6620634
 0.51828472 0.34650377 0.3131368  0.10754642 0.21357215 0.15362835
 0.1808613 ]
'''

print(f"eigen_vecs\n{eigen_vecs}")

'''
对应的特征向量

eigen_vecs
[[-1.37242175e-01  5.03034778e-01 -1.37748734e-01 -3.29610003e-03
  -2.90625226e-01  2.99096847e-01  7.90529293e-02 -3.68176414e-01
  -3.98377017e-01 -9.44869777e-02  3.74638877e-01 -1.27834515e-01
   2.62834263e-01]
 [ 2.47243265e-01  1.64871190e-01  9.61503863e-02  5.62646692e-01
   8.95378697e-02  6.27036396e-01 -2.74002014e-01 -1.25775752e-02
   1.10458230e-01  2.63652406e-02 -1.37405597e-01  8.06401578e-02
  -2.66769211e-01]
 [-2.54515927e-02  2.44564761e-01  6.77775667e-01 -1.08977111e-01
  -1.60834991e-01  3.89128239e-04  1.32328045e-01  1.77578177e-01
   3.82496856e-01  1.42747511e-01  4.61583035e-01  1.67924873e-02
  -1.15542548e-01]
 [ 2.06945084e-01 -1.13529045e-01  6.25040550e-01  3.38187002e-02
   5.15873402e-02 -4.05836452e-02  2.23999097e-01 -4.40592110e-01
  -2.43373853e-01 -1.30485780e-01 -4.18953989e-01 -1.10845657e-01
   1.99483410e-01]
 [-1.54365821e-01  2.89745182e-01  1.96135481e-01 -3.67511070e-01
   6.76487073e-01  6.57772614e-02 -4.05268966e-01  1.16617503e-01
  -2.58982359e-01 -6.76080782e-02  1.00470630e-02  7.93879562e-02
   2.89018810e-02]
 [-3.93769523e-01  5.08010391e-02  1.40310572e-01  2.40245127e-01
  -1.18511144e-01 -5.89776247e-02 -3.47419412e-02  3.50192127e-01
  -3.42312860e-01  4.59917661e-01 -2.21254241e-01 -4.91459313e-01
  -6.63868598e-02]
 [-4.17351064e-01 -2.28733792e-02  1.17053859e-01  1.87053299e-01
  -1.07100349e-01 -3.01103180e-02  4.17835724e-02  2.18718183e-01
  -3.61231642e-02 -8.14583947e-01 -4.17513600e-02 -5.03074004e-02
  -2.13349079e-01]
 [ 3.05728961e-01  9.04888470e-02  1.31217777e-01 -2.29262234e-02
  -5.07581610e-01 -2.71728086e-01 -6.31145686e-01  1.97129425e-01
  -1.71436883e-01 -9.57480885e-02 -8.87569452e-02  1.75328030e-01
   1.86391279e-01]
 [-3.06683469e-01  8.35232677e-03  3.04309008e-02  4.96262330e-01
   2.01634619e-01 -4.39997519e-01 -3.23122775e-01 -4.33055871e-01
   2.44370210e-01  6.72468934e-02  1.99921861e-01 -3.67595797e-03
   1.68082985e-01]
 [ 7.55406578e-02  5.49775805e-01 -7.99299713e-02  1.06482939e-01
   5.73607091e-03 -4.11743459e-01  2.69082623e-01 -6.68411823e-02
  -1.55514919e-01  8.73336218e-02 -2.21668868e-01  3.59756535e-01
  -4.66369031e-01]
 [-3.26132628e-01 -2.07164328e-01  5.30591506e-02 -3.69053747e-01
  -2.76914216e-01  1.41673377e-01 -3.02640661e-01 -4.59762295e-01
   2.11961247e-02  1.29061125e-01 -9.84694573e-02  4.04669797e-02
  -5.32483880e-01]
 [-3.68610222e-01 -2.49025357e-01  1.32391030e-01  1.42016088e-01
  -6.66275572e-02  1.75842384e-01  1.30540143e-01  1.10827548e-01
  -2.38089559e-01  1.87646268e-01  1.91205783e-02  7.42229543e-01
   2.37835283e-01]
 [-2.96696514e-01  3.80229423e-01 -7.06502178e-02 -1.67682173e-01
  -1.28029045e-01  1.38018388e-01  8.11335043e-04  5.60817288e-03
   5.17278463e-01  1.21112574e-02 -5.42532072e-01  3.87395209e-02
   3.67763359e-01]]
'''

# 按特征值排序。找出2个最大的feature。
# 所有sample的这2个feature构成最后的W矩阵。

'''
[[-0.13724218  0.50303478]
 [ 0.24724326  0.16487119]
 [-0.02545159  0.24456476]
 [ 0.20694508 -0.11352904]
 [-0.15436582  0.28974518]
 [-0.39376952  0.05080104]
 [-0.41735106 -0.02287338]
 [ 0.30572896  0.09048885]
 [-0.30668347  0.00835233]
 [ 0.07554066  0.54977581]
 [-0.32613263 -0.20716433]
 [-0.36861022 -0.24902536]
 [-0.29669651  0.38022942]]
 '''

X_train_pca = X_train_std.dot(w)
# 原始的124x13数据点乘13x2的w，得到最终的124x2数据。

图5.3画出了最终降了维度的124x2数据。

可以看出，数据从13维降到2维，看上去能很容易区分。

5.1.5 scikit-learn中的PCA#

sklearn.decomposition.PCA

pca = PCA(n_components=2) # 选2个feature

# 转换训练和测试数据
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)

# 进行逻辑回归
lr = LogisticRegression(multi_class='ovr', random_state=1, solver='lbfgs')
lr = lr.fit(X_train_pca, y_train)

# 画出训练数据的决策区域
plot_decision_regions(X_train_pca, y_train, classifier=lr)

# 画出测试数据的决策区域
plot_decision_regions(X_test_pca, y_test, classifier=lr)


pca = PCA(n_components=None)
# n_components为None时。n_components == min(n_samples, n_features)
# 即保留所有
X_train_pca = pca.fit_transform(X_train_std)

# 可直接输出explained_variance_ratio_
print(pca.explained_variance_ratio_)

5.1.6 评估feature的贡献#

没懂

原始feature在principal components中的贡献。叫做loadings。

5.2 用linear discriminant analysis进行有监督数据的压缩#

LDA可以用来给non-regularized的模型降维。

LDA本质和PCA相似
PCA主要是找maximum variance的orthogonal component。
LDA主要找适于优化区分class的feature子空间。

5.2.1 PCA vs LDA#

pca和lda都是用来降维的线性转换技术。

pca是无监督。lda是有监督。

可以认为lda是更高一级

但是有研究称pca在某些特定场景分类表现更好。

5.2.2 LDA步骤#

对d维数据标准化
对每个class，计算d维mean向量。
计算between-class scatter矩阵\(S_B\)，within-class scatter矩阵\(S_W\)。
计算矩阵\(S_W^{-1}S_B\)的特征向量和特征值。
按特征值排序。
取头部k个，生成W矩阵。
用W转换原始sample。

5.2.3 scatter矩阵的计算#

有一个mean向量\(m_i\)，存class i的所有feature的平均值。

\({\Large m_i = [class_i.feature_1.mean, c_i.f_2.mean, \ ... \ , c_i.f_n.mean]}\)

# 这个代码依次挑出class 1，2，3的sample，算出他们的feature的mean。

mean_vecs = []
for label in range(1, 4):
    mean_vecs.append(np.mean(X_train_std[y_train == label], axis=0))
    print(f'MV {label}: {mean_vecs[label - 1]}\n')

'''
MV 1: [ 0.9066 -0.3497  0.3201 -0.7189  0.5056  0.8807  0.9589 -0.5516  0.5416
  0.2338  0.5897  0.6563  1.2075]

MV 2: [-0.8749 -0.2848 -0.3735  0.3157 -0.3848 -0.0433  0.0635 -0.0946  0.0703
 -0.8286  0.3144  0.3608 -0.7253]

MV 3: [ 0.1992  0.866   0.1682  0.4148 -0.0451 -1.0286 -1.2876  0.8287 -0.7795
  0.9649 -1.209  -1.3622 -0.4013]
'''

算一个\(S_i\)。class i中的每个sample，与\(m_i\)做矩阵计算，再加起来。

\({\Large S_i = \sum_{x \in class_i} (x-m_i)(x-m_i)^T }\)

其中x-m是一个1xd的向量，需要转一下变成dx1，然后乘起来变成一个dxd的矩阵。

每个sample算一个这种矩阵，全部加起来得到\(S_i\)。

最后把所有\(S_i\)加起来得到\(S_W\)。within-class scatter矩阵

\({\Large S_W = \sum_{i = 1}^c S_i }\)

\(S_W\)这里面有2个循环，里面算每个sample，外面算每个类。

# 计算训练数据每个feature的mean。并旋转，变成dx1。
mean_overall = np.mean(X_train_std, axis=0)
mean_overall = mean_overall.reshape(d, 1)

计算\(S_B\)，between-class scatter矩阵

\({\Large S_B = \sum_{i = 1}^c n (m_i - m_{overall})(m_i - m_{overall})^T }\)

\(S_W\)这里面只有1个循环，每个类算一下就ok。

来到第4步。接下来跟PCA差不多。

计算矩阵\(S_W^{-1}S_B\)的特征向量和特征值。

取头部几个生成W矩阵。

转换原始数据。

图5.10展示LDA处理后的数据。

5.2.6 cikit-learn的LDA#

学习sklearn.discriminant_analysis.LinearDiscriminantAnalysis相关操作。

对比训练数据和测试数据的决策区域图。

5.3 非线性降维以及可视化#

PCA/LDA都是线性转换。

一个常用的非线性技术。t-distributed stochastic neighbor embedding(t-SNE)。

常用来把高维度数据转成2，3维进行可视化。

5.3.1 为什么要非线性#

很多机器学习算法默认数据是线性可分。

但是现实中很多数据是非线性的，PCA/LDA不是最佳选择。

图5.13展示了线性和非线性数据的区别。

非线性的降维一般会指向Manifold learning。超出了本书的范围，需要另行学习。

https://scikit-learn.org/stable/modules/manifold.html

虽然强大，但是难用。如果缺乏理想的hyperparameter，可能帮倒忙。

另外如果不是降到简单的2维3维之类，很难或者几乎不可能判定结果的质量。还是可能帮倒忙。

所以很多人仍然依靠更简单的PCA/LDA来降维。

5.3.2 t-SNE#

本质上基于高纬度中feature之间的两两距离。

这个两两距离有一个概率分布。然后它找出一个低纬度的空间，其概率分布与高纬度最相似。

换句话，它找一个低纬度空间，条件是其两两距离数据能被保留(preserved)。

论文https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf

最后看一个识别数字的例子。

数据见https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html

每个sample是一个8x8黑白像素的图片。

那么就是64个feature。feature的值是深浅程度。

到这里对于我们初学者，我感觉某种程度上已经结束了，因为拿到了数据，并且定下了特征。那么再套各种算法就完事儿了。里面的数学原理就得去长期啃论文了。

这里得有一个初级的思想转变，这里的图片与其他数据比如酒的成分，身高体重这些并没有本质区别，并没有特别高深莫测。

不过是把图片分割成多个区域，每个区域给个值当作特征值而已。

代码中降维到2d。图5.16可以看到效果还挺好的。

6 模型评估和超参数调教的最佳实践#

这一章学习
评估机器学习模型的表现
常见问题的诊断
调教模型
学习不同的表现评估标准

6.1 pipeline#

数据为乳腺癌数据。

https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic 32个feature是乳房肿块的半径，面积，周长等。结果是benign/malignant(良性/恶性)。

read_csv加载数据
LabelEncoder标签转数字
train_test_split创建训练和测试数据
sklearn.pipeline.make_pipeline生成一个流水线
fit进行训练
predict进行测试
score获取准确度

6.2 使用k-fold交叉验证来评估模型表现#

holdout是经典且流行的评估算法整体表现的方法。
把原始数据分成训练数据和测试数据。
之前我们几乎都是在这样做。

然而对于某个具体应用，我们需要调整各种参数来改进整体表现。

这个叫做model selection。本质是要挑选最佳的hyperparameters。

如果我们每次都用同一组训练数据，容易过度拟合。

一个更好的方案是把数据分成训练/验证/测试三组。
训练数据进行fit。
验证数据用来不断调参数，并evaluate。
测试数据用来产生最终结果。

图6.2展示了流程。

k-fold cross-validation(k-fold cv)中，把训练数据分成k份(k
fold)。
其中的k-1份作为正常的训练数据，和之前一样。
1份作为验证数据，也就是1/k的验证数据。

然后第1次操作选第1份作为验证数据，其余作为训练数据。
第2此选第2份作为验证数据，其余作为训练数据。
这样可以进行k次操作，每次选不同各种参数，选不同的fold作为验证数据。同时产生k个模型，以及k个验证结果。
那么就可以对比这k个结果，看看哪一批参数最好使。

scikit-learn的文档非常好

https://scikit-learn.org/stable/modules/cross_validation.html

sklearn.model_selection.KFold(n_splits=5)
n_splits即fold数。默认为5，即4/5训练数据，1/5验证数据。
k选大选小各有利弊。

sklearn.model_selection.StratifiedKFold(n_splits=10)
Stratified意思是每一组都包含差不多数量的class，或者说把每个class的sample均匀分在各个组里。
一般表现更好。

sklearn.model_selection.cross_val_score可以多cpu并行训练。

6.3 使用学习/验证曲线来debug#

两个诊断工具可以提高算法的表现
- 学习曲线
- 验证曲线

6.3.1 用学习曲线来诊断bias和variance#

之前学过了bias和variance是对立关系。
倾向bias表示欠拟合，模型过于简单。
倾向variance表示过拟合，模型过于复杂。

采用更多的训练数据经常可以缓解过度拟合。

但是实战中，获取更多数据是有代价的，不一定现实。

画出数据量与准确度之间的关系，我们可以较为容易地判断是否受bias/variance拖累，是否需要更多的数据来解决问题。

图6.4展示了一些场景。

Training accuracy，有些陌生，回看第4章。有代码

lr.fit(X_train_std, y_train)
print('Training accuracy:', lr.score(X_train_std, y_train))

# Training accuracy就是用训练数据fit之后的score。
# validation accuracy就是用验证数据fit之后的score。
# 同样test accuracy就是用测试数据predict之后的score。

左上图是high bias。训练准确度和cv准确度都低。欠拟合。
通常需要增加参数，比如需要更多的feature，或者降低regularization力度。
(regularization用来解决过度拟合)

右上是high variance。训练准确度和cv准确度差距很大。过度拟合。

需要增加训练数据，简化模型，增加reg力度。

右下是一个好的结果。

# 先做一个pipeline
pipe_lr = make_pipeline(StandardScaler(), LogisticRegression(penalty='l2', max_iter=10000))

# 把数据喂给learning_curve函数即可
train_sizes, train_scores, test_scores = learning_curve(estimator=pipe_lr,
               X=X_train,
               y=y_train,
               train_sizes=np.linspace(0.1, 1.0, 10),
               cv=10,
               n_jobs=1)

# 用来生成曲线
# train_sizes规定训练数据的数量。一般作为横坐标上的数据点。
# np.linspace(0.1, 1.0, 10)意思从0.1到1区间产生10个值。曲线上最终会有10个点。

# cv即fold数。对于上面10种比例的数据，每次再做k-fold cv，共做10次。

# 最后得出每种比列下的cv准确度

# 打印一下
print(f"train_sizes {train_sizes}")
print(f"train_scores {train_scores}")
print(f"train_scores.shape {train_scores.shape}")
print(f"test_scores {test_scores}")
print(f"test_scores.shape {test_scores.shape}")

# 我区分一下，看得更清楚。把train_sizes设为5。则曲线有5个点。
# train_sizes为[ 40 132 224 316 409]

# 对于每个train_size，做10-fold的cv。有10个score。
# 所以最后的train_scores和test_scores都是5行10列。
# 可看作train_sizes是外面大循环。里面是k-fold小循环。

train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

# 5行10列的cv结果，对每行10个score取平均值，作为cv的最终结果。对应一个train_size。形成最后的曲线。
# 他还算了标准差，填充了颜色。

6.3.2 用验证曲线来判断欠拟合/过拟合#

learning_curve是测试不同的测试数据量。

validation_curve是测试不同的参数，比如逻辑回归的C参数。

代码大同小异

sklearn.model_selection.validation_curve
param_name指定要修改的参数
param_range指定参数值

6.4 使用grid search来调教模型#

机器学习中有2种参数
- 从数据中学习得来的参数。比如一开始就接触到的权重。
- 模型调教参数(hyperparameters)。比如C参数，比如决策树的深度。

上一节用validation_curve指定不同的C参数。

下面学grid search，可以用来找出最佳的超参数组合。

6.4.1 grid search调教超参数#

原理很简单，就是暴力测试多个超参数的所有组合。

sklearn.model_selection.GridSearchCV

# 做pipeline
pipe_svc = make_pipeline(StandardScaler(),  SVC(random_state=1))

# 设置超参数
param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
param_grid = [{'svc__C': param_range,
'svc__kernel': ['linear']}, {'svc__C': param_range,
'svc__gamma': param_range,
'svc__kernel': ['rbf']}]

# 喂给GridSearchCV
gs = GridSearchCV(estimator=pipe_svc,
                  param_grid=param_grid,
                  scoring='accuracy',
                  refit=True,
                  cv=10)

gs = gs.fit(X_train, y_train)

# 打印最佳结果
print(gs.best_score_)
# 0.9846859903381642

print(gs.best_params_)
# {'svc__C': 100.0, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}

# 获取模型
clf = gs.best_estimator_

# 打印分数
print(f'Test accuracy: {clf.score(X_test, y_test):.3f}')
# Test accuracy: 0.974

6.4.2 用随机搜索获取更广泛的超参数#

GridSearchCV是在用户指定的参数范围中搜索。

如果范围很大，非常耗时。

另一种方法是用随机搜索。从某种分布中随机抽取超参数来测试。

sklearn.model_selection.RandomizedSearchCV

图6.7展示了一种场景。非常清晰，grid search是死搜所有组合，如果很稀疏，很有可能错过最佳参数。

随机化以后皆有可能。

我猜对于有的场景，如果分布取得好，可以获得奇效。

这里运行代码报了一大堆错。

原来是把svc的kernel参数的rbf写成了rbg。

6.4.3 二分随机#

顺着随机搜索，又有一种二分随机方法。

sklearn.model_selection.HalvingRandomSearchCV

做少量训练，丢弃一半不看好的数据。不断重复直到剩下最后一个胜者。

具体暂不研究。

6.4.4 嵌套的交叉验证#

见图6.8。

没有仔细看。大概是GridSearchCV外面再套一个cross_val_score？

也就是2层cv。

有研究显示结果会更好。

6.5 其他的评估标准#

这一节很繁琐。以后得持续学习。

6.5.1 confusion matrix#

图6.9。四种预测状态。预测0实际1，0/0，1/0，1/1。

FN/TN/FP/TP

预测对了就是Ture，预测错了就是False。

预测值是0就是Negative，预测值是1就是Positive。

把算法结果y_true_, y_pred_输入到sklearn.metrics.confusion_matrix可以直接统计出来。

与图中对应。

4个值加起来一定等于总预测数。

6.5.2 优化模型的precision和recall#

预测的err和acc(错误率和准确率)展示了总体预测正确数。相加为1。

基于上面4种状态，有各种指标。

\({\Large ERR = \frac{FP+FN}{FP+FN+TP+TN}}\)

\({\Large ACC = \frac{TP+TN}{FP+FN+TP+TN} = 1 - ERR}\)

假真率。把实际假，误报为真的比例。

\({\Large FPR = \frac{FP}{N} = \frac{FP}{FP+TN}}\)

真真率。把实际真，正确报了真的比例。

\({\Large TPR = \frac{TP}{P} = \frac{TP}{FN+TP}}\)

例如在肿瘤诊断中，非常重要的是恶性被正确预测。

同样重要的是，减少良性被误报为恶性。

准确率PRE(precision)。预测为真中，实际是真的比例。

\({\Large PRE = \frac{TP}{FP+TP}}\)

REC(recall)。同TPR。实际为真的数据中，报了真的比例。

\({\Large REC = TPR = \frac{TP}{FN+TP}}\)

这些东西真的很绕。。。需要多查别的资料综合来看。

https://www.iguazio.com/glossary/recall

https://en.wikipedia.org/wiki/Precision_and_recall

https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall

https://www.evidentlyai.com/classification-metrics/accuracy-precision-recall

https://mlu-explain.github.io/precision-recall/

这些指标在不同场景，有重要性的区分。

recall

比如在看病场景中，recall就是把实际有病预测为有病的概率，非常重要，越高越好。

也就是TP(有病预测为有病)要高，FN(有病预测为没病)要少。

FP(没病预测为有病)和TN(没病预测为没病)相对不重要。

代入现实想一下就是，如果真有病，那么最好预测有病，如果预测没病那就出大问题。而如果实际没病，预测有病，基本是虚惊一场(因为会持续治疗，大概率会发现其实没问题)(当然有极小概率没病但被治死)。

也就是更侧重实际有病的情况。这里更看重recall指标。

适用场景倾向于不希望放过任何一个坏人。比较严厉。
precision

垃圾邮件的检测中。准确性更重要。也就是TP和FP更重要。

就是说你预测是垃圾，那最好真的是垃圾。

漏判垃圾没关系，反正最后我会打开自己看。

但是如果把好的邮件错判为垃圾，丢掉或者挪走，那就出大问题。

倾向于不要错判任何一个好人。比较宽容。

不能只看ACC指标。存在ACC高但是REC和PRE低的情况。

最好REC和PRE都高。但是REC和PRE实际上经常是对立关系，需要tradeoff。
不是一定对立，是通常对立。
这一点找了几本书都没有说的很明白，都是直接说结果。

这里是一个很好的图示，展示了他们对立的例子。
对决策的threshhold进行shift，一个指标变大另一个变小。
https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall

REC和PRE我们都想要，但他们是对立。那么发明出一个整合。

\({\Large F1 = 2 \frac{PRE \times REC}{PRE + REC}}\)

6.5.3 Receiver operating characteristic#

https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

ROC曲线有助于分类模型的挑选。基于FPR和TPR，对决策点做shift。

FPR和TPR作为两个轴。

不同的决策threshhold(也就是上面的对决策点做shift)，对应不同FPR和TPR值，这样可画出一条曲线。

见https://mlu-explain.github.io/roc-auc/的空袭报警例子。

从左边开始，左边是倾向于任何情况都报警，把假都报成了真，FPR趋近1。
所有都报真，自然是把真也报了真，TPR也是趋近1。
靠近右边时。倾向于少报警，都不报警了FPR和TPR自然是0。和从左开始大致是对称的形状。
这样扫过去就会形成一个从(1,1)到(0,0)的roc曲线。

我们希望TPR越大越好，FPR越小越好。
完美的情况是FPR永远为0，TPR永远为1。也就是永远正确。
也就是左边竖线连着顶端横线。
ROC越靠近左上角越好。

现实情况中，就像空袭的例子，飞机和云是穿插的，经常会形成常见的ROC曲线。

从(0,0)到(1,1)的直线，被认为是一个随机分类器。我没完全懂。查了很多资料都没有直接说明。先不纠结。问题可能在于这个随机该如何理解。

姑且先这样理解。直线意思是，任何情况下TPR=FPR。
也就是把假报成真和把真报成真的概率永远相等。
比如我给你10个假的你报3个真，我给10个真的你也报3个真。我给任何数据你都报30%真，感觉你其实啥也没干。
比如我定一个0.3的报真概率，然后输入随机的数据进行预测，就能跟你打平。
编不下去了。

AUC是area under the curve。也就是roc曲线下面的面积。

auc有多种解读。

sklearn.metrics.roc_curve计算roc曲线。

sklearn.metrics.auc可算auc值。

7 Ensemble Learning(集成学习)#

基于上一章的评估技术，这一章学习分类器组，这些分类器组通常比其中任何一个分类器效果更好。

7.1 集成学习#

ensemble methods创建一个元分类器，由不同的分类器组成。通常比其中任何一个分类器效果更好。

例如对于一个问题，找10个专家给出意见，再综合起来。一般要比只找一个专家更靠谱。

有多种方法来进行集成。这一节学基本原理。

majority voting原理，多个分类器对同个sample预测，哪个class出的多，就采用哪个class。

图7.2展示整体思路。

先看二项分布

有某种操作，每次操作出现结果A的概率为p，操作相互独立。

那么n次操作后恰好出现k次A的概率为

\({\Large P(k) = \binom{n}{k} p^k (1-p)^{n-k}}\)

二项分布的图形跟正态分布一样也像钟形。
如果p=0.5，就是两边对称，中间最高。
比如扔一个对称的硬币若干次，正面和反面的次数相等的概率最大。

把p换成错误率\(\epsilon\)。再做成概率的累积。有

\({\Large P(y \ge k) = \sum_k^n \binom{n}{k} \epsilon^k (1-\epsilon)^{n-k} = \epsilon_{ensemble}}\)

例如我们有11个分类器(n = 11)，每个的\({\epsilon = 0.25}\)。

那么6个或以上分类器出现错误的概率为

\({\Large P(y \ge 6) = \sum_{k = 6}^{11} \binom{11}{k} 0.25^k (1-0.25)^{11-k} = \epsilon_{ensemble} \approx 0.034}\)

也就是半数以上出错的概率，即最终结果出错的概率，为0.034。

可以看到整体的0.034错误率，比单个0.25的错误率好了很多。

但是单个错误率超过0.5时，整体错误率超过0.5，快速上升。

曲线是一个拉伸的S型，有点像sigmoid。

7.2 majority vote#

7.2.1 实现一个简单的majority voting分类器#

\({\Large y = argmax_i \sum_{j = 1}^m w_j X_A(C_j(x) = i)}\)

y是预测的最终结果

argmax是获取list中最大的值的index

i是class i

j是分类器下标

\(C_j(x)\)是分类器j对于一条数据x得到的预测值

w是某个分类器的权重

X可以看成一个函数，返回括号中的概率。\(X(C_j(x) = i)\)就是分类器j预测结果为i的概率。

可简单写为

\({\Large y = mode\{ C_1(x), C_2(x), ... , C_m(x) \}}\)

mode意思是选取list中得票最多的那个。

假设有3个分类器\(C_1\)\(C_2\)\(C_3\)。有2个y值0和1。即

\({\Large C_j(x) \in \{0,1\}}\)

假设权重分别为0.2，0.2，0.6，预测值分别为0，0，1。

\(C_3\)的权重为3倍，实际效果是

\({\Large y = mode\{0, 0, 1, 1, 1\}}\)

用argmax表示

\({\Large y = argmax[0.2 \times 1 + 0.2 \times 1 + 0.6 \times 0, \ \ 0.2 \times 0 + 0.2 \times 0 + 0.6 \times 1 ]}\)

\({\Large \ \ = argmax[0.4, \ 0.6 ] = 1}\)

0.6大于0.4，也就是第二项得票多。最终结果就是1。

总结一下就是
- y是要求整个分类器对于一个sample x的预测 - 对每个class i都要算一个和A
- A是每个分类器对于x的预测值为i的概率乘上各自权重再全部加起来
- 所有class的和用argmax挑出最大的作为最终结果

看代码，展示了上述公式的计算。

np.argmax(np.bincount([0, 0, 1],  weights=[0.2, 0.2, 0.6]))
# bincount一步算出每个class的和
# argmax获取index
# 一行代码直接取得上面的y
# X函数要么是0要么是1。更符合简写的mode公式。



ex = np.array([[0.9, 0.1],
               [0.8, 0.2],
               [0.4, 0.6]])

p = np.average(ex,
               axis=0,
               weights=[0.2, 0.2, 0.6])

print(np.argmax(p))
# 这样写是显示指定概率。更符合完整的y=argmax公式。

接着写MajorityVoteClassifier类。
这里开始需要进一步了解scikit-learn的代码结构。
需要了解如何开发自己的estimator。

sklearn.utils.estimator_checks.check_estimator可以检查自己的estimator是否和scikit-learn兼容。

见文档

https://scikit-learn.org/stable/developers/develop.html

class MajorityVoteClassifier(BaseEstimator, ClassifierMixin):

    def __init__(self, classifiers, vote='classlabel', weights=None):

        # 传入一批classifier，形成集成。
        self.classifiers = classifiers
        self.named_classifiers = {key: value for key, value
                                  in _name_estimators(classifiers)}
        self.vote = vote
        self.weights = weights

    def fit(self, X, y):

        # X_train数量为50。10-fold。所以每次fit数量是45。
        print(f"X.shape {X.shape}") # 45x2
        print(f"X {X}")

        # 处理label，转为数字。
        self.lablenc_ = LabelEncoder()
        self.lablenc_.fit(y)
        self.classes_ = self.lablenc_.classes_
        self.classifiers_ = []
        for clf in self.classifiers:
            # 每个clf做fit
            fitted_clf = clone(clf).fit(X, self.lablenc_.transform(y))
            self.classifiers_.append(fitted_clf)
        return self

    def predict(self, X):
        if self.vote == 'probability':
            maj_vote = np.argmax(self.predict_proba(X), axis=1)
        else:  # 'classlabel' vote  默认
            # 得到每个clf的预测
            predictions = np.asarray([clf.predict(X)
                                      for clf in self.classifiers_]).T

            # 对predictions执行y公式，得到最好的clf的index。
            maj_vote = np.apply_along_axis(
                                      lambda x:
                                      np.argmax(np.bincount(x,
                                                weights=self.weights)),
                                      axis=1,
                                      arr=predictions)

        maj_vote = self.lablenc_.inverse_transform(maj_vote)
        return maj_vote

    # 当前场景下必须提供此函数作为decision_function。否则报错。
    # 可能是由下面的score方法roc_auc指定。具体待研究。
    def predict_proba(self, X):
        probas = np.asarray([clf.predict_proba(X)
                             for clf in self.classifiers_])

        # X_train数量50。10-fold。每次用5个sample验证。5x2
        print(f"X.shape {X.shape}") # 5x2

        # 因为是2个class？每个predict_proba出来是5x2。
        # 3个clf。就是3x5x2。
        print(f"probas.shape {probas.shape}")
        print(f"probas {probas}")

        # 加权计算。得5个sample的每个class的概率。5x2
        avg_proba = np.average(probas, axis=0, weights=self.weights)
        print(f"avg_proba {avg_proba}")

        return avg_proba

    def get_params(self, deep=True):
        """ Get classifier parameter names for GridSearch"""
        if not deep:
            return super().get_params(deep=False)
        else:
            out = self.named_classifiers.copy()
            for name, step in self.named_classifiers.items():
                for key, value in step.get_params(deep=True).items():
                    out[f'{name}__{key}'] = value
            return out


# 准备数据
iris = datasets.load_iris()

# 选下标1和2两个feature
# 50之后只有2个class
# 最后的数就是100x2。2个class。
X, y = iris.data[50:, [1, 2]], iris.target[50:]
le = LabelEncoder()
y = le.fit_transform(y)

# 按0.5分割。最后X_train数量为50。
X_train, X_test, y_train, y_test = train_test_split(X, y,
    test_size=0.5,
    random_state=1,
    stratify=y)

# 创建3个estimator
clf1 = LogisticRegression(penalty='l2',
                        C=0.001,
                        solver='lbfgs',
                        random_state=1)

clf2 = DecisionTreeClassifier(max_depth=1,
                        criterion='entropy',
                        random_state=0)

clf3 = KNeighborsClassifier(n_neighbors=1,
                        p=2,
                        metric='minkowski')

# 逻辑回归和knn之前进行标准化
pipe1 = Pipeline([['sc', StandardScaler()],
                  ['clf', clf1]])
pipe3 = Pipeline([['sc', StandardScaler()],
                  ['clf', clf3]])

# 3个estimator进行集成
mv_clf = MajorityVoteClassifier(classifiers=[pipe1, clf2, pipe3])

# 4个estimator进行对比
clf_labels = ['Logistic regression', 'Decision tree', 'KNN', 'Majority voting']
all_clf = [pipe1, clf2, pipe3, mv_clf]

for clf, label in zip(all_clf, clf_labels):
    # 4个estimator都跑一次交叉验证
    # 10-fold
    # 算分采用roc_auc
    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv=10,
                             scoring='roc_auc')

    print(f'ROC AUC: {scores.mean():.2f} '
          f'(+/- {scores.std():.2f}) [{label}]')


# 可以看到集成后。roc auc指标变好。
# ROC AUC: 0.92 (+/- 0.15) [Logistic regression]
# ROC AUC: 0.87 (+/- 0.18) [Decision tree]
# ROC AUC: 0.85 (+/- 0.13) [KNN]
# ROC AUC: 0.98 (+/- 0.05) [Majority voting]


# 到此predict函数是没跑的。

cross_val_score的算分见

https://scikit-learn.org/stable/modules/model_evaluation.html# 非常多的内容。毕竟评估是非常重要的工作。

然后代码用sklearn.metrics.roc_curve得到roc数据。画出4个clf的曲线。

然后像前几章那样画出决策边界。观察4个clf的决策边界。

见图7.5

7.2.3 参数调教#

get_params返回estimator的参数

见https://scikit-learn.org/stable/developers/develop.html#get-params-and-set-params

# 查看参数
mv_clf.get_params()
print(f"mv_clf.get_params() {mv_clf.get_params()}")

# 设定让GridSearchCV搜索的参数
# 2x3=6种组合
params = {'decisiontreeclassifier__max_depth': [1, 2],
          'pipeline-1__clf__C': [0.001, 0.1, 100.0]}

grid = GridSearchCV(estimator=mv_clf,
                    param_grid=params,
                    cv=10,
                    scoring='roc_auc')

# fit执行训练
grid.fit(X_train, y_train)

print(f"grid.cv_results_ {grid.cv_results_}")
# 因为有6种组合，所以很多结果都有6个值。

for r, _ in enumerate(grid.cv_results_['mean_test_score']):
    mean_score = grid.cv_results_['mean_test_score'][r]
    std_dev = grid.cv_results_['std_test_score'][r]
    params = grid.cv_results_['params'][r]
    print(f'{mean_score:.3f} +/- {std_dev:.2f} {params}')
    # 这里打印出所有组合的分数

'''
0.983 +/- 0.05 {'decisiontreeclassifier__max_depth': 1, 'pipeline-1__clf__C': 0.001}
0.983 +/- 0.05 {'decisiontreeclassifier__max_depth': 1, 'pipeline-1__clf__C': 0.1}
0.967 +/- 0.10 {'decisiontreeclassifier__max_depth': 1, 'pipeline-1__clf__C': 100.0}
0.983 +/- 0.05 {'decisiontreeclassifier__max_depth': 2, 'pipeline-1__clf__C': 0.001}
0.983 +/- 0.05 {'decisiontreeclassifier__max_depth': 2, 'pipeline-1__clf__C': 0.1}
0.967 +/- 0.10 {'decisiontreeclassifier__max_depth': 2, 'pipeline-1__clf__C': 100.0}
'''

# 直接打印出最好的参数组合
print(f'Best parameters: {grid.best_params_}')
print(f'ROC AUC: {grid.best_score_:.2f}')

'''
Best parameters: {'decisiontreeclassifier__max_depth': 1, 'pipeline-1__clf__C': 0.001}
ROC AUC: 0.98
'''

7.3 Bagging#

Bagging和上一节的MajorityVoteClassifier相似。

区别在于，Bagging不用同一组训练数据，而是随机从初始训练数据中取bootstrap samples。
图7.6展示了流程。
就是很简单地把原始训练数据分成x份，每轮每个clf随机抽x份，带replacement。

df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/'
                      'machine-learning-databases/wine/wine.data',
                      header=None)

df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
                   'Alcalinity of ash', 'Magnesium', 'Total phenols',
                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
                   'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
                   'Proline']

# 去掉一个class。剩2个class。
df_wine = df_wine[df_wine['Class label'] != 1]

print(f"df_wine.shape {df_wine.shape}")
print(f"df_wine {df_wine}")

# 取2个feature
# X为119x2
y = df_wine['Class label'].values
X = df_wine[['Alcohol', 'OD280/OD315 of diluted wines']].values

print(f"X.shape {X.shape}")
print(f"X {X}")


le = LabelEncoder()
y = le.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y,
    test_size=0.2,
    random_state=1,
    stratify=y)

# X_train为80%原始数据
# 95x2

# 做一个决策树
tree = DecisionTreeClassifier(criterion='entropy',
    max_depth=None,
    random_state=1)

# 做一个bag
# base_estimator为上面的tree
bag = BaggingClassifier(base_estimator=tree,
    n_estimators=500,
    max_samples=1.0,
    max_features=1.0,
    bootstrap=True,
    bootstrap_features=False,
    n_jobs=1,
    random_state=1)

# 直接用决策树训练
tree = tree.fit(X_train, y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)

tree_train = accuracy_score(y_train, y_train_pred)
tree_test = accuracy_score(y_test, y_test_pred)
print(f'Decision tree train/test accuracies '
      f'{tree_train:.3f}/{tree_test:.3f}')

# Decision tree train/test accuracies 1.000/0.833

# 用bag训练
bag = bag.fit(X_train, y_train)
y_train_pred = bag.predict(X_train)
y_test_pred = bag.predict(X_test)

bag_train = accuracy_score(y_train, y_train_pred)
bag_test = accuracy_score(y_test, y_test_pred)
print(f'Bagging train/test accuracies '
      f'{bag_train:.3f}/{bag_test:.3f}')

# Bagging train/test accuracies 1.000/0.917

可以看到，用bag处理后，准确度提高。

图7.8画出了决策边界。

bagging可用于减少过度拟合。

7.4 adaptive boosting#

7.5 Gradient boosting#

8 机器学习用于情感分析#

先进的互联网和多媒体时代，人们的意见/观点/推荐都是政治/经济操作的资源。

这一章看natural language processing(NLP)的一个子领域，情感分析。

学习如何用机器学习把文本分类，基于文本的情感(也就是作者的态度)。

使用IMDb的电影数据，对观众的点评进行分类。

8.1 数据预处理#

http://ai.stanford.edu/~amaas/data/sentiment/

数据包含50000条影评，正面/反面两种class。

超过6星算正面。小于5星算负面。

进行预处理后，用这些数据训练出模型，预测一段文字是正面还是反面。

pip3.11 install pyprind --index-url http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com

pip3.11 install nltk==3.8.1 --index-url http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com

# 加载数据
# 数据比较多，使用进度条更友好。

basepath = 'aclImdb'

labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000, stream=sys.stdout)
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file),  'r', encoding='utf-8') as infile:
                txt = infile.read()

            # 这里pandas最新版本api变了
            # append要换成concat
            #df = df.append([[txt, labels[l]]],  ignore_index=True)

            new_df = pd.DataFrame([[txt, labels[l]]])
            #new_df = pd.DataFrame([txt, labels[l]])

            df = pd.concat([df, new_df], ignore_index=True)

            pbar.update()

# concat中ignore_index。即不指定列的title。自动从0开始编号。
# 这里指定列的title，也就是feature的title。
df.columns = ['review', 'sentiment']

# 打乱顺序
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))

# 存csv
df.to_csv('movie_data.csv', index=False, encoding='utf-8')

# 读csv
df = pd.read_csv('movie_data.csv', encoding='utf-8')


# 最终数据的shape 50000x2

pandas.DataFrame是比较基本的一个数据结构，内容也比较多。得专门看一下。

最终的df数据结构见图8.1。

# 要这样初始化一个df。一条数据做成1行2列。
new_df = pd.DataFrame([[txt, labels[l]]])
'''
                                                   0  1
0  Battleship Potemkin is a celluloid masterpiece...  1
'''

# 不能这样。2行1列。
new_df = pd.DataFrame([txt, labels[l]])
'''
                                                   0
0  I really enjoyed Fierce People. I discovered t...
1                                                  1
'''

8.2 bag-of-words模型#

bag-of-words(词袋)模型可以把一个文本转成feature vector。

创建唯一的词汇token。例如所有文本中的单词。可看作是所有feature。
每个文本创建自己的feature vector。即文本中每个单词出现了几次？

单个文本中的单词一般只是整个词汇表的较小子集，所以很可能包含很多0。即sparse。

sklearn.feature_extraction.text.CountVectorizer可以做出整个词汇表和feature vector。

count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])

# fit_transform生成词袋
bag = count.fit_transform(docs)

# 所有单词和对应的index
print(f"count.vocabulary_\n{count.vocabulary_}")
# {'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}

# 每个文本中各个单词的数量
print(f"bag.toarray()\n{bag.toarray()}")
'''
[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]
'''

这些数量叫raw term frequencies

term t在文本d中出现的次数

\({\Large tf(t,d)}\)

8.2.2 term frequency-inverse document frequency#

分析文本时，经常会遇到某个单词在多个sample中出现，这些词可能不能提供太有用的信息。

引入term frequency-inverse document frequency(tf-idf)，可以降低这种词的权重。

\({\Large tf \mbox{-} idf(t,d) = tf(t,d) \times idf(t,d)}\)

其中

\({\Large idf(t,d) = log \frac{n_d}{1+df(d,t)}}\)

\(n_d\)是文本总数。
\(df(d,t)\)是包含单词t的文本数量。
下面的1是可选的，主要为避免出现0。

例如共有100个文本，如果每个都出现单词t，idf接近log(1)=0。那么tf-idf接近0，说明这个词哪都有，没大用。

如果只出现在3个文本，那么idf为log(100/4)接近3.2。

如果总文本数巨大，但某个词出现很少，idf也不会太大。

log就是防止这种低频得分太高。例如log(10000)也只有9.2，log(1000000)也只有13.8。一般不会太离谱。

sklearn.feature_extraction.text.TfidfTransformer可计算tfidf。

tfidf = TfidfTransformer(use_idf=True,
    norm='l2',
    smooth_idf=True)

# 传入词袋得到tfidf
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

'''
[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]
'''

# 可见考虑权重后，第2列单词is降到0.4。

scikit-learn的实现不一定完全按照论文。可有细微的变化。

norm参数可指定对结果进行normalize。

8.2.3 清理数据#

上一节学了bag-of-words模型，tf，tf-idf。

在这之前，需要清洗一下数据。

文本中可能包含html标签，标点符号等。html标签总体用处不大，标点符号在某些场景是有用的。

简单起见，只保留:)这种表情，删除其他符号。

删除非文字。大写字母统一转小写。

8.2.4 把文本处理成token#

可以直接按空格分割。

另一个技巧是word stemming。可以按词源来处理(Porter stemmer算法)。

https://tartarus.org/martin/PorterStemmer/

可直接用nltk的api。nltk.stem.porter.PorterStemmer。
单词会转成词根，比如running会变成run，thus会变成thu。
Porter stemmer最简单，还有其他算法。

stop word removal
比如is/and/has/like，及其常用的单词，对分析可以说基本没用。
如果没用tf-idf降权重的话，可以直接删除这些词。
nltk可以直接获取这些词。

8.3 使用逻辑回归对文本分类#

起一个pipeline，先做tfidf，再逻辑回归。然后GridSearchCV选出最好的参数。

# 数据对半分
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

# TfidfVectorizer
# 根据文档可视为联合使用CountVectorizer和TfidfTransformer。
tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

# 设置要对比的参数
# 可以自己改改玩玩
small_param_grid = [{'vect__ngram_range': [(1, 1), (1, 3)],
                     'vect__stop_words': [None],
                     'vect__tokenizer': [tokenizer, tokenizer_porter],
                     'vect__token_pattern':[None],
                     'clf__penalty': ['l2'],
                     'clf__C': [1.0, 10.0]},
                    {'vect__ngram_range': [(1, 1), (1, 2), (1, 3)],
                     'vect__stop_words': [stop, None],
                     'vect__tokenizer': [tokenizer, tokenizer_porter],
                     'vect__token_pattern':[None],
                     'vect__use_idf':[False, True],
                     'vect__norm':[None, True],
                     'clf__penalty': ['l2', 'l1'],
                     'clf__C': [1.0, 5, 10.0]},
              ]

# 做pipeline
lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(solver='liblinear'))])

# 做cv
gs_lr_tfidf = GridSearchCV(lr_tfidf, small_param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

# 训练
gs_lr_tfidf.fit(X_train, y_train)

# 输出结果
print(f'Best parameter set: {gs_lr_tfidf.best_params_}')
print(f'CV Accuracy: {gs_lr_tfidf.best_score_:.3f}')

clf = gs_lr_tfidf.best_estimator_
print(f'Test Accuracy: {clf.score(X_test, y_test):.3f}')

8.4 在线算法和out of core学习#

运行之前的代码可以看到处理50000条影评，开销可以非常大。

现实中数据量常常更大。

https://machinelearning.wtf/terms/out-of-core/
https://scikit-learn.org/stable/computing/scaling_strategies.html
https://en.wikipedia.org/wiki/Out-of-core_algorithm

out of core，意思数据量太大，不可能完全放入ram一次处理完。那么需要一点点来。

并且需要针对此场景做io等等优化。

3个要点
需要一种stream方法加载数据
从数据中提取feature
需要一个增量算法(incremental algorithm)

增量意思就是可以一点点来，不用看到所有数据也能工作。

之前我们的代码都是一次传入所有训练数据。这样就不是增量。

第二章学了随机梯度下降。每次使用一个sample更新权重参数。
这一节使用SGDClassifier的partial_fit做stream操作。
partial_fit不是所有estimator都支持。

sklearn.feature_extraction.text.HashingVectorizer可增量。

相比sklearn.feature_extraction.text.CountVectorizer有优劣。见文档。

feature提取和hash相关信息

https://scikit-learn.org/stable/modules/feature_extraction.html#feature-hashing

https://scikit-learn.org/stable/modules/feature_extraction.html#vectorizing-a-large-text-corpus-with-the-hashing-trick

n_features参数可限定hash的空间。如果内存是个问题，可以调小n_features。

stop = stopwords.words('english')

# 保留表情。大写转小写。删除非文字。按空格打碎成token。删除stopsords。
def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) + ''.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

# stream方法。从磁盘文本中获取一条新的sample
def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv)  # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

# 从stream中拿出size个sample一起返回。
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

# 起HashingVectorizer
vect = HashingVectorizer(decode_error='ignore',
                        n_features=2**21,
                        preprocessor=None,
                        tokenizer=tokenizer)

# SGDClassifier。api有变化。log需要写成log_loss
clf = SGDClassifier(loss='log_loss', random_state=1)

doc_stream = stream_docs(path='movie_data.csv')

# 进度条
pbar = pyprind.ProgBar(45)

classes = np.array([0, 1])

# 可调小range和size观察数据
for _ in range(45): # 分45次batch
    # 取数据
    X_train, y_train = get_minibatch(doc_stream, size=1000)

    # 各种打印。观察数据。感受数据的流转。
    # 此时X_train是文本的list。1000个
    print(f"X_train\n{X_train}")
    print(f"len(X_train.shape) {len(X_train)}")
    print(f"y_train\n{y_train}")
    print(f"len(y_train.shape) {len(y_train)}")
    if not X_train:
        break
    X_train = vect.transform(X_train)
    print(f"type(X_train) {type(X_train)}")
    print(f"X_train.shape {X_train.shape}")
    # 如果n_features=2**21。X_train.shape是(1000, 2097152)

    # X_train此时是一个稀疏矩阵。如果[0, 0]没hash到，那值就是0。
    print(f"X_train[0，0] {X_train[0, 0]}")
    print(f"X_train\n{X_train}")

    # 增量训练
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()


# 最后用测试数据算分
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print(f'Accuracy: {clf.score(X_test, y_test):.3f}')

内容非常多

8.5 使用latent Dirichlet allocation进行topic建模#

topic建模是个每个文本赋予各种topic。

就是打标签。

可以看作是无监督学习，clustering。

latent Dirichlet allocation(LDA)是一个流行的topic建模技术。
非常大的课题，数学很难，不是一般人一时半会能搞定的。
具体内容我们直接跳过，以后有缘再相见。

LDA是一个生成性的概率模型，试图找出跨文本频繁出现的词组。

这些词组相当于一个个topic。

输入词袋，LDA将它分解为
1. document-to-topic矩阵
2. word-to-topic矩阵

如果把这两个矩阵相乘，可以得到原词袋(以最小的错误)。

需要提供topic数量给LDA。

# LDA代码流程

# 读数据
df = pd.read_csv('movie_data.csv', encoding='utf-8')

df = df.rename(columns={"0": "review", "1": "sentiment"})

# 50000x2
print(f"df.shape {df.shape}")

# 起CountVectorizer
# max_features指定只保留最多的5000个单词
count = CountVectorizer(stop_words='english',
                        max_df=.1,
                        max_features=5000)

# 文本list生成词袋
X = count.fit_transform(df['review'].values)

# 50000x5000的词袋
print(f"X {X}")
print(f"X.shape {X.shape}")

# LDA
# n_components指定topic数
lda = LatentDirichletAllocation(n_components=10,
                                random_state=123,
                                learning_method='batch')

# 训练
X_topics = lda.fit_transform(X)


# 文本与所有topic的关系
# X_topics.shape (50000, 10)
# samples x topic
# 每个文本有个10个topic得分，值越大，越与该topic相关。
# 比如可以针对某个topic排序，就可以列出该topic的电影。
print(f"X_topics {X_topics}")
print(f"X_topics.shape {X_topics.shape}")


# components_ 10x5000
# topic与所有单词的关系
# components_[i, j]可解读为单词j被赋予topic i的次数
# 也可以解读为topic i在每个单词上的分布
# 次数越多，说明这个单词对这个topic越重要，越关键，越能代表这个topic。
# 比如如果'烂'这个字排在一个topic的前边，那么跟该topic比较相关的文本都跟'烂'沾边。
print(f"lda.components_.shape {lda.components_.shape}")
print(f"lda.components_ {lda.components_}")

n_top_words = 5
feature_names = count.get_feature_names_out()

# 5000个单词
print(f"feature_names {feature_names}")
print(f"feature_names.shape {feature_names.shape}")

# 每个topic的5000个单词按出现次数排序。
# 取最大的5个(关键词)代表这个topic。
# 我试了7个词。感觉更好一点。
for topic_idx, topic in enumerate(lda.components_):
    print(f'Topic {(topic_idx + 1)}:')
    print(' '.join([feature_names[i]
                    for i in topic.argsort()\
                        [:-n_top_words - 1:-1]]))

'''
Topic 1:
worst minutes awful script stupid
Topic 2:
family mother father children girl
Topic 3:
american war dvd music tv
Topic 4:
human audience cinema art sense
Topic 5:
police guy car dead murder
Topic 6:
horror house sex girl woman
Topic 7:
role performance comedy actor performances
Topic 8:
series episode war episodes tv
Topic 9:
book version original read novel
Topic 10:
action fight guy guys cool
'''

# 根据每个topic的关键词给这个topic起一个恰当的真实类别
# 1. Generally bad movies (not really a topic category)
# 2. Movies about families
# 3. War movies
# 4. Art movies
# 5. Crime movies
# 6. Horror movies
# 7. Comedies
# 8. Movies somehow related to TV shows
# 9. Movies based on books
# 10. Action movies

# 最后可以直接快速获取任何文本的topic
# 或者按topic找文本
# 例如horror是第6个topic。那么取X_topics第6列排序，可得到horror成分最高的文本。
horror = X_topics[:, 5].argsort()[::-1]

for iter_idx, movie_idx in enumerate(horror[:3]):
    print(f'\nHorror movie #{(iter_idx + 1)}:')
    print(df['review'][movie_idx][:300], '...')

9 用回归分析预测连续变量#

这一章学习有监督学习的另一个子类:回归分析。

回归分析以连续的形式进行预测，在很多领域特别有用，比如预测股市，预测销量。

9.1 线性回归#

线性回归目的是对feature和连续目标变量之间的关系建模。

和分类不同，线性回归给出连续的预测值，而不是分类标签。

9.1.1 简单线性回归#

简单(单变量)线性回归对单个特征x，和一个连续的目标y的关系建模。

\({\Large y = w_1 x + b}\)

和直线方程一样。
w和b和最开始学的也一样，就是权重和bias。
目的就是要得到最终的权重(直线的斜率)。然后就可以预测后续的值了。

图9.1展示了1-feature线性回归。就是要找一条和数据最吻合的直线。

这条线叫做回归线。每个点的预测值与实际值的差就是offsets/residuals/errors。

9.1.2 多重线性回归#

简单线性回归推广到一般就是multiple linear regression。

\({\Large y = w_1 x_1 + \ ... \ + w_m x_m + b = \sum_{i = 1}^{m}w_i x_i + b = w^T x + b}\)

图9.2。对于2个feature，线性回归就是找一个与数据最吻合的平面。

简单和多重，本质原理和技术一致。

9.2 Ames Housing数据#

https://jse.amstat.org/v19n3/decock.pdf

数据是房屋的各种属性，状态/面积/房型/车库/位置/浴室/壁炉/价格等等。

有几十个feature。

9.2.1 加载数据#

整个数据是2930x80。简单起见，只挑选6个feature。

Overall Qual 整体质量数值
Overall Cond 整体状态数值
Gr Liv Area 地面以上面积数值
Central Air 是否有中央空调(Y/N)
Total Bsmt SF 地下室面积数值
SalePrice 价格数值

价格作为目标变量y，也就是要预测的值。

9.2.2 可视化#

拿到数据后直接做一些可视化是非常好的，有助于掌握数据。

scatterplot matrix是feature之间的两两关系。

pip3.11 install mlxtend==0.22.0 --index-url http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com

mlxtend.plotting.scatterplotmatrix可画出所有feature的两两关系。

columns = ['Overall Qual', 'Overall Cond', 'Gr Liv Area',
           'Central Air', 'Total Bsmt SF', 'SalePrice']

# 读取csv。指定列。
df = pd.read_csv('http://jse.amstat.org/v19n3/decock/AmesHousing.txt',
    sep='\t',
    usecols=columns)

print(f"df.shape {df.shape}")
# df.shape (2930, 6)

# YN换成10
df['Central Air'] = df['Central Air'].map({'N': 0, 'Y': 1})

# 删除有na的行
df = df.dropna(axis=0)

# 画图
scatterplotmatrix(df.values, figsize=(12, 10), names=df.columns, alpha=0.5)
plt.tight_layout()

plt.show()

9.2.3 correlation matrix#

correlation matrix可以看作第五章中covariance matrix的rescale版本。

correlation matrix是一个方阵。
包含Pearson product-moment correlation coefficient(r值)。描述两个feature的相关性。
值的范围[-1,1]。r为1是完全的正相关。r为0是无关。r为-1是完全的负相关。

\({\LARGE r(x, y) = \frac{\sum_{i=1}^{n} [(x^i - \mu_x)(y^i - \mu_y)]}{\sqrt{\sum_{i=1}^{n} (x^i - \mu_x)^2} \sqrt{\sum_{i=1}^{n} (y^i - \mu_y)^2}} = \frac{\sigma_{xy}}{\sigma_x \sigma_y}}\)

\(\mu\)是feature的平均值

n是sample总数

例如前6个数据

     整体质量 整体状态  地下室面积  中央空调  地上面积  价格
0       6       5      1080.0      1       1656     215000
1       5       6      882.0       1       896      105000
2       6       6      1329.0      1       1329     172000
3       7       5      2110.0      1       2110     244000
4       5       5      928.0       1       1629     189900
5       6       6      926.0       1       1604     195500

平均值  5.83    5.5    1209        1       1537     186900

手算一个r(2, 5)=0.6279

(1080-1209)*(215000-186900) + \
(882-1209)*(105000-186900) + \
(1329-1209)*(172000-186900) + \
(2110-1209)*(244000-186900) + \
(928-1209)*(189900-186900) + \
(926-1209)*(195500-186900)
= 69538700

(1080-1209)**2 + \
(882-1209)**2 + \
(1329-1209)**2 + \
(2110-1209)**2 + \
(928-1209)**2 + \
(926-1209)**2
= 1108821

(215000-186900)**2 + \
(105000-186900)**2 + \
(172000-186900)**2 + \
(244000-186900)**2 + \
(189900-186900)**2 + \
(195500-186900)**2
= 11062600000

69538700/ sqrt(1108821) / sqrt(11062600000) = 0.6279

即地下室和房价有0.6279的正相关性。

mlxtend.plotting.heatmap可以画出这个数据的heatmap。

cm = np.corrcoef(df.values.T)

# 如果这样取前6个数据。可以看到对应的值确实是0.63和手算吻合。
# df = df.head(6)

hm = heatmap(cm, row_names=df.columns, column_names=df.columns)

plt.tight_layout()
plt.show()

9.3 现实一个ordinary least squares线性回归模型#

上面说了回归分析是要找一个最吻合的直线，或者超平面。

怎么算是吻合？方差的和最小就是最吻合。

ordinary least squares(OLS)方法用来找这个最小方差。

9.3.1 梯度下降#

回忆第二章的loss函数

\({\Large L(w, b)=\frac{1}{2n} \sum_{i=1}^{n}(y^i-\hat y^i))^2}\)

本质上OLS可以看作一个Adline，但没有threshold函数，这样可以得到连续的目标值而不是0/1分类。

基本和第二章的AdalineGD一样。

然后画出loss的迭代情况，和预测曲线。见图9.6和9.7。

9.3.2 scikit-learn的线性回归#

sklearn.linear_model.LinearRegression

直接调api即可。

9.4 RANSAC#

线性回归受离群值的影响较大，有时一小部分数据就能造成较大偏差。

很多统计学手段能检测离群值，不在本书讨论范围。

RANdom SAmple Consensus(RANSAC)用于改善离群值的问题。

随机选一些sample(称为inliers)，做fit。
用其余的数据进行测试，定一个容错范围/阈值，把测试结果在这个范围中的sample加入inliers。
再次用所有inliers进行fit。
评估error
到此结束一轮。按需要结束流程。

sklearn.linear_model.RANSACRegressor
max_trials=100 默认迭代100次
min_samples指定至少选取的sample数
residual_threshold默认None，median absolute deviation方法确定阈值。
loss可选 absolute_error/squared_error

median absolute deviation的算法

np.mean(np.abs(data - np.mean(data)))

ransac = RANSACRegressor(LinearRegression(),
                    max_trials=100, # default
                    min_samples=0.95,
                    loss='absolute_error', # default
                    residual_threshold=None, # default
                    random_state=123)

# X.shape (2929, 1)
# 1维的面积数据
print(f"X.shape {X.shape}")

# y.shape (2929,)
print(f"y.shape {y.shape}")

# 直接数据扔进去
ransac.fit(X, y)

# 训完后inlier_mask_存放了inlier标志位
inlier_mask = ransac.inlier_mask_
outlier_mask = np.logical_not(inlier_mask)

# line_X = np.arange(3, 10, 1)
# 这里原代码应该是想画一个均匀的预测线？但范围太小，我改为0-6000。
line_X = np.arange(0, 6000, 1)

# 预测从0到6000
line_y_ransac = ransac.predict(line_X[:, np.newaxis])

# 画出inlier数据。蓝色
plt.scatter(X[inlier_mask], y[inlier_mask],
            c='steelblue', edgecolor='white',
            marker='o', label='Inliers')

# 画出outlier数据。绿色
plt.scatter(X[outlier_mask], y[outlier_mask],
            c='limegreen', edgecolor='white',
            marker='s', label='Outliers')

# 画出预测曲线。黑色
plt.plot(line_X, line_y_ransac, color='black', lw=2)

plt.xlabel('Living area above ground in square feet')
plt.ylabel('Sale price in U.S. dollars')
plt.legend(loc='upper left')

plt.tight_layout()
plt.show()

# 打印相关参数
print(f'Slope: {ransac.estimator_.coef_[0]:.3f}')
print(f'Intercept: {ransac.estimator_.intercept_:.3f}')

调大residual_threshold参数，可包容更多的离群值。见图9.10。

9.5 评估线性回归模型的表现#

上一节只是用了1个feature。现在使用5个feature做LinearRegression。

target = 'SalePrice'
features = df.columns[df.columns != target]

print(f"features {features}")
'''
features Index(['Overall Qual', 'Overall Cond', 'Total Bsmt SF', 'Central Air', 'Gr Liv Area'], dtype='object')
'''

# X.shape (2929, 1)
X = df[features].values
y = df[target].values

# 分割训练数据
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=123)

# X_train.shape (2050, 5)
print(f"X_train {X_train}")
print(f"X_train.shape {X_train.shape}")

# y_train.shape (2050,)
print(f"y_train {y_train}")
print(f"y_train.shape {y_train.shape}")

slr = LinearRegression()

# 训练
slr.fit(X_train, y_train)

# 预测X_train和X_test数据
y_train_pred = slr.predict(X_train)
y_test_pred = slr.predict(X_test)

# y_train_pred.shape (2050,)
print(f"y_train_pred {y_train_pred}")
print(f"y_train_pred.shape {y_train_pred.shape}")

# y_test_pred.shape (879,)
print(f"y_test_pred {y_test_pred}")
print(f"y_test_pred.shape {y_test_pred.shape}")

# 获取预测最大最小值。为了画图像
x_max = np.max([np.max(y_train_pred), np.max(y_test_pred)])
x_min = np.min([np.min(y_train_pred), np.min(y_test_pred)])

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(7, 3), sharey=True)

# 画test的预测值，和对应的test预测值与实际值的差。residuals
ax1.scatter(y_test_pred, y_test_pred - y_test,
            c='limegreen', marker='s', edgecolor='white',
            label='Test data')

# 画train的预测值，和对应的train预测值与实际值的差。residuals
ax2.scatter(y_train_pred, y_train_pred - y_train,
            c='steelblue', marker='o', edgecolor='white',
            label='Training data')
ax1.set_ylabel('Residuals')

for ax in (ax1, ax2):
    ax.set_xlabel('Predicted values')
    ax.legend(loc='upper left')
    ax.hlines(y=0, xmin=x_min-100, xmax=x_max+100, color='black', lw=2)

plt.tight_layout()

plt.show()

图9.11画出了预测值与其对应的residual值的关系。

在预测值与实际值的差的基础上，有多种指标来反映误差。

sklearn.metrics.mean_squared_error

sklearn.metrics.mean_absolute_error

coefficient of determination

sklearn.metrics.r2_score(\(R^2\))

9.6 线性回归中使用regularized方法#

最流行的对线性回归的正规化的方法
- ridge regression(L2) - least absolute shrinkage and selection
operator (LASSO)(L1)
- elastic net

ridge回归是mse loss函数加上L2正规化

\({\Large L(w)_{ridge} = \sum_{i=1}^{n} (y^i - \hat y^i)^2 + \lambda||w||_2^2}\)

\({\Large \lambda||w||_2^2 = \lambda \sum_{j=1}^{m} w_j^2}\)

n是sample数

m是feature数

bias并不会进行正规化

LASSO针对稀疏模型

\({\Large L(w)_{Lasso} = \sum_{i=1}^{n} (y^i - \hat y^i)^2 + \lambda||w||_1}\)

\({\Large \lambda||w||_1 = \lambda \sum_{j=1}^{m} |w_j|}\)

lasso的一个限制是当m>n时，只能选n个feature。
但有的场景下是个好处，能避免saturation。
saturation在m=n时发生，可能仅在训练时效果好，但泛化不好。

ridge回归和LASSO的折中的是elastic net。

既有L1生成稀疏性，又有L2以选择多余n个feature(当m>n时)。

\({\Large L(w)_{Elastic Net} = \sum_{i=1}^{n} (y^i - \hat y^i)^2 + \lambda||w||_2^2 + \lambda||w||_1}\)

sklearn.linear_model.Ridge

sklearn.linear_model.Lasso

sklearn.linear_model.ElasticNet

9.7 线性回归转为多项式回归#

\({\Large y = w_1 x + w_2x^2 + \ ... \ + w_dx^d + b}\)

sklearn.preprocessing.PolynomialFeatures用来把数据升到更高维度。

X = np.array([258.0, 270.0, 294.0,
            320.0, 342.0, 368.0,
            396.0, 446.0, 480.0, 586.0])\
             [:, np.newaxis]
# X.shape (10, 1)

y = np.array([236.4, 234.4, 252.8,
            298.6, 314.2, 342.2,
            360.8, 368.0, 391.2,
            390.8])
# y.shape (10,)

# X y还是一堆2d的数据

lr = LinearRegression()
pr = LinearRegression()
quadratic = PolynomialFeatures(degree=2)

# 转X为二阶
# a 会变成 [1, a, a^2]
X_quad = quadratic.fit_transform(X)

# 训练10x1
lr.fit(X, y)

X_fit = np.arange(250, 600, 10)[:, np.newaxis]
# X_fit.shape (35, 1)

# 预测35个
y_lin_fit = lr.predict(X_fit)
# y_lin_fit.shape (35,)

# X_quad.shape (10, 3)
# 训练二阶
pr.fit(X_quad, y)

# 预测二阶
y_quad_fit = pr.predict(quadratic.fit_transform(X_fit))

# 画训练数据
plt.scatter(X, y, label='Training points')

# 画线性回归结果
plt.plot(X_fit, y_lin_fit, label='Linear fit', linestyle='--')

# 画多项式回归的结果
plt.plot(X_fit, y_quad_fit, label='Quadratic fit')
plt.xlabel('Explanatory variable')
plt.ylabel('Predicted or known target values')
plt.legend(loc='upper left')

plt.tight_layout()
plt.show()

# 打印mse和r2分数
y_lin_pred = lr.predict(X)
y_quad_pred = pr.predict(X_quad)

mse_lin = mean_squared_error(y, y_lin_pred)
mse_quad = mean_squared_error(y, y_quad_pred)
print(f'Training MSE linear: {mse_lin:.3f}'
      f', quadratic: {mse_quad:.3f}')

r2_lin = r2_score(y, y_lin_pred)
r2_quad = r2_score(y, y_quad_pred)
print(f'Training R^2 linear: {r2_lin:.3f}'
      f', quadratic: {r2_quad:.3f}')
'''
Training MSE linear: 569.780, quadratic: 61.330
Training R^2 linear: 0.832, quadratic: 0.982
'''

可以看到二次曲线与数据更吻合。

从error和r2分数来看，二次多项式回归明显提升了表现。

9.7.2 应用到房产数据#

对比线性/二次/三次。

9.8 使用随机森立处理非线性关系#

随机森林是多个决策树的集成，可以看作多个线性方程的集合。之前讨论的是整体的一个线性或非线性方程。

另一个角度，随机森林可把问题分成多个容易处理的问题。

9.8.1 决策树回归#

9.8.2 随机森林回归#

10 处理无标签数据-聚类分析#

前面的章节，我们学习了监督学习，正确结果是已知的。

现在要学一种无监督学习，在没有正确答案的情况下，从数据中找出隐藏的信息/结构。

clustering聚类的目的是从一堆数据中找出若干分组(cluster)，同个组中的数据相似，不同组中的数据区别大。

10.1 用k-means以相似度来分组#

10.1.1 scikit-learn的k-means#

k-means实现简单，而且效率高。属于prototype-based clustering。

另两种 hierarchical-based clustering

density-based clustering

prototype-based意思是每个组对应一个原型，通常这个原型就是某种中点。
例如一堆相似的feature值的平均值。
例如某种空间中的点，组中每个点都有一个所有点到这个点的距离的和，那么该和最小的那个点可以作为原型。

k-means非常适用于球形的数据。
一个缺点是我们必须指定组的数量k。k选得不好，效果就不好。
elbow和silhouette plots可评估效果，帮助选择k。

用sklearn.datasets.make_blobs可以生成一堆数据用以测试聚类。

见图10.1

k-means流程
1. 随机选k个中心点作为起点
2.
把所有sample关联到最近的中点。这里是所有点，即使数据超大，也是要全部关联一遍。
3. 重新计算当前中点关联的所有点的中点，更新中点。 4.
重复2和3直到中点不再变化，或者误差达到某种预定的阈值，或者达到最大迭代次数。

如何定义物体之间的相似程度？
可以是squared Euclidean distance
即差方和

\({\Large d(x, y)^2 = \sum_{j=1}^{m}(x_j - y_j)^2 = ||x-y||_2^2}\)

m是feature数

基于欧氏距离，kmeans可以看作一个优化问题，让差方和(sum of squared errors/SSE)最小。这个差方和也叫做cluster inertia。

\({\Large SSE = \sum_{i=1}^{n} \sum_{j=1}^{k} w^{i,j} ||x^i-\mu^j||_2^2}\)

n是sample数
k是组数
\(\mu^j\)是组j的中点
\(w^{i,j}\)意思是sample
i是否在组j中。合起来就是只算组内的值，如果不在组内，w就是0，被略去。

# 生成数据
X, y = make_blobs(n_samples=150,
    n_features=2,
    centers=3,
    cluster_std=0.5,
    shuffle=True,
    random_state=1111)

print(f"X {X}")
print(f"X.shape {X.shape}")
print(f"y {X}")
print(f"y.shape {y.shape}")

# 生成150个sample。3个组。也就是0/1/2，3个类。
# X.shape (150, 2)
# y.shape (150,)

# init选random是随机从sample中选初始中点
# n_init是kmeans算法的运行次数，每次重新选中点跑一次，取结果(SSE)最好的一次。
# tol是决定收敛的阈值。
# 如果该次结果与上次的差小于阈值，提前结束该次算法(不用算满max_iter次)。直到完成n_init次kmeans。
km = KMeans(n_clusters=3,
            init='random',
            n_init=10,
            max_iter=300,
            tol=1e-04,
            random_state=0)

# 训练并分组
y_km = km.fit_predict(X)

print(f"y_km {y_km}")
print(f"y_km.shape {y_km.shape}")
# y_km.shape (150,)

# 画出三个分组的数据
plt.scatter(X[y_km == 0, 0],
            X[y_km == 0, 1],
            s=50, c='lightgreen',
            marker='s', edgecolor='black',
            label='Cluster 1')
plt.scatter(X[y_km == 1, 0],
            X[y_km == 1, 1],
            s=50, c='orange',
            marker='o', edgecolor='black',
            label='Cluster 2')
plt.scatter(X[y_km == 2, 0],
            X[y_km == 2, 1],
            s=50, c='lightblue',
            marker='v', edgecolor='black',
            label='Cluster 3')

# 画出三个质心
plt.scatter(km.cluster_centers_[:, 0],
            km.cluster_centers_[:, 1],
            s=250, marker='*',
            c='red', edgecolor='black',
            label='Centroids')

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend(scatterpoints=1)
plt.grid()
plt.tight_layout()
plt.show()

10.1.2 用kmeans++更好地初始化质点#

随机选初始质点可能效果不佳。

k-means++算法让初始质点相互远离。收敛效果会更好。

10.1.3 软硬聚类#

硬意思是一个sample只能属于一个cluster。

软(fuzzy聚类)意思是一个sample可以属于多个cluster。

FCM算法

\({\Large J_m = \sum_{i=1}^{n} \sum_{j=1}^{k} {w^{i,j}}^m ||x^i-\mu^j||_2^2}\)

和之前的SSE很类似，但是w变为i在组j中的概率，而不是0/1二值。而且加上了一个指数。具体不再深入。

10.1.4 elbow方法找出较好的k值#

无监督的一大挑战就是不知道正确答案。

对聚类来说，需要对比不同的模型/参数，比如SSE等，来选一个较好的配置。

sklearn.cluster.KMeans结束后，其inertia_就是SSE值，可以进行多次分组，比较每次的inertia_值。

distortions = []
for i in range(1, 11):
    # 每次选不同的k值
    km = KMeans(n_clusters=i,
        init='k-means++',
        n_init=10,
        max_iter=300,
        random_state=0)

    km.fit(X)

    # 记录SSE
    distortions.append(km.inertia_)

plt.plot(range(1, 11), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.tight_layout()
plt.show()

见图10.3。k值从3开始，SSE明显变小。

所谓elbow就是找一个明显的拐点(曲线中像胳膊肘一样)，即3，作为最后的k值。

10.1.5 silhouette plots方法量化聚类的效果#

另一个有效的指标是silhouette analysis，不仅聚类，其他算法也适用。

是一个图形工具，

计算单个sample的silhouette coefficient的流程
1.
计算cluster cohesion，\(a^{(i)}\)，该sample到所有本组其他点的平均距离。
2.
计算cluster separation，\(b^{(i)}\)，找除了本组外最近的一个组，算该sample到此组所有点的平均距离。
3. 计算silhouette，\(s^{(i)}\)
4.
\({\LARGE s^{(i)} = \frac{b^{(i)} - a^{(i)}}{ max\{b^{(i)}, a^{(i)} \}}}\)

silhouette coefficient范围限定在[-1, 1]。如果ab相等，值为0。

b越大与a，效果越好。因为b代表了sample与其他组的不相似程度。越大那么区分度越高。

km = KMeans(n_clusters=3,
            init='k-means++',
            n_init=10,
            max_iter=300,
            tol=1e-04,
            random_state=0)

# X.shape (150, 2)
# 聚类
y_km = km.fit_predict(X)
# y_km.shape (150,)

# 去掉重复
cluster_labels = np.unique(y_km)
# cluster_labels [0 1 2]
# cluster_labels.shape (3,)

# 拿到3
n_clusters = cluster_labels.shape[0]

# 给定原始数据，和分号的组标签。计算每个sample的Silhouette Coefficient
silhouette_vals = silhouette_samples(X, y_km, metric='euclidean')
# silhouette_vals.shape (150,)
# 得到150个sample的Silhouette Coefficient值

y_ax_lower, y_ax_upper = 0, 0
yticks = []

# 对于每个label。0,1,2
for i, c in enumerate(cluster_labels):
    # 抽取该label的Silhouette Coefficient值
    c_silhouette_vals = silhouette_vals[y_km == c]

    # 排序
    c_silhouette_vals.sort()

    # 高点+=数据数量
    y_ax_upper += len(c_silhouette_vals)

    # color map
    color = cm.jet(float(i) / n_clusters)

    # 画水平的bar图
    # 从下往上画每个sample的值。上面排过序，所以是刀一样的形状。
    plt.barh(range(y_ax_lower, y_ax_upper),
        c_silhouette_vals, height=1.0,
        edgecolor='none', color=color)

    yticks.append((y_ax_lower + y_ax_upper) / 2.)

    # 更新底部高度
    y_ax_lower += len(c_silhouette_vals)

# 计算平均值
silhouette_avg = np.mean(silhouette_vals)

# 画出平均值的竖线
plt.axvline(silhouette_avg, color="red", linestyle="--")

plt.yticks(yticks, cluster_labels + 1)
plt.ylabel('Cluster')
plt.xlabel('Silhouette coefficient')

plt.tight_layout()
plt.show()

matplotlib.cm.jet
https://matplotlib.org/stable/tutorials/colors/colormaps.html
没有太懂，总之是画一种color map。暂时忽略。

这样可以看到每个cluster的图像。平均值越高越好。只要每个cluster的底部不要靠近0，就说明效果还可以。

如果把n_clusters参数改为2，画出图像，可以看到其中一个cluster明显和平均值差距很大，且最差值趋近于0。

10.2 把聚类做成树型结构#

k-means是prototype-based clustering。

现在看hierarchical-based clustering，一个好处是可以画出dendrograms(树型图)。有助于创建更有用的taxonomy(分类)。

另一个大优点是不用像k-means那样指定cluster数量。

两个主要方法
- agglomerative(聚)
- divisive(分)

divisive，从包含所有sample的1个cluster开始，不断分裂，直到每个cluster只包含一个sample？

agglomerative是反向操作。初始每个sample都是一个cluster，然后不断merge。也就是自底向上。

10.2.1 自底向上#

自底向上的两个标准算法
- single linkage
- complete linkage

single linkage，两个cluster中最接近的两个sample的距离作为这两个cluster的距离，算出所有cluster之间的两两距离，merge最近的2个cluster。

complete
linkage，最接近的两个sample的距离替换为最远的两个sample的距离。
见图10.7
我们关注complete linkage。

流程
计算sample之间的两两距离(linkage matrix)
每个sample作为一个cluster
按上述方法merge两个最远的cluster 4. 更新linkage matrix
重复2-4直到只剩一个cluster

scipy.spatial.distance.pdist计算两两之间的距离。可选各种距离比如欧氏距离/海明距离等。

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html

scipy.spatial.distance.squareform把pdist的结果转成矩阵形式。
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.squareform.html#scipy.spatial.distance.squareform
见图10.9

scipy.cluster.hierarchy.linkage
进行一个agglomerative聚类操作。
https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html
可以输入
1.
一个1-D condensed distance matrix也就是pdist的结果。
2. 原始sample，并指定method和metric。

如果传一个squareform的结果，会报警。

linkage返回一个(n-1)x4的矩阵Z。存放了树型信息。例如

[[0.         4.         3.83539555 2.        ]
 [1.         2.         4.34707339 2.        ]
 [3.         5.         5.89988504 3.        ]
 [6.         7.         8.31659367 5.        ]]

n=5。按照它的逻辑

第i=0次迭代，组0和组4merge成组5。组0和组4的距离为3.83。merge以后，组5中的sample数量为1+1=2。

第i=1次迭代，组1和组2merge成组6。组1和组2的距离为4.34。merge以后，组6中的sample数量为1+1=2。

第i=2次迭代，组3和组5merge成组7。组3和组5的距离为5.899。merge以后，组7中的sample数量为1+2=3。

第i=3次迭代，组6和组7merge成组8。组6和组7的距离为8.31。merge以后，组8中的sample数量为2+3=5。

然后scipy.cluster.hierarchy.dendrogram可以把linkage的结果画成树型图。
https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html
见图10.11

variables = ['X', 'Y', 'Z']
labels = ['ID_0', 'ID_1', 'ID_2', 'ID_3', 'ID_4']

# 随机生成5x3的数据。[0, 10]之间
X = np.random.random_sample([5, 3])*10
df = pd.DataFrame(X, columns=variables, index=labels)

print(f"df {df}")
print(f"df.shape {df.shape}")
# df.shape (5, 3)

# pdist算出两两欧氏距离，再squareform转成矩阵。
row_dist = pd.DataFrame(squareform(pdist(df, metric='euclidean')),
                        columns=labels,
                        index=labels)

# row_dist.shape (5, 5)

# linkage传入pdist结果
row_clusters = linkage(pdist(df, metric='euclidean'), method='complete')

print(f"row_clusters {row_clusters}")
print(f"row_clusters.shape {row_clusters.shape}")
# row_clusters.shape (4, 4)

# 画出树型图
row_dendr = dendrogram(row_clusters, labels=labels)
plt.tight_layout()
plt.ylabel('Euclidean distance')
plt.show()

10.2.3 树型图挂到热图上#

树型图经常搭配热图食用。

暂时忽略。

10.2.4 scikit-learn的agglomerative聚类#

sklearn.cluster.AgglomerativeClustering
n_clusters指定cluster数量。
metric指定linkage的metric。默认欧氏。
linkage。single/complete等方法。

ac = AgglomerativeClustering(n_clusters=3,
    affinity='euclidean',
    linkage='complete')

labels = ac.fit_predict(X)

print(f'Cluster labels: {labels}')
# Cluster labels: [1 0 0 2 1]

ac = AgglomerativeClustering(n_clusters=2,
    affinity='euclidean',
    linkage='complete')

labels = ac.fit_predict(X)

print(f'Cluster labels: {labels}')
# Cluster labels: [0 1 1 0 0]

10.3 用DBSCAN定位高密度区域#

再看一种density-based聚类

density-based spatial clustering of applications with noise(DBSCAN)

有的数据不适合kmeans或agglomerative。见图10.15。这种情况用DBSCAN能解决。

DBSCAN主要依数据的密度来贴标签。

密度的定义为半径为\(\varepsilon\)范围内sample的数量。

先给数据贴上一些特殊的标签：
1.
某个sample的半径\(\varepsilon\)范围内，如果至少有MinPts个sample，打上core point标签。
2.
范围内没有MinPts个sample，但处于某个core point的范围，打上border point。
3. 其他都是noise point

然后
1.
距离在\(\varepsilon\)范围内的core point连起来形成一个cluster，单独的一个core point也形成一个cluster。
2. 把所有border point归到它所属的core point的cluster。

DBSCAN的主要有点是，它可以处理非球形的数据，比如一些相互穿插或者包围的。

比如如果数据是一个圆形外面再套一个环形，kmeans处理不了。而DBSCAN能较好处理。

图10.15和图10.16可以看到对于这种月牙型交错数据。

sklearn.cluster.DBSCAN胜出。

11 从0开始实现一个人工神经网络#

深度学习可以认为是一个机器学习的子领域。专注于高效地训练多层人工neural networks(NNs)。

这一章学人工神经网络的基本知识。后续会学深度学习和深度神经网络(DNN)。

11.1 用人工神经网络为复杂函数建模#

最开始我们学了人工神经元。人工神经元是多层人工NN的基石。

人工NN本质就是模拟人脑来解决问题。最早可追溯到1940年。
不过到1950年，很多研究者开始失去了兴趣，因为没有一个好的多层人工NN的解决方案。
https://en.wikipedia.org/wiki/AI_winter
直到1986年又死火重燃，因为有人发现了有效的算法。

今天深度学习加持的复杂NN已经又各种应用。

11.1.1 单层神经网络复习#

复习第二章的神经元，梯度下降等。

#### 11.1.2 多层神经网络

多个单神经元连接到一个多层的feedforward NN。
形成一个网络，称为multilayer perceptron，MLP，多层感知器。
图11.2展示了2层MLP。

input层完全连接到一个隐藏层，隐藏层又完全连接到output层。

如果隐藏层数超过1。就叫做deep NN。

每一层都会通过一个权重因子与下一层连接。
\({\Large w_{j,k}^{(l)}}\)
表示第l层到l+1层的连接中，\(w_k\)到\(w_j\)的的权重因子。注意jk是反的，可以理解为从后往前吸收。
每个w可以连到下一层的任意w，并不是按顺序一一对应。

所以整个W是一个矩阵形式，包含两层w之间的所有两两数据。

不一定是个方阵，因为每层神经元的w(树突)数量可以不一样。

11.1.3 forward propagation激活神经网络#

这一节看forward propagation流程，计算MLP模型的输出。

MLP的学习流程
1.
从input开始，我们训练数据往前推，所谓forward propagation。通过网络，最终产生一个输出。
2. 基于这个output，算出loss。希望这个loss最小。
3.
把这个loss反推(backpropagate)，求出其对于权重和bias的偏导。更新模型。

也就是最开始的adline的流程的一形式。

考察隐藏层的第一个输入值

\({\Large z_1^h = x_1^{in}w_{1,1}^h + x_2^{in}w_{1,2}^h + \ ... \ + x_m^{in}w_{1,m}^h + b_1}\)

\({\Large a_1^h = \sigma(z_1^h)}\)

m是feature数。
\(z_1^h\)是隐藏层的输入。
\(\sigma\)是激活函数，和第2/3章一样。
前一层的所有x值乘上与下一层对应输入点的对应权重，再加起来就是下一层的第一个输入。

为了能解决复杂问题例如图像识别，我们需要一个非线性的激活函数。

例如第三章的sigmoid

\({\Large p= \sigma(z) = \frac{e^z}{1+e^z} = \frac{1}{1+e^{-z}} }\)

能把z值映射到logistic分布的[0,1]。

可以把MLP中的神经元看作返回连续[0,1]值得逻辑回归单。

把公式写扩展到所有feature

\({\Large z^h = x^{in}W^{hT} + b^h}\)

\({\Large a^h = \sigma(z^h)}\)

如果in层的feature数是f1，h层的feature数是f2。

x的shape做成1xf1。W一定是f2xf1。T一下再乘出来z就是1xf2。

上面x是1xf1，即针对1个sample。

还可扩展到所有sample，即x的shape做成nxf1。

\({\Large Z^h = X^{in}W^{hT} + b^h}\)

\({\Large A^h = \sigma(Z^h)}\)

11.2 分辨手写数字#

这一节看一个NN的例子。

非常流行和重要的数据集
Mixed National Institute of Standards and Technology(MNIST)
#### 11.2.1 准备数据

训练数据为250个人的手写数字。一半人是高中生，一半人是人口普查局雇员。

测试数据也是一样的比例，但是不同的人。

每个图片为28x28=784像素。每个像素的值为[0, 255]。

11.2.2 实现多层感知器#

看代码。先看懂大致流程。backward方法具体在下一节学习。到时再会过来看代码。

# 数据集包含70000个sample
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)

X = X.values
y = y.astype(int).values

print(f"X\n{X}")
print(f"X.shape {X.shape}")
# 70000个。每个784像素。
# X.shape (70000, 784)

print(f"y\n{X}")
print(f"y.shape {y.shape}")
# 70000个class。0到9。
# y.shape (70000,)

# 把像素值从[0,255]归到[-1,1]
X = ((X / 255.) - .5) * 2

# 分10000为test
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=10000, random_state=123, stratify=y)
# test后续没用了

# 分5000为valid。55000为train。
X_train, X_valid, y_train, y_valid = train_test_split(
    X_temp, y_temp, test_size=5000, random_state=123, stratify=y_temp)

print(f"X_train\n{X_train}")
print(f"X_train.shape\n{X_train.shape}")
# X_train.shape (55000, 784)

print(f"y_train\n{y_train}")
print(f"y_train.shape\n{y_train.shape}")
# y_train [4 5 4 ... 4 3 0]
# y_train.shape (55000,)


print(f"X_valid\n{X_valid}")
print(f"X_valid.shape\n{X_valid.shape}")
# X_valid.shape (5000, 784)


print(f"y_valid\n{y_valid}")
print(f"y_valid.shape\n{y_valid.shape}")
# y_valid [5 5 1 ... 2 4 9]
# y_valid.shape (5000,)

# 最后就是55000的train和5000的valid


def sigmoid(z):
    return 1. / (1. + np.exp(-z))


# onehot处理list y。
# label转为数字，list扩展为num_labels列，label对应的列为1，其他值为0。
def int_to_onehot(y, num_labels):

    ary = np.zeros((y.shape[0], num_labels))
    for i, val in enumerate(y):
        ary[i, val] = 1

    return ary


# 神经网络类
class NeuralNetMLP:

    # num_features 784 num_hidden 50 num_classes 10
    def __init__(self, num_features, num_hidden, num_classes, random_seed=123):
        super().__init__()

        # 10个class
        self.num_classes = num_classes

        # hidden
        rng = np.random.RandomState(random_seed)

        # 取正态分布随机数。中点为0，sd为0.1。
        self.weight_h = rng.normal(
            loc=0.0, scale=0.1, size=(num_hidden, num_features))
        # self.weight_h.shape (50, 784)
        # 784个feature输入到50个隐层节点

        # 初始50个0
        self.bias_h = np.zeros(num_hidden)

        # 取正态分布随机数
        self.weight_out = rng.normal(
            loc=0.0, scale=0.1, size=(num_classes, num_hidden))
        # self.weight_out.shape (10, 50)
        # 50个隐层节点输出为10个class

        # 初始10个0
        self.bias_out = np.zeros(num_classes)

    # forward就是正向从输入x到输出class
    def forward(self, x):
        # 按公式。从input到h层。
        z_h = np.dot(x, self.weight_h.T) + self.bias_h
        a_h = sigmoid(z_h)

        # 按公式。从h层到out层。
        z_out = np.dot(a_h, self.weight_out.T) + self.bias_out
        a_out = sigmoid(z_out)
        return a_h, a_out

    def backward(self, x, a_h, a_out, y):

        #########################
        ### Output layer weights
        #########################
        # onehot encoding
        y_onehot = int_to_onehot(y, self.num_classes)

        # Part 1: dLoss/dOutWeights
        ## = dLoss/dOutAct * dOutAct/dOutNet * dOutNet/dOutWeight
        ## where DeltaOut = dLoss/dOutAct * dOutAct/dOutNet
        ## for convenient re-use
        # input/output dim: [n_examples, n_classes]
        d_loss__d_a_out = 2. *(a_out - y_onehot) / y.shape[0]

        # input/output dim: [n_examples, n_classes]
        d_a_out__d_z_out = a_out * (1. - a_out) # sigmoid derivative

        # output dim: [n_examples, n_classes]
        delta_out = d_loss__d_a_out * d_a_out__d_z_out # "delta (rule) placeholder"

        # gradient for output weights

        # [n_examples, n_hidden]
        d_z_out__dw_out = a_h

        # input dim: [n_classes, n_examples] dot [n_examples, n_hidden]
        # output dim: [n_classes, n_hidden]
        d_loss__dw_out = np.dot(delta_out.T, d_z_out__dw_out)
        d_loss__db_out = np.sum(delta_out, axis=0)


        #################################
        # Part 2: dLoss/dHiddenWeights
        ## = DeltaOut * dOutNet/dHiddenAct * dHiddenAct/dHiddenNet * dHiddenNet/dWeight
        # [n_classes, n_hidden]
        d_z_out__a_h = self.weight_out

        # output dim: [n_examples, n_hidden]
        d_loss__a_h = np.dot(delta_out, d_z_out__a_h)

        # [n_examples, n_hidden]
        d_a_h__d_z_h = a_h * (1. - a_h) # sigmoid derivative

        # [n_examples, n_features]
        d_z_h__d_w_h = x

        # output dim: [n_hidden, n_features]
        d_loss__d_w_h = np.dot((d_loss__a_h * d_a_h__d_z_h).T, d_z_h__d_w_h)
        d_loss__d_b_h = np.sum((d_loss__a_h * d_a_h__d_z_h), axis=0)

        return (d_loss__dw_out, d_loss__db_out,
                d_loss__d_w_h, d_loss__d_b_h)


model = NeuralNetMLP(num_features=28 *28,
                     num_hidden=50,
                     num_classes=10)


num_epochs = 50
minibatch_size = 100

# minibatch_generator(X_train, y_train, minibatch_size)
# 传入X_train, y_train
def minibatch_generator(X, y, minibatch_size):
    indices = np.arange(X.shape[0])
    # range(0, 55000)

    # 从0到54999打乱
    np.random.shuffle(indices)

    # range跨度为minibatch_size
    for start_idx in range(0, indices.shape[0] - minibatch_size + 1, minibatch_size):

        # 每次取minibatch_size个index
        batch_idx = indices[start_idx:start_idx + minibatch_size]

        # 返回55000sample里的minibatch_size个x和y
        yield X[batch_idx], y[batch_idx]


def train(model, X_train, y_train, X_valid, y_valid, num_epochs,
          learning_rate=0.1):

    epoch_loss = []
    epoch_train_acc = []
    epoch_valid_acc = []

    # 50次
    for e in range(num_epochs):

        # 每次拿100个数据
        minibatch_gen = minibatch_generator(X_train, y_train, minibatch_size)

        for X_train_mini, y_train_mini in minibatch_gen:

            # 正向算。从输入到输出的class。
            a_h, a_out = model.forward(X_train_mini)

            # 反向算
            # 算出针对多个sample的，loss对每个w_h/b_h/w_out/b_out的偏微分。
            d_loss__d_w_out, d_loss__d_b_out, d_loss__d_w_h, d_loss__d_b_h = model.backward(X_train_mini, a_h, a_out, y_train_mini)

            # 更新w和b
            model.weight_h -= learning_rate * d_loss__d_w_h
            model.bias_h -= learning_rate * d_loss__d_b_h
            model.weight_out -= learning_rate * d_loss__d_w_out
            model.bias_out -= learning_rate * d_loss__d_b_out

        # 打印
        train_mse, train_acc = compute_mse_and_acc(model, X_train, y_train)
        valid_mse, valid_acc = compute_mse_and_acc(model, X_valid, y_valid)
        train_acc, valid_acc = train_acc *100, valid_acc *100
        epoch_train_acc.append(train_acc)
        epoch_valid_acc.append(valid_acc)
        epoch_loss.append(train_mse)
        print(f'Epoch: { e +1:03d}/{num_epochs:03d} '
              f'| Train MSE: {train_mse:.2f} '
              f'| Train Acc: {train_acc:.2f}% '
              f'| Valid Acc: {valid_acc:.2f}%')

    return epoch_loss, epoch_train_acc, epoch_valid_acc

11.3 人工神经网络的训练#

11.3.1 loss函数的计算#

目前采用MSE loss来训练多层神经网络。后续章节学习其他的loss函数。

按图11.2，假设计算某个sample，\(a^{out}\)是输出值，y是正确答案。

\({\LARGE a^{out} = \begin{bmatrix} 0.1 \\ 0.9 \\ \vdots \\ 0.3 \end{bmatrix} , \ \ \ y = \begin{bmatrix} 0 \\ 1 \\ \vdots \\ 0 \end{bmatrix} }\)

假设class有t种，y是经过onehot处理过的，包含t个元素。

只有一个对应原label的元素为1，其他为0。

对于这一个sample有mse loss函数

\({\Large L(w, b) = \frac{1}{t} \sum_{j = 1}^{t} (y_j - a_j^{out})^2 }\)

这里其实和之前的loss函数有区别，之前都是一个sample只有一个输出y。
然后mse是针对所有sample取平均。
现在是一个sample有t个输出，对于这个t个输出，算mse。

扩展到所有sample，就是外面套一个sample数的循环。

\({\Large L(W, b) = \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{t} \sum_{j = 1}^{t} (y_j^{[i]} - a_j^{out[i]})^2 }\)

n为sample总数。

要使L最小，需要计算偏微分

最简单的神经元的权重是1维，一个list。
两层神经元之间的权重是2维，因为每个unit都会对应一个list。
多层神经元的权重就是每两层之间的权重的list，就是3维。
见图11.10

图中\(W^{h}\)和\(W^{out}\)的权重数据量相等，是方便演示。实际中不一定，除非层中的unit数量真的一样。

11.3.2 backpropagation的认识#

backpropagation虽然在30年前发明，现在仍然是一种广泛使用的高效NN算法。

本质上可以把backpropagation看作一个计算方法，可以高效地计算多层NN中的loss函数(复杂/非凸)的偏微分。

NN的难点在于多feature的空间中，参数巨多。
而且loss函数是非凸的，不平滑，有各种bump，很难找到最优点。
相比简单的adline，loss函数是convex或者smooth的。

求导的链式法则

\({\Large \frac{d}{dx} [f(g(x))] = \frac{df}{dg} \frac{dg}{dx}}\)

可以做成任意长度

\({\Large \frac{d}{dx} F(x) = \frac{d}{dx} f(g(h(u(x)))) = \frac{df}{dg} \frac{dg}{dh} \frac{dh}{du} \frac{du}{dx}}\)

有一种技术叫automatic differentiation，能高效地算这种偏微分。

有两种模式forward/reverse。

backpropagation是reverse的一种特殊形式。

用forward模式算链式法则太慢，因为要算很多矩阵相乘。
而reverse模式是从右往左算，矩阵先乘vector，得到vector，再往前乘矩阵。
这样每次都是算矩阵乘vector，而不是矩阵乘矩阵。这样大大提高效率。

https://arxiv.org/pdf/1404.7456.pdf

11.3.3 使用backpropagation训练NN#

之前推过forward propagation公式的input到h层

\({\Large Z^h = X^{in}W^{hT} + b^h}\)

\({\Large A^h = \sigma(Z^h)}\)

h层到output层类似

\({\Large Z^{out} = A^{h}W^{outT} + b^{out}}\)

\({\Large A^{out} = \sigma(Z^{out})}\)

现在要更新\({\Large w_{1,1}^{out}}\)，就要算loss函数对于\({\Large w_{1,1}^{out}}\)的偏微分。

这里个人理解是
原则上L是整个sample的loss，也就是所有y都要算上。这里\({\Large w_{1,1}^{out}}\)特殊，紧挨着\({\Large a_{1}^{out}}\)，所以其他a项不会做出贡献，直接忽略。
后面会看到如果再往前推，就需要算多个L。
再继续理解，要算某个w，要算所有能达到它的路径。
不能达到的路径，其中肯定包含了此w完全无关的项，那么偏微分就是0，那么乘出来整个就是0，就是不用算。

可以用链式法则这样凑三项

\({\Huge \frac{\partial L}{\partial w_{1,1}^{out} } = \frac{\partial L}{\partial a_{1}^{out} } \frac{\partial a_1^{out}}{\partial z_{1}^{out} } \frac{\partial z_1^{out}}{\partial w_{1,1}^{out} }}\)

右边一项一项来

\({\Huge \frac{\partial L}{\partial a_{1}^{out}} = \frac{\partial }{\partial a_{1}^{out}} (y_1 - a_1^{out})^2}\)

\({\Large \ \ \ \ \ \ \ \ = 2(a_1^{out} - y_1)}\)

这里的L是一个完整的mse均方差，但算偏微分只剩下包含\({\Large a_{1}^{out}}\)的这一项有效，其他都是0。

另外按理需要除一个class数量，这里省略，代码里会除。

\({\Huge \frac{\partial a_{1}^{out}}{\partial z_{1}^{out}} = \frac{\partial}{\partial z_1^{out}} \frac{1}{1+e^{-z_1^{out}}} }\) (下面简写成z)

\({\Large \ \ \ \ \ \ \ \ = -(1+e^{-z})^{-2} \frac {\partial }{\partial z} e^{-z} }\)

\({\Large \ \ \ \ \ \ \ \ = (1+e^{-z})^{-2} e^{-z} }\)

\({\Large \ \ \ \ \ \ \ \ = \frac{1}{1+e^{-z}} \frac{e^{-z}}{1+e^{-z}} }\)

\({\Large \ \ \ \ \ \ \ \ = \frac{1}{1+e^{-z}} ( 1- \frac{1}{1+e^{-z}}) }\)

\({\Large \ \ \ \ \ \ \ \ = a_{1}^{out}(1- a_{1}^{out}) \ \ \ \ (刚好可以用a_{1}^{out}凑出来) }\)

\({\Huge \frac{\partial z_1^{out}}{\partial w_{1,1}^{out}} = \frac{\partial}{\partial w_{1,1}^{out}} (a_1^h w_{1,1}^{out} + b_1^{out}) }\)

\({\Large \ \ \ \ \ \ \ \ = a_1^h }\)

三项乘起来，最终

\({\Huge \frac{\partial L}{\partial w_{1,1}^{out} } = 2(a_1^{out} - y_1) a_{1}^{out}(1- a_{1}^{out}) a_1^h }\)

对bias求偏导的话最后一项为1。

可以看到还是挺简洁的。那么给它乘上学习率eta，就可以更新\({\Large w_{1,1}^{out}}\)了。

\({\Large w_{1,1}^{out} = w_{1,1}^{out} - 2 \eta (a_1^{out} - y_1) a_{1}^{out}(1- a_{1}^{out}) a_1^h}\)

见图11.12。要知道这只是往前推了一步，更新\({\Large w_{1,1}^{out}}\)。

我觉得把它推广到一般能更好理解，并转换成矩阵相乘。

\({\Huge \color{Purple} \frac{\partial L}{\partial w_{j,i}^{out} } = 2(a_j^{out} - y_j) a_{j}^{out}(1- a_{j}^{out}) a_i^h }\)

h层下标是i，out层下标是j。

这时手算几个数据就能找到规律，可以把单个值推广到数组。

# a_out和y都是1xJ
delta = 2 * (a_out - y) * (a_out * (1 - a_out))

# a_h是I个。1xI
# 乘出来一个JxI的矩阵。包含了L对所有w的偏微分。
d_L_d_w_out = np.dot(delta.T, a_h)

再往前一步要更新\({\Large w_{1,1}^{h}}\)，见之前的理解，要算从y通向它的所有路径。

见图11.13

\({\Huge \frac{\partial L}{\partial w_{1,1}^{h} } = \frac{\partial L}{\partial a_{1}^{out} } \frac{\partial a_1^{out}}{\partial a_{1}^{h} } \frac{\partial a_1^{h}}{\partial w_{1,1}^{h}}}\)

\({\Huge \ \ \ \ \ \ \ \ + \frac{\partial L}{\partial a_{2}^{out} } \frac{\partial a_2^{out}}{\partial a_{1}^{h} } \frac{\partial a_1^{h}}{\partial w_{1,1}^{h}}}\)

同样链式法则把z凑进去

\({\Huge \frac{\partial L}{\partial w_{1,1}^{h} } = \frac{\partial L}{\partial a_{1}^{out}} \frac{\partial a_1^{out}}{\partial z_1^{out}} \frac{\partial z_1^{out}} {\partial a_{1}^{h}} \frac{\partial a_1^{h}} {\partial z_{1}^{h} } \frac{\partial z_1^{h}}{\partial w_{1,1}^{h}}}\)

\({\Huge \ \ \ \ \ \ \ \ \ + \frac{\partial L}{\partial a_{2}^{out}} \frac{\partial a_2^{out}}{\partial z_2^{out}} \frac{\partial z_2^{out}} {\partial a_{1}^{h}} \frac{\partial a_1^{h}} {\partial z_{1}^{h} } \frac{\partial z_1^{h}}{\partial w_{1,1}^{h}}}\)

可以发现其中的的\({\Huge \frac{\partial L}{\partial a_{1}^{out} } \frac{\partial a_1^{out}}{\partial z_{1}^{out}}}\) 部分，之前算\({\Large w_{1,1}^{out}}\)时算过。是可以复用的。

\({\Huge \delta_1^{out} = \frac{\partial L}{\partial a_{1}^{out} } \frac{\partial a_1^{out}}{\partial z_{1}^{out}} = 2(a_1^{out} - y_1) a_{1}^{out}(1- a_{1}^{out})}\)

推广到一般

\({\Huge \color{Purple} \delta_i^{out} = \frac{\partial L}{\partial a_{i}^{out} } \frac{\partial a_i^{out}}{\partial z_{i}^{out}} = 2(a_i^{out} - y_i) a_{i}^{out}(1- a_{i}^{out})}\)

算后三项

\({\Huge \frac{\partial z_1^{out}}{\partial a_{1}^{h}} = \frac{\partial}{\partial a_{1}^{h}} (a_1^h w_{1,1}^{out} + b_1^{out}) }\)

\({\Large \ \ \ \ \ \ \ \ = w_{1,1}^{out}}\)

\({\Huge \frac{\partial a_{1}^{h}}{\partial z_{1}^{h}} = \frac{\partial}{\partial z_1^{h}} (\frac{1}{1+e^{-z_1^{h}}}) }\)

\({\Large \ \ \ \ \ \ \ \ = a_{1}^{h}(1- a_{1}^{h})}\)

\({\Huge \frac{\partial z_1^{h}}{\partial w_{1,1}^{h}} = \frac{\partial}{\partial w_{1,1}^{h}} (x_1 w_{1,1}^{h} + b_1^{h}) }\)

\({\Large \ \ \ \ \ \ \ \ = x_1 }\)

第二组类似，就是把下标换一下。

最终

\({\Huge \frac{\partial L}{\partial w_{1,1}^{h} } = 2(a_1^{out} - y_1) a_{1}^{out}(1- a_{1}^{out}) w_{1,1}^{out} a_{1}^{h}(1- a_{1}^{h}) x_1 }\)

\({\Huge \ \ \ \ \ \ \ \ \ + 2(a_2^{out} - y_2) a_{2}^{out}(1- a_{2}^{out}) w_{2,1}^{out} a_{1}^{h}(1- a_{1}^{h}) x_1 }\)

进一步推广找规律

\({\Huge \color{Purple} \frac{\partial L}{\partial w_{1,1}^{h} } = \delta_1 w_{1,1}^{out} a_{1}^{h}(1- a_{1}^{h}) x_1 }\)

\({\Huge \color{Purple} \ \ \ \ \ \ \ \ \ + \delta_2 w_{2,1}^{out} a_{1}^{h}(1- a_{1}^{h}) x_1 }\)

\({\color{Purple} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ ......}\)

\({\Huge \color{Purple} \ \ \ \ \ \ \ \ \ + \delta_t w_{t,1}^{out} a_{1}^{h}(1- a_{1}^{h}) x_1 }\)

加一个\({\Large w_{2,1}}\)的例子

\({\Huge \color{Purple} \frac{\partial L}{\partial w_{2,1}^{h} } = \delta_1 w_{1,2}^{out} a_{2}^{h}(1- a_{2}^{h}) x_1 }\)

\({\Huge \color{Purple} \ \ \ \ \ \ \ \ \ + \delta_2 w_{2,2}^{out} a_{2}^{h}(1- a_{2}^{h}) x_1 }\)

\({\color{Purple} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ ......}\)

\({\Huge \color{Purple} \ \ \ \ \ \ \ \ \ + \delta_t w_{t,2}^{out} a_{2}^{h}(1- a_{2}^{h}) x_1 }\)

进一步推广

\({\Huge \color{Purple} \frac{\partial L}{\partial w_{j,i}^{h} } = \delta_1 w_{1,j}^{out} a_{j}^{h}(1- a_{j}^{h}) x_i }\)

\({\Huge \color{Purple} \ \ \ \ \ \ \ \ \ + \delta_2 w_{2,j}^{out} a_{j}^{h}(1- a_{j}^{h}) x_i }\)

\({\color{Purple} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ ......}\)

\({\Huge \color{Purple} \ \ \ \ \ \ \ \ \ + \delta_t w_{t,j}^{out} a_{j}^{h}(1- a_{j}^{h}) x_i }\)

可以看到后面3项是一样的可以提出来。
而前面非常像一个dot的形式，是一排\(\delta\)点乘一排w。
可以凑一个矩阵相乘，把所有w一下子写出来。

# N个sample
# x是NxI，a_h是1xJ，delta是1xT。
# 那么w_out是TxJ。w_h是JxI。

# delta点乘w_out的第1列，再乘后3项，结果放到w_h的1，1位置。
# delta点乘w_out的第2列，再乘后3项，结果放到w_h的2，1位置。
# 可以找出矩阵乘法

# 乘出J个
temp = np.dot(delta, w_out)

temp = temp * (a_h * (1. - a_h))

temp = temp.T * x

forward从一批sample X从前往后算。

算出一串a_h和最终的a_out。

backward用y/a_out/a_h/X反算偏微分。

12 PyTprch并行训练神经网络#

pytorch是非常流行的深度学习库。可以实现神经网络，比上一章我们手搓的高效得多。

12.1 Pytorch和训练速度#

从上一章可以看到，28x28像素，feature只有700多，class只有10，h层的unit只有50的情况下，得计算将近40000个权重。

如果增加层数，增加图片分辨率，计算量会爆炸。

轮到gpu出场。用gpu算这些东西比cpu性价比高得多。
但是给gpu写代码并不像之前写python代码那么简单。
pytorch支持cuda。

一个标量(scalar)可以看作一个rank 0的tensor。
向量rank 1
矩阵rank 2
三维rank 3

之前已经看到图11.10中有类似的形状。三维的数据大致展示了神经网络中的feature个数/unit个数/层数的结构。

12.2 PyTorch第一步#

了解pytorch的各种基本操作

各种创建tensor
multiply
mean
matmul
取随机
chunk打散
split分组
cat拼接
stack

12.3 创建输入流水线#

DataLoader加载tensor。可按批次。
TensorDataset可把tensor zip起来
DataLoader可加载TensorDataset
用PIL加载本地图片
按文件名分配class。猫或狗
torchvision.transforms可对PIL打开的图片进行缩放等转换。

torchvision.datasets.CelebA包含了名人照片数据。
我们可以自己下载1.4g版本，另有6个信息文件。
http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html

torchvision.datasets.MNIST和之前一样是手写数字数据。

12.4 用Pytorch创建神经网络模型#

torch.nn用来开发nn模型，可以很容易地做出原型和复杂模型。

了解原理最重要，首先不用torch.nn，而是用最基本的tensor操作来做一个线性回归训练。

然后逐步添加torch的feature。

# 生成10组数据
X_train = np.arange(10, dtype='float32').reshape((10, 1))
y_train = np.array([1.0, 1.3, 3.1, 2.0, 5.0, 6.3, 6.6,
7.4, 8.0, 9.0], dtype='float32')

plt.plot(X_train, y_train, 'o', markersize=10)
plt.xlabel('x')
plt.ylabel('y')

# 画出10个点
plt.show()


# 做standardization
X_train_norm = (X_train - np.mean(X_train)) / np.std(X_train)
X_train_norm = torch.from_numpy(X_train_norm)

# x和y做成dataset
y_train = torch.from_numpy(y_train).float()
train_ds = TensorDataset(X_train_norm, y_train)

# 加载dataset
batch_size = 1
train_dl = DataLoader(train_ds, batch_size, shuffle=True)

# 初始化权重和bias
# 输入数据是1维。权重只做一个。是一条直线。
torch.manual_seed(1)
weight = torch.randn(1)
weight.requires_grad_()
bias = torch.zeros(1, requires_grad=True)

# 算mse
def loss_fn(input, target):
    return (input-target).pow(2).mean()

# 算预测值
def model(xb):
    return xb @ weight + bias

learning_rate = 0.001
num_epochs = 200
log_epochs = 10


for epoch in range(num_epochs):
    for x_batch, y_batch in train_dl:
        # 算预测值
        pred = model(x_batch)

        # 算loss值
        loss = loss_fn(pred, y_batch)

        # 这里面的内容非常多，初学肯定懵。大致意思是
        # 之前我们学了backpropagation。需要往回算偏微分也就是梯度。
        # 我们要让tensor标记需要求梯度，见上面的requires_grad。
        # 标记以后它就能记录一些必要的数据，如操作路径之类。
        # 然后就能算backward了
        loss.backward()

        # 暂时关掉算梯度
        with torch.no_grad():
            # 用梯度和学习率进行更新
            weight -= weight.grad * learning_rate
            bias -= bias.grad * learning_rate

            # 每批次最后要清除老的梯度
            weight.grad.zero_()
            bias.grad.zero_()

    if epoch % log_epochs==0:
        # 打印每次的loss
        print(f'Epoch {epoch}  Loss {loss.item():.4f}')


# 最终结果
print('Final Parameters:', weight.item(), bias.item())

可以看到loss从45一直降到0.0011。

12.4.3 替换为pytorch的组件#

input_size = 1
output_size = 1
# 创建一个线性模型
# 二维的点。输入和输出的feature数都为1。
model = nn.Linear(input_size, output_size)

# 创建loss标准
loss_fn = nn.MSELoss(reduction='mean')

# 随机梯度下降
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

for epoch in range(num_epochs):
    for x_batch, y_batch in train_dl:
        # 算预测值
        pred = model(x_batch)[:, 0]

        # 算loss
        loss = loss_fn(pred, y_batch)

        # 算梯度
        loss.backward()

        # 更新参数
        optimizer.step()

        # 重置梯度
        optimizer.zero_grad()

    if epoch % log_epochs==0:
        print(f'Epoch {epoch}  Loss {loss.item():.4f}')

# 打印结果
print('Final Parameters:', model.weight.item(), model.bias.item())

可以看到把我们手搓的计算都替换成了pytorch库代码。

最后得到的结果相似。

12.4.4 创建多层感知机来对iris数据分类#

这一节用pytorch创建一个跟11章一样的多层nn。

# 加载数据
iris = load_iris()
X = iris['data']
y = iris['target']

# 分成训练组和测试组
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=1./3, random_state=2)

# standardization
X_train_norm = (X_train - np.mean(X_train)) / np.std(X_train)
X_train_norm = torch.from_numpy(X_train_norm).float()
y_train = torch.from_numpy(y_train)

# 100x4
# 3个class
print(f"X_train_norm {X_train_norm}")
print(f"X_train_norm.shape {X_train_norm.shape}")

print(f"y_train {y_train}")
print(f"y_train.shape {y_train.shape}")

# 组成dataset
train_ds = TensorDataset(X_train_norm, y_train)

# 加载dataset
torch.manual_seed(1)
batch_size = 2
train_dl = DataLoader(train_ds, batch_size, shuffle=True)


class Model(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()

        # 两层线性模型。每层包含一套权重和bias。
        # 组成包含一个h层的网络。跟第11章一样。
        self.layer1 = nn.Linear(input_size, hidden_size)
        self.layer2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # 第一层的输出
        x = self.layer1(x)

        # 激活函数Sigmoid
        x = nn.Sigmoid()(x)

        # 第二层的输出
        x = self.layer2(x)

        # 激活函数使用了Softmax
        # 每个值范围在[0, 1]，和为1。
        # 每个值就是每个class的概率，适用于分类。
        x = nn.Softmax(dim=1)(x)

        return x

# 4个feature。h层16个unit。输出3个class。
# 多层结构即为4x16x3
input_size = X_train_norm.shape[1]
hidden_size = 16
output_size = 3

model = Model(input_size, hidden_size, output_size)

learning_rate = 0.001

# 交叉熵损失。另一种loss函数。暂不深入
loss_fn = nn.CrossEntropyLoss()

# Adam算法。一种随机优化算法。这里替代随机梯度下降。
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
print(f"model.parameters() {model.parameters()}")

# 大循环100遍。存下每次的loss和准确率。
num_epochs = 100
loss_hist = [0] * num_epochs
accuracy_hist = [0] * num_epochs

# 大循环
for epoch in range(num_epochs):

    # dataset循环。100组数据每次取batch_size = 2个
    for x_batch, y_batch in train_dl:
        # 每次处理2个sample
        pred = model(x_batch)

        # 书中原始代码loss = loss_fn(pred.long(), y_batch)
        # 默认代码会报错"log_softmax_lastdim_kernel_impl" not implemented for 'Long'
        # 这里pred是
        # pred tensor([[0.2706, 0.4538, 0.2757],
        #             [0.2665, 0.4673, 0.2663]], grad_fn=<SoftmaxBackward0>)
        # torch.float32
        # 其中一行是一个sample的三个class的输出(概率)。

        # y_batch是tensor([1, 2], dtype=torch.int32)
        # 是2个sample的实际y值。

        # 最后正确的代码是loss_fn(pred, y_batch.long())

        # 和之前一样。算loss，算梯度，更新，重置。
        loss = loss_fn(pred, y_batch.long())
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        # 结果累计到大循环的某一次中
        loss_hist[epoch] += loss.item()*y_batch.size(0)

        # 概率最大的那个就是预测答案，如果和y值一样就算预测正确。
        is_correct = (torch.argmax(pred, dim=1) == y_batch).float()
        accuracy_hist[epoch] += is_correct.sum()

    # 算平均值
    loss_hist[epoch] /= len(train_dl.dataset)
    accuracy_hist[epoch] /= len(train_dl.dataset)

最后可得到类似图12.9的曲线。loss不断减小，准确率不断提高。

用测试组测一下

X_test_norm = (X_test - np.mean(X_train)) / np.std(X_train)
X_test_norm = torch.from_numpy(X_test_norm).float()
y_test = torch.from_numpy(y_test)

# 直接出预测值
pred_test = model(X_test_norm)

# 跟上面一样。如果概率值最大的那个正确就记1。最后统计1的比例。
correct = (torch.argmax(pred_test, dim=1) == y_test).float()
accuracy = correct.mean()

# 得到0.98
print(f'Test Acc.: {accuracy:.4f}')

#################

# 可把模型存到文件备用

# 存
path = 'iris_classifier.pt'
torch.save(model, path)

# 取
model_new = torch.load(path)
model_new.eval()

# 用
pred_test = model_new(X_test_norm)

12.5 多层神经网络激活函数的选择#

目前为止为了方便起见我们都是用的sigmoid作为激活函数。

但在有的场景可导致学习缓慢，sigmoid值域在[0,1]，很大的负值输入可导致值趋近于0。

或者困在一个局部最优位置。

人们最终更喜欢在h层中使用hyperbolic tangent。见图12.10的对比。

rectified linear unit (ReLU)，深度nn中常用。

最后总结一下目前遇到的激活函数。见图12.11。

13 PyTorch的深层机制#

这一章更深入学习PyTorch。为之后的学习打下基础。

PyTorch底层使用了dynamic computational graphs。更高效，容易debug。

13.2 computation graphs#

基于directed acyclic graph(DAG)。

比如算一个最简单的z=2x(a -b) + c。

会做成一个图结构，每个操作对应一个节点，把中间数据存下来。

之前代码也看到了算backward，就是要依赖这种结构来算。

13.3 tensor objects#

各种初始化 xavier_normal_

13.4 自动微分#

再次演示backward

13.5 用torch.nn简化结构的创建#

nn.Sequential可以把一堆操作串起来，前一个的输出作为后一个的输入。

各种激活函数的选择

https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity

各种loss函数

https://pytorch.org/docs/stable/nn.html#loss-functions

可继承nn.Module实现更复杂的模型，可以加入自己的layer。

比如在最开始加入一些noise。

13.6 Project1 预测汽车的燃油效率#

经常会出现feature由多种种类组成。比如图13.7的汽车油耗场景。

把气缸数/排量/马力/重量/加速这个五个数据作为数字feature。
把型号年份作为ordered categorical。
把产地作为unordered categorical。
共7个feature。
油耗 miles per gallon (MPG)是要预测的值。

url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',  'Acceleration', 'Model Year', 'Origin']

# 取数据
df = pd.read_csv(url, names=column_names,
                 na_values = "?", comment='\t',
                 sep=" ", skipinitialspace=True)

# 打印空数据数
print(df.isna().sum())

# 去除空数据
df = df.dropna()
df = df.reset_index(drop=True)

# 分割数据
df_train, df_test = sklearn.model_selection.train_test_split(df, train_size=0.8, random_state=1)
train_stats = df_train.describe().transpose()
print(f"train_stats {train_stats}")

# 5个数值feature
numeric_column_names = ['Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration']

df_train_norm, df_test_norm = df_train.copy(), df_test.copy()

# standardization
for col_name in numeric_column_names:
    mean = train_stats.loc[col_name, 'mean']
    std  = train_stats.loc[col_name, 'std']
    df_train_norm.loc[:, col_name] = (df_train_norm.loc[:, col_name] - mean)/std
    df_test_norm.loc[:, col_name] = (df_test_norm.loc[:, col_name] - mean)/std

print(f"df_train_norm\n{df_train_norm}")
print(f"df_train_norm.shape {df_train_norm.shape}")

print(f"df_test_norm\n{df_test_norm}")
print(f"df_test_norm.shape {df_test_norm.shape}")

# 用于分隔年份
boundaries = torch.tensor([73, 76, 79])

# 年份分隔为0/1/2/3
v = torch.tensor(df_train_norm['Model Year'].values)
df_train_norm['Model Year Bucketed'] = torch.bucketize(v, boundaries, right=True)

# 年份分隔为0/1/2/3
v = torch.tensor(df_test_norm['Model Year'].values)
df_test_norm['Model Year Bucketed'] = torch.bucketize(v, boundaries, right=True)

numeric_column_names.append('Model Year Bucketed')

total_origin = len(set(df_train_norm['Origin']))

# torch.nn.functional.one_hot
# 进行one hot操作。把单个无序标签feature转成多个feature，只有对应值为1，其他为0。
# 原本7个feature，三个产地变成3个feature，最后共9个feature。
origin_encoded = one_hot(torch.from_numpy(df_train_norm['Origin'].values) % total_origin)
x_train_numeric = torch.tensor(df_train_norm[numeric_column_names].values)
x_train = torch.cat([x_train_numeric, origin_encoded], 1).float()

print(f"x_train\n{x_train}")
print(f"x_train.shape {x_train.shape}")

# 同样处理test数据
origin_encoded = one_hot(torch.from_numpy(df_test_norm['Origin'].values) % total_origin)
x_test_numeric = torch.tensor(df_test_norm[numeric_column_names].values)
x_test = torch.cat([x_test_numeric, origin_encoded], 1).float()

# mpg作为y值
y_train = torch.tensor(df_train_norm['MPG'].values).float()
y_test = torch.tensor(df_test_norm['MPG'].values).float()

# 组成dataset并加载
train_ds = TensorDataset(x_train, y_train)
batch_size = 8
torch.manual_seed(1)
train_dl = DataLoader(train_ds, batch_size, shuffle=True)

# 做2个隐藏层。一个8个unit，一个4个unit。
hidden_units = [8, 4]

# 输入是9个feature
input_size = x_train.shape[1]
print(f"input_size {input_size}")

all_layers = []
for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ReLU())
    input_size = hidden_unit

# 最后加一个输出。最后成为9->8->4->1的结构。
all_layers.append(nn.Linear(hidden_units[-1], 1))

# 做成Sequential
model = nn.Sequential(*all_layers)

# 打印看看。9->8->4->1的结构。
print(f"model {model}")
'''
model Sequential(
  (0): Linear(in_features=9, out_features=8, bias=True)
  (1): ReLU()
  (2): Linear(in_features=8, out_features=4, bias=True)
  (3): ReLU()
  (4): Linear(in_features=4, out_features=1, bias=True)
)
'''

# 起mse和sgd
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

torch.manual_seed(1)
num_epochs = 200
log_epochs = 20

# 大循环训练200次
for epoch in range(num_epochs):
    loss_hist_train = 0

    # 训练所有数据。每批8个。
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)[:, 0]
        loss = loss_fn(pred, y_batch)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_hist_train += loss.item()

    # 可看到loss不断变小
    if epoch % log_epochs==0:
        print(f'Epoch {epoch}  Loss {loss_hist_train/len(train_dl):.4f}')

# 最后打印测试结果
with torch.no_grad():
    pred = model(x_test.float())[:, 0]
    loss = loss_fn(pred, y_test)
    print(f'Test MSE: {loss.item():.4f}')
    print(f'Test MAE: {nn.L1Loss()(pred, y_test).item():.4f}')

13.7 Project2 识别手写数字#

image_path = './'

# torchvision.transforms.ToTensor把PIL图片的像素值转成[0,1]的tensor。
transform = transforms.Compose([transforms.ToTensor()])

# 读取手写数字数据
mnist_train_dataset = torchvision.datasets.MNIST(root=image_path,
    train=True,
    transform=transform,
    download=True)

mnist_test_dataset = torchvision.datasets.MNIST(root=image_path,
    train=False,
    transform=transform,
    download=False)

# 60000个
print(f"mnist_train_dataset {mnist_train_dataset}")

# 10000个
print(f"mnist_test_dataset {mnist_test_dataset}")

batch_size = 64
torch.manual_seed(1)

# 加载数据
train_dl = DataLoader(mnist_train_dataset, batch_size, shuffle=True)

# 隐藏层unit数量
hidden_units = [32, 16]
image_size = mnist_train_dataset[0][0].shape
input_size = image_size[0] * image_size[1] * image_size[2]

# 首先有一个nn.Flatten，把图片数据打成一维。
# 因为图片默认是28x28。nn输入必须是一维feature。
all_layers = [nn.Flatten()]
for hidden_unit in hidden_units:
    # 添加layer
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ReLU())
    input_size = hidden_unit

# 最后加一层10unit的输出。总的结构784->32->16->10
all_layers.append(nn.Linear(hidden_units[-1], 10))
model = nn.Sequential(*all_layers)

# 打印看看model
print(f"model {model}")
'''
model Sequential(
  (0): Flatten(start_dim=1, end_dim=-1)
  (1): Linear(in_features=784, out_features=32, bias=True)
  (2): ReLU()
  (3): Linear(in_features=32, out_features=16, bias=True)
  (4): ReLU()
  (5): Linear(in_features=16, out_features=10, bias=True)
)
'''

# 起loss函数和算法
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

torch.manual_seed(1)
num_epochs = 20

# 大循环训练20次
for epoch in range(num_epochs):
    accuracy_hist_train = 0

    # 处理所有数据。每批64个。
    for x_batch, y_batch in train_dl:
        pred = model(x_batch)
        loss = loss_fn(pred, y_batch)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        is_correct = (torch.argmax(pred, dim=1) == y_batch).float()
        accuracy_hist_train += is_correct.sum()

    accuracy_hist_train /= len(train_dl.dataset)
    print(f'Epoch {epoch}  Accuracy {accuracy_hist_train:.4f}')

# 测试
pred = model(mnist_test_dataset.data / 255.)
is_correct = (torch.argmax(pred, dim=1) == mnist_test_dataset.targets).float()
print(f'Test accuracy: {is_correct.mean():.4f}')

13.8 PyTorch-Lightning#

https://lightning.ai/

暂忽略

14 用深度卷积神经网络对图片分类#

convolutional neural networks (CNNs)

14.1 CNN的结构#

cnn起源于对大脑皮层的研究。

cnn对于图像识别表现优异，所以在计算机视觉方面获得极大关注。

14.1.1 cnn和feature层级#

成功地抓取重要feature是任何机器学习算法的关键。

传统的机器学习模型依赖的feature数据，来源于领域专家，或某些feature抓取计算。

cnn可以自动从原始数据中分析出有用的feature。

所以cnn可以当作feature抓取器。

前面靠近输入的layer抓取low-level features。

后续的layer再进行学习。

深度cnn会创建一个feature hierarchy，从low-level features到high-level features。

例如我们处理图片，图中的点/边，可作为low-level feature先被抓取。
low-level基础上组合成high-level。
再进行训练，识别出例如猫/狗/房子的轮廓。

图14.1。cnn把输入图片中的一片local patch of pixels映射到feature map上。

cnn对处理图片很在行，取决于
- Sparse connectivity稀疏连通性
feature map中的一个元素只与一小块path相关。 - 参数共享
所有patch共用一套权重。

基于这两点，用一个卷积层替换传统的mlp，大大减少了网络中的权重数量。

而且抓取feature的能力提升。

一个典型的cnn包含若干个卷积层和subsampling层，跟着一批mlp。
subsampling层，又叫pooling layers，不包含参数(权重这些东西)。
卷积层和mlp包含参数。

14.1.2 离散卷积#

discrete convolution是cnn的一个基本计算，必须搞懂。

14.1.2.1 一维离散卷积#

\({\Large y = x * w \to y[i] = \sum_{k=- \infty }^{+ \infty}x[k]w[i -k] }\)

x是输入或者叫signal
w是filter或者kernel
x/w/y都是一维数组。最终要得到一串y值。每个y[i]值就要算后面的那个和。
k的范围是无穷，造成x和w的下标无穷，但是现实中x和w一般都是有限的数据。
或者可以看作把有限数据之外都填0。这样一来有效数据之外的y[i]都是0。
只需关注有效数据即可。

卷积x和w可交换，结果一样。

一维离散卷积计算示意

x      0 1 2 3 4
w      2 3 7 1 0 0 3
反转w
w'     3 0 0 1 7 3 2

w反转后放到x前边，有接触时i为0，算点积作为y[i]。
每算一个i+1往后挪1。算到无接触为止。

x                 0 1 2 3 4
w'    3 0 0 1 7 3 2                        0x2 = 0
        3 0 0 1 7 3 2                      0x3+1x2 = 2
          3 0 0 1 7 3 2                    0x7+1x3+2x2 = 7
            3 0 0 1 7 3 2                  19
              3 0 0 1 7 3 2                32
                3 0 0 1 7 3 2              35
                  3 0 0 1 7 3 2            31
                    3 0 0 1 7 3 2          7
                      3 0 0 1 7 3 2        6
                        3 0 0 1 7 3 2      9
                          3 0 0 1 7 3 2    12

x = np.array([0, 1, 2, 3, 4])
h = np.array([2, 3, 7, 1, 0, 0, 3])
# 3 0 0 1 7 3 2

# 三种模式
# full默认。有接触开始到无接触。
y = np.convolve(x, h, mode = 'full')
print(y)

# same，输出长度为x和h中较长的长度，值一般取full模式输出的中部。
# 也有取左边和右边的
y = np.convolve(x, h, mode = 'same')
print(y)

# valid。只有x和h中较长的完全包住较短的状态下才算点积。
y = np.convolve(x, h, mode = 'valid')
print(y)

# [ 0  2  7 19 32 35 31  7  6  9 12]

#       [ 7 19 32 35 31  7  6]

#             [32 35 31]

14.1.2.4 二维离散卷积#

\({\Large Y = X * W \to Y[i,j] = \sum_{k_1 =- \infty }^{+ \infty} \sum_{k_2 =- \infty }^{+ \infty}x[k_1, k_2]w[i - k_1, j - k_2]}\)

W     1 2 3
      6 9 8
      5 4 7

反转

W'    7 4 5
      8 9 6
      3 2 1

X         8 8 6 1 0
          1 2 3 4 5
          5 4 6 9 1
          6 6 6 6 6
          1 6 7 8 9

和一维类似。W'从左上角开始，当W'的1和X的左上角8重合时开始算一个Y[]值。
挪到3和0重合时算第一行的最后一个Y[]值。然后往下算其他行。

from scipy import signal

x = np.array([[8, 8, 6, 1, 0],
              [1, 2, 3, 4, 5],
              [5, 4, 6, 9, 1],
              [6, 6, 6, 6, 6],
              [1, 6, 7, 8, 9]])
h = np.array([[1, 2, 3],
              [6, 9, 8],
              [5, 4, 7]])

# np.convolve只能一维？需要另外处理。
# signal.convolve可以直接来
y = signal.convolve(x, h, mode = 'full') # y = np.convolve(x, h, mode = 'full')
print(y)
y = signal.convolve(x, h, mode = 'same') # y = np.convolve(x, h, mode = 'same')
print(y)
y = signal.convolve(x, h, mode = 'valid') # y = np.convolve(x, h, mode = 'valid')
print(y)

'''
[[  8  24  46  37  20   3   0]
 [ 49 124 182 140  79  30  15]
 [ 51 107 191 185 173 113  43]
 [ 41 101 178 222 233 159  61]
 [ 62 138 241 275 267 211  82]
 [ 36  99 200 255 278 211 114]
 [  5  34  66 110 126  92  63]]

[[124 182 140  79  30]
 [107 191 185 173 113]
 [101 178 222 233 159]
 [138 241 275 267 211]
 [ 99 200 255 278 211]]

[[191 185 173]
 [178 222 233]
 [241 275 267]]
 '''

14.1.3 Subsampling层#

Subsampling层一般进行cnn的两个操作，max-pooling和mean-pooling。

一般写为\({\Large P_{n_1 \times n_2}}\)，n1和n2叫pooling-size。

图14.8展示了例子。

pooling的优势

max-pooling有local invariance性质。本地一些微小变化不会改变max-pooling结果。那么有助于在noise多的情况下生成feature。
pooling减少feature的数量，减少计算量，可能减少过度拟合的几率。

有的人会不用pooling，而用stride2代替。

意思是算卷积时每次挪动2格，而不是默认的1格。

14.2 实现一个CNN#

传统NN中最重要的操作是矩阵相乘。比如输入乘上权重再加bias。

CNN中用卷积替代矩阵相乘。

\({\Large Z = W * X + b}\)

X是二维矩阵，表示图片的像素信息。

激活函数还是一样

\({\Large A = \sigma(Z)}\)

14.2.1 多输入和color通道#

卷积层的输入可以是多个2d数组，或者说N1xN2的矩阵。
这些N1xN2矩阵叫做channel。
一般的卷积层希望一个rank3的tensor作为输入。比如\({\Large X_{N_1 \times N_2 \times C_{in}}}\)
\(C_{in}\)是channel数量。

比如一个图片，有rgb三通道。可看作\(C_{in}\)为3。

如果是黑白图，\(C_{in}\)为1。

分别对各个通道进行卷积操作，然后加起来。

每个通道有各自的kernel矩阵\({\Large W[:,:,c]}\)

给定
\({\Large X_{n_1 \times n_2 \times C_{in}}}\)
\({\Large W_{m_1 \times m_2 \times C_{in}}}\)
\({\Large b}\)

生成feature map的流程是

\({\Large Z^{Conv} = \sum_{c=1}^{C_{in}}W[:,:,c] * X[:,:,c]}\)

\({\Large Z = Z^{Conv} + b_c}\)

\({\Large A = \sigma(Z)}\)

比如有3个通道，那么有3个kernel，1个bias。
每个通道和对应的kernel进行卷积，加起来，再加bias。
再激活成为1个feature map。

可以扩展为\(C_{out}\)组kernel和bias。那么最终生成\(C_{in}\)个feature map。进入pooling层。

见图14.9

需要的参数数量m1xm2x3x5个kernel值，再加5个bias值。

如果是传统nn，那就是两两之间有一个权重，那就是平方级别的数量。

14.2.1 L2 regularization和dropout#

选择网络的大小，无论是nn还是cnn，总是一个难题。

需要调教才能得出要给好的结果。

过于简单的网络，比如没有隐藏层，很可能解决不了问题。
网络小，参数少，倾向于欠拟合。反之倾向于过拟合。
现实中我们无法预知一个好的尺寸。

一个解决方法是先做一个较大的网络，先在训练数据上得到一个较好的表现。通常会过拟合。

再做一次或多次regularization，来达到更好的一般化。

第3/4章已经看过L1/L2 regularization，在nn里也可以用。

可以把模型的参数用parameters()拿出来手算regularization。

loss_func = nn.BCELoss()

# 原始loss值
loss = loss_func(torch.tensor([0.9]), torch.tensor([1.0]))
l2_lambda = 0.001

# 起卷积层
conv_layer = nn.Conv2d(in_channels=3, out_channels=5, kernel_size=5)

# 看看parameters()的返回
print(f"conv_layer.parameters() {conv_layer.parameters()}")
aaa = [p for p in conv_layer.parameters()]
print(f"aaa {aaa}")
print(f"aaa len {len(aaa)}")
# parameters()包含权重和bias两部分
print(aaa[0])
print(f"aaa[0].shape {aaa[0].shape}")
# torch.Size([5, 3, 5, 5])
# out_channels=5即5大组参数，3个channel，kernel_size=5即5x5的2d卷积核
# 最终的权重就是5x3x5x5

# **2所有元素平方
print(f"aaa[0]**2\n{aaa[0]**2}")

# .sum()所有元素相加
print(f"(aaa[0]**2).sum {(aaa[0]**2).sum()}")
print(f"{[(p**2).sum() for p in conv_layer.parameters()]}")

# w和b都算平方和再相加乘上l2_lambda
l2_penalty = l2_lambda * sum([(p**2).sum() for p in conv_layer.parameters()])
print(f"l2_penalty {l2_penalty}")
loss_with_penalty = loss + l2_penalty

linear_layer = nn.Linear(10, 16)
l2_penalty = l2_lambda * sum([(p**2).sum() for p in linear_layer.parameters()])
loss_with_penalty = loss + l2_penalty

另有dropout技术

https://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf

概率性丢弃一些数据能提高表现。

14.2.3 分类算法的loss函数#

ReLU一般用在NN的隐藏层，增加非线性性。

sigmoid和softmax用在输出层，作为分类的结果。

14.3 用PyTorch实现深度CNN#

13章我们用torch.nn实现了数字识别。现在看看cnn的效果如何。

图14.12展示了多层cnn结构。

输入为28x28，1通道。
kernel为5x5，32组。
pooling，缩成一半，变成14x14x32。
再卷积5x5，2组，变成14x14x64。
再pooling，缩成一半，变成7x7x64。
flatten打平，交给输出层。
输出层用softmax输出10个class。

看代码

image_path = './'

# 设置直读图片
transform = transforms.Compose([transforms.ToTensor()])

# 获取手写数字数据集训练部分
mnist_dataset = torchvision.datasets.MNIST(root=image_path,
    train=True,
    transform=transform,
    download=True)

print(f"mnist_dataset {mnist_dataset}")
# 60000个数据分成10000和50000
mnist_valid_dataset = Subset(mnist_dataset, torch.arange(10000))
mnist_train_dataset = Subset(mnist_dataset, torch.arange(10000, len(mnist_dataset)))

# 获取手写数字数据集test部分
mnist_test_dataset = torchvision.datasets.MNIST(root=image_path,
    train=False,
    transform=transform,
    download=False)

# test有10000个数据
print(f"mnist_test_dataset {mnist_test_dataset}")

# 每批处理64个sample
batch_size = 64
torch.manual_seed(1)

# 加载数据
train_dl = DataLoader(mnist_train_dataset, batch_size, shuffle=True)
valid_dl = DataLoader(mnist_valid_dataset, batch_size, shuffle=False)

# 起Sequential
model = nn.Sequential()

# 起各种层
# 起卷积。kernel_size=5，padding=2做成kernel中心与第一个输入重合时开始算卷积。
model.add_module('conv1', nn.Conv2d(in_channels=1, out_channels=32, kernel_size=5, padding=2))
model.add_module('relu1', nn.ReLU())
# max-pooling缩成一半
model.add_module('pool1', nn.MaxPool2d(kernel_size=2))
# 再卷
model.add_module('conv2', nn.Conv2d(in_channels=32, out_channels=64, kernel_size=5, padding=2))
model.add_module('relu2', nn.ReLU())
# 再缩
model.add_module('pool2', nn.MaxPool2d(kernel_size=2))

# 摊平
model.add_module('flatten', nn.Flatten())

# 线性
model.add_module('fc1', nn.Linear(3136, 1024))
model.add_module('relu3', nn.ReLU())

# 0.5的概率丢弃
model.add_module('dropout', nn.Dropout(p=0.5))

# 输出10个数。对应10个数字。
model.add_module('fc2', nn.Linear(1024, 10))


# 获取cuda设备
device = torch.device("cuda:0")

# 数据进cuba
model = model.to(device)

# loss函数和算法
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

def train(model, num_epochs, train_dl, valid_dl):
    loss_hist_train = [0] * num_epochs
    accuracy_hist_train = [0] * num_epochs
    loss_hist_valid = [0] * num_epochs
    accuracy_hist_valid = [0] * num_epochs

    # 大循环训练20次
    for epoch in range(num_epochs):
        # 训练
        # https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.train
        # https://discuss.pytorch.org/t/what-does-nn-modules-train-true-train-false-do/4004/6
        # 设置为training模式。可重置dropout等的数据。
        model.train()

        # 每批64个数据
        for x_batch, y_batch in train_dl:
            # 数据进cuda
            x_batch = x_batch.to(device)
            y_batch = y_batch.to(device)

            # 预测/loss/propagation/更新/清零一条龙
            pred = model(x_batch)
            loss = loss_fn(pred, y_batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

            # 判断结果。累计数据。
            # argmax最大的那个等于y_batch即预测正确。
            loss_hist_train[epoch] += loss.item()*y_batch.size(0)
            is_correct = (torch.argmax(pred, dim=1) == y_batch).float()
            accuracy_hist_train[epoch] += is_correct.sum().cpu()

        # 平均值
        loss_hist_train[epoch] /= len(train_dl.dataset)
        accuracy_hist_train[epoch] /= len(train_dl.dataset)

        # 关掉training模式
        # 进行验证，记录数据。
        model.eval()
        with torch.no_grad():
            for x_batch, y_batch in valid_dl:
                x_batch = x_batch.to(device)
                y_batch = y_batch.to(device)
                pred = model(x_batch)
                loss = loss_fn(pred, y_batch)
                loss_hist_valid[epoch] += loss.item()*y_batch.size(0)
                is_correct = (torch.argmax(pred, dim=1) == y_batch).float()
                accuracy_hist_valid[epoch] += is_correct.sum().cpu()

        loss_hist_valid[epoch] /= len(valid_dl.dataset)
        accuracy_hist_valid[epoch] /= len(valid_dl.dataset)

        print(f'Epoch {epoch+1} accuracy: {accuracy_hist_train[epoch]:.4f} val_accuracy: {accuracy_hist_valid[epoch]:.4f}')

    return loss_hist_train, loss_hist_valid, accuracy_hist_train, accuracy_hist_valid

torch.manual_seed(1)
num_epochs = 20
# 开始训练
hist = train(model, num_epochs, train_dl, valid_dl)

# 画出结果数据
x_arr = np.arange(len(hist[0])) + 1

fig = plt.figure(figsize=(12, 4))
ax = fig.add_subplot(1, 2, 1)
ax.plot(x_arr, hist[0], '-o', label='Train loss')
ax.plot(x_arr, hist[1], '--<', label='Validation loss')
ax.set_xlabel('Epoch', size=15)
ax.set_ylabel('Loss', size=15)
ax.legend(fontsize=15)
ax = fig.add_subplot(1, 2, 2)
ax.plot(x_arr, hist[2], '-o', label='Train acc.')
ax.plot(x_arr, hist[3], '--<', label='Validation acc.')
ax.legend(fontsize=15)
ax.set_xlabel('Epoch', size=15)
ax.set_ylabel('Accuracy', size=15)

plt.show()

# 等待cuda操作结束
# 用test数据算准确率
torch.cuda.synchronize()
model_cpu = model.cpu()
pred = model(mnist_test_dataset.data.unsqueeze(1) / 255.)
is_correct = (torch.argmax(pred, dim=1) == mnist_test_dataset.targets).float()
print(f'Test accuracy: {is_correct.mean():.4f}')

# 挑12个数字来预测
fig = plt.figure(figsize=(12, 4))
for i in range(12):
    ax = fig.add_subplot(2, 6, i+1)
    ax.set_xticks([]); ax.set_yticks([])
    img = mnist_test_dataset[i][0][0, :, :]
    pred = model(img.unsqueeze(0).unsqueeze(1))
    y_pred = torch.argmax(pred)
    ax.imshow(img, cmap='gray_r')
    ax.text(0.9, 0.1, y_pred.item(),
            size=15, color='blue',
            horizontalalignment='center',
            verticalalignment='center',
            transform=ax.transAxes)

# 画出预测结果
plt.show()

# 存下当前的模型
if not os.path.exists('models'):
    os.mkdir('models')

path = 'models/mnist-cnn.ph'
torch.save(model, path)

14.4 用CNN分辨笑脸#

把12章的celeba数据复制过来。

对数据有一个data augmentation的概念。把图片做一些随机处理，比如flip，crop。

可提高训练表现。

直接看代码

# 获取CelebA原始数据
image_path = './'
celeba_train_dataset = torchvision.datasets.CelebA(image_path, split='train', target_type='attr', download=False)
celeba_valid_dataset = torchvision.datasets.CelebA(image_path, split='valid', target_type='attr', download=False)
celeba_test_dataset = torchvision.datasets.CelebA(image_path, split='test', target_type='attr', download=False)

print('Train set:', len(celeba_train_dataset))
print('Validation set:', len(celeba_valid_dataset))
print('Test set:', len(celeba_test_dataset))


# 先取5个图片进行一些基本图像操作

fig = plt.figure(figsize=(16, 8.5))

## Column 1: cropping to a bounding-box
ax = fig.add_subplot(2, 5, 1)
# 取第一个图片
img, attr = celeba_train_dataset[0]
ax.set_title('Crop to a \nbounding-box', size=15)
ax.imshow(img)
ax = fig.add_subplot(2, 5, 6)
# crop操作
img_cropped = transforms.functional.crop(img, 50, 20, 128, 128)
ax.imshow(img_cropped)

## Column 2: flipping (horizontally)
ax = fig.add_subplot(2, 5, 2)
img, attr = celeba_train_dataset[1]
ax.set_title('Flip (horizontal)', size=15)
ax.imshow(img)
ax = fig.add_subplot(2, 5, 7)
# hflip操作
img_flipped = transforms.functional.hflip(img)
ax.imshow(img_flipped)

## Column 3: adjust contrast
ax = fig.add_subplot(2, 5, 3)
img, attr = celeba_train_dataset[2]
ax.set_title('Adjust constrast', size=15)
ax.imshow(img)
ax = fig.add_subplot(2, 5, 8)
# adjust_contrast操作
img_adj_contrast = transforms.functional.adjust_contrast(img, contrast_factor=2)
ax.imshow(img_adj_contrast)

## Column 4: adjust brightness
ax = fig.add_subplot(2, 5, 4)
img, attr = celeba_train_dataset[3]
ax.set_title('Adjust brightness', size=15)
ax.imshow(img)
ax = fig.add_subplot(2, 5, 9)
# adjust_brightness操作
img_adj_brightness = transforms.functional.adjust_brightness(img, brightness_factor=1.3)
ax.imshow(img_adj_brightness)

## Column 5: cropping from image center ax = fig.add_subplot(2, 5, 5)
img, attr = celeba_train_dataset[4]
ax.set_title('Center crop\nand resize', size=15)
ax.imshow(img)
ax = fig.add_subplot(2, 5, 10)
# img_center_crop和resize
img_center_crop = transforms.functional.center_crop(img, [0.7*218, 0.7*178])
img_resized = transforms.functional.resize(img_center_crop, size=(218, 178))
ax.imshow(img_resized)

# 显示5个图片
plt.show()

torch.manual_seed(1)

fig = plt.figure(figsize=(14, 12))

# 取三个图片进行一些随机操作
for i, (img, attr) in enumerate(celeba_train_dataset):
    ax = fig.add_subplot(3, 4, i*4+1)
    ax.imshow(img)
    if i == 0:
        ax.set_title('Orig.', size=15)

    ax = fig.add_subplot(3, 4, i*4+2)
    img_transform = transforms.Compose([transforms.RandomCrop([178, 178])])
    img_cropped = img_transform(img)
    ax.imshow(img_cropped)
    if i == 0:
        ax.set_title('Step 1: Random crop', size=15)

    ax = fig.add_subplot(3, 4, i*4+3)
    img_transform = transforms.Compose([transforms.RandomHorizontalFlip()])
    img_flip = img_transform(img_cropped)
    ax.imshow(img_flip)
    if i == 0:
        ax.set_title('Step 2: Random flip', size=15)

    ax = fig.add_subplot(3, 4, i*4+4)
    img_resized = transforms.functional.resize(img_flip, size=(128, 128))
    ax.imshow(img_resized)
    if i == 0:
        ax.set_title('Step 3: Resize', size=15)

    if i == 2:
        break
        # plt.savefig('figures/14_15.png', dpi=300)

plt.show()


# 打开list_attr_celeba.txt可看到图片有哪些数据
# 比如是否光头，性别，大嘴唇，发色，是否在笑等等。
# 那么我们把在笑的图片喂给cnn，就能得到预测是否在笑的参数了。
get_smile = lambda attr: attr[18]

# 对于训练数据。随机操作，resize并转为tensor。
transform_train = transforms.Compose([
    transforms.RandomCrop([178, 178]),
    transforms.RandomHorizontalFlip(),
    transforms.Resize([64, 64]),
    transforms.ToTensor(),
])

# 对于测试和验证数据。resize并转为tensor。
transform = transforms.Compose([
    transforms.CenterCrop([178, 178]),
    transforms.Resize([64, 64]),
    transforms.ToTensor(),
])

# 取数据同时做转换。只取笑脸。
celeba_train_dataset = torchvision.datasets.CelebA(image_path,
    split='train',
    target_type='attr',
    download=False,
    transform=transform_train,
    target_transform=get_smile)

torch.manual_seed(1)
# 加载
data_loader = DataLoader(celeba_train_dataset, batch_size=2)

fig = plt.figure(figsize=(15, 6))


# 这里每次都取的第一个图片，看随机变换的效果。
# my_iter = iter(data_loader)
num_epochs = 5
for j in range(num_epochs):
    img_batch, label_batch = next(iter(data_loader)) # next(my_iter) # next(iter(data_loader))
    #print(f"img_batch {img_batch} label_batch {label_batch}")    img = img_batch[0]
    ax = fig.add_subplot(2, 5, j + 1)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_title(f'Epoch {j}:', size=15)
    ax.imshow(img.permute(1, 2, 0))

    img = img_batch[1]
    ax = fig.add_subplot(2, 5, j + 6)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.imshow(img.permute(1, 2, 0))


plt.show()


# 获取验证和测试数据
celeba_valid_dataset = torchvision.datasets.CelebA(image_path,
    split='valid',
    target_type='attr',
    download=False,
    transform=transform,
    target_transform=get_smile)

celeba_test_dataset = torchvision.datasets.CelebA(image_path,
    split='test',
    target_type='attr',
    download=False,
    transform=transform,
    target_transform=get_smile)

# 取16000和1000
celeba_train_dataset = Subset(celeba_train_dataset, torch.arange(16000))
celeba_valid_dataset = Subset(celeba_valid_dataset, torch.arange(1000))

print('Train set:', len(celeba_train_dataset))
print('Validation set:', len(celeba_valid_dataset))

batch_size = 32

torch.manual_seed(1)
train_dl = DataLoader(celeba_train_dataset, batch_size, shuffle=True)
valid_dl = DataLoader(celeba_valid_dataset, batch_size, shuffle=False)
test_dl = DataLoader(celeba_test_dataset, batch_size, shuffle=False)


# 起Sequential
model = nn.Sequential()

# 起各种layer

# 卷积/relu/pooling/dropout
model.add_module('conv1', nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, padding=1))
model.add_module('relu1', nn.ReLU())
model.add_module('pool1', nn.MaxPool2d(kernel_size=2))
model.add_module('dropout1', nn.Dropout(p=0.5))

model.add_module('conv2', nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1))
model.add_module('relu2', nn.ReLU())
model.add_module('pool2', nn.MaxPool2d(kernel_size=2))
model.add_module('dropout2', nn.Dropout(p=0.5))

model.add_module('conv3', nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=1))
model.add_module('relu3', nn.ReLU())
model.add_module('pool3', nn.MaxPool2d(kernel_size=2))

model.add_module('conv4', nn.Conv2d(in_channels=128, out_channels=256, kernel_size=3, padding=1))
model.add_module('relu4', nn.ReLU())

model.add_module('pool4', nn.AvgPool2d(kernel_size=8))

# 摊平
model.add_module('flatten', nn.Flatten())

# 线性256转到1
model.add_module('fc', nn.Linear(256, 1))

# 最后sigmoid输出一个值0-1。大于0.5即预测为笑脸。
model.add_module('sigmoid', nn.Sigmoid())

# 看看模型整体结构
print(f"model {model}")
'''
Sequential(
  (conv1): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (relu1): ReLU()
  (pool1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (dropout1): Dropout(p=0.5, inplace=False)
  (conv2): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (relu2): ReLU()
  (pool2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (dropout2): Dropout(p=0.5, inplace=False)
  (conv3): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (relu3): ReLU()
  (pool3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (conv4): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (relu4): ReLU()
  (pool4): AvgPool2d(kernel_size=8, stride=8, padding=0)
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)
'''

device = torch.device("cuda:0")
model = model.to(device)

# 起loss/算法
loss_fn = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

def train(model, num_epochs, train_dl, valid_dl):
    loss_hist_train = [0] * num_epochs
    accuracy_hist_train = [0] * num_epochs
    loss_hist_valid = [0] * num_epochs
    accuracy_hist_valid = [0] * num_epochs

    # 总体训练20次
    for epoch in range(num_epochs):
        model.train()

        # 训练所有数据。每批32个。同样的套路。
        # 预测/loss/propagation/更新/清零/记录一条龙
        for x_batch, y_batch in train_dl:
            x_batch = x_batch.to(device)
            y_batch = y_batch.to(device)
            pred = model(x_batch)[:, 0]
            loss = loss_fn(pred, y_batch.float())
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            loss_hist_train[epoch] += loss.item()*y_batch.size(0)
            is_correct = ((pred>=0.5).float() == y_batch).float()
            accuracy_hist_train[epoch] += is_correct.sum().cpu()

        loss_hist_train[epoch] /= len(train_dl.dataset)
        accuracy_hist_train[epoch] /= len(train_dl.dataset)

        model.eval()
        with torch.no_grad():
            # 验证
            for x_batch, y_batch in valid_dl:
                x_batch = x_batch.to(device)
                y_batch = y_batch.to(device)
                pred = model(x_batch)[:, 0]
                loss = loss_fn(pred, y_batch.float())
                loss_hist_valid[epoch] += loss.item()*y_batch.size(0)
                is_correct = ((pred>=0.5).float() == y_batch).float()
                accuracy_hist_valid[epoch] += is_correct.sum().cpu()

        loss_hist_valid[epoch] /= len(valid_dl.dataset)
        accuracy_hist_valid[epoch] /= len(valid_dl.dataset)

        print(f'Epoch {epoch+1} accuracy: {accuracy_hist_train[epoch]:.4f} val_accuracy: {accuracy_hist_valid[epoch]:.4f}')

    return loss_hist_train, loss_hist_valid, accuracy_hist_train, accuracy_hist_valid

torch.manual_seed(1)
num_epochs = 20

# 开始训练
hist = train(model, num_epochs, train_dl, valid_dl)


# 画出结果。可看到loss不断下降，acc不断上升。
x_arr = np.arange(len(hist[0])) + 1

fig = plt.figure(figsize=(12, 4))
ax = fig.add_subplot(1, 2, 1)
ax.plot(x_arr, hist[0], '-o', label='Train loss')
ax.plot(x_arr, hist[1], '--<', label='Validation loss')
ax.legend(fontsize=15)
ax.set_xlabel('Epoch', size=15)
ax.set_ylabel('Loss', size=15)

ax = fig.add_subplot(1, 2, 2)
ax.plot(x_arr, hist[2], '-o', label='Train acc.')
ax.plot(x_arr, hist[3], '--<', label='Validation acc.')
ax.legend(fontsize=15)
ax.set_xlabel('Epoch', size=15)
ax.set_ylabel('Accuracy', size=15)

plt.show()


# 用test数据来检查准确率
accuracy_test = 0

model.eval()
with torch.no_grad():
    for x_batch, y_batch in test_dl:
        x_batch = x_batch.to(device)
        y_batch = y_batch.to(device)
        pred = model(x_batch)[:, 0]
        is_correct = ((pred>=0.5).float() == y_batch).float()
        accuracy_test += is_correct.sum().cpu()

accuracy_test /= len(test_dl.dataset)
print(f'Test accuracy: {accuracy_test:.4f}')


# 再次算一下x_batch的预测值
pred = model(x_batch)[:, 0] * 100

# 挑10个图片显示相应的预测值。肉眼看效果。
fig = plt.figure(figsize=(15, 7))
for j in range(0, 10):
    ax = fig.add_subplot(2, 5, j+1)
    ax.set_xticks([]); ax.set_yticks([])
    ax.imshow(x_batch[j].cpu().permute(1, 2, 0))
    if y_batch[j] == 1:
        label = 'Smile'
    else:
        label = 'Not Smile'
    ax.text(
        0.5, -0.15,
        f'GT: {label:s}\nPr(Smile)={pred[j]:.0f}%',
        size=16,
        horizontalalignment='center',
        verticalalignment='center',
        transform=ax.transAxes)

plt.show()

# 保存模型
path = 'models/celeba-cnn.ph'
torch.save(model, path)

15 用RNN对顺序数据建模#

本章学习recurrent neural networks

15.1 sequential data#

sample之间是否有关系？比如iris数据，数据之间没关系，打乱输入顺序，结果还是一样。

对于sequences数据，顺序有影响。

比如市值曲线，是一个连续的数据，绝对不能打乱它的顺序。

time series数据是一种典型的顺序数据，每个数据都跟一个时间戳关联，顺序排列。

比如股票数据和音频数据。

不包含时间的顺序数据比如DNA数据。

顺序数据表示为

\({\Large <x^{1}, x^{1}, \ ... \ x^{T} >}\)

每个\({\Large x^t}\)对应一个\({\Large y^t}\)。

对于标准nn和cnn，数据相互无关，可以说这种模型不存在对之前训练数据的记忆。

各个数据都是分开训练，独自更新参数。

相反，rnn会记住之前的信息，并对后续数据产生影响。

15.1.4 顺序模型的种类#

顺序模型有很多应用，有语言翻译，自动图像描述，文本生成等。

顺序模型分几种
见图15.2
http://karpathy.github.io/2015/05/21/rnn-effectiveness/

如果输入和输出都不是顺序数据，就走其他nn。如果有一个是顺序，那么

Many-to-one 输入有顺序，输出没有顺序，是定长的向量或标量。例如情感分析，输入是文本有序，输出是一串无序的label。
One-to-many 输入无序，输出有序。例如生成图像描述，输入一个图片，输出一串文本描述这个图片。
Many-to-many 输入输出都有序。还可细分输入输出是否同步(synchronized)。 synchronized的例子，视频分类，每一帧都要同步处理完，再处理下一帧。不是synchronized的例子，语言的翻译，允许存在delay。你必须读一段文字才能知道它的意思，才能准确翻译。

15.2 RNN#

15.2.1 rnn中的数据流转#

图15.3展示了标准nn和rnn的数据流。

标准nn数据从input直接到h层，直接到output。

rnn的h层同时接收input和本身前一个时序的数据。图中一般画成一个指向自己的环。

即recurrent edge，即rnn的由来。

图15.4展示了多层rnn，即上一层的输出作为下一层的输入。

15.2.2 rnn的激活函数#

如图15.5
每一条边都对应一个权重。这些权重和时间无关，即在任意时刻都相等。
按数据流向分成\({\Large W_{hh}}\)，\({\Large W_{xh}}\)和\({\Large W_{ho}}\)，每一层只有一组。

如果输入feature数为A，h层feature数为B。
那么\({\Large W_{xh}}\)就是BxA。
\({\Large W_{hh}}\)是BxB

计算方式和nn是类似的。z值为

\({\Large z_h^{t} = W_{xh}x^{t} + W_{hh}h^{t-1} + b_h}\)

每个时序都要激活一次

\({\Large h^t = \sigma_h(z_h^{t}) }\)

这个\({\Large h^{t}}\)首先是下一个时序的输入。

把它再做输出激活，成为时序t的输出

\({\Large o^t = \sigma_o(W_{ho}h^{t} + b_o)}\)

所以针对两个方向有两次激活。

15.2.3 Hidden recurrence vs output recurrence#

见图15.7。上一节的数据流向也是可以变的，o可以直接从前一个o获取数据，h也可以从前一个o获取数据。

代码演示了nn.RNN的输出和手算的输出，值是一样的。

torch.manual_seed(1)

# https://pytorch.org/docs/stable/generated/torch.nn.RNN.html
# 起RNN
# input_size 每个sample的feature数
# hidden_size h层的feature数
# num_layers 循环层数量
rnn_layer = nn.RNN(input_size=5, hidden_size=2, num_layers=1, batch_first=True)

# 见文档。weight_ih_l[k]表示第0层的ih权重。其他类似。
w_xh = rnn_layer.weight_ih_l0
w_hh = rnn_layer.weight_hh_l0
b_xh = rnn_layer.bias_ih_l0
b_hh = rnn_layer.bias_hh_l0

print('W_xh shape:', w_xh.shape)
print('W_hh shape:', w_hh.shape)
print('b_xh shape:', b_xh.shape)
print('b_hh shape:', b_hh.shape)

print(f"w_xh {w_xh}")
print(f"w_hh {w_hh}")
print(f"b_xh {b_xh}")
print(f"b_hh {b_hh}")

x_seq = torch.tensor([[1.0]*5, [2.0]*5, [3.0]*5]).float()

print(f"x_seq {x_seq}")
print(f"x_seq.shape {x_seq.shape}")
aaa = torch.reshape(x_seq, (1, 3, 5))
print(f"aaa {aaa}")
print(f"aaa.shape {aaa.shape}")

# 输入的shape有说法，见文档。batch_first=True时是(N,L,Hin)。
output, hn = rnn_layer(torch.reshape(x_seq, (1, 3, 5)))

out_man = []
# 3个sample。循环3次。
for t in range(3):
    print(f"xt {x_seq[t]} xt.shape {x_seq[t].shape}")
    xt = torch.reshape(x_seq[t], (1, 5))
    print(f"xt {xt} xt.shape {xt.shape}")
    print(f'Time step {t} =>')
    print('   Input           :', xt.numpy())

    # 输入z值
    ht = torch.matmul(xt, torch.transpose(w_xh, 0, 1)) + b_xh
    print('   Hidden          :', ht.detach().numpy())

    # 前一个时序的值。初始为0。
    if t>0:
        prev_h = out_man[t-1]
    else:
        prev_h = torch.zeros((ht.shape))

    # 加上前一个时序值乘权重再加hh的bias
    ot = ht + torch.matmul(prev_h, torch.transpose(w_hh, 0, 1)) + b_hh

    # 激活并保存
    ot = torch.tanh(ot)
    out_man.append(ot)

    # 值相等
    print('   Output (manual) :', ot.detach().numpy())
    print('   RNN output      :', output[:, t].detach().numpy())