SoftmaxRegression：逻辑回归的多类别版本

用于多类别分类任务的逻辑回归类。

from mlxtend.classifier import SoftmaxRegression

概述

Softmax 回归（同义词：多项式逻辑、最大熵分类器或简称多类别逻辑回归）是逻辑回归的一种推广，可用于多类别分类（假设类别互斥）。相比之下，我们在二元分类任务中使用（标准）逻辑回归模型。

下面是逻辑回归模型的示意图，更多详情，请参阅LogisticRegression 手册。

在Softmax 回归 (SMR) 中，我们用所谓的softmax 函数替换 sigmoid 逻辑函数 $\phi_{softmax}(\cdot)$ .

$P(y=j \mid z^{(i)}) = \phi_{softmax}(z^{(i)}) = \frac{e^{z^{(i)}}}{\sum_{j=0}^{k} e^{z_{k}^{(i)}}},$

其中我们将净输入 z 定义为

$z = w_1x_1 + ... + w_mx_m + b= \sum_{l=1}^{m} w_l x_l + b= \mathbf{w}^T\mathbf{x} + b.$

(w 是权重向量， $\mathbf{x}$ 是 1 个训练样本的特征向量，以及 $b$ 是偏置单元。）
现在，这个 softmax 函数计算该训练样本 $\mathbf{x}^{(i)}$ 属于类别 $j$ 给定权重和净输入 $z^{(i)}$ 。因此，我们计算概率 $p(y = j \mid \mathbf{x^{(i)}; w}_j)$ 对于中的每个类别标签 $j = 1, \ldots, k.$ 。注意分母中的归一化项，它使得这些类别概率之和为一。

为了说明 softmax 的概念，我们来看一个具体示例。假设我们有一个包含来自 3 个不同类别（0、1 和 2）的 4 个样本的训练集

$x_0 \rightarrow \text{class }0$
$x_1 \rightarrow \text{class }1$
$x_2 \rightarrow \text{class }2$
$x_3 \rightarrow \text{class }2$

import numpy as np

y = np.array([0, 1, 2, 2])

首先，我们想将类别标签编码成更易于使用的格式；我们应用独热编码

y_enc = (np.arange(np.max(y) + 1) == y[:, None]).astype(float)

print('one-hot encoding:\n', y_enc)

one-hot encoding:
 [[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]
 [ 0.  0.  1.]]

属于类别 0 的样本（第一行）在第一个单元格中为 1，属于类别 2 的样本在其行的第二个单元格中为 1，依此类推。

接下来，我们定义 4 个训练样本的特征矩阵。这里，我们假设数据集包含 2 个特征；因此，我们创建了一个 4x2 维的样本和特征矩阵。类似地，我们创建了一个 2x3 维的权重矩阵（每行代表一个特征，每列代表一个类别）。

X = np.array([[0.1, 0.5],
              [1.1, 2.3],
              [-1.1, -2.3],
              [-1.5, -2.5]])

W = np.array([[0.1, 0.2, 0.3],
              [0.1, 0.2, 0.3]])

bias = np.array([0.01, 0.1, 0.1])

print('Inputs X:\n', X)
print('\nWeights W:\n', W)
print('\nbias:\n', bias)

Inputs X:
 [[ 0.1  0.5]
 [ 1.1  2.3]
 [-1.1 -2.3]
 [-1.5 -2.5]]

Weights W:
 [[ 0.1  0.2  0.3]
 [ 0.1  0.2  0.3]]

bias:
 [ 0.01  0.1   0.1 ]

为了计算净输入，我们将 4x2 维特征矩阵 X 与 2x3 维（n_features x n_classes）权重矩阵 W 相乘，这会得到一个 4x3 维（n_samples x n_classes）输出矩阵，然后我们加上偏置单元

$\mathbf{Z} = \mathbf{X}\mathbf{W} + \mathbf{b}.$

X = np.array([[0.1, 0.5],
              [1.1, 2.3],
              [-1.1, -2.3],
              [-1.5, -2.5]])

W = np.array([[0.1, 0.2, 0.3],
              [0.1, 0.2, 0.3]])

bias = np.array([0.01, 0.1, 0.1])

print('Inputs X:\n', X)
print('\nWeights W:\n', W)
print('\nbias:\n', bias)

Inputs X:
 [[ 0.1  0.5]
 [ 1.1  2.3]
 [-1.1 -2.3]
 [-1.5 -2.5]]

Weights W:
 [[ 0.1  0.2  0.3]
 [ 0.1  0.2  0.3]]

bias:
 [ 0.01  0.1   0.1 ]

def net_input(X, W, b):
    return (X.dot(W) + b)

net_in = net_input(X, W, bias)
print('net input:\n', net_in)

net input:
 [[ 0.07  0.22  0.28]
 [ 0.35  0.78  1.12]
 [-0.33 -0.58 -0.92]
 [-0.39 -0.7  -1.1 ]]

现在，是时候计算我们之前讨论的 softmax 激活了

$P(y=j \mid z^{(i)}) = \phi_{softmax}(z^{(i)}) = \frac{e^{z^{(i)}}}{\sum_{j=0}^{k} e^{z_{k}^{(i)}}}.$

def softmax(z):
    return (np.exp(z.T) / np.sum(np.exp(z), axis=1)).T

smax = softmax(net_in)
print('softmax:\n', smax)

softmax:
 [[ 0.29450637  0.34216758  0.36332605]
 [ 0.21290077  0.32728332  0.45981591]
 [ 0.42860913  0.33380113  0.23758974]
 [ 0.44941979  0.32962558  0.22095463]]

正如我们所见，现在每个样本（行）的值之和正好为 1。例如，我们可以说第一个样本
[ 0.29450637 0.34216758 0.36332605] 属于类别 0 的概率为 29.45%。

现在，为了将这些概率转换回类别标签，我们可以简单地取每行的 argmax 索引位置

[[ 0.29450637 0.34216758 0.36332605] -> 2
[ 0.21290077 0.32728332 0.45981591] -> 2
[ 0.42860913 0.33380113 0.23758974] -> 0
[ 0.44941979 0.32962558 0.22095463]] -> 0

def to_classlabel(z):
    return z.argmax(axis=1)

print('predicted class labels: ', to_classlabel(smax))

predicted class labels:  [2 2 0 0]

正如我们所见，我们的预测非常错误，因为正确的类别标签是 [0, 1, 2, 2]。现在，为了训练我们的逻辑模型（例如，通过梯度下降等优化算法），我们需要定义一个成本函数 $J(\cdot)$ 我们想要最小化的

$J(\mathbf{W}; \mathbf{b}) = \frac{1}{n} \sum_{i=1}^{n} H(T_i, O_i),$

这是我们所有交叉熵的平均值，基于我们的 $n$ 训练样本。交叉熵函数定义为

$H(T_i, O_i) = -\sum_m T_i \cdot log(O_i).$

这里的 $T$ 代表“目标”（即真实类别标签）以及 $O$ 代表输出——通过 softmax 计算的概率；不是预测的类别标签。

def cross_entropy(output, y_target):
    return - np.sum(np.log(output) * (y_target), axis=1)

xent = cross_entropy(smax, y_enc)
print('Cross Entropy:', xent)

Cross Entropy: [ 1.22245465  1.11692907  1.43720989  1.50979788]

def cost(output, y_target):
    return np.mean(cross_entropy(output, y_target))

J_cost = cost(smax, y_enc)
print('Cost: ', J_cost)

Cost:  1.32159787159

为了通过梯度下降学习我们的 softmax 模型——确定权重系数，我们需要计算导数

$\nabla \mathbf{w}_j \, J(\mathbf{W}; \mathbf{b}).$

我不想在这里详细阐述繁琐的细节，但这个成本函数的导数结果很简单，就是

$\nabla \mathbf{w}_j \, J(\mathbf{W}; \mathbf{b}) = \frac{1}{n} \sum^{n}_{i=0} \big[\mathbf{x}^{(i)}\ \big(O_i - T_i \big) \big]$

然后，我们可以使用成本函数的导数，以学习率 $\eta$ :

$\mathbf{w}_j := \mathbf{w}_j - \eta \nabla \mathbf{w}_j \, J(\mathbf{W}; \mathbf{b})$

对于每个类别 $j \in \{0, 1, ..., k\}$

(注意 $\mathbf{w}_j$ 是类别的权重向量 $y=j$ )，并且我们更新偏置单元

$\mathbf{b}_j := \mathbf{b}_j - \eta \bigg[ \frac{1}{n} \sum^{n}_{i=0} \big(O_i - T_i \big) \bigg].$

作为对复杂度的惩罚，一种通过增加额外偏差来减少模型方差并降低过拟合程度的方法，我们可以进一步添加正则化项，例如带有正则化参数 $\lambda$ :

L2 $\frac{\lambda}{2} ||\mathbf{w}||_{2}^{2}$ ,

其中

$||\mathbf{w}||_{2}^{2} = \sum^{m}_{l=0} \sum^{k}_{j=0} w_{i, j}$

使得我们的成本函数变为

$J(\mathbf{W}; \mathbf{b}) = \frac{1}{n} \sum_{i=1}^{n} H(T_i, O_i) + \frac{\lambda}{2} ||\mathbf{w}||_{2}^{2}$

并且我们将“正则化”的权重更新定义为

$\mathbf{w}_j := \mathbf{w}_j - \eta \big[\nabla \mathbf{w}_j \, J(\mathbf{W}) + \lambda \mathbf{w}_j \big].$

(请注意，我们不对偏置项进行正则化。)

示例 1 - 梯度下降

from mlxtend.data import iris_data
from mlxtend.plotting import plot_decision_regions
from mlxtend.classifier import SoftmaxRegression
import matplotlib.pyplot as plt

# Loading Data

X, y = iris_data()
X = X[:, [0, 3]] # sepal length and petal width

# standardize
X[:,0] = (X[:,0] - X[:,0].mean()) / X[:,0].std()
X[:,1] = (X[:,1] - X[:,1].mean()) / X[:,1].std()

lr = SoftmaxRegression(eta=0.01, 
                       epochs=500, 
                       minibatches=1, 
                       random_seed=1,
                       print_progress=3)
lr.fit(X, y)

plot_decision_regions(X, y, clf=lr)
plt.title('Softmax Regression - Gradient Descent')
plt.show()

plt.plot(range(len(lr.cost_)), lr.cost_)
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.show()

Iteration: 500/500 | Cost 0.06 | Elapsed: 0:00:00 | ETA: 0:00:00

png

预测类别标签

y_pred = lr.predict(X)
print('Last 3 Class Labels: %s' % y_pred[-3:])

Last 3 Class Labels: [2 2 2]

预测类别概率

y_pred = lr.predict_proba(X)
print('Last 3 Class Labels:\n %s' % y_pred[-3:])

Last 3 Class Labels:
 [[  9.18728149e-09   1.68894679e-02   9.83110523e-01]
 [  2.97052325e-11   7.26356627e-04   9.99273643e-01]
 [  1.57464093e-06   1.57779528e-01   8.42218897e-01]]

示例 2 - 随机梯度下降

from mlxtend.data import iris_data
from mlxtend.plotting import plot_decision_regions
from mlxtend.classifier import SoftmaxRegression
import matplotlib.pyplot as plt

# Loading Data

X, y = iris_data()
X = X[:, [0, 3]] # sepal length and petal width

# standardize
X[:,0] = (X[:,0] - X[:,0].mean()) / X[:,0].std()
X[:,1] = (X[:,1] - X[:,1].mean()) / X[:,1].std()

lr = SoftmaxRegression(eta=0.01, epochs=300, minibatches=len(y), random_seed=1)
lr.fit(X, y)

plot_decision_regions(X, y, clf=lr)
plt.title('Softmax Regression - Stochastic Gradient Descent')
plt.show()

plt.plot(range(len(lr.cost_)), lr.cost_)
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.show()

png

API

SoftmaxRegression(eta=0.01, epochs=50, l2=0.0, minibatches=1, n_classes=None, random_seed=None, print_progress=0)

Softmax 回归分类器。

参数

eta : 浮点数 (默认值: 0.01)

学习率（介于 0.0 和 1.0 之间）
epochs : 整数 (默认值: 50)

遍历训练数据集。在每个 epoch 之前，如果 minibatches > 1，数据集会被打乱，以防止随机梯度下降中的循环。
l2 : 浮点数

L2 正则化的正则化参数。如果 l2=0.0 则不进行正则化。
minibatches : 整数 (默认值: 1)

基于梯度的优化中的 minibatch 数量。如果为 1：梯度下降学习如果为 len(y)：随机梯度下降 (SGD) 在线学习如果 1 < minibatches < len(y)：SGD Minibatch 学习
n_classes : 整数 (默认值: None)

如果部分训练集中没有所有类别标签，则用此正整数声明类别标签的数量。如果为 None，则自动获取类别标签的数量。
random_seed : 整数 (默认值: None)

设置用于打乱和初始化权重的随机状态。
print_progress : 整数 (默认值: 0)

将拟合过程打印到 stderr。 0: 无输出 1: 已完成的 epoch 和成本 2: 1 加上已用时间 3: 2 加上预估完成时间

属性

w_ : 2d 数组, 形状={n_features, 1}

拟合后的模型权重。
b_ : 1d 数组, 形状={1,}

拟合后的偏置单元。
cost_ : 列表

浮点数列表，表示每个 epoch 的平均 cross_entropy。

示例

有关用法示例，请参阅 https://mlxtend.cn/mlxtend/user_guide/classifier/SoftmaxRegression/

方法

fit(X, y, init_params=True)

从训练数据中学习模型。

参数

X : {类数组对象, 稀疏矩阵}, 形状 = [n_samples, n_features]

训练向量，其中 n_samples 是样本数量，n_features 是特征数量。
y : 类数组对象, 形状 = [n_samples]

目标值。
init_params : 布尔值 (默认值: True)

在拟合之前重新初始化模型参数。设置为 False 以使用先前模型拟合的权重继续训练。

返回值

self : 对象

predict(X)

从 X 预测目标。

参数

X : {类数组对象, 稀疏矩阵}, 形状 = [n_samples, n_features]

训练向量，其中 n_samples 是样本数量，n_features 是特征数量。

返回值

target_values : 类数组对象, 形状 = [n_samples]

预测的目标值。

predict_proba(X)

从净输入预测 X 的类别概率。

参数

X : {类数组对象, 稀疏矩阵}, 形状 = [n_samples, n_features]

训练向量，其中 n_samples 是样本数量，n_features 是特征数量。

返回值

类别概率 : 类数组对象, 形状= [n_samples, n_classes]

score(X, y)

计算预测准确率

参数

X : {类数组对象, 稀疏矩阵}, 形状 = [n_samples, n_features]

训练向量，其中 n_samples 是样本数量，n_features 是特征数量。
y : 类数组对象, 形状 = [n_samples]

目标值（真实类别标签）。

返回值

acc : 浮点数

预测准确率，一个介于 0.0 和 1.0 之间的浮点数（完美分数为 1.0）。

ython

键	操作
`?`	打开此帮助
`n`	下一页
`p`	上一页
`s`	搜索