bias_variance_decomp: 分类和回归损失的偏差-方差分解

适用于各种损失函数的机器学习算法偏差方差分解。

from mlxtend.evaluate import bias_variance_decomp

概述

研究人员经常使用偏差和方差或“偏差-方差权衡”等术语来描述模型的性能——也就是说，您可能会在演讲、书籍或文章中遇到人们说某个模型具有高方差或高偏差的情况。那么，这意味着什么？一般来说，我们可以说“高方差”与过拟合成正比，而“高偏差”与欠拟合成正比。

那么，我们为什么要首先进行这种偏差-方差分解呢？将损失分解为偏差和方差有助于我们理解学习算法，因为这些概念与欠拟合和过拟合相关。

为了使用偏差和方差的更正式术语，假设我们有一个点估计量 $\hat{\theta}$ 用于某个参数或函数 $\theta$ 。那么，偏差通常定义为估计量的期望值与我们要估计的参数之间的差

$\text{Bias} = E[\hat{\theta}] - \theta.$

如果偏差大于零，我们也说估计量是正偏差的；如果偏差小于零，估计量是负偏差的；如果偏差恰好为零，估计量是无偏的。类似地，我们将方差定义为估计量的平方的期望值减去估计量的期望的平方之间的差

$\text{Var}(\hat{\theta}) = E\big[\hat{\theta}^2\big] - \bigg(E\big[\hat{\theta}\big]\bigg)^2.$

请注意，在本讲座的上下文中，以其替代形式书写方差会更方便

$\text{Var}(\hat{\theta}) = E[(E[{\hat{\theta}}] - \hat{\theta})^2].$

为了在机器学习的上下文中进一步说明这个概念...

假设存在一个我们希望逼近的未知目标函数或“真实函数”。现在，假设我们有从未知分布中抽取出的不同训练集，该分布定义为“真实函数 + 噪声”。下图显示了不同的线性回归模型，每个模型都拟合到不同的训练集。这些假设中没有一个能很好地逼近真实函数，除了在两个点（大约在 x=-10 和 x=6）之外。在这里，我们可以说偏差很大，因为真实值和预测值之间的差异平均而言（这里的平均是指“训练集的期望”，而不是“训练集中样本的期望”）很大。

下一个图显示了不同的未剪枝决策树模型，每个模型都拟合到不同的训练集。请注意，这些假设非常紧密地拟合了训练数据。然而，如果我们考虑训练集的期望，平均假设将完美地拟合真实函数（假设噪声是无偏的且期望值为 0）。正如我们所见，方差非常大，因为平均而言，预测值与预测的期望值差异很大。

平方损失的偏差-方差分解

我们可以将损失函数（例如平方损失）分解为三个项：方差项、偏差项和噪声项（稍后对 0-1 损失的分解也是如此）。然而，为了简化起见，我们将忽略噪声项。

在介绍分类的 0-1 损失的偏差-方差分解之前，让我们先从平方损失的分解开始，作为熟悉整体概念的简单热身练习。

前面一节已经列出了偏差和方差的常见正式定义，然而，为了方便起见，让我们再次定义它们

$\text{Bias}(\hat{\theta}) = E[\hat{\theta}] - \theta, \quad \text{Var}(\hat{\theta}) = E[(E[{\hat{\theta}}] - \hat{\theta})^2].$

回顾一下，在这些机器学习讲座（笔记）的上下文中，我们定义了

真实函数或目标函数为 $y = f(x)$ ,
预测目标值为 $\hat{y} = \hat{f}(x) = h(x)$ ,
平方损失为 $S = (y - \hat{y})^2$ 。（我使用 $S$ 在这里，因为它更容易与 $E$ 相区分，我们在本讲座中使用它表示期望。）

请注意，除非另有说明，期望是对训练集而言的！

为了开始将平方误差损失分解为偏差和方差，让我们进行一些代数运算，即，加上和减去 $\hat{y}$ 然后使用二次公式展开表达式 $(a+b)^2 = a^2 + b^2 + 2ab)$ :

$\begin{equation} \begin{split} S = (y - \hat{y})^2 \\ (y - \hat{y})^2 &= (y - E[{\hat{y}}] + E[{\hat{y}}] - \hat{y})^2 \\ &= (y-E[{\hat{y}}])^2 + (E[{\hat{y}}] - y)^2 + 2(y - E[\hat{y}])(E[\hat{y}] - \hat{y}). \end{split} \end{equation}$

接下来，我们只需对两边取期望，就完成了

$\begin{align} E[S] &= E[(y - \hat{y})^2] \\ E[(y - \hat{y})^2] &= (y-E[{\hat{y}}])^2 + E[(E[{\hat{y}}] - \hat{y})^2]\\ &= \text{[Bias]}^2 + \text{Variance}. \end{align}$

您可能想知道“ $2ab$ ”项（ $2(y - E[\hat{y}])(E[\hat{y}] - \hat{y})$ ）在我们取期望时发生了什么。事实证明它计算结果为零，因此从等式中消失了，这可以证明如下

$\begin{align} E[2(y - E[{\hat{y}}])(E[{\hat{y}}] - \hat{y})] &= 2 E[(y - E[{\hat{y}}])(E[{\hat{y}}] - \hat{y})] \\ &= 2(y - E[{\hat{y}}])E[(E[{\hat{y}}] - \hat{y})] \\ &= 2(y - E[{\hat{y}}])(E[E[{\hat{y}}]] - E[\hat{y}])\\ &= 2(y - E[{\hat{y}}])(E[{\hat{y}}] - E[{\hat{y}}]) \\ &= 0. \end{align}$

因此，这是平方误差损失到偏差和方差的经典分解。下一节将讨论一些用于分解我们通常用于分类准确率或误差的 0-1 损失的方法。

下图概述了方差和偏差与训练误差和泛化误差的关系——高方差如何与过拟合相关，以及大偏差如何与欠拟合相关

0-1 损失的偏差-方差分解

请注意，将 0-1 损失分解为偏差和方差分量不像平方误差损失那样直接。引用华盛顿大学著名机器学习研究员兼教授 Pedro Domingos 的话

“一些作者提出了与 0-1 损失相关的偏差-方差分解 (Kong & Dietterich, 1995; Breiman, 1996b; Kohavi & Wolpert, 1996; Tibshirani, 1996; Friedman, 1997)。然而，这些分解方法各自都有明显的缺点。” [1]

事实上，这句话引用的那篇论文可能提供了目前最直观和通用的公式。然而，为了简化起见，我们将首先介绍 Kong & Dietterich 关于 0-1 损失分解的公式 [2]，该公式与 Domingos 的相同，但为了简化起见排除了噪声项。

下表总结了我们在讨论 0-1 损失时用到的与平方损失相关的术语。回顾一下，0-1 损失， $L$ ，如果类别标签预测正确则为 0，否则为 1。平方误差损失的主要预测值就是所有预测值的平均值 $E[\hat{y}]$ （期望是对训练集而言的），对于 0-1 损失，Kong & Dietterich 和 Domingos 将其定义为众数。也就是说，如果一个模型预测标签 1 的次数超过 50%（考虑所有可能的训练集），那么主要预测值就是 1，否则为 0。

-	平方损失	0-1 损失
单个损失	$(y - \hat{y})^2$	$L(y, \hat{y})$
期望损失	$E[(y - \hat{y})^2]$	$E[L(y, \hat{y})]$
主要预测值 $E[\hat{y}]$	均值（平均值）	众数
偏差 $^2$	$(y-E[{\hat{y}}])^2$	$L(y, E[\hat{y}])$
方差	$E[(E[{\hat{y}}] - \hat{y})^2]$	$E[L(\hat{y}, E[\hat{y}])]$

因此，由于使用众数来定义 0-1 损失的主要预测值，如果主要预测值与真实标签不一致，则偏差为 1 $y$ ，否则为 0

$Bias = \begin{cases} 1 \text{ if } y \neq E[{\hat{y}}], \\ 0 \text{ otherwise}. \end{cases}$

0-1 损失的方差定义为预测标签与主要预测值不匹配的概率

$Variance = P(\hat{y} \neq E[\hat{{y}}]).$

接下来，让我们看看当偏差为 0 时，损失会发生什么。根据损失的一般定义，损失 = 偏差 + 方差，如果偏差为 0，那么我们将损失定义为方差

$Loss = 0 + Variance = Loss = P(\hat{y} \neq y) = Variance = P(\hat{y} \neq E[\hat{{y}}]).$

换句话说，如果一个模型的偏差为零，其损失完全由方差决定，这在我们将方差视为与过拟合成正比的上下文中是直观的。

更令人惊讶的情况是偏差等于 1 时。正如 Pedro Domingos 所解释的，如果偏差等于 1，增加方差可以减少损失，这是一个有趣的观察。这可以通过首先将 0-1 损失函数重写为以下形式来看出

$Loss = P(\hat{y} \neq y) = 1 - P(\hat{y} = y).$

（请注意，我们还没有做任何新的事情。）现在，如果我们看一下前面关于偏差的等式，如果偏差是 1，我们有 $y \neq E[{\hat{y}}]$ 。如果 $y$ 不等于主要预测值，但 $y$ 也等于 $\hat{y}$ ，那么 $\hat{y}$ 必须等于主要预测值。使用“逆”（“1 减去”），那么我们可以将损失写成

$Loss = P(\hat{y} \neq y) = 1 - P(\hat{y} = y) = 1 - P(\hat{y} \neq E[{\hat{y}}]).$

由于偏差是 1，因此当偏差为 1 时，损失被定义为“损失 = 偏差 - 方差”（或“损失 = 1 - 方差”）。这乍一看可能很不直观，但 Kong, Dietterich 和 Domingos 提供的解释是，如果一个模型的偏差非常高，以至于其主要预测值总是错误的，那么增加方差可能会有益处，因为增加方差会推动决策边界，这可能偶然导致一些正确的预测。换句话说，对于高偏差的情况，增加方差可以改善（降低）损失！

参考文献

[1] Domingos, Pedro. “一个统一的偏差-方差分解。” Proceedings of 17th International Conference on Machine Learning. 2000.
[2] Dietterich, Thomas G., and Eun Bae Kong. 《机器学习偏差、统计偏差和决策树算法的统计方差》。Technical report, Department of Computer Science, Oregon State University, 1995.

示例 1 -- 决策树分类器的偏差方差分解

from mlxtend.evaluate import bias_variance_decomp
from sklearn.tree import DecisionTreeClassifier
from mlxtend.data import iris_data
from sklearn.model_selection import train_test_split


X, y = iris_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=123,
                                                    shuffle=True,
                                                    stratify=y)



tree = DecisionTreeClassifier(random_state=123)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
        tree, X_train, y_train, X_test, y_test, 
        loss='0-1_loss',
        random_seed=123)

print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

Average expected loss: 0.062
Average bias: 0.022
Average variance: 0.040

为了进行比较，一个 bagging 分类器的偏差-方差分解，与单个决策树相比，直观上其方差应该较低

from sklearn.ensemble import BaggingClassifier

tree = DecisionTreeClassifier(random_state=123)
bag = BaggingClassifier(base_estimator=tree,
                        n_estimators=100,
                        random_state=123)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
        bag, X_train, y_train, X_test, y_test, 
        loss='0-1_loss',
        random_seed=123)

print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

Average expected loss: 0.048
Average bias: 0.022
Average variance: 0.026

示例 2 -- 决策树回归器的偏差方差分解

from mlxtend.evaluate import bias_variance_decomp
from sklearn.tree import DecisionTreeRegressor
from mlxtend.data import boston_housing_data
from sklearn.model_selection import train_test_split


X, y = boston_housing_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=123,
                                                    shuffle=True)



tree = DecisionTreeRegressor(random_state=123)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
        tree, X_train, y_train, X_test, y_test, 
        loss='mse',
        random_seed=123)

print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

Average expected loss: 31.536
Average bias: 14.096
Average variance: 17.440

为了进行比较，下面显示了一个 bagging 回归器的偏差-方差分解，与单个决策树相比，直观上其方差应该较低

from sklearn.ensemble import BaggingRegressor

tree = DecisionTreeRegressor(random_state=123)
bag = BaggingRegressor(base_estimator=tree,
                       n_estimators=100,
                       random_state=123)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
        bag, X_train, y_train, X_test, y_test, 
        loss='mse',
        random_seed=123)

print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

Average expected loss: 18.620
Average bias: 15.461
Average variance: 3.159

示例 3 -- TensorFlow/Keras 支持

自 mlxtend v0.18.0 起，bias_variance_decomp 现在支持 Keras 模型。请注意，原始模型在每一轮中都会被重置（在重新拟合到自助样本之前）。

from mlxtend.evaluate import bias_variance_decomp
from mlxtend.data import boston_housing_data
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import tensorflow as tf
import numpy as np


np.random.seed(1)
tf.random.set_seed(1)


X, y = boston_housing_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=123,
                                                    shuffle=True)


model = tf.keras.Sequential([
    tf.keras.layers.Dense(32, activation=tf.nn.relu),
    tf.keras.layers.Dense(1)
  ])

optimizer = tf.keras.optimizers.Adam()
model.compile(loss='mean_squared_error', optimizer=optimizer)

model.fit(X_train, y_train, epochs=100, verbose=0)

mean_squared_error(model.predict(X_test), y_test)

32.69300595184836

请注意，强烈建议使用与在原始训练集上使用的相同数量的训练 epoch，以确保收敛

np.random.seed(1)
tf.random.set_seed(1)


avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
        model, X_train, y_train, X_test, y_test, 
        loss='mse',
        num_rounds=100,
        random_seed=123,
        epochs=200, # fit_param
        verbose=0) # fit_param


print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

Average expected loss: 32.740
Average bias: 27.474
Average variance: 5.265

API

bias_variance_decomp(estimator, X_train, y_train, X_test, y_test, loss='0-1_loss', num_rounds=200, random_seed=None, fit_params)

estimator : 对象一个分类器或回归器对象或类，实现了 fit 和 predict 方法，类似于 scikit-learn API。

X_train : 类似数组，形状=(num_examples, num_features)

用于抽取自助样本以进行偏差-方差分解的训练数据集。
y_train : 类似数组，形状=(num_examples)

目标（分类时的类别标签，回归时的连续值），与 X_train 样本相关联。
X_test : 类似数组，形状=(num_examples, num_features)

用于计算平均损失、偏差和方差的测试数据集。
y_test : 类似数组，形状=(num_examples)

目标（分类时的类别标签，回归时的连续值），与 X_test 样本相关联。
loss : str (默认='0-1_loss')

用于执行偏差-方差分解的损失函数。当前允许的值为 '0-1_loss' 和 'mse'。
num_rounds : int (默认=200)

自助轮数（从训练集中采样），用于执行偏差-方差分解。每个自助样本与原始训练集大小相同。
random_seed : int (默认=None)

用于自助采样的随机种子，用于偏差-方差分解。
fit_params : 附加参数

要传递给 estimator 的 .fit() 函数的附加参数，当将其拟合到自助样本时。

avg_expected_loss, avg_bias, avg_var : 返回平均期望

平均偏差和平均方差（均为浮点数），其中平均值是在测试集的数据点上计算的。

示例

有关使用示例，请参阅 https://mlxtend.cn/mlxtend/user_guide/evaluate/bias_variance_decomp/

键	操作
`?`	打开此帮助
`n`	下一页
`p`	上一页
`s`	搜索