bootstrap_point632_score: .632 和 .632+ bootstrap 用于分类器评估

用于评估有监督学习算法的 .632 bootstrap 实现。

from mlxtend.evaluate import bootstrap_point632_score

概述

最初，bootstrap 方法旨在确定在基础分布未知且没有额外样本可用时，估计器的统计特性。现在，为了利用这种方法评估预测模型（例如分类和回归的假设），我们可能更喜欢使用所谓的袋外 (Out-Of-Bag, OOB) 或留一法 Bootstrap (Leave-One-Out Bootstrap, LOOB) 技术来执行 bootstrap。在这里，我们将袋外样本用作评估的测试集，而不是在训练数据上评估模型。袋外样本是未用于模型拟合的唯一实例集，如下图所示 [1]。

上图展示了从一个示例性的十样本数据集（ $X_1,X_2, ..., X_{10}$ ）中抽取的三个随机 bootstrap 样本及其用于测试的袋外样本可能是什么样子。在实践中，Bradley Efron 和 Robert Tibshirani 建议抽取 50 到 200 个 bootstrap 样本足以获得可靠的估计 [2]。

.632 Bootstrap

1983 年，Bradley Efron 描述了 .632 估计，这是对上述 bootstrap 交叉验证方法悲观偏差的进一步改进 [3]。"经典" bootstrap 方法中的悲观偏差可归因于 bootstrap 样本仅包含原始数据集约 63.2% 的唯一样本这一事实。例如，我们可以计算大小为 n 的数据集中的给定样本*未*被抽取为 bootstrap 样本的概率为

$P (\text{not chosen}) = \bigg(1 - \frac{1}{n}\bigg)^n,$

这渐近等价于 $\frac{1}{e} \approx 0.368$ 当 $n \rightarrow \infty.$

反之，我们可以计算样本*被*选中的概率为 $P (\text{chosen}) = 1 - \bigg(1 - \frac{1}{n}\bigg)^n \approx 0.632$ 对于足够大的数据集，我们会选择大约 $0.632 \times n$ 唯一样本作为 bootstrap 训练集，并保留 $0.368 \times n$ 袋外样本用于每次迭代中的测试。

现在，为了解决由这种有放回抽样引起的偏差，Bradley Efron 提出了我们前面提到的 .632 估计，它通过以下公式计算

$\text{ACC}_{boot} = \frac{1}{b} \sum_{i=1}^b \big(0.632 \cdot \text{ACC}_{h, i} + 0.368 \cdot \text{ACC}_{train}\big),$

其中 $\text{ACC}_{train}$ 是基于整个训练集计算的准确率，而 $\text{ACC}_{h, i}$ 是基于袋外样本的准确率。

.632+ Bootstrap

现在，虽然 .632 Boostrap 试图解决估计的悲观偏差，但对于倾向于过拟合的模型，可能会出现乐观偏差，因此 Bradley Efron 和 Robert Tibshirani 提出了 .632+ Bootstrap 方法 (Efron and Tibshirani, 1997)。而不是使用固定的 "权重" $\omega = 0.632$ 在

$ACC_{\text{boot}} = \frac{1}{b} \sum_{i=1}^b \big(\omega \cdot \text{ACC}_{h, i} + (1-\omega) \cdot \text{ACC}_{train} \big),$

我们计算权重 $\gamma$ 当

$\omega = \frac{0.632}{1 - 0.368 \times R},$

其中 R 是*相对过拟合率*

$R = \frac{(-1) \times (\text{ACC}_{h, i} - \text{ACC}_{train})}{\gamma - (1 -\text{ACC}_{h, i})}.$

（由于我们将 $\omega$ 代入计算 $ACC_{boot}$ 我们上面定义的公式中， $\text{ACC}_{h, i}$ 和 $\text{ACC}_{train}$ 仍然分别指第 i 次 bootstrap 迭代中的袋外准确率和整个训练集的准确率。）

此外，我们需要确定*无信息率* $\gamma$ 为了计算 R。例如，我们可以计算 $\gamma$ 通过将模型拟合到包含样本之间所有可能组合的数据集上 $x_{i'}$ 和目标类别标签 $y_{i}$ — 我们假定观测值和类别标签是独立的

$\gamma = \frac{1}{n^2} \sum_{i=1}^{n} \sum_{i '=1}^{n} L(y_{i}, f(x_{i '})).$

或者，我们可以按如下方式估计无信息率 $\gamma$ 如下

$\gamma = \sum_{k=1}^K p_k (1 - q_k),$

其中 $p_k$ 是类别 $k$ 在数据集中观察到的样本比例，而 $q_k$ 是类别 $k$ 是分类器在数据集中预测的样本比例。

参考文献

[1] https://sebastianraschka.com/blog/2016/model-evaluation-selection-part2.html
[2] Efron, Bradley, and Robert J. Tibshirani. An introduction to the bootstrap. CRC press, 1994. Management of Data (ACM SIGMOD '97), pages 265-276, 1997. [3] Efron, Bradley. 1983. “Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation.” Journal of the American Statistical Association 78 (382): 316. doi:10.2307/2288636.
[4] Efron, Bradley, and Robert Tibshirani. 1997. “Improvements on Cross-Validation: The .632+ Bootstrap Method.” Journal of the American Statistical Association 92 (438): 548. doi:10.2307/2965703.

示例 1 -- 通过经典袋外 Bootstrap 评估模型的预测性能

bootstrap_point632_score 函数模仿了 scikit-learn 的 `cross_val_score` 的行为，一个典型的使用示例如下所示

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from mlxtend.evaluate import bootstrap_point632_score
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target
tree = DecisionTreeClassifier(random_state=123)

# Model accuracy
scores = bootstrap_point632_score(tree, X, y, method='oob')
acc = np.mean(scores)
print('Accuracy: %.2f%%' % (100*acc))


# Confidence interval
lower = np.percentile(scores, 2.5)
upper = np.percentile(scores, 97.5)
print('95%% Confidence interval: [%.2f, %.2f]' % (100*lower, 100*upper))

Accuracy: 94.45%
95% Confidence interval: [87.71, 100.00]

示例 2 -- 通过 .632 Bootstrap 评估模型的预测性能

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from mlxtend.evaluate import bootstrap_point632_score
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target
tree = DecisionTreeClassifier(random_state=123)

# Model accuracy
scores = bootstrap_point632_score(tree, X, y)
acc = np.mean(scores)
print('Accuracy: %.2f%%' % (100*acc))


# Confidence interval
lower = np.percentile(scores, 2.5)
upper = np.percentile(scores, 97.5)
print('95%% Confidence interval: [%.2f, %.2f]' % (100*lower, 100*upper))

Accuracy: 96.42%
95% Confidence interval: [92.41, 100.00]

示例 3 -- 通过 .632+ Bootstrap 评估模型的预测性能

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from mlxtend.evaluate import bootstrap_point632_score
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target
tree = DecisionTreeClassifier(random_state=123)

# Model accuracy
scores = bootstrap_point632_score(tree, X, y, method='.632+')
acc = np.mean(scores)
print('Accuracy: %.2f%%' % (100*acc))


# Confidence interval
lower = np.percentile(scores, 2.5)
upper = np.percentile(scores, 97.5)
print('95%% Confidence interval: [%.2f, %.2f]' % (100*lower, 100*upper))

Accuracy: 96.29%
95% Confidence interval: [91.86, 98.92]

API

bootstrap_point632_score(estimator, X, y, n_splits=200, method='.632', scoring_func=None, predict_proba=False, random_seed=None, clone_estimator=True)

.632 [1] 和 .632+ [2] bootstrap 的有监督学习实现

References:

- [1] Efron, Bradley. 1983. "Estimating the Error Rate
of a Prediction Rule: Improvement on Cross-Validation."
Journal of the American Statistical Association
78 (382): 316. doi:10.2307/2288636.
- [2] Efron, Bradley, and Robert Tibshirani. 1997.
"Improvements on Cross-Validation: The .632+ Bootstrap Method."
Journal of the American Statistical Association
92 (438): 548. doi:10.2307/2965703.

参数

estimator : 对象

用于分类或回归的估计器，遵循 scikit-learn API 并实现了 "fit" 和 "predict" 方法。
X : 类数组

用于拟合的数据。例如，可以是列表，或至少是二维数组。
y : 类数组，可选，默认值: None

在有监督学习中尝试预测的目标变量。
n_splits : int (默认值=200)

bootstrap 迭代次数。必须大于 1。
method : str (默认值='.632')

bootstrap 方法，可以是以下之一： - 1) '.632' bootstrap (默认) - 2) '.632+' bootstrap - 3) 'oob' (常规袋外，无权重) 用于比较研究。
scoring_func : 可调用，

评分函数（或损失函数），签名格式为 scoring_func(y, y_pred, **kwargs)。如果为 None，则在

估计器是分类器时使用分类准确率，估计器是回归器时使用均方误差。

predict_proba : 布尔型

是否对 estimator 参数使用 predict_proba 函数。这与 scoring_func 结合使用，后者接受概率值而不是实际预测。例如，如果 scoring_func 是 :meth:sklearn.metrics.roc_auc_score，则使用 predict_proba=True。请注意，这要求 estimator 实现 predict_proba 方法。
random_seed : int (默认值=None)

如果是整数，random_seed 是随机数生成器使用的种子。
clone_estimator : 布尔型 (默认值=True)

如果为 true，克隆估计器，否则拟合原始估计器。

返回值

scores : 浮点数数组，形状=(len(list(n_splits)),)

每个 bootstrap 重复的估计器得分数组。

示例

    >>> from sklearn import datasets, linear_model
    >>> from mlxtend.evaluate import bootstrap_point632_score
    >>> iris = datasets.load_iris()
    >>> X = iris.data
    >>> y = iris.target
    >>> lr = linear_model.LogisticRegression()
    >>> scores = bootstrap_point632_score(lr, X, y)
    >>> acc = np.mean(scores)
    >>> print('Accuracy:', acc)
    0.953023146884
    >>> lower = np.percentile(scores, 2.5)
    >>> upper = np.percentile(scores, 97.5)
    >>> print('95%% Confidence interval: [%.2f, %.2f]' % (lower, upper))
    95% Confidence interval: [0.90, 0.98]

    For more usage examples, please see
    https://mlxtend.cn/mlxtend/user_guide/evaluate/bootstrap_point632_score/

按键	操作
`?`	打开此帮助
`n`	下一页
`p`	上一页
`s`	搜索