ExhaustiveFeatureSelector:通过考虑所有可能的特征组合来选择最优特征集

穷举特征选择器的实现,用于在指定范围内采样和评估所有可能的特征组合。

from mlxtend.feature_selection import ExhaustiveFeatureSelector

概述

这个穷举特征选择算法是一种封装方法,用于对特征子集进行暴力评估;通过给定任意回归器或分类器并优化指定的性能指标来选择最佳子集。例如,如果分类器是逻辑回归且数据集包含 4 个特征,该算法将评估所有 15 种特征组合(如果 min_features=1max_features=4

  • {0}
  • {1}
  • {2}
  • {3}
  • {0, 1}
  • {0, 2}
  • {0, 3}
  • {1, 2}
  • {1, 3}
  • {2, 3}
  • {0, 1, 2}
  • {0, 1, 3}
  • {0, 2, 3}
  • {1, 2, 3}
  • {0, 1, 2, 3}

并选择使逻辑回归分类器达到最佳性能(例如,分类准确率)的组合。

示例 1 - 简单的鸢尾花数据集示例

从 scikit-learn 初始化一个简单分类器

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS

iris = load_iris()
X = iris.data
y = iris.target

knn = KNeighborsClassifier(n_neighbors=3)

efs1 = EFS(knn, 
           min_features=1,
           max_features=4,
           scoring='accuracy',
           print_progress=True,
           cv=5)

efs1 = efs1.fit(X, y)

print('Best accuracy score: %.2f' % efs1.best_score_)
print('Best subset (indices):', efs1.best_idx_)
print('Best subset (corresponding names):', efs1.best_feature_names_)
Features: 15/15

Best accuracy score: 0.97
Best subset (indices): (0, 2, 3)
Best subset (corresponding names): ('0', '2', '3')

特征名称


处理大型数据集时,特征索引可能难以解释。在这种情况下,建议使用带有明确列名的 pandas DataFrame 作为输入

import pandas as pd

df_X = pd.DataFrame(X, columns=["Sepal length", "Sepal width", "Petal length", "Petal width"])
df_X.head()
萼片长度 萼片宽度 花瓣长度 花瓣宽度
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
efs1 = efs1.fit(df_X, y)

print('Best accuracy score: %.2f' % efs1.best_score_)
print('Best subset (indices):', efs1.best_idx_)
print('Best subset (corresponding names):', efs1.best_feature_names_)
Features: 15/15

Best accuracy score: 0.97
Best subset (indices): (0, 2, 3)
Best subset (corresponding names): ('Sepal length', 'Petal length', 'Petal width')

详细输出

通过 subsets_ 属性,我们可以查看每个步骤中选定的特征索引

efs1.subsets_
{0: {'feature_idx': (0,),
  'cv_scores': array([0.53333333, 0.63333333, 0.7       , 0.8       , 0.56666667]),
  'avg_score': 0.6466666666666667,
  'feature_names': ('Sepal length',)},
 1: {'feature_idx': (1,),
  'cv_scores': array([0.43333333, 0.63333333, 0.53333333, 0.43333333, 0.5       ]),
  'avg_score': 0.5066666666666666,
  'feature_names': ('Sepal width',)},
 2: {'feature_idx': (2,),
  'cv_scores': array([0.93333333, 0.93333333, 0.9       , 0.93333333, 1.        ]),
  'avg_score': 0.9400000000000001,
  'feature_names': ('Petal length',)},
 3: {'feature_idx': (3,),
  'cv_scores': array([0.96666667, 0.96666667, 0.93333333, 0.93333333, 1.        ]),
  'avg_score': 0.96,
  'feature_names': ('Petal width',)},
 4: {'feature_idx': (0, 1),
  'cv_scores': array([0.66666667, 0.8       , 0.7       , 0.86666667, 0.66666667]),
  'avg_score': 0.74,
  'feature_names': ('Sepal length', 'Sepal width')},
 5: {'feature_idx': (0, 2),
  'cv_scores': array([0.96666667, 1.        , 0.86666667, 0.93333333, 0.96666667]),
  'avg_score': 0.9466666666666667,
  'feature_names': ('Sepal length', 'Petal length')},
 6: {'feature_idx': (0, 3),
  'cv_scores': array([0.96666667, 0.96666667, 0.9       , 0.93333333, 1.        ]),
  'avg_score': 0.9533333333333334,
  'feature_names': ('Sepal length', 'Petal width')},
 7: {'feature_idx': (1, 2),
  'cv_scores': array([0.93333333, 0.93333333, 0.9       , 0.93333333, 0.93333333]),
  'avg_score': 0.9266666666666667,
  'feature_names': ('Sepal width', 'Petal length')},
 8: {'feature_idx': (1, 3),
  'cv_scores': array([0.96666667, 0.96666667, 0.86666667, 0.93333333, 0.96666667]),
  'avg_score': 0.9400000000000001,
  'feature_names': ('Sepal width', 'Petal width')},
 9: {'feature_idx': (2, 3),
  'cv_scores': array([0.96666667, 0.96666667, 0.9       , 0.93333333, 1.        ]),
  'avg_score': 0.9533333333333334,
  'feature_names': ('Petal length', 'Petal width')},
 10: {'feature_idx': (0, 1, 2),
  'cv_scores': array([0.96666667, 0.96666667, 0.86666667, 0.93333333, 0.96666667]),
  'avg_score': 0.9400000000000001,
  'feature_names': ('Sepal length', 'Sepal width', 'Petal length')},
 11: {'feature_idx': (0, 1, 3),
  'cv_scores': array([0.93333333, 0.96666667, 0.9       , 0.93333333, 1.        ]),
  'avg_score': 0.9466666666666667,
  'feature_names': ('Sepal length', 'Sepal width', 'Petal width')},
 12: {'feature_idx': (0, 2, 3),
  'cv_scores': array([0.96666667, 0.96666667, 0.96666667, 0.96666667, 1.        ]),
  'avg_score': 0.9733333333333334,
  'feature_names': ('Sepal length', 'Petal length', 'Petal width')},
 13: {'feature_idx': (1, 2, 3),
  'cv_scores': array([0.96666667, 0.96666667, 0.93333333, 0.93333333, 1.        ]),
  'avg_score': 0.96,
  'feature_names': ('Sepal width', 'Petal length', 'Petal width')},
 14: {'feature_idx': (0, 1, 2, 3),
  'cv_scores': array([0.96666667, 0.96666667, 0.93333333, 0.96666667, 1.        ]),
  'avg_score': 0.9666666666666668,
  'feature_names': ('Sepal length',
   'Sepal width',
   'Petal length',
   'Petal width')}}

示例 2 - 可视化特征选择结果

为了方便起见,可以使用 ExhaustiveFeatureSelector 对象的 get_metric_dict 方法以 pandas DataFrame 格式可视化特征选择的输出。列 std_devstd_err 分别表示交叉验证分数的标准差和标准误。

下面,我们看到示例 2 中 Sequential Forward Selector 的 DataFrame

import pandas as pd

iris = load_iris()
X = iris.data
y = iris.target

knn = KNeighborsClassifier(n_neighbors=3)

efs1 = EFS(knn, 
           min_features=1,
           max_features=4,
           scoring='accuracy',
           print_progress=True,
           cv=5)

feature_names = ('sepal length', 'sepal width',
                 'petal length', 'petal width')

df_X = pd.DataFrame(
    X, columns=["Sepal length", "Sepal width", "Petal length", "Petal width"])
efs1 = efs1.fit(df_X, y)

df = pd.DataFrame.from_dict(efs1.get_metric_dict()).T
df.sort_values('avg_score', inplace=True, ascending=False)
df
Features: 15/15
feature_idx cv_scores avg_score feature_names ci_bound std_dev std_err
12 (0, 2, 3) [0.9666666666666667, 0.9666666666666667, 0.966... 0.973333 (Sepal length, Petal length, Petal width) 0.017137 0.013333 0.006667
14 (0, 1, 2, 3) [0.9666666666666667, 0.9666666666666667, 0.933... 0.966667 (Sepal length, Sepal width, Petal length, Peta... 0.027096 0.021082 0.010541
3 (3,) [0.9666666666666667, 0.9666666666666667, 0.933... 0.96 (Petal width,) 0.032061 0.024944 0.012472
13 (1, 2, 3) [0.9666666666666667, 0.9666666666666667, 0.933... 0.96 (Sepal width, Petal length, Petal width) 0.032061 0.024944 0.012472
6 (0, 3) [0.9666666666666667, 0.9666666666666667, 0.9, ... 0.953333 (Sepal length, Petal width) 0.043691 0.033993 0.016997
9 (2, 3) [0.9666666666666667, 0.9666666666666667, 0.9, ... 0.953333 (Petal length, Petal width) 0.043691 0.033993 0.016997
5 (0, 2) [0.9666666666666667, 1.0, 0.8666666666666667, ... 0.946667 (Sepal length, Petal length) 0.058115 0.045216 0.022608
11 (0, 1, 3) [0.9333333333333333, 0.9666666666666667, 0.9, ... 0.946667 (Sepal length, Sepal width, Petal width) 0.043691 0.033993 0.016997
2 (2,) [0.9333333333333333, 0.9333333333333333, 0.9, ... 0.94 (Petal length,) 0.041977 0.03266 0.01633
8 (1, 3) [0.9666666666666667, 0.9666666666666667, 0.866... 0.94 (Sepal width, Petal width) 0.049963 0.038873 0.019437
10 (0, 1, 2) [0.9666666666666667, 0.9666666666666667, 0.866... 0.94 (Sepal length, Sepal width, Petal length) 0.049963 0.038873 0.019437
7 (1, 2) [0.9333333333333333, 0.9333333333333333, 0.9, ... 0.926667 (Sepal width, Petal length) 0.017137 0.013333 0.006667
4 (0, 1) [0.6666666666666666, 0.8, 0.7, 0.8666666666666... 0.74 (Sepal length, Sepal width) 0.102823 0.08 0.04
0 (0,) [0.5333333333333333, 0.6333333333333333, 0.7, ... 0.646667 (Sepal length,) 0.122983 0.095685 0.047842
1 (1,) [0.43333333333333335, 0.6333333333333333, 0.53... 0.506667 (Sepal width,) 0.095416 0.074237 0.037118
import matplotlib.pyplot as plt

metric_dict = efs1.get_metric_dict()

fig = plt.figure()
k_feat = sorted(metric_dict.keys())
avg = [metric_dict[k]['avg_score'] for k in k_feat]

upper, lower = [], []
for k in k_feat:
    upper.append(metric_dict[k]['avg_score'] +
                 metric_dict[k]['std_dev'])
    lower.append(metric_dict[k]['avg_score'] -
                 metric_dict[k]['std_dev'])

plt.fill_between(k_feat,
                 upper,
                 lower,
                 alpha=0.2,
                 color='blue',
                 lw=1)

plt.plot(k_feat, avg, color='blue', marker='o')
plt.ylabel('Accuracy +/- Standard Deviation')
plt.xlabel('Number of Features')
feature_min = len(metric_dict[k_feat[0]]['feature_idx'])
feature_max = len(metric_dict[k_feat[-1]]['feature_idx'])
plt.xticks(k_feat, 
           [str(metric_dict[k]['feature_names']) for k in k_feat], 
           rotation=90)
plt.show()

png

示例 3 - 用于回归分析的穷举特征选择

与上述分类示例类似,SequentialFeatureSelector 也支持 scikit-learn 用于回归的估计器。

from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

boston = load_boston()
X, y = boston.data, boston.target

lr = LinearRegression()

efs = EFS(lr, 
          min_features=10,
          max_features=12,
          scoring='neg_mean_squared_error',
          cv=10)

efs.fit(X, y)

print('Best MSE score: %.2f' % efs.best_score_ * (-1))
print('Best subset:', efs.best_idx_)
/Users/sebastianraschka/miniforge3/lib/python3.9/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function load_boston is deprecated; `load_boston` is deprecated in 1.0 and will be removed in 1.2.

    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "https://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_housing
        housing = fetch_california_housing()

    for the California housing dataset and::

        from sklearn.datasets import fetch_openml
        housing = fetch_openml(name="house_prices", as_frame=True)

    for the Ames housing dataset.

  warnings.warn(msg, category=FutureWarning)
Features: 377/377


Best subset: (0, 1, 4, 6, 7, 8, 9, 10, 11, 12)

示例 4 - 回归与调整 R^2

如示例 3 所示,穷举特征选择器可用于通过回归模型选择特征。在回归分析中,存在一个常见现象,即选择的特征越多,R^2 评分会虚假地膨胀。因此,特别是在特征选择中,基于调整后的 R^2值进行模型比较是有用的,而不是基于常规的 R^2。调整后的 R^2, ,考虑了特征数量和样本数量,计算如下

其中是样本数量,是特征数量。

scikit-learn API 的优点之一是它一致、直观且易于使用。然而,这种 API 设计的一个缺点是对于某些场景可能有点限制。例如,scikit-learn 的评分函数只接受两个输入:预测值和真实目标值。因此,我们无法使用 scikit-learn 的评分 API 来计算调整后的 R^2,因为它还需要特征数量。

然而,作为一种变通方法,我们可以计算 R^2不同特征子集的 ,然后进行事后计算以获得调整后的 R^2.

步骤 1:计算 R^2:

from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

boston = load_boston()
X, y = boston.data, boston.target

lr = LinearRegression()

efs = EFS(lr, 
          min_features=10,
          max_features=12,
          scoring='r2',
          cv=10)

efs.fit(X, y)

print('Best R2 score: %.2f' % efs.best_score_ * (-1))
print('Best subset:', efs.best_idx_)
/Users/sebastianraschka/miniforge3/lib/python3.9/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function load_boston is deprecated; `load_boston` is deprecated in 1.0 and will be removed in 1.2.

    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "https://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_housing
        housing = fetch_california_housing()

    for the California housing dataset and::

        from sklearn.datasets import fetch_openml
        housing = fetch_openml(name="house_prices", as_frame=True)

    for the Ames housing dataset.

  warnings.warn(msg, category=FutureWarning)
Features: 377/377


Best subset: (1, 3, 5, 6, 7, 8, 9, 10, 11, 12)

步骤 2:计算调整后的 R^2:

def adjust_r2(r2, num_examples, num_features):
    coef = (num_examples - 1) / (num_examples - num_features - 1) 
    return 1 - (1 - r2) * coef
for i in efs.subsets_:
    efs.subsets_[i]['adjusted_avg_score'] = (
        adjust_r2(r2=efs.subsets_[i]['avg_score'],
                  num_examples=X.shape[0]/10,
                  num_features=len(efs.subsets_[i]['feature_idx']))
    )

步骤 3:基于调整后的 R^2 选择最佳子集:

score = -99e10

for i in efs.subsets_:
    score = efs.subsets_[i]['adjusted_avg_score']
    if ( efs.subsets_[i]['adjusted_avg_score'] == score and
        len(efs.subsets_[i]['feature_idx']) < len(efs.best_idx_) )\
      or efs.subsets_[i]['adjusted_avg_score'] > score:
        efs.best_idx_ = efs.subsets_[i]['feature_idx']
print('Best adjusted R2 score: %.2f' % efs.best_score_ * (-1))
print('Best subset:', efs.best_idx_)
Best subset: (1, 3, 5, 6, 7, 8, 9, 10, 11, 12)

示例 5 - 使用选定的特征子集进行新预测

# Initialize the dataset

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
         X, y, test_size=0.33, random_state=1)

knn = KNeighborsClassifier(n_neighbors=3)
# Select the "best" three features via
# 5-fold cross-validation on the training set.

from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS

efs1 = EFS(knn, 
           min_features=1,
           max_features=4,
           scoring='accuracy',
           cv=5)
efs1 = efs1.fit(X_train, y_train)
Features: 15/15
print('Selected features:', efs1.best_idx_)
Selected features: (2, 3)
# Generate the new subsets based on the selected features
# Note that the transform call is equivalent to
# X_train[:, efs1.k_feature_idx_]

X_train_efs = efs1.transform(X_train)
X_test_efs = efs1.transform(X_test)

# Fit the estimator using the new feature subset
# and make a prediction on the test data
knn.fit(X_train_efs, y_train)
y_pred = knn.predict(X_test_efs)

# Compute the accuracy of the prediction
acc = float((y_test == y_pred).sum()) / y_pred.shape[0]
print('Test set accuracy: %.2f %%' % (acc*100))
Test set accuracy: 96.00 %

示例 6 - 穷举特征选择与 GridSearch

# Initialize the dataset

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
         X, y, test_size=0.33, random_state=1)

使用 scikit-learn 的 GridSearch 来调整 ExhaustiveFeatureSelectorLogisticRegression 估计器的超参数,并在 pipeline 中用于预测。注意,clone_estimator 属性需要设置为 False

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS

lr = LogisticRegression(multi_class='multinomial', 
                        solver='newton-cg', 
                        random_state=123)

efs1 = EFS(estimator=lr, 
           min_features=2,
           max_features=3,
           scoring='accuracy',
           print_progress=False,
           clone_estimator=False,
           cv=5,
           n_jobs=1)

pipe = make_pipeline(efs1, lr)

param_grid = {'exhaustivefeatureselector__estimator__C': [0.1, 1.0, 10.0]}

gs = GridSearchCV(estimator=pipe, 
                  param_grid=param_grid, 
                  scoring='accuracy', 
                  n_jobs=1, 
                  cv=2, 
                  verbose=1, 
                  refit=False)

# run gridearch
gs = gs.fit(X_train, y_train)
Fitting 2 folds for each of 3 candidates, totalling 6 fits

... 由 GridSearch 确定的“最佳”参数是 ...

print("Best parameters via GridSearch", gs.best_params_)
Best parameters via GridSearch {'exhaustivefeatureselector__estimator__C': 0.1}

在 GridSearch 后获取最佳 k 个特征索引

如果通过 SequentialFeatureSelection.best_idx_ 对最佳 k 个特征索引感兴趣,必须使用 refit=True 初始化一个 GridSearchCV 对象。现在,GridSearch 对象将使用完整的训练数据集和它通过交叉验证找到的最佳参数来训练估计器 pipeline。

gs = GridSearchCV(estimator=pipe, 
                  param_grid=param_grid, 
                  scoring='accuracy', 
                  n_jobs=1, 
                  cv=2, 
                  verbose=1, 
                  refit=True)

运行 GridSearch 后,可以通过 steps 属性访问 best_estimator_ 的各个 pipeline 对象。

gs = gs.fit(X_train, y_train)
gs.best_estimator_.steps
Fitting 2 folds for each of 3 candidates, totalling 6 fits





[('exhaustivefeatureselector',
  ExhaustiveFeatureSelector(clone_estimator=False,
                            estimator=LogisticRegression(C=0.1,
                                                         multi_class='multinomial',
                                                         random_state=123,
                                                         solver='newton-cg'),
                            feature_groups=[[0], [1], [2], [3]], max_features=3,
                            min_features=2, print_progress=False)),
 ('logisticregression',
  LogisticRegression(multi_class='multinomial', random_state=123,
                     solver='newton-cg'))]

通过子索引,然后可以获取最佳选定的特征子集

print('Best features:', gs.best_estimator_.steps[0][1].best_idx_)
Best features: (2, 3)

在交叉验证期间,此特征组合的 CV 准确率为

print('Best score:', gs.best_score_)
Best score: 0.96
gs.best_params_
{'exhaustivefeatureselector__estimator__C': 0.1}

或者,如果使用 refit=False 运行 GridSearchCV,我们可以在 pipeline 中手动设置“最佳 GridSearch 参数”。这应该会产生相同的结果

pipe.set_params(**gs.best_params_).fit(X_train, y_train)
print('Best features:', pipe.steps[0][1].best_idx_)
Best features: (2, 3)

示例 7 - 带 LOOCV 的穷举特征选择

ExhaustiveFeatureSelector 不限于 k 折交叉验证。可以使用任何支持通用 scikit-learn 交叉验证 API 的交叉验证方法。

以下示例演示了 scikit-learn 的 LeaveOneOut 交叉验证方法与穷举特征选择器结合使用。

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
from sklearn.model_selection import LeaveOneOut


iris = load_iris()
X = iris.data
y = iris.target

knn = KNeighborsClassifier(n_neighbors=3)

efs1 = EFS(knn, 
           min_features=1,
           max_features=4,
           scoring='accuracy',
           print_progress=True,
           cv=LeaveOneOut()) ### Use cross-validation generator here

efs1 = efs1.fit(X, y)

print('Best accuracy score: %.2f' % efs1.best_score_)
print('Best subset (indices):', efs1.best_idx_)
print('Best subset (corresponding names):', efs1.best_feature_names_)
Features: 15/15

Best accuracy score: 0.96
Best subset (indices): (3,)
Best subset (corresponding names): ('3',)

示例 8 - 中断长时间运行以获取中间结果

如果运行时间过长,可以通过触发 KeyboardInterrupt(例如,在 Mac 上按 ctrl+c,或在 Jupyter notebook 中中断 cell)来获取临时结果。

玩具数据集

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split


X, y = make_classification(
    n_samples=200000,
    n_features=6,
    n_informative=2,
    n_redundant=1,
    n_repeated=1,
    n_clusters_per_class=2,
    flip_y=0.05,
    class_sep=0.5,
    random_state=123,
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123
)

带中断的长运行

from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=10000)

efs1 = EFS(model, 
           min_features=1, 
           max_features=4,
           print_progress=True,
           scoring='accuracy')

efs1 = efs1.fit(X_train, y_train)
Features: 56/56

完成拟合

注意,特征选择运行尚未完成,因此某些属性可能不可用。为了使用 EFS 实例,建议调用 finalize_fit 方法,这将使 EFS 估计器看起来“已拟合”,并处理临时结果

efs1.finalize_fit()
print('Best accuracy score: %.2f' % efs1.best_score_)
print('Best subset (indices):', efs1.best_idx_)
Best accuracy score: 0.73
Best subset (indices): (1, 2)

示例 9 - 使用特征组

自 mlxtend v0.21.0 版本起,可以指定特征组。特征组允许您将某些特征组合在一起,以便它们始终作为一个组被选择。这在类似于独热编码的场景中非常有用——如果您想将独热编码的特征视为一个单一特征

在以下示例中,我们将萼片长度和萼片宽度指定为一个特征组,以便它们始终一起被选择

from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
X = iris.data
y = iris.target

X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',
                                'sepal wid', 'petal wid'])
X_df.head()
sepal len petal len sepal wid petal wid
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS

knn = KNeighborsClassifier(n_neighbors=3)

efs1 = EFS(knn, 
           min_features=2,
           max_features=2,
           scoring='accuracy',
           feature_groups=[['sepal len', 'sepal wid'], ['petal len'], ['petal wid']],
           cv=3)

efs1 = efs1.fit(X_df, y)

print('Best accuracy score: %.2f' % efs1.best_score_)
print('Best subset (indices):', efs1.best_idx_)
print('Best subset (corresponding names):', efs1.best_feature_names_)
Features: 3/3

Best accuracy score: 0.97
Best subset (indices): (0, 2, 3)
Best subset (corresponding names): ('sepal len', 'sepal wid', 'petal wid')

注意,返回的特征数量是 3,因为 min_featuresmax_features 对应的是特征组的数量。也就是说,我们在 ['sepal len', 'sepal wid'], ['petal wid'] 中有两个特征组,但这扩展为 3 个特征。

API

ExhaustiveFeatureSelector(estimator, min_features=1, max_features=1, print_progress=True, scoring='accuracy', cv=5, n_jobs=1, pre_dispatch='2*n_jobs', clone_estimator=True, fixed_features=None, feature_groups=None)

用于分类和回归的穷举特征选择。(v0.4.3 新增)

参数

  • estimator : scikit-learn 分类器或回归器

  • min_features : int (默认值: 1)

    要选择的最小特征数量

  • max_features : int (默认值: 1)

    要选择的最大特征数量。如果参数 feature_groups 不是 None,特征数量等于特征组的数量,即 len(feature_groups)。例如,如果 feature_groups = [[0], [1], [2, 3], [4]],则 max_features 的值不能超过 4。

  • print_progress : bool (默认值: True)

    将进度(以 epoch 数表示)打印到 stderr。

  • scoring : str, (默认值='accuracy') 分类器的评分指标 {accuracy, f1, precision, recall, roc_auc},回归器的评分指标 {'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'r2'},或者是一个签名形如 scorer(estimator, X, y) 的可调用对象或函数。

    cv : int (默认值: 5)

  • Scikit-learn 交叉验证生成器或 int。如果估计器是分类器(或 y 由整数类别标签组成),则执行分层 k 折交叉验证;否则执行常规 k 折交叉验证。如果 cv 为 None, False 或 0,则不执行交叉验证。

    n_jobs : int (默认值: 1)

  • 用于并行评估不同特征子集的 CPU 数量。-1 表示“所有 CPU”。

    pre_dispatch : int 或 string (默认值: '2*n_jobs')

  • 控制当 n_jobs > 1n_jobs=-1 时,并行执行期间调度的任务数量。减少此数量有助于避免在调度的任务多于 CPU 可处理的任务时内存消耗爆炸。此参数可以为:None,在这种情况下,所有任务会立即创建并启动。对于轻量且快速运行的任务,可以使用此设置,以避免因按需启动任务而产生的延迟。一个整数,给出要启动的任务总数的精确值。一个字符串,给出 n_jobs 的表达式函数,如 2*n_jobs

    clone_estimator : bool (默认值: True)

  • 如果为 True,则克隆估计器;如果为 False,则使用原始估计器实例。如果估计器未实现 scikit-learn 的 set_params 和 get_params 方法,请设置为 False。此外,还需要将 cv=0 和 n_jobs=1。

    fixed_features : tuple (默认值: None)

  • 如果不是 None,作为元组提供的特征索引将被特征选择器视为固定特征。例如,如果 fixed_features=(1, 3, 7),则解决方案中必定包含第 2、4 和 8 个特征。注意,如果 fixed_features 不是 None,请确保要选择的特征数量大于 len(fixed_features)。换句话说,确保 k_features > len(fixed_features)

    feature_groups : list 或 None (默认值: None)

  • 用于将某些特征视为一个组的可选参数。这意味着,组内的特征总是被一起选择,永不拆分。例如,feature_groups=[[1], [2], [3, 4, 5]] 指定了 3 个特征组。在这种情况下,当 k_features=2 时,可能的特征选择结果是 [[1], [2]][[1], [3, 4, 5]][[2], [3, 4, 5]]。特征组对于解释性很有用,例如,如果特征 3, 4, 5 是独热编码特征。(更多详情,请阅读此文档字符串底部的说明)。mlxtend v. 0.21.0 新增。

    属性

best_idx_ : array-like, 形状 = [n_predictions]

  • 选定特征子集的特征索引。

    best_feature_names_ : array-like, 形状 = [n_predictions]

  • 选定特征子集的特征名称。如果在 fit 方法中使用 pandas DataFrame,特征名称对应于列名。否则,特征名称是特征数组索引的字符串表示。v 0.13.0 新增。

    best_score_ : float

  • 选定子集的交叉验证平均分。

    subsets_ : dict

  • 穷举选择过程中选定的特征子集的字典,其中字典键是这些特征子集的长度 k。字典值本身是字典,包含以下键:'feature_idx'(特征子集索引的元组)'feature_names'(特征子集的特征名称元组)'cv_scores'(个体交叉验证得分列表)'avg_score'(平均交叉验证得分)注意,如果在 fit 方法中使用 pandas DataFrame,'feature_names' 对应于列名。否则,特征名称是特征数组索引的字符串表示。'feature_names' 在 v. 0.13.0 新增。

    说明

(1) 如果参数 feature_groups 不是 None,特征数量等于特征组的数量,即 len(feature_groups)。例如,如果 feature_groups = [[0], [1], [2, 3], [4]],则 max_features 的值不能超过 4。

示例

(2) Although two or more individual features may be considered as one group
throughout the feature-selection process, it does not mean the individual
features of that group have the same impact on the outcome. For instance, in
linear regression, the coefficient of the feature 2 and 3 can be different
even if they are considered as one group in feature_groups.

(3) If both fixed_features and feature_groups are specified, ensure that each
feature group contains the fixed_features selection. E.g., for a 3-feature set
fixed_features=[0, 1] and feature_groups=[[0, 1], [2]] is valid;
fixed_features=[0, 1] and feature_groups=[[0], [1, 2]] is not valid.

有关用法示例,请参阅 https://mlxtend.cn/mlxtend/user_guide/feature_selection/ExhaustiveFeatureSelector/

方法

fit(X, y, groups=None, **fit_params)


执行特征选择并从训练数据中学习模型。

X : {array-like, sparse matrix}, 形状 = [n_samples, n_features]

参数

  • 训练向量,其中 n_samples 是样本数量,n_features 是特征数量。v 0.13.0 新增:现在也接受 pandas DataFrame 作为 X 的参数。

    y : array-like, 形状 = [n_samples]

  • 目标值。

    groups : array-like, 形状 (n_samples,), 可选

  • 用于在将数据集分割为训练/测试集时使用的样本分组标签。传递给交叉验证器的 fit 方法。

    fit_params : dict of string -> object, 可选

  • 传递给分类器 fit 方法的参数。

    返回值

self : object

  • fit_transform(X, y, groups=None, **fit_params)

拟合训练数据并返回从 X 中选出的最佳特征。

X 的特征子集,形状={n_samples, k_features}

参数

  • 训练向量,其中 n_samples 是样本数量,n_features 是特征数量。v 0.13.0 新增:现在也接受 pandas DataFrame 作为 X 的参数。

    y : array-like, 形状 = [n_samples]

  • 目标值。

    groups : array-like, 形状 (n_samples,), 可选

  • 用于在将数据集分割为训练/测试集时使用的样本分组标签。传递给交叉验证器的 fit 方法。

    fit_params : dict of string -> object, 可选

  • 传递给分类器 fit 方法的参数。

    返回值

self : object

get_metric_dict(confidence_interval=0.95)


返回度量字典

confidence_interval : float (默认值: 0.95)

参数

  • 一个介于 0.0 和 1.0 之间的正浮点数,用于计算 CV 平均得分的置信区间界限。

    字典,其中每个字典值是一个列表,其长度等于迭代次数(特征子集数量)。对应于这些列表的字典键如下:'feature_idx':特征子集索引的元组 'cv_scores':个体 CV 得分列表 'avg_score':CV 平均得分 'std_dev':CV 平均得分的标准差 'std_err':CV 平均得分的标准误 'ci_bound':CV 平均得分的置信区间界限

self : object

get_params(deep=True)


获取此估计器的参数。

deep : bool, default=True

参数

  • 如果为 True,将返回此估计器以及包含的估计器子对象的参数。

    params : dict

self : object

  • 参数名称映射到其值。

    set_params(**params)


设置此估计器的参数。

**params : dict

The method works on simple estimators as well as on nested objects
(such as :class:`~sklearn.pipeline.Pipeline`). The latter have
parameters of the form ``<component>__<parameter>`` so that it's
possible to update each component of a nested object.

参数

  • 估计器参数。

    self : estimator instance

self : object

  • 估计器实例。

    transform(X)


返回从 X 中选出的最佳特征。

Copyright © 2014-2023 Sebastian Raschka

参数

  • 训练向量,其中 n_samples 是样本数量,n_features 是特征数量。v 0.13.0 新增:现在也接受 pandas DataFrame 作为 X 的参数。

    y : array-like, 形状 = [n_samples]

self : object

get_metric_dict(confidence_interval=0.95)