ExhaustiveFeatureSelector:通过考虑所有可能的特征组合来选择最优特征集
穷举特征选择器的实现,用于在指定范围内采样和评估所有可能的特征组合。
from mlxtend.feature_selection import ExhaustiveFeatureSelector
概述
这个穷举特征选择算法是一种封装方法,用于对特征子集进行暴力评估;通过给定任意回归器或分类器并优化指定的性能指标来选择最佳子集。例如,如果分类器是逻辑回归且数据集包含 4 个特征,该算法将评估所有 15 种特征组合(如果 min_features=1
且 max_features=4
)
- {0}
- {1}
- {2}
- {3}
- {0, 1}
- {0, 2}
- {0, 3}
- {1, 2}
- {1, 3}
- {2, 3}
- {0, 1, 2}
- {0, 1, 3}
- {0, 2, 3}
- {1, 2, 3}
- {0, 1, 2, 3}
并选择使逻辑回归分类器达到最佳性能(例如,分类准确率)的组合。
示例 1 - 简单的鸢尾花数据集示例
从 scikit-learn 初始化一个简单分类器
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=3)
efs1 = EFS(knn,
min_features=1,
max_features=4,
scoring='accuracy',
print_progress=True,
cv=5)
efs1 = efs1.fit(X, y)
print('Best accuracy score: %.2f' % efs1.best_score_)
print('Best subset (indices):', efs1.best_idx_)
print('Best subset (corresponding names):', efs1.best_feature_names_)
Features: 15/15
Best accuracy score: 0.97
Best subset (indices): (0, 2, 3)
Best subset (corresponding names): ('0', '2', '3')
特征名称
处理大型数据集时,特征索引可能难以解释。在这种情况下,建议使用带有明确列名的 pandas DataFrame 作为输入
import pandas as pd
df_X = pd.DataFrame(X, columns=["Sepal length", "Sepal width", "Petal length", "Petal width"])
df_X.head()
萼片长度 | 萼片宽度 | 花瓣长度 | 花瓣宽度 | |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 |
efs1 = efs1.fit(df_X, y)
print('Best accuracy score: %.2f' % efs1.best_score_)
print('Best subset (indices):', efs1.best_idx_)
print('Best subset (corresponding names):', efs1.best_feature_names_)
Features: 15/15
Best accuracy score: 0.97
Best subset (indices): (0, 2, 3)
Best subset (corresponding names): ('Sepal length', 'Petal length', 'Petal width')
详细输出
通过 subsets_
属性,我们可以查看每个步骤中选定的特征索引
efs1.subsets_
{0: {'feature_idx': (0,),
'cv_scores': array([0.53333333, 0.63333333, 0.7 , 0.8 , 0.56666667]),
'avg_score': 0.6466666666666667,
'feature_names': ('Sepal length',)},
1: {'feature_idx': (1,),
'cv_scores': array([0.43333333, 0.63333333, 0.53333333, 0.43333333, 0.5 ]),
'avg_score': 0.5066666666666666,
'feature_names': ('Sepal width',)},
2: {'feature_idx': (2,),
'cv_scores': array([0.93333333, 0.93333333, 0.9 , 0.93333333, 1. ]),
'avg_score': 0.9400000000000001,
'feature_names': ('Petal length',)},
3: {'feature_idx': (3,),
'cv_scores': array([0.96666667, 0.96666667, 0.93333333, 0.93333333, 1. ]),
'avg_score': 0.96,
'feature_names': ('Petal width',)},
4: {'feature_idx': (0, 1),
'cv_scores': array([0.66666667, 0.8 , 0.7 , 0.86666667, 0.66666667]),
'avg_score': 0.74,
'feature_names': ('Sepal length', 'Sepal width')},
5: {'feature_idx': (0, 2),
'cv_scores': array([0.96666667, 1. , 0.86666667, 0.93333333, 0.96666667]),
'avg_score': 0.9466666666666667,
'feature_names': ('Sepal length', 'Petal length')},
6: {'feature_idx': (0, 3),
'cv_scores': array([0.96666667, 0.96666667, 0.9 , 0.93333333, 1. ]),
'avg_score': 0.9533333333333334,
'feature_names': ('Sepal length', 'Petal width')},
7: {'feature_idx': (1, 2),
'cv_scores': array([0.93333333, 0.93333333, 0.9 , 0.93333333, 0.93333333]),
'avg_score': 0.9266666666666667,
'feature_names': ('Sepal width', 'Petal length')},
8: {'feature_idx': (1, 3),
'cv_scores': array([0.96666667, 0.96666667, 0.86666667, 0.93333333, 0.96666667]),
'avg_score': 0.9400000000000001,
'feature_names': ('Sepal width', 'Petal width')},
9: {'feature_idx': (2, 3),
'cv_scores': array([0.96666667, 0.96666667, 0.9 , 0.93333333, 1. ]),
'avg_score': 0.9533333333333334,
'feature_names': ('Petal length', 'Petal width')},
10: {'feature_idx': (0, 1, 2),
'cv_scores': array([0.96666667, 0.96666667, 0.86666667, 0.93333333, 0.96666667]),
'avg_score': 0.9400000000000001,
'feature_names': ('Sepal length', 'Sepal width', 'Petal length')},
11: {'feature_idx': (0, 1, 3),
'cv_scores': array([0.93333333, 0.96666667, 0.9 , 0.93333333, 1. ]),
'avg_score': 0.9466666666666667,
'feature_names': ('Sepal length', 'Sepal width', 'Petal width')},
12: {'feature_idx': (0, 2, 3),
'cv_scores': array([0.96666667, 0.96666667, 0.96666667, 0.96666667, 1. ]),
'avg_score': 0.9733333333333334,
'feature_names': ('Sepal length', 'Petal length', 'Petal width')},
13: {'feature_idx': (1, 2, 3),
'cv_scores': array([0.96666667, 0.96666667, 0.93333333, 0.93333333, 1. ]),
'avg_score': 0.96,
'feature_names': ('Sepal width', 'Petal length', 'Petal width')},
14: {'feature_idx': (0, 1, 2, 3),
'cv_scores': array([0.96666667, 0.96666667, 0.93333333, 0.96666667, 1. ]),
'avg_score': 0.9666666666666668,
'feature_names': ('Sepal length',
'Sepal width',
'Petal length',
'Petal width')}}
示例 2 - 可视化特征选择结果
为了方便起见,可以使用 ExhaustiveFeatureSelector
对象的 get_metric_dict
方法以 pandas DataFrame 格式可视化特征选择的输出。列 std_dev
和 std_err
分别表示交叉验证分数的标准差和标准误。
下面,我们看到示例 2 中 Sequential Forward Selector 的 DataFrame
import pandas as pd
iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=3)
efs1 = EFS(knn,
min_features=1,
max_features=4,
scoring='accuracy',
print_progress=True,
cv=5)
feature_names = ('sepal length', 'sepal width',
'petal length', 'petal width')
df_X = pd.DataFrame(
X, columns=["Sepal length", "Sepal width", "Petal length", "Petal width"])
efs1 = efs1.fit(df_X, y)
df = pd.DataFrame.from_dict(efs1.get_metric_dict()).T
df.sort_values('avg_score', inplace=True, ascending=False)
df
Features: 15/15
feature_idx | cv_scores | avg_score | feature_names | ci_bound | std_dev | std_err | |
---|---|---|---|---|---|---|---|
12 | (0, 2, 3) | [0.9666666666666667, 0.9666666666666667, 0.966... | 0.973333 | (Sepal length, Petal length, Petal width) | 0.017137 | 0.013333 | 0.006667 |
14 | (0, 1, 2, 3) | [0.9666666666666667, 0.9666666666666667, 0.933... | 0.966667 | (Sepal length, Sepal width, Petal length, Peta... | 0.027096 | 0.021082 | 0.010541 |
3 | (3,) | [0.9666666666666667, 0.9666666666666667, 0.933... | 0.96 | (Petal width,) | 0.032061 | 0.024944 | 0.012472 |
13 | (1, 2, 3) | [0.9666666666666667, 0.9666666666666667, 0.933... | 0.96 | (Sepal width, Petal length, Petal width) | 0.032061 | 0.024944 | 0.012472 |
6 | (0, 3) | [0.9666666666666667, 0.9666666666666667, 0.9, ... | 0.953333 | (Sepal length, Petal width) | 0.043691 | 0.033993 | 0.016997 |
9 | (2, 3) | [0.9666666666666667, 0.9666666666666667, 0.9, ... | 0.953333 | (Petal length, Petal width) | 0.043691 | 0.033993 | 0.016997 |
5 | (0, 2) | [0.9666666666666667, 1.0, 0.8666666666666667, ... | 0.946667 | (Sepal length, Petal length) | 0.058115 | 0.045216 | 0.022608 |
11 | (0, 1, 3) | [0.9333333333333333, 0.9666666666666667, 0.9, ... | 0.946667 | (Sepal length, Sepal width, Petal width) | 0.043691 | 0.033993 | 0.016997 |
2 | (2,) | [0.9333333333333333, 0.9333333333333333, 0.9, ... | 0.94 | (Petal length,) | 0.041977 | 0.03266 | 0.01633 |
8 | (1, 3) | [0.9666666666666667, 0.9666666666666667, 0.866... | 0.94 | (Sepal width, Petal width) | 0.049963 | 0.038873 | 0.019437 |
10 | (0, 1, 2) | [0.9666666666666667, 0.9666666666666667, 0.866... | 0.94 | (Sepal length, Sepal width, Petal length) | 0.049963 | 0.038873 | 0.019437 |
7 | (1, 2) | [0.9333333333333333, 0.9333333333333333, 0.9, ... | 0.926667 | (Sepal width, Petal length) | 0.017137 | 0.013333 | 0.006667 |
4 | (0, 1) | [0.6666666666666666, 0.8, 0.7, 0.8666666666666... | 0.74 | (Sepal length, Sepal width) | 0.102823 | 0.08 | 0.04 |
0 | (0,) | [0.5333333333333333, 0.6333333333333333, 0.7, ... | 0.646667 | (Sepal length,) | 0.122983 | 0.095685 | 0.047842 |
1 | (1,) | [0.43333333333333335, 0.6333333333333333, 0.53... | 0.506667 | (Sepal width,) | 0.095416 | 0.074237 | 0.037118 |
import matplotlib.pyplot as plt
metric_dict = efs1.get_metric_dict()
fig = plt.figure()
k_feat = sorted(metric_dict.keys())
avg = [metric_dict[k]['avg_score'] for k in k_feat]
upper, lower = [], []
for k in k_feat:
upper.append(metric_dict[k]['avg_score'] +
metric_dict[k]['std_dev'])
lower.append(metric_dict[k]['avg_score'] -
metric_dict[k]['std_dev'])
plt.fill_between(k_feat,
upper,
lower,
alpha=0.2,
color='blue',
lw=1)
plt.plot(k_feat, avg, color='blue', marker='o')
plt.ylabel('Accuracy +/- Standard Deviation')
plt.xlabel('Number of Features')
feature_min = len(metric_dict[k_feat[0]]['feature_idx'])
feature_max = len(metric_dict[k_feat[-1]]['feature_idx'])
plt.xticks(k_feat,
[str(metric_dict[k]['feature_names']) for k in k_feat],
rotation=90)
plt.show()
示例 3 - 用于回归分析的穷举特征选择
与上述分类示例类似,SequentialFeatureSelector
也支持 scikit-learn 用于回归的估计器。
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
boston = load_boston()
X, y = boston.data, boston.target
lr = LinearRegression()
efs = EFS(lr,
min_features=10,
max_features=12,
scoring='neg_mean_squared_error',
cv=10)
efs.fit(X, y)
print('Best MSE score: %.2f' % efs.best_score_ * (-1))
print('Best subset:', efs.best_idx_)
/Users/sebastianraschka/miniforge3/lib/python3.9/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function load_boston is deprecated; `load_boston` is deprecated in 1.0 and will be removed in 1.2.
The Boston housing prices dataset has an ethical problem. You can refer to
the documentation of this function for further details.
The scikit-learn maintainers therefore strongly discourage the use of this
dataset unless the purpose of the code is to study and educate about
ethical issues in data science and machine learning.
In this special case, you can fetch the dataset from the original
source::
import pandas as pd
import numpy as np
data_url = "https://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
Alternative datasets include the California housing dataset (i.e.
:func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
dataset. You can load the datasets as follows::
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
for the California housing dataset and::
from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True)
for the Ames housing dataset.
warnings.warn(msg, category=FutureWarning)
Features: 377/377
Best subset: (0, 1, 4, 6, 7, 8, 9, 10, 11, 12)
示例 4 - 回归与调整 R^2
如示例 3 所示,穷举特征选择器可用于通过回归模型选择特征。在回归分析中,存在一个常见现象,即选择的特征越多,R^2 评分会虚假地膨胀。因此,特别是在特征选择中,基于调整后的 R^2值进行模型比较是有用的,而不是基于常规的 R^2。调整后的 R^2, ,考虑了特征数量和样本数量,计算如下
其中是样本数量,是特征数量。
scikit-learn API 的优点之一是它一致、直观且易于使用。然而,这种 API 设计的一个缺点是对于某些场景可能有点限制。例如,scikit-learn 的评分函数只接受两个输入:预测值和真实目标值。因此,我们无法使用 scikit-learn 的评分 API 来计算调整后的 R^2,因为它还需要特征数量。
然而,作为一种变通方法,我们可以计算 R^2不同特征子集的 ,然后进行事后计算以获得调整后的 R^2.
步骤 1:计算 R^2:
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
boston = load_boston()
X, y = boston.data, boston.target
lr = LinearRegression()
efs = EFS(lr,
min_features=10,
max_features=12,
scoring='r2',
cv=10)
efs.fit(X, y)
print('Best R2 score: %.2f' % efs.best_score_ * (-1))
print('Best subset:', efs.best_idx_)
/Users/sebastianraschka/miniforge3/lib/python3.9/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function load_boston is deprecated; `load_boston` is deprecated in 1.0 and will be removed in 1.2.
The Boston housing prices dataset has an ethical problem. You can refer to
the documentation of this function for further details.
The scikit-learn maintainers therefore strongly discourage the use of this
dataset unless the purpose of the code is to study and educate about
ethical issues in data science and machine learning.
In this special case, you can fetch the dataset from the original
source::
import pandas as pd
import numpy as np
data_url = "https://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
Alternative datasets include the California housing dataset (i.e.
:func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
dataset. You can load the datasets as follows::
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
for the California housing dataset and::
from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True)
for the Ames housing dataset.
warnings.warn(msg, category=FutureWarning)
Features: 377/377
Best subset: (1, 3, 5, 6, 7, 8, 9, 10, 11, 12)
步骤 2:计算调整后的 R^2:
def adjust_r2(r2, num_examples, num_features):
coef = (num_examples - 1) / (num_examples - num_features - 1)
return 1 - (1 - r2) * coef
for i in efs.subsets_:
efs.subsets_[i]['adjusted_avg_score'] = (
adjust_r2(r2=efs.subsets_[i]['avg_score'],
num_examples=X.shape[0]/10,
num_features=len(efs.subsets_[i]['feature_idx']))
)
步骤 3:基于调整后的 R^2 选择最佳子集:
score = -99e10
for i in efs.subsets_:
score = efs.subsets_[i]['adjusted_avg_score']
if ( efs.subsets_[i]['adjusted_avg_score'] == score and
len(efs.subsets_[i]['feature_idx']) < len(efs.best_idx_) )\
or efs.subsets_[i]['adjusted_avg_score'] > score:
efs.best_idx_ = efs.subsets_[i]['feature_idx']
print('Best adjusted R2 score: %.2f' % efs.best_score_ * (-1))
print('Best subset:', efs.best_idx_)
Best subset: (1, 3, 5, 6, 7, 8, 9, 10, 11, 12)
示例 5 - 使用选定的特征子集进行新预测
# Initialize the dataset
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=1)
knn = KNeighborsClassifier(n_neighbors=3)
# Select the "best" three features via
# 5-fold cross-validation on the training set.
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
efs1 = EFS(knn,
min_features=1,
max_features=4,
scoring='accuracy',
cv=5)
efs1 = efs1.fit(X_train, y_train)
Features: 15/15
print('Selected features:', efs1.best_idx_)
Selected features: (2, 3)
# Generate the new subsets based on the selected features
# Note that the transform call is equivalent to
# X_train[:, efs1.k_feature_idx_]
X_train_efs = efs1.transform(X_train)
X_test_efs = efs1.transform(X_test)
# Fit the estimator using the new feature subset
# and make a prediction on the test data
knn.fit(X_train_efs, y_train)
y_pred = knn.predict(X_test_efs)
# Compute the accuracy of the prediction
acc = float((y_test == y_pred).sum()) / y_pred.shape[0]
print('Test set accuracy: %.2f %%' % (acc*100))
Test set accuracy: 96.00 %
示例 6 - 穷举特征选择与 GridSearch
# Initialize the dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=1)
使用 scikit-learn 的 GridSearch
来调整 ExhaustiveFeatureSelector
内 LogisticRegression
估计器的超参数,并在 pipeline 中用于预测。注意,clone_estimator
属性需要设置为 False
。
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
lr = LogisticRegression(multi_class='multinomial',
solver='newton-cg',
random_state=123)
efs1 = EFS(estimator=lr,
min_features=2,
max_features=3,
scoring='accuracy',
print_progress=False,
clone_estimator=False,
cv=5,
n_jobs=1)
pipe = make_pipeline(efs1, lr)
param_grid = {'exhaustivefeatureselector__estimator__C': [0.1, 1.0, 10.0]}
gs = GridSearchCV(estimator=pipe,
param_grid=param_grid,
scoring='accuracy',
n_jobs=1,
cv=2,
verbose=1,
refit=False)
# run gridearch
gs = gs.fit(X_train, y_train)
Fitting 2 folds for each of 3 candidates, totalling 6 fits
... 由 GridSearch 确定的“最佳”参数是 ...
print("Best parameters via GridSearch", gs.best_params_)
Best parameters via GridSearch {'exhaustivefeatureselector__estimator__C': 0.1}
在 GridSearch 后获取最佳 k 个特征索引
如果通过 SequentialFeatureSelection.best_idx_
对最佳 k 个特征索引感兴趣,必须使用 refit=True
初始化一个 GridSearchCV
对象。现在,GridSearch 对象将使用完整的训练数据集和它通过交叉验证找到的最佳参数来训练估计器 pipeline。
gs = GridSearchCV(estimator=pipe,
param_grid=param_grid,
scoring='accuracy',
n_jobs=1,
cv=2,
verbose=1,
refit=True)
运行 GridSearch 后,可以通过 steps
属性访问 best_estimator_
的各个 pipeline 对象。
gs = gs.fit(X_train, y_train)
gs.best_estimator_.steps
Fitting 2 folds for each of 3 candidates, totalling 6 fits
[('exhaustivefeatureselector',
ExhaustiveFeatureSelector(clone_estimator=False,
estimator=LogisticRegression(C=0.1,
multi_class='multinomial',
random_state=123,
solver='newton-cg'),
feature_groups=[[0], [1], [2], [3]], max_features=3,
min_features=2, print_progress=False)),
('logisticregression',
LogisticRegression(multi_class='multinomial', random_state=123,
solver='newton-cg'))]
通过子索引,然后可以获取最佳选定的特征子集
print('Best features:', gs.best_estimator_.steps[0][1].best_idx_)
Best features: (2, 3)
在交叉验证期间,此特征组合的 CV 准确率为
print('Best score:', gs.best_score_)
Best score: 0.96
gs.best_params_
{'exhaustivefeatureselector__estimator__C': 0.1}
或者,如果使用 refit=False
运行 GridSearchCV
,我们可以在 pipeline 中手动设置“最佳 GridSearch 参数”。这应该会产生相同的结果
pipe.set_params(**gs.best_params_).fit(X_train, y_train)
print('Best features:', pipe.steps[0][1].best_idx_)
Best features: (2, 3)
示例 7 - 带 LOOCV 的穷举特征选择
ExhaustiveFeatureSelector
不限于 k 折交叉验证。可以使用任何支持通用 scikit-learn 交叉验证 API 的交叉验证方法。
以下示例演示了 scikit-learn 的 LeaveOneOut
交叉验证方法与穷举特征选择器结合使用。
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
from sklearn.model_selection import LeaveOneOut
iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=3)
efs1 = EFS(knn,
min_features=1,
max_features=4,
scoring='accuracy',
print_progress=True,
cv=LeaveOneOut()) ### Use cross-validation generator here
efs1 = efs1.fit(X, y)
print('Best accuracy score: %.2f' % efs1.best_score_)
print('Best subset (indices):', efs1.best_idx_)
print('Best subset (corresponding names):', efs1.best_feature_names_)
Features: 15/15
Best accuracy score: 0.96
Best subset (indices): (3,)
Best subset (corresponding names): ('3',)
示例 8 - 中断长时间运行以获取中间结果
如果运行时间过长,可以通过触发 KeyboardInterrupt
(例如,在 Mac 上按 ctrl+c,或在 Jupyter notebook 中中断 cell)来获取临时结果。
玩具数据集
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(
n_samples=200000,
n_features=6,
n_informative=2,
n_redundant=1,
n_repeated=1,
n_clusters_per_class=2,
flip_y=0.05,
class_sep=0.5,
random_state=123,
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=123
)
带中断的长运行
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=10000)
efs1 = EFS(model,
min_features=1,
max_features=4,
print_progress=True,
scoring='accuracy')
efs1 = efs1.fit(X_train, y_train)
Features: 56/56
完成拟合
注意,特征选择运行尚未完成,因此某些属性可能不可用。为了使用 EFS 实例,建议调用 finalize_fit
方法,这将使 EFS 估计器看起来“已拟合”,并处理临时结果
efs1.finalize_fit()
print('Best accuracy score: %.2f' % efs1.best_score_)
print('Best subset (indices):', efs1.best_idx_)
Best accuracy score: 0.73
Best subset (indices): (1, 2)
示例 9 - 使用特征组
自 mlxtend v0.21.0 版本起,可以指定特征组。特征组允许您将某些特征组合在一起,以便它们始终作为一个组被选择。这在类似于独热编码的场景中非常有用——如果您想将独热编码的特征视为一个单一特征
在以下示例中,我们将萼片长度和萼片宽度指定为一个特征组,以便它们始终一起被选择
from sklearn.datasets import load_iris
import pandas as pd
iris = load_iris()
X = iris.data
y = iris.target
X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',
'sepal wid', 'petal wid'])
X_df.head()
sepal len | petal len | sepal wid | petal wid | |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 |
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
knn = KNeighborsClassifier(n_neighbors=3)
efs1 = EFS(knn,
min_features=2,
max_features=2,
scoring='accuracy',
feature_groups=[['sepal len', 'sepal wid'], ['petal len'], ['petal wid']],
cv=3)
efs1 = efs1.fit(X_df, y)
print('Best accuracy score: %.2f' % efs1.best_score_)
print('Best subset (indices):', efs1.best_idx_)
print('Best subset (corresponding names):', efs1.best_feature_names_)
Features: 3/3
Best accuracy score: 0.97
Best subset (indices): (0, 2, 3)
Best subset (corresponding names): ('sepal len', 'sepal wid', 'petal wid')
注意,返回的特征数量是 3,因为 min_features
和 max_features
对应的是特征组的数量。也就是说,我们在 ['sepal len', 'sepal wid'], ['petal wid']
中有两个特征组,但这扩展为 3 个特征。
API
ExhaustiveFeatureSelector(estimator, min_features=1, max_features=1, print_progress=True, scoring='accuracy', cv=5, n_jobs=1, pre_dispatch='2*n_jobs', clone_estimator=True, fixed_features=None, feature_groups=None)
用于分类和回归的穷举特征选择。(v0.4.3 新增)
参数
-
estimator
: scikit-learn 分类器或回归器 -
min_features
: int (默认值: 1)要选择的最小特征数量
-
max_features
: int (默认值: 1)要选择的最大特征数量。如果参数
feature_groups
不是 None,特征数量等于特征组的数量,即len(feature_groups)
。例如,如果feature_groups = [[0], [1], [2, 3], [4]]
,则max_features
的值不能超过 4。 -
print_progress
: bool (默认值: True)将进度(以 epoch 数表示)打印到 stderr。
-
scoring
: str, (默认值='accuracy') 分类器的评分指标 {accuracy, f1, precision, recall, roc_auc},回归器的评分指标 {'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'r2'},或者是一个签名形如scorer(estimator, X, y)
的可调用对象或函数。cv
: int (默认值: 5) -
Scikit-learn 交叉验证生成器或
int
。如果估计器是分类器(或 y 由整数类别标签组成),则执行分层 k 折交叉验证;否则执行常规 k 折交叉验证。如果 cv 为 None, False 或 0,则不执行交叉验证。n_jobs
: int (默认值: 1) -
用于并行评估不同特征子集的 CPU 数量。-1 表示“所有 CPU”。
pre_dispatch
: int 或 string (默认值: '2*n_jobs') -
控制当
n_jobs > 1
或n_jobs=-1
时,并行执行期间调度的任务数量。减少此数量有助于避免在调度的任务多于 CPU 可处理的任务时内存消耗爆炸。此参数可以为:None,在这种情况下,所有任务会立即创建并启动。对于轻量且快速运行的任务,可以使用此设置,以避免因按需启动任务而产生的延迟。一个整数,给出要启动的任务总数的精确值。一个字符串,给出 n_jobs 的表达式函数,如2*n_jobs
clone_estimator
: bool (默认值: True) -
如果为 True,则克隆估计器;如果为 False,则使用原始估计器实例。如果估计器未实现 scikit-learn 的 set_params 和 get_params 方法,请设置为 False。此外,还需要将 cv=0 和 n_jobs=1。
fixed_features
: tuple (默认值: None) -
如果不是
None
,作为元组提供的特征索引将被特征选择器视为固定特征。例如,如果fixed_features=(1, 3, 7)
,则解决方案中必定包含第 2、4 和 8 个特征。注意,如果fixed_features
不是None
,请确保要选择的特征数量大于len(fixed_features)
。换句话说,确保k_features > len(fixed_features)
。feature_groups
: list 或 None (默认值: None) -
用于将某些特征视为一个组的可选参数。这意味着,组内的特征总是被一起选择,永不拆分。例如,
feature_groups=[[1], [2], [3, 4, 5]]
指定了 3 个特征组。在这种情况下,当k_features=2
时,可能的特征选择结果是[[1], [2]]
、[[1], [3, 4, 5]]
或[[2], [3, 4, 5]]
。特征组对于解释性很有用,例如,如果特征 3, 4, 5 是独热编码特征。(更多详情,请阅读此文档字符串底部的说明)。mlxtend v. 0.21.0 新增。属性
best_idx_
: array-like, 形状 = [n_predictions]
-
选定特征子集的特征索引。
best_feature_names_
: array-like, 形状 = [n_predictions] -
选定特征子集的特征名称。如果在
fit
方法中使用 pandas DataFrame,特征名称对应于列名。否则,特征名称是特征数组索引的字符串表示。v 0.13.0 新增。best_score_
: float -
选定子集的交叉验证平均分。
subsets_
: dict -
穷举选择过程中选定的特征子集的字典,其中字典键是这些特征子集的长度 k。字典值本身是字典,包含以下键:'feature_idx'(特征子集索引的元组)'feature_names'(特征子集的特征名称元组)'cv_scores'(个体交叉验证得分列表)'avg_score'(平均交叉验证得分)注意,如果在
fit
方法中使用 pandas DataFrame,'feature_names' 对应于列名。否则,特征名称是特征数组索引的字符串表示。'feature_names' 在 v. 0.13.0 新增。说明
(1) 如果参数 feature_groups
不是 None,特征数量等于特征组的数量,即 len(feature_groups)
。例如,如果 feature_groups = [[0], [1], [2, 3], [4]]
,则 max_features
的值不能超过 4。
示例
(2) Although two or more individual features may be considered as one group
throughout the feature-selection process, it does not mean the individual
features of that group have the same impact on the outcome. For instance, in
linear regression, the coefficient of the feature 2 and 3 can be different
even if they are considered as one group in feature_groups.
(3) If both fixed_features and feature_groups are specified, ensure that each
feature group contains the fixed_features selection. E.g., for a 3-feature set
fixed_features=[0, 1] and feature_groups=[[0, 1], [2]] is valid;
fixed_features=[0, 1] and feature_groups=[[0], [1, 2]] is not valid.
有关用法示例,请参阅 https://mlxtend.cn/mlxtend/user_guide/feature_selection/ExhaustiveFeatureSelector/
方法
fit(X, y, groups=None, **fit_params)
执行特征选择并从训练数据中学习模型。
X
: {array-like, sparse matrix}, 形状 = [n_samples, n_features]
参数
-
训练向量,其中 n_samples 是样本数量,n_features 是特征数量。v 0.13.0 新增:现在也接受 pandas DataFrame 作为 X 的参数。
y
: array-like, 形状 = [n_samples] -
目标值。
groups
: array-like, 形状 (n_samples,), 可选 -
用于在将数据集分割为训练/测试集时使用的样本分组标签。传递给交叉验证器的 fit 方法。
fit_params
: dict of string -> object, 可选 -
传递给分类器 fit 方法的参数。
返回值
self
: object
- fit_transform(X, y, groups=None, **fit_params)
拟合训练数据并返回从 X 中选出的最佳特征。
X 的特征子集,形状={n_samples, k_features}
参数
-
训练向量,其中 n_samples 是样本数量,n_features 是特征数量。v 0.13.0 新增:现在也接受 pandas DataFrame 作为 X 的参数。
y
: array-like, 形状 = [n_samples] -
目标值。
groups
: array-like, 形状 (n_samples,), 可选 -
用于在将数据集分割为训练/测试集时使用的样本分组标签。传递给交叉验证器的 fit 方法。
fit_params
: dict of string -> object, 可选 -
传递给分类器 fit 方法的参数。
返回值
self
: object
get_metric_dict(confidence_interval=0.95)
返回度量字典
confidence_interval
: float (默认值: 0.95)
参数
-
一个介于 0.0 和 1.0 之间的正浮点数,用于计算 CV 平均得分的置信区间界限。
字典,其中每个字典值是一个列表,其长度等于迭代次数(特征子集数量)。对应于这些列表的字典键如下:'feature_idx':特征子集索引的元组 'cv_scores':个体 CV 得分列表 'avg_score':CV 平均得分 'std_dev':CV 平均得分的标准差 'std_err':CV 平均得分的标准误 'ci_bound':CV 平均得分的置信区间界限
self
: object
get_params(deep=True)
获取此估计器的参数。
deep
: bool, default=True
参数
-
如果为 True,将返回此估计器以及包含的估计器子对象的参数。
params
: dict
self
: object
-
参数名称映射到其值。
set_params(**params)
设置此估计器的参数。
**params
: dict
The method works on simple estimators as well as on nested objects
(such as :class:`~sklearn.pipeline.Pipeline`). The latter have
parameters of the form ``<component>__<parameter>`` so that it's
possible to update each component of a nested object.
参数
-
估计器参数。
self
: estimator instance
self
: object
-
估计器实例。
transform(X)
返回从 X 中选出的最佳特征。
Copyright © 2014-2023 Sebastian Raschka
参数
-
训练向量,其中 n_samples 是样本数量,n_features 是特征数量。v 0.13.0 新增:现在也接受 pandas DataFrame 作为 X 的参数。
y
: array-like, 形状 = [n_samples]
self
: object
get_metric_dict(confidence_interval=0.95)