SequentialFeatureSelector: 流行的向前和向后特征选择方法(包括浮动变体)
顺序特征算法(SFA)的实现——贪婪搜索算法——这些算法是作为计算上通常不可行的穷举搜索的次优解而开发的。
from mlxtend.feature_selection import SequentialFeatureSelector
概述
顺序特征选择算法是一系列贪婪搜索算法,用于将初始的d维特征空间缩减到k维特征子空间,其中k < d。特征选择算法的动机是自动选择与问题最相关的特征子集。特征选择的目标有两个方面:我们希望通过去除不相关特征或噪声来提高计算效率并降低模型的泛化误差。此外,如果嵌入式特征选择(例如LASSO等正则化惩罚)不适用,顺序特征选择等封装方法会很有优势。
简而言之,SFA根据分类器性能每次移除或添加一个特征,直到达到所需大小k的特征子集。通过SequentialFeatureSelector
可以使用四种不同的SFA变体:
- 顺序向前选择 (SFS)
- 顺序向后选择 (SBS)
- 顺序向前浮动选择 (SFFS)
- 顺序向后浮动选择 (SBFS)
浮动变体SFFS和SBFS可以被认为是更简单的SFS和SBS算法的扩展。浮动算法有一个额外的排除或包含步骤,以便一旦特征被包含(或排除)后可以将其移除,从而可以采样更多的特征子集组合。需要强调的是,这一步是条件性的,只有在移除(或添加)特定特征后,准则函数评估结果特征子集为“更好”时才会发生。此外,我添加了一个可选的检查,如果算法陷入循环,则跳过条件排除步骤。
这与递归特征消除(RFE)有何不同——例如,在sklearn.feature_selection.RFE
中实现的?RFE计算复杂度较低,使用特征权重系数(例如线性模型)或特征重要性(基于树的算法)递归地消除特征,而SFSs根据用户定义的分类器/回归性能指标消除(或添加)特征。
教程视频
视觉说明
下面提供了顺序向后选择过程的视觉说明,来自论文:
- Joe Bemister-Buffington, Alex J. Wolf, Sebastian Raschka, and Leslie A. Kuhn (2020) Machine Learning to Identify Flexibility Signatures of Class A GPCR Inhibition Biomolecules 2020, 10, 454. https://www.mdpi.com/2218-273X/10/3/454#
算法细节
顺序向前选择 (SFS)
输入
- SFS算法接受整个维特征集作为输入。
输出 ,其中
- SFS返回特征的一个子集;选定的特征数量,其中,必须预先指定。
初始化 ,
- 我们用空集初始化算法(“空集”),使得(其中是子集的大小)。
步骤 1 (包含)
转到步骤 1
- 在此步骤中,我们添加一个附加特征,,到我们的特征子集中.
- 是使我们的准则函数最大化的特征,即,如果将其添加到,则该特征与最佳分类器性能相关联.
- 我们重复此过程,直到满足终止准则。
终止
- 我们从特征子集中添加特征直到大小为的特征子集包含所需数量的特征我们预先指定的。
顺序向后选择 (SBS)
输入:所有特征的集合,
- SBS算法接受整个特征集作为输入。
输出 ,其中
- SBS返回特征的一个子集;选定的特征数量,其中,必须预先指定。
初始化 ,
- 我们使用给定的特征集初始化算法,以便.
步骤 1 (排除)
转到步骤 1
- 在此步骤中,我们移除一个特征,从我们的特征子集中.
- 是在移除时使我们的准则函数最大化的特征,即,如果将其从移除,则该特征与最佳分类器性能相关联.
- 我们重复此过程,直到满足终止准则。
终止
- 我们从特征子集中添加特征直到大小为的特征子集包含所需数量的特征我们预先指定的。
顺序向后浮动选择 (SBFS)
输入:所有特征的集合,
- SBFS算法接受整个特征集作为输入。
输出 ,其中
- SBFS返回特征的一个子集;选定的特征数量,其中,必须预先指定。
初始化 ,
- 我们使用给定的特征集初始化算法,以便.
步骤 1 (排除)
转到步骤 2
- 在此步骤中,我们移除一个特征,从我们的特征子集中.
- 是在移除时使我们的准则函数最大化的特征,即,如果将其从移除,则该特征与最佳分类器性能相关联.
步骤 2 (条件包含)
if J(X_k + x) > J(X_k):
转到步骤 1
- 在步骤 2 中,我们搜索如果添加回特征子集能提高分类器性能的特征。如果存在此类特征,我们添加特征使性能提升最大化。如果或者无法进行改进(即,此类特征找不到),则返回步骤 1;否则,重复此步骤。
终止
- 我们从特征子集中添加特征直到大小为的特征子集包含所需数量的特征我们预先指定的。
顺序向前浮动选择 (SFFS)
输入:所有特征的集合,
- SFFS算法接受整个特征集作为输入,例如,如果我们的特征空间包含 10 维(d = 10),则输入整个特征集。
输出:特征子集,,其中
- 算法返回的输出是指定大小的特征空间的子集。例如,从10维特征空间中选出的5个特征子集(k = 5, d = 10)。
初始化 ,
- 我们用空集(“空集”)初始化算法,使得 k = 0(其中 k 是子集的大小)
步骤 1 (包含)
转到步骤 2
步骤 2 (条件排除)
:
转到步骤 1
- 在步骤1中,我们从特征空间中包含导致我们特征子集性能提升最大的特征(由准则函数评估)。然后,我们转到步骤2
-
在步骤2中,我们仅在移除后得到的子集会带来性能提升时才移除特征。如果或者无法进行改进(即,此类特征找不到),则返回步骤 1;否则,重复此步骤。
-
步骤1和2重复进行,直到达到终止准则。
终止:当 k 等于所需特征数量时停止
参考文献
-
Ferri, F. J., Pudil P., Hatef, M., Kittler, J. (1994). “大规模特征选择技术的比较研究。” Pattern Recognition in Practice IV : 403-413.
-
Pudil, P., Novovičová, J., & Kittler, J. (1994). “特征选择中的浮动搜索方法。” Pattern recognition letters 15.11 (1994): 1119-1125.
示例 1 - 一个简单的顺序向前选择示例
初始化scikit-learn中的简单分类器
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)
我们首先通过顺序向前选择(SFS)从Iris数据集中选择“最佳”3个特征。在这里,我们将forward
设置为True,floating
设置为False。通过选择cv=0
,我们不执行任何交叉验证,因此,性能(此处为:'accuracy'
)完全在训练集上计算。
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
sfs1 = SFS(knn,
k_features=3,
forward=True,
floating=False,
verbose=2,
scoring='accuracy',
cv=0)
sfs1 = sfs1.fit(X, y)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.0s finished
[2023-05-17 08:36:17] Features: 1/3 -- score: 0.96[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished
[2023-05-17 08:36:17] Features: 2/3 -- score: 0.9733333333333334[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished
[2023-05-17 08:36:17] Features: 3/3 -- score: 0.9733333333333334
通过subsets_
属性,我们可以查看每一步选定的特征索引
sfs1.subsets_
{1: {'feature_idx': (3,),
'cv_scores': array([0.96]),
'avg_score': 0.96,
'feature_names': ('3',)},
2: {'feature_idx': (2, 3),
'cv_scores': array([0.97333333]),
'avg_score': 0.9733333333333334,
'feature_names': ('2', '3')},
3: {'feature_idx': (1, 2, 3),
'cv_scores': array([0.97333333]),
'avg_score': 0.9733333333333334,
'feature_names': ('1', '2', '3')}}
sfs1 = sfs1.fit(X, y)
sfs1.subsets_
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.0s finished
[2023-05-17 08:36:17] Features: 1/3 -- score: 0.96[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished
[2023-05-17 08:36:17] Features: 2/3 -- score: 0.9733333333333334[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished
[2023-05-17 08:36:17] Features: 3/3 -- score: 0.9733333333333334
{1: {'feature_idx': (3,),
'cv_scores': array([0.96]),
'avg_score': 0.96,
'feature_names': ('3',)},
2: {'feature_idx': (2, 3),
'cv_scores': array([0.97333333]),
'avg_score': 0.9733333333333334,
'feature_names': ('2', '3')},
3: {'feature_idx': (1, 2, 3),
'cv_scores': array([0.97333333]),
'avg_score': 0.9733333333333334,
'feature_names': ('1', '2', '3')}}
此外,我们可以直接通过k_feature_idx_
属性访问3个最佳特征的索引
sfs1.k_feature_idx_
(1, 2, 3)
最后,可以通过k_score_
访问这3个特征的预测得分
sfs1.k_score_
0.9733333333333334
特征名称
处理大型数据集时,特征索引可能难以解释。在这种情况下,我们建议使用具有明确列名的pandas DataFrame作为输入
import pandas as pd
df_X = pd.DataFrame(X, columns=["Sepal length", "Sepal width", "Petal length", "Petal width"])
df_X.head()
花萼长度 | 花萼宽度 | 花瓣长度 | 花瓣宽度 | |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 |
sfs1 = sfs1.fit(df_X, y)
print('Best accuracy score: %.2f' % sfs1.k_score_)
print('Best subset (indices):', sfs1.k_feature_idx_)
print('Best subset (corresponding names):', sfs1.k_feature_names_)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Best accuracy score: 0.97
Best subset (indices): (1, 2, 3)
Best subset (corresponding names): ('Sepal width', 'Petal length', 'Petal width')
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.0s finished
[2023-05-17 08:36:17] Features: 1/3 -- score: 0.96[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished
[2023-05-17 08:36:17] Features: 2/3 -- score: 0.9733333333333334[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished
[2023-05-17 08:36:17] Features: 3/3 -- score: 0.9733333333333334
示例 2 - 在SFS、SBS、SFFS和SBFS之间切换
使用forward
和floating
参数,我们可以在SFS、SBS、SFFS和SBFS之间切换,如下所示。请注意,与示例1相比,我们执行的是(分层)4折交叉验证,以获得更稳健的估计。通过n_jobs=-1
,我们选择在所有可用的CPU核心上运行交叉验证。
# Sequential Forward Selection
sfs = SFS(knn,
k_features=3,
forward=True,
floating=False,
scoring='accuracy',
cv=4,
n_jobs=-1)
sfs = sfs.fit(X, y)
print('\nSequential Forward Selection (k=3):')
print(sfs.k_feature_idx_)
print('CV Score:')
print(sfs.k_score_)
###################################################
# Sequential Backward Selection
sbs = SFS(knn,
k_features=3,
forward=False,
floating=False,
scoring='accuracy',
cv=4,
n_jobs=-1)
sbs = sbs.fit(X, y)
print('\nSequential Backward Selection (k=3):')
print(sbs.k_feature_idx_)
print('CV Score:')
print(sbs.k_score_)
###################################################
# Sequential Forward Floating Selection
sffs = SFS(knn,
k_features=3,
forward=True,
floating=True,
scoring='accuracy',
cv=4,
n_jobs=-1)
sffs = sffs.fit(X, y)
print('\nSequential Forward Floating Selection (k=3):')
print(sffs.k_feature_idx_)
print('CV Score:')
print(sffs.k_score_)
###################################################
# Sequential Backward Floating Selection
sbfs = SFS(knn,
k_features=3,
forward=False,
floating=True,
scoring='accuracy',
cv=4,
n_jobs=-1)
sbfs = sbfs.fit(X, y)
print('\nSequential Backward Floating Selection (k=3):')
print(sbfs.k_feature_idx_)
print('CV Score:')
print(sbfs.k_score_)
Sequential Forward Selection (k=3):
(1, 2, 3)
CV Score:
0.9731507823613088
Sequential Backward Selection (k=3):
(1, 2, 3)
CV Score:
0.9731507823613088
Sequential Forward Floating Selection (k=3):
(1, 2, 3)
CV Score:
0.9731507823613088
Sequential Backward Floating Selection (k=3):
(1, 2, 3)
CV Score:
0.9731507823613088
在这个简单的场景中,从Iris数据集的4个可用特征中选择最佳的3个特征,无论我们使用哪种顺序选择算法,最终结果都相似。
示例 3 - 在DataFrame中可视化结果
为了方便,我们可以使用SequentialFeatureSelector对象的get_metric_dict
方法将特征选择的输出可视化为pandas DataFrame格式。std_dev
和std_err
列分别表示交叉验证分数的标准差和标准误。
下面是示例2中的顺序向前选择器的DataFrame
import pandas as pd
pd.DataFrame.from_dict(sfs.get_metric_dict()).T
feature_idx | cv_scores | avg_score | feature_names | ci_bound | std_dev | std_err | |
---|---|---|---|---|---|---|---|
1 | (3,) | [0.9736842105263158, 0.9473684210526315, 0.918... | 0.959993 | (3,) | 0.048319 | 0.030143 | 0.017403 |
2 | (2, 3) | [0.9736842105263158, 0.9473684210526315, 0.918... | 0.959993 | (2, 3) | 0.048319 | 0.030143 | 0.017403 |
3 | (1, 2, 3) | [0.9736842105263158, 1.0, 0.9459459459459459, ... | 0.973151 | (1, 2, 3) | 0.030639 | 0.019113 | 0.011035 |
现在,让我们与顺序向后选择器进行比较
pd.DataFrame.from_dict(sbs.get_metric_dict()).T
feature_idx | cv_scores | avg_score | feature_names | ci_bound | std_dev | std_err | |
---|---|---|---|---|---|---|---|
4 | (0, 1, 2, 3) | [0.9736842105263158, 0.9473684210526315, 0.918... | 0.953236 | (0, 1, 2, 3) | 0.03602 | 0.022471 | 0.012974 |
3 | (1, 2, 3) | [0.9736842105263158, 1.0, 0.9459459459459459, ... | 0.973151 | (1, 2, 3) | 0.030639 | 0.019113 | 0.011035 |
我们可以看到SFS和SBFS都找到了相同的“最佳”3个特征,但中间步骤显然不同。
上述DataFrame中的ci_bound
列表示计算出的交叉验证分数周围的置信区间。默认情况下使用95%的置信区间,但我们可以通过confidence_interval
参数使用不同的置信界限。例如,可以通过以下方式获得90%置信区间的置信界限
pd.DataFrame.from_dict(sbs.get_metric_dict(confidence_interval=0.90)).T
feature_idx | cv_scores | avg_score | feature_names | ci_bound | std_dev | std_err | |
---|---|---|---|---|---|---|---|
4 | (0, 1, 2, 3) | [0.9736842105263158, 0.9473684210526315, 0.918... | 0.953236 | (0, 1, 2, 3) | 0.027658 | 0.022471 | 0.012974 |
3 | (1, 2, 3) | [0.9736842105263158, 1.0, 0.9459459459459459, ... | 0.973151 | (1, 2, 3) | 0.023525 | 0.019113 | 0.011035 |
示例 4 - 绘制结果
导入小型辅助函数plotting.plot_sequential_feature_selection
后,我们还可以使用matplotlib图表可视化结果。
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
sfs = SFS(knn,
k_features=4,
forward=True,
floating=False,
scoring='accuracy',
verbose=2,
cv=5)
sfs = sfs.fit(X, y)
fig1 = plot_sfs(sfs.get_metric_dict(), kind='std_dev')
plt.ylim([0.8, 1])
plt.title('Sequential Forward Selection (w. StdDev)')
plt.grid()
plt.show()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.0s finished
[2023-05-17 08:36:18] Features: 1/4 -- score: 0.96[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished
[2023-05-17 08:36:18] Features: 2/4 -- score: 0.9666666666666668[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished
[2023-05-17 08:36:18] Features: 3/4 -- score: 0.9533333333333334[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s finished
[2023-05-17 08:36:18] Features: 4/4 -- score: 0.9733333333333334
示例 5 - 用于回归的顺序特征选择
与上面的分类示例类似,SequentialFeatureSelector
也支持scikit-learn用于回归的估计器。
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()
X, y = data.data, data.target
lr = LinearRegression()
sfs = SFS(lr,
k_features=8,
forward=True,
floating=False,
scoring='neg_mean_squared_error',
cv=10)
sfs = sfs.fit(X, y)
fig = plot_sfs(sfs.get_metric_dict(), kind='std_err')
plt.title('Sequential Forward Selection (w. StdErr)')
plt.grid()
plt.show()
示例 6 - 使用固定训练/验证分割进行特征选择
如果您不想使用交叉验证(此处为:k折交叉验证,即轮换训练和验证折叠),您可以使用PredefinedHoldoutSplit
类指定您自己的固定训练集和验证集划分。
from sklearn.datasets import load_iris
from mlxtend.evaluate import PredefinedHoldoutSplit
import numpy as np
iris = load_iris()
X = iris.data
y = iris.target
rng = np.random.RandomState(123)
my_validation_indices = rng.permutation(np.arange(150))[:30]
print(my_validation_indices)
[ 72 112 132 88 37 138 87 42 8 90 141 33 59 116 135 104 36 13
63 45 28 133 24 127 46 20 31 121 117 4]
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
knn = KNeighborsClassifier(n_neighbors=4)
piter = PredefinedHoldoutSplit(my_validation_indices)
sfs1 = SFS(knn,
k_features=3,
forward=True,
floating=False,
verbose=2,
scoring='accuracy',
cv=piter)
sfs1 = sfs1.fit(X, y)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.0s finished
[2023-05-17 08:36:19] Features: 1/3 -- score: 0.9666666666666667[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished
[2023-05-17 08:36:19] Features: 2/3 -- score: 0.9666666666666667[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished
[2023-05-17 08:36:19] Features: 3/3 -- score: 0.9666666666666667
示例 7 - 使用选定的特征子集进行新预测
# Initialize the dataset
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=1)
knn = KNeighborsClassifier(n_neighbors=4)
# Select the "best" three features via
# 5-fold cross-validation on the training set.
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
sfs1 = SFS(knn,
k_features=3,
forward=True,
floating=False,
scoring='accuracy',
cv=5)
sfs1 = sfs1.fit(X_train, y_train)
print('Selected features:', sfs1.k_feature_idx_)
Selected features: (1, 2, 3)
# Generate the new subsets based on the selected features
# Note that the transform call is equivalent to
# X_train[:, sfs1.k_feature_idx_]
X_train_sfs = sfs1.transform(X_train)
X_test_sfs = sfs1.transform(X_test)
# Fit the estimator using the new feature subset
# and make a prediction on the test data
knn.fit(X_train_sfs, y_train)
y_pred = knn.predict(X_test_sfs)
# Compute the accuracy of the prediction
acc = float((y_test == y_pred).sum()) / y_pred.shape[0]
print('Test set accuracy: %.2f %%' % (acc * 100))
Test set accuracy: 96.00 %
示例 8 - 顺序特征选择和GridSearch
在以下示例中,我们使用GridSearch调整SFS的估计器。为了避免意外行为或副作用,建议在SFS内部和外部将估计器作为单独的实例使用。
# Initialize the dataset
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=123)
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import mlxtend
knn1 = KNeighborsClassifier()
knn2 = KNeighborsClassifier()
sfs1 = SFS(estimator=knn1,
k_features=3,
forward=True,
floating=False,
scoring='accuracy',
cv=5)
pipe = Pipeline([('sfs', sfs1),
('knn2', knn2)])
param_grid = {
'sfs__k_features': [1, 2, 3],
'sfs__estimator__n_neighbors': [3, 4, 7], # inner knn
'knn2__n_neighbors': [3, 4, 7] # outer knn
}
gs = GridSearchCV(estimator=pipe,
param_grid=param_grid,
scoring='accuracy',
n_jobs=1,
cv=5,
refit=False)
# run gridearch
gs = gs.fit(X_train, y_train)
让我们看看下面建议的超参数
for i in range(len(gs.cv_results_['params'])): print(gs.cv_results_['params'][i], 'test acc.:', gs.cv_results_['mean_test_score'][i])
GridSearch确定的“最佳”参数是...
print("Best parameters via GridSearch", gs.best_params_)
Best parameters via GridSearch {'knn2__n_neighbors': 7, 'sfs__estimator__n_neighbors': 3, 'sfs__k_features': 3}
pipe.set_params(**gs.best_params_).fit(X_train, y_train)
Pipeline(steps=[('sfs', SequentialFeatureSelector(estimator=KNeighborsClassifier(n_neighbors=3), k_features=(3, 3), scoring='accuracy')), ('knn2', KNeighborsClassifier(n_neighbors=7))])在Jupyter环境中,请重新运行此单元格以显示HTML表示或信任此notebook。
在GitHub上,HTML表示无法渲染,请尝试使用nbviewer.org加载此页面。
Pipeline(steps=[('sfs', SequentialFeatureSelector(estimator=KNeighborsClassifier(n_neighbors=3), k_features=(3, 3), scoring='accuracy')), ('knn2', KNeighborsClassifier(n_neighbors=7))])
SequentialFeatureSelector(estimator=KNeighborsClassifier(n_neighbors=3), k_features=(3, 3), scoring='accuracy')
KNeighborsClassifier(n_neighbors=3)
KNeighborsClassifier(n_neighbors=3)
KNeighborsClassifier(n_neighbors=7)
示例 9 - 在k范围内选择“最佳”特征组合
如果将k_features
设置为元组(min_k, max_k)
(0.4.2新功能),SFS现在将通过从k=1
迭代到max_k
(向前)或从max_k
迭代到min_k
(向后)来选择它发现的最佳特征组合。返回的特征子集的大小将在max_k
到min_k
之间,具体取决于哪个组合在交叉验证期间得分最高。
X.shape
(150, 4)
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.data import wine_data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
X, y = wine_data()
X_train, X_test, y_train, y_test= train_test_split(X, y,
stratify=y,
test_size=0.3,
random_state=1)
knn = KNeighborsClassifier(n_neighbors=2)
sfs1 = SFS(estimator=knn,
k_features=(3, 10),
forward=True,
floating=False,
scoring='accuracy',
cv=5)
pipe = make_pipeline(StandardScaler(), sfs1)
pipe.fit(X_train, y_train)
print('best combination (ACC: %.3f): %s\n' % (sfs1.k_score_, sfs1.k_feature_idx_))
print('all subsets:\n', sfs1.subsets_)
plot_sfs(sfs1.get_metric_dict(), kind='std_err');
best combination (ACC: 0.992): (0, 1, 2, 3, 6, 8, 9, 10, 11, 12)
all subsets:
{1: {'feature_idx': (6,), 'cv_scores': array([0.84 , 0.64 , 0.84 , 0.8 , 0.875]), 'avg_score': 0.799, 'feature_names': ('6',)}, 2: {'feature_idx': (6, 9), 'cv_scores': array([0.92 , 0.88 , 1. , 0.96 , 0.91666667]), 'avg_score': 0.9353333333333333, 'feature_names': ('6', '9')}, 3: {'feature_idx': (6, 9, 12), 'cv_scores': array([0.92 , 0.92 , 0.96 , 1. , 0.95833333]), 'avg_score': 0.9516666666666665, 'feature_names': ('6', '9', '12')}, 4: {'feature_idx': (3, 6, 9, 12), 'cv_scores': array([0.96 , 0.96 , 0.96 , 1. , 0.95833333]), 'avg_score': 0.9676666666666666, 'feature_names': ('3', '6', '9', '12')}, 5: {'feature_idx': (3, 6, 9, 10, 12), 'cv_scores': array([0.92, 0.96, 1. , 1. , 1. ]), 'avg_score': 0.976, 'feature_names': ('3', '6', '9', '10', '12')}, 6: {'feature_idx': (2, 3, 6, 9, 10, 12), 'cv_scores': array([0.92, 0.96, 1. , 0.96, 1. ]), 'avg_score': 0.968, 'feature_names': ('2', '3', '6', '9', '10', '12')}, 7: {'feature_idx': (0, 2, 3, 6, 9, 10, 12), 'cv_scores': array([0.92, 0.92, 1. , 1. , 1. ]), 'avg_score': 0.968, 'feature_names': ('0', '2', '3', '6', '9', '10', '12')}, 8: {'feature_idx': (0, 2, 3, 6, 8, 9, 10, 12), 'cv_scores': array([1. , 0.92, 1. , 1. , 1. ]), 'avg_score': 0.984, 'feature_names': ('0', '2', '3', '6', '8', '9', '10', '12')}, 9: {'feature_idx': (0, 2, 3, 6, 8, 9, 10, 11, 12), 'cv_scores': array([1. , 0.92, 1. , 1. , 1. ]), 'avg_score': 0.984, 'feature_names': ('0', '2', '3', '6', '8', '9', '10', '11', '12')}, 10: {'feature_idx': (0, 1, 2, 3, 6, 8, 9, 10, 11, 12), 'cv_scores': array([1. , 0.96, 1. , 1. , 1. ]), 'avg_score': 0.992, 'feature_names': ('0', '1', '2', '3', '6', '8', '9', '10', '11', '12')}}
示例 10 - 使用其他交叉验证方案
除了标准的k折和分层k折交叉验证外,还可以将其他交叉验证方案与SequentialFeatureSelector
一起使用。例如,scikit-learn中的GroupKFold
或LeaveOneOut
交叉验证。
将GroupKFold与SequentialFeatureSelector一起使用
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.data import iris_data
from sklearn.model_selection import GroupKFold
import numpy as np
X, y = iris_data()
groups = np.arange(len(y)) // 10
print('groups: {}'.format(groups))
groups: [ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2
2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4
4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 7 7
7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9
9 9 9 9 10 10 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11
12 12 12 12 12 12 12 12 12 12 13 13 13 13 13 13 13 13 13 13 14 14 14 14
14 14 14 14 14 14]
调用scikit-learn交叉验证器对象的split()
方法将返回一个生成器,该生成器产生训练集和测试集分割。
cv_gen = GroupKFold(4).split(X, y, groups)
cv_gen
<generator object _BaseKFold.split at 0x17c109580>
SequentialFeatureSelector
的cv
参数必须是整数或生成训练集、测试集分割的可迭代对象。此可迭代对象可以通过将训练集、测试集分割生成器传递给内置的list()
函数来构建。
cv = list(cv_gen)
knn = KNeighborsClassifier(n_neighbors=2)
sfs = SFS(estimator=knn,
k_features=2,
scoring='accuracy',
cv=cv)
sfs.fit(X, y)
print('best combination (ACC: %.3f): %s\n' % (sfs.k_score_, sfs.k_feature_idx_))
best combination (ACC: 0.940): (2, 3)
示例 11 - 中断长时间运行以获取中间结果
如果运行时间过长,可以触发KeyboardInterrupt
(例如,Mac上按ctrl+c,或在Jupyter notebook中中断单元格)以获取临时结果。
玩具数据集
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(
n_samples=20000,
n_features=500,
n_informative=10,
n_redundant=40,
n_repeated=25,
n_clusters_per_class=5,
flip_y=0.05,
class_sep=0.5,
random_state=123,
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=123
)
中断的长时运行
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
sfs1 = SFS(model,
k_features=10,
forward=True,
floating=False,
verbose=2,
scoring='accuracy',
cv=5)
sfs1 = sfs1.fit(X_train, y_train)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed: 8.3s finished
[2023-05-17 08:36:32] Features: 1/10 -- score: 0.5965[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 499 out of 499 | elapsed: 13.8s finished
[2023-05-17 08:36:45] Features: 2/10 -- score: 0.6256875000000001[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 498 out of 498 | elapsed: 18.1s finished
[2023-05-17 08:37:03] Features: 3/10 -- score: 0.642[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 497 out of 497 | elapsed: 20.4s finished
[2023-05-17 08:37:24] Features: 4/10 -- score: 0.6463125[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 496 out of 496 | elapsed: 22.2s finished
[2023-05-17 08:37:46] Features: 5/10 -- score: 0.6495000000000001[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 495 out of 495 | elapsed: 26.1s finished
[2023-05-17 08:38:12] Features: 6/10 -- score: 0.6514374999999999[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 494 out of 494 | elapsed: 26.1s finished
[2023-05-17 08:38:38] Features: 7/10 -- score: 0.6533749999999999[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 493 out of 493 | elapsed: 25.3s finished
[2023-05-17 08:39:04] Features: 8/10 -- score: 0.6545624999999999[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 492 out of 492 | elapsed: 26.3s finished
[2023-05-17 08:39:30] Features: 9/10 -- score: 0.6549375[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 491 out of 491 | elapsed: 27.0s finished
[2023-05-17 08:39:57] Features: 10/10 -- score: 0.6554374999999999
完成拟合
请注意,特征选择运行尚未完成,因此某些属性可能不可用。为了使用SFS实例,建议调用finalize_fit
,这将使SFS估计器显示为“已拟合”并处理临时结果
sfs1.finalize_fit()
print(sfs1.k_feature_idx_)
print(sfs1.k_score_)
(30, 128, 144, 160, 184, 229, 256, 356, 439, 458)
0.6554374999999999
示例 12 - 使用Pandas DataFrame
(可选)我们还可以使用pandas DataFrames和pandas Series作为fit
函数的输入。在这种情况下,pandas DataFrame的列名将用作特征名称。但是请注意,如果在fit
函数中提供了custom_feature_names
,则这些custom_feature_names
将优先于基于DataFrame列的特征名称。
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)
sfs1 = SFS(knn,
k_features=3,
forward=True,
floating=False,
scoring='accuracy',
cv=0)
X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',
'sepal width', 'petal width'])
X_df.head()
花萼长 | 花瓣长 | 花萼宽 | 花瓣宽 | |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 |
此外,目标数组y
也可以(可选地)转换为Series
y_series = pd.Series(y)
y_series.head()
0 0
1 0
2 0
3 0
4 0
dtype: int64
sfs1 = sfs1.fit(X_df, y_series)
请注意,将pandas DataFrame作为输入的唯一区别是sfs1.subsets_
数组现在将包含一个新列,
sfs1.subsets_
{1: {'feature_idx': (3,),
'cv_scores': array([0.96]),
'avg_score': 0.96,
'feature_names': ('petal width',)},
2: {'feature_idx': (2, 3),
'cv_scores': array([0.97333333]),
'avg_score': 0.9733333333333334,
'feature_names': ('sepal width', 'petal width')},
3: {'feature_idx': (1, 2, 3),
'cv_scores': array([0.97333333]),
'avg_score': 0.9733333333333334,
'feature_names': ('petal len', 'sepal width', 'petal width')}}
在mlxtend >= 0.13版本中,支持将pandas DataFrame作为SequentialFeatureSelector
的特征输入,而不是NumPy数组或其他类似NumPy的数组类型。
示例 13 - 指定固定特征集
通常,指定我们希望用于给定模型的固定特征集(例如,由先验知识或领域知识确定)可能会很有用。从MLxtend v 0.18.0开始,现在可以通过fixed_features
属性指定此类特征。这意味着这些特征保证包含在选定的子集中。
请注意,此功能适用于关于向前和向后选择以及是否使用浮动选择的所有选项。
下面的示例说明了如何将数据集中的特征0和2设置为固定特征
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=3)
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
sfs1 = SFS(knn,
k_features=4,
forward=True,
floating=False,
verbose=2,
scoring='accuracy',
fixed_features=(0, 2),
cv=3)
sfs1 = sfs1.fit(X, y)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished
[2023-05-17 08:39:57] Features: 3/4 -- score: 0.9733333333333333[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s finished
[2023-05-17 08:39:57] Features: 4/4 -- score: 0.9733333333333333
sfs1.subsets_
{2: {'feature_idx': (0, 2),
'cv_scores': array([0.98, 0.92, 0.94]),
'avg_score': 0.9466666666666667,
'feature_names': ('0', '2')},
3: {'feature_idx': (0, 2, 3),
'cv_scores': array([0.98, 0.96, 0.98]),
'avg_score': 0.9733333333333333,
'feature_names': ('0', '2', '3')},
4: {'feature_idx': (0, 1, 2, 3),
'cv_scores': array([0.98, 0.96, 0.98]),
'avg_score': 0.9733333333333333,
'feature_names': ('0', '1', '2', '3')}}
如果输入数据集是pandas DataFrame,我们也可以直接使用列名
import pandas as pd
X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',
'sepal width', 'petal width'])
X_df.head()
花萼长 | 花瓣长 | 花萼宽 | 花瓣宽 | |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 |
sfs2 = SFS(knn,
k_features=4,
forward=True,
floating=False,
verbose=2,
scoring='accuracy',
fixed_features=('sepal len', 'petal len'),
cv=3)
sfs2 = sfs2.fit(X_df, y_series)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished
[2023-05-17 08:39:57] Features: 3/4 -- score: 0.9466666666666667[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s finished
[2023-05-17 08:39:57] Features: 4/4 -- score: 0.9733333333333333
sfs2.subsets_
{2: {'feature_idx': (0, 1),
'cv_scores': array([0.72, 0.74, 0.78]),
'avg_score': 0.7466666666666667,
'feature_names': ('sepal len', 'petal len')},
3: {'feature_idx': (0, 1, 2),
'cv_scores': array([0.98, 0.92, 0.94]),
'avg_score': 0.9466666666666667,
'feature_names': ('sepal len', 'petal len', 'sepal width')},
4: {'feature_idx': (0, 1, 2, 3),
'cv_scores': array([0.98, 0.96, 0.98]),
'avg_score': 0.9733333333333333,
'feature_names': ('sepal len', 'petal len', 'sepal width', 'petal width')}}
示例 13 - 使用特征组
从mlxtend v0.21.0开始,可以指定特征组。特征组允许您将某些特征组合在一起,以便它们始终作为一组进行选择。这在类似于one-hot编码的场景中非常有用——如果您想将one-hot编码的特征视为一个单一特征。
在下面的示例中,我们将花萼长度和花萼宽度指定为一个特征组,以便它们始终一起被选择
from sklearn.datasets import load_iris
import pandas as pd
iris = load_iris()
X = iris.data
y = iris.target
X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',
'sepal wid', 'petal wid'])
X_df.head()
花萼长 | 花瓣长 | 花萼宽 | 花瓣宽 | |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 |
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
knn = KNeighborsClassifier(n_neighbors=3)
sfs1 = SFS(knn,
k_features=2,
scoring='accuracy',
feature_groups=(['sepal len', 'sepal wid'], ['petal len'], ['petal wid']),
cv=3)
sfs1 = sfs1.fit(X_df, y)
sfs1 = SFS(knn, k_features=2, scoring='accuracy', feature_groups=[[0, 2], [1], [3]], cv=3)
sfs1 = sfs1.fit(X, y)
示例 14 - 多类别评估指标
某些评分指标,如ROC AUC,最初设计用于二分类。然而,它们也可以用于多类别设置。最好查阅此scikit-learn指标表。
例如,我们可以使用“roc_auc_ovr”通过一对余方法获得ROC AUC分数,如下所示。
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10, centers=4, n_features=5, random_state=0)
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
sfs1 = SFS(knn,
k_features=3,
forward=True,
floating=False,
verbose=2,
scoring='roc_auc_ovr',
cv=0)
sfs1 = sfs1.fit(X, y)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.0s finished
[2023-05-17 08:39:57] Features: 1/3 -- score: 1.0[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.0s finished
[2023-05-17 08:39:57] Features: 2/3 -- score: 1.0[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished
[2023-05-17 08:39:57] Features: 3/3 -- score: 1.0
API
SequentialFeatureSelector(estimator, k_features=1, forward=True, floating=False, verbose=0, scoring=None, cv=5, n_jobs=1, pre_dispatch='2*n_jobs', clone_estimator=True, fixed_features=None, feature_groups=None)
用于分类和回归的顺序特征选择。
参数
-
estimator
: scikit-learn 分类器或回归器 -
k_features
: int 或 tuple 或 str (默认值: 1)要选择的特征数量,其中k_features < 完整特征集。0.4.2新功能:可以提供包含最小值和最大值的元组,SFS将考虑返回在交叉验证中得分最高的介于最小值和最大值之间的任何特征组合。例如,元组(1, 4)将返回1到4个特征之间的任意组合,而不是固定数量的特征k。0.8.0新功能:字符串参数"best"或"parsimonious"。如果提供了"best",特征选择器将返回交叉验证性能最佳的特征子集。如果提供了"parsimonious",将选择在交叉验证性能的一个标准误范围内的最小特征子集。
-
forward
: bool (默认值: True)True时进行向前选择,否则进行向后选择
-
floating
: bool (默认值: False)True时添加条件排除/包含步骤。
-
verbose
: int (默认值: 0),日志记录的详细程度。如果为0,没有输出;如果为1,显示当前集合中的特征数量;如果为2,显示详细日志,包括时间戳和每步的cv分数。
-
scoring
: str, callable, 或 None (默认值: None)如果为None(默认值),则对sklearn分类器使用'accuracy',对sklearn回归器使用'r2'。如果是字符串,则使用sklearn评分指标字符串标识符,例如分类器的{accuracy, f1, precision, recall, roc_auc},回归器的{'mean_absolute_error', 'mean_squared_error'/'neg_mean_squared_error', 'median_absolute_error', 'r2'}。如果提供了可调用对象或函数,它必须符合sklearn的签名
scorer(estimator, X, y)
;更多信息请参阅https://scikit-learn.cn/stable/modules/generated/sklearn.metrics.make_scorer.html。 -
cv
: int (默认值: 5)整数或生成训练集、测试集分割的可迭代对象。如果cv是整数且estimator是分类器(或y包含整数类别标签),则进行分层k折交叉验证。否则执行常规k折交叉验证。如果cv为None、False或0,则不进行交叉验证。
-
n_jobs
: int (默认值: 1)用于并行评估不同特征子集的CPU数量。-1表示“所有CPU”。
-
pre_dispatch
: int, 或 string (默认值: '2*n_jobs')控制在
n_jobs > 1
或n_jobs=-1
时并行执行期间调度的作业数量。当调度的作业多于CPU能处理的数量时,减少此数量有助于避免内存消耗的爆炸。此参数可以是:None,在这种情况下,所有作业都会立即创建并启动。这适用于轻量级和快速运行的作业,以避免由于按需启动作业而引起的延迟;一个整数,指定启动的总作业的确切数量;一个字符串,指定作为n_jobs函数的表达式,如2*n_jobs
。 -
clone_estimator
: bool (默认值: True)如果为True则克隆估计器;如果为False则使用原始估计器实例。如果估计器未实现scikit-learn的set_params和get_params方法,则设置为False。此外,需要设置cv=0和n_jobs=1。
-
fixed_features
: tuple (默认值: None)如果不为None,则作为元组提供的特征索引将被特征选择器视为固定。例如,如果
fixed_features=(1, 3, 7)
,则第2、4和8个特征保证会出现在结果中。请注意,如果fixed_features
不为None,请确保要选择的特征数量大于len(fixed_features)
。换句话说,确保k_features > len(fixed_features)
。mlxtend v. 0.18.0 新功能。 -
feature_groups
: list 或 None (默认值: None)可选参数,用于将某些特征视为一组。这意味着,组内的特征总是一起选择,永不分开。例如,
feature_groups=[[1], [2], [3, 4, 5]]
指定了3个特征组。在这种情况下,k_features=2
可能的特征选择结果是[[1], [2]]
、[[1], [3, 4, 5]]
或[[2], [3, 4, 5]]
。特征组对于可解释性很有用,例如,如果特征3、4、5是one-hot编码的特征。(更多详细信息,请阅读此文档字符串底部的注释)。mlxtend v. 0.21.0 新功能。
属性
-
k_feature_idx_
: array-like, shape = [n_predictions]所选特征子集的特征索引。
-
k_feature_names_
: array-like, shape = [n_predictions]所选特征子集的特征名称。如果在
fit
方法中使用pandas DataFrame,则特征名称对应于列名。否则,特征名称是特征数组索引的字符串表示。v 0.13.0 新功能。 -
k_score_
: float所选子集的交叉验证平均得分。
-
subsets_
: dict顺序选择过程中所选特征子集的字典,其中字典键是这些特征子集的长度k。如果参数
feature_groups
不为None,则键的值表示一起选择的组数。字典值本身是字典,包含以下键:'feature_idx'(特征子集索引的元组)、'feature_names'(特征子集特征名称的元组)、'cv_scores'(单个交叉验证分数的列表)、'avg_score'(交叉验证平均分数)。请注意,如果在fit
方法中使用pandas DataFrame,则'feature_names'对应于列名。否则,特征名称是特征数组索引的字符串表示。'feature_names' 是 v 0.13.0 中的新功能。
注意事项
(1) 如果参数feature_groups
不为None,则特征数量等于特征组的数量,即len(feature_groups)
。例如,如果feature_groups = [[0], [1], [2, 3], [4]]
,则max_features
值不能超过4。
(2) Although two or more individual features may be considered as one group
throughout the feature-selection process, it does not mean the individual
features of that group have the same impact on the outcome. For instance, in
linear regression, the coefficient of the feature 2 and 3 can be different
even if they are considered as one group in feature_groups.
(3) If both fixed_features and feature_groups are specified, ensure that each
feature group contains the fixed_features selection. E.g., for a 3-feature set
fixed_features=[0, 1] and feature_groups=[[0, 1], [2]] is valid;
fixed_features=[0, 1] and feature_groups=[[0], [1, 2]] is not valid.
(4) In case of KeyboardInterrupt, the dictionary subsets may not be completed.
If user is still interested in getting the best score, they can use method
`finalize_fit`.
示例
有关使用示例,请参阅 https://mlxtend.cn/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/
方法
finalize_fit()
None
fit(X, y, groups=None, *fit_params)
执行特征选择并从训练数据中学习模型。
参数
-
X
: {array-like, sparse matrix}, shape = [n_samples, n_features]训练向量,其中 n_samples 是样本数量,n_features 是特征数量。v 0.13.0 新功能:现在也接受 pandas DataFrame 作为 X 的参数。
-
y
: array-like, shape = [n_samples]目标值。v 0.13.0 新功能:现在也接受 pandas DataFrame 作为 y 的参数。
-
groups
: array-like, with shape (n_samples,), optional分割数据集为训练/测试集时使用的样本组标签。传递给交叉验证器的fit方法。
-
fit_params
: various, optional传递给估计器的附加参数。例如,
sample_weights=weights
。
返回值
self
: object
fit_transform(X, y, groups=None, *fit_params)
拟合训练数据,然后将 X 缩减到其最重要的特征。
参数
-
X
: {array-like, sparse matrix}, shape = [n_samples, n_features]训练向量,其中 n_samples 是样本数量,n_features 是特征数量。v 0.13.0 新功能:现在也接受 pandas DataFrame 作为 X 的参数。
-
y
: array-like, shape = [n_samples]目标值。v 0.13.0 新功能:现在也接受 pandas Series 作为 y 的参数。
-
groups
: array-like, with shape (n_samples,), optional分割数据集为训练/测试集时使用的样本组标签。传递给交叉验证器的fit方法。
-
fit_params
: various, optional传递给估计器的附加参数。例如,
sample_weights=weights
。
返回值
X 的缩减特征子集,shape={n_samples, k_features}
generate_error_message_k_features(name)
None
get_metric_dict(confidence_interval=0.95)
返回度量字典
参数
-
confidence_interval
: float (默认值: 0.95)介于0.0和1.0之间的正浮点数,用于计算CV平均分数的置信区间边界。
返回值
字典,其中每个字典值是一个列表,其长度等于迭代次数(特征子集数量)。对应于这些列表的字典键如下:'feature_idx':特征子集索引的元组;'cv_scores':单个CV分数的列表;'avg_score':CV平均分数;'std_dev':CV平均分数的标准差;'std_err':CV平均分数的标准误;'ci_bound':CV平均分数的置信区间边界。
get_params(deep=True)
获取此估计器的参数。
参数
-
deep
: bool, 默认值=True如果为True,将返回此估计器及其包含的作为估计器的子对象的参数。
返回值
-
params
: dict参数名称映射到其值。
set_params(*params)
设置此估计器的参数。有效的参数键可以通过get_params()
列出。
返回值
self
transform(X)
将 X 缩减到其最重要的特征。
参数
-
X
: {array-like, sparse matrix}, shape = [n_samples, n_features]训练向量,其中 n_samples 是样本数量,n_features 是特征数量。v 0.13.0 新功能:现在也接受 pandas DataFrame 作为 X 的参数。
返回值
X 的缩减特征子集,shape={n_samples, k_features}
属性
named_estimators
返回值
命名估计器元组列表,例如 [('svc', SVC(...))]
python