主成分分析：用于降维的主成分分析（PCA）

用于降维的主成分分析实现

from mlxtend.feature_extraction import PrincipalComponentAnalysis

概述

在现代，数据的庞大规模不仅对计算机硬件是一个挑战，也是许多机器学习算法性能的主要瓶颈。PCA分析的主要目标是识别数据中的模式；PCA旨在检测变量之间的相关性。如果变量之间存在很强的相关性，那么尝试降维才是有意义的。简而言之，这就是PCA的全部意义：在高维数据中找到最大方差的方向，并将其投影到维度更小的子空间上，同时保留大部分信息。

PCA与降维

通常，期望的目标是通过将一个 $d$ 维数据集投影到一个 $(k)$ 维子空间（其中 $k\;<\;d$ ），以提高计算效率，同时保留大部分信息。一个重要的问题是，“ $k$ 的大小是多少才能‘很好地’表示数据？”

稍后，我们将计算数据集的特征向量（主成分）并将它们收集到一个投影矩阵中。这些特征向量中的每一个都与一个特征值相关联，特征值可以解释为相应特征向量的“长度”或“大小”。如果某些特征值的大小显著大于其他特征值，那么通过PCA将数据集降维到较小的子空间，丢弃“信息量较少”的特征对是合理的。

PCA方法总结

标准化数据。
从协方差矩阵或相关矩阵获取特征向量和特征值，或执行奇异值分解。
按降序排序特征值并选择 $k$ 与 $k$ 个最大特征值对应的特征向量，其中 $k$ 是新特征子空间的维数（ $k \le d$ ).
构建投影矩阵 $\mathbf{W}$ 从选定的 $k$ 特征向量构建。
转换原始数据集 $\mathbf{X}$ 通过 $\mathbf{W}$ 得到一个 $k$ 维特征子空间 $\mathbf{Y}$ .

参考文献

Pearson, Karl. "LIII. On lines and planes of closest fit to systems of points in space." 《伦敦、爱丁堡和都柏林哲学杂志与科学期刊》2.11 (1901): 559-572。

示例 1 - 在Iris数据集上使用PCA

from mlxtend.data import iris_data
from mlxtend.preprocessing import standardize
from mlxtend.feature_extraction import PrincipalComponentAnalysis

X, y = iris_data()
X = standardize(X)

pca = PrincipalComponentAnalysis(n_components=2)
pca.fit(X)
X_pca = pca.transform(X)

import matplotlib.pyplot as plt

with plt.style.context('seaborn-whitegrid'):
    plt.figure(figsize=(6, 4))
    for lab, col in zip((0, 1, 2),
                        ('blue', 'red', 'green')):
        plt.scatter(X_pca[y==lab, 0],
                    X_pca[y==lab, 1],
                    label=lab,
                    c=col)
    plt.xlabel('Principal Component 1')
    plt.ylabel('Principal Component 2')
    plt.legend(loc='lower center')
    plt.tight_layout()
    plt.show()

png

示例 2 - 绘制解释方差比

from mlxtend.data import iris_data
from mlxtend.preprocessing import standardize

X, y = iris_data()
X = standardize(X)

pca = PrincipalComponentAnalysis(n_components=None)
pca.fit(X)
X_pca = pca.transform(X)

pca.e_vals_

array([2.91081808, 0.92122093, 0.14735328, 0.02060771])

pca.e_vals_normalized_

array([0.72770452, 0.23030523, 0.03683832, 0.00515193])

import numpy as np

tot = sum(pca.e_vals_)
var_exp = [(i / tot)*100 for i in sorted(pca.e_vals_, reverse=True)]
cum_var_exp = np.cumsum(pca.e_vals_normalized_*100)

with plt.style.context('seaborn-whitegrid'):
    fig, ax = plt.subplots(figsize=(6, 4))
    plt.bar(range(4), var_exp, alpha=0.5, align='center',
            label='individual explained variance')
    plt.step(range(4), cum_var_exp, where='mid',
             label='cumulative explained variance')
    plt.ylabel('Explained variance ratio')
    plt.xlabel('Principal components')
    plt.xticks(range(4))
    ax.set_xticklabels(np.arange(1, X.shape[1] + 1))
    plt.legend(loc='best')
    plt.tight_layout()

png

示例 3 - 通过SVD进行PCA

虽然协方差或相关矩阵的特征值分解可能更直观，但大多数PCA实现使用奇异值分解（SVD）来提高计算效率。使用SVD的另一个优点是结果在数值上更稳定，因为我们可以直接分解输入矩阵，而无需额外的协方差矩阵计算步骤。

from mlxtend.data import iris_data
from mlxtend.preprocessing import standardize
from mlxtend.feature_extraction import PrincipalComponentAnalysis

X, y = iris_data()
X = standardize(X)

pca = PrincipalComponentAnalysis(n_components=2,
                                 solver='svd')
pca.fit(X)
X_pca = pca.transform(X)

import matplotlib.pyplot as plt

with plt.style.context('seaborn-whitegrid'):
    plt.figure(figsize=(6, 4))
    for lab, col in zip((0, 1, 2),
                        ('blue', 'red', 'green')):
        plt.scatter(X_pca[y==lab, 0],
                    X_pca[y==lab, 1],
                    label=lab,
                    c=col)
    plt.xlabel('Principal Component 1')
    plt.ylabel('Principal Component 2')
    plt.legend(loc='lower center')
    plt.tight_layout()
    plt.show()

png

如果我们将此PCA投影与示例1中的上一个图进行比较，我们会发现它们是彼此的镜像。请注意，这不是因为任何一个实现的错误，而是由于特征求解器不同，特征向量可以具有负号或正号。

例如，如果 $v$ 是矩阵 $\Sigma$ 的特征向量，则有

$\Sigma v = \lambda v,$

其中 $\lambda$ 是我们的特征值

则 $-v$ 也是具有相同特征值的特征向量，因为

$\Sigma(-v) = -\Sigma v = -\lambda v = \lambda(-v).$

示例 4 - 因子载荷

调用fit方法后，可以通过loadings_属性获取因子载荷。简单来说，载荷是特征向量的非标准化值。换句话说，我们可以将载荷解释为输入特征与主成分（或特征向量）之间的协方差（如果输入特征已标准化，则为相关性），这些主成分（或特征向量）已被缩放到单位长度。

通过缩放载荷，它们在数值上变得可比较，我们可以评估一个成分中有多少方差归因于输入特征（因为成分只是输入特征的加权线性组合）。

from mlxtend.data import iris_data
from mlxtend.preprocessing import standardize
from mlxtend.feature_extraction import PrincipalComponentAnalysis
import matplotlib.pyplot as plt

X, y = iris_data()
X = standardize(X)

pca = PrincipalComponentAnalysis(n_components=2,
                                 solver='eigen')
pca.fit(X);

xlabels = ['sepal length', 'sepal width', 'petal length', 'petal width']

fig, ax = plt.subplots(1, 2, figsize=(8, 3))

ax[0].bar(range(4), pca.loadings_[:, 0], align='center')
ax[1].bar(range(4), pca.loadings_[:, 1], align='center')

ax[0].set_ylabel('Factor loading onto PC1')
ax[1].set_ylabel('Factor loading onto PC2')

ax[0].set_xticks(range(4))
ax[1].set_xticks(range(4))
ax[0].set_xticklabels(xlabels, rotation=45)
ax[1].set_xticklabels(xlabels, rotation=45)
plt.ylim([-1, 1])
plt.tight_layout()

png

例如，我们可以说第一个成分的大部分方差归因于花瓣特征（尽管萼片长度在PC1上的载荷在数值上也没有少太多）。相比之下，PC2捕获的剩余方差主要归因于萼片宽度。请注意，我们从示例2中知道PC1解释了大部分方差，并且根据载荷图的信息，我们可以说花瓣特征与萼片长度结合可能解释了数据的大部分散布。

示例 5 - 特征提取Pipeline

from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from mlxtend.data import wine_data

X, y = wine_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123, test_size=0.3, stratify=y)

pipe_pca = make_pipeline(StandardScaler(),
                         PrincipalComponentAnalysis(n_components=3),
                         KNeighborsClassifier(n_neighbors=5))

pipe_pca.fit(X_train, y_train)


print('Transf. training accyracy: %.2f%%' % (pipe_pca.score(X_train, y_train)*100))
print('Transf. test accyracy: %.2f%%' % (pipe_pca.score(X_test, y_test)*100))

Transf. training accyracy: 96.77%
Transf. test accyracy: 96.30%

示例 6 - 白化

某些算法要求数据进行白化。这意味着特征具有单位方差且非对角线元素都为零（即特征不相关）。PCA已经确保特征是不相关的，因此我们只需要应用简单的缩放来对转换后的数据进行白化。

例如，对于给定的转换后特征 $X'_i$ ，我们将其除以相应特征值的平方根 $\lambda_i$ :

$X'_{\text{whitened}} = \frac{X'_i}{\sqrt{\lambda_i}}.$

通过PrincipalComponentAnalysis进行白化可以通过在初始化时设置whitening=True来实现。我们用一个例子来演示。

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from mlxtend.data import wine_data

X, y = wine_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123, test_size=0.3, stratify=y)

常规PCA

sc = StandardScaler()

pca1 = PrincipalComponentAnalysis(n_components=2)

X_train_scaled = sc.fit_transform(X_train)
X_train_transf = pca1.fit(X_train_scaled).transform(X_train_scaled)


with plt.style.context('seaborn-whitegrid'):
    plt.figure(figsize=(6, 4))
    for lab, col in zip((0, 1, 2),
                        ('blue', 'red', 'green')):
        plt.scatter(X_train_transf[y_train==lab, 0],
                    X_train_transf[y_train==lab, 1],
                    label=lab,
                    c=col)
    plt.xlabel('Principal Component 1')
    plt.ylabel('Principal Component 2')
    plt.legend(loc='lower center')
    plt.tight_layout()
    plt.show()

png

np.set_printoptions(precision=1, suppress=True)

print('Covariance matrix:\n')
np.cov(X_train_transf.T)

Covariance matrix:






array([[4.9, 0. ],
       [0. , 2.5]])

正如我们所见，转换后的特征是不相关的，但它们没有单位方差。

带白化的PCA

sc = StandardScaler()

pca1 = PrincipalComponentAnalysis(n_components=2, whitening=True)

X_train_scaled = sc.fit_transform(X_train)
X_train_transf = pca1.fit(X_train_scaled).transform(X_train_scaled)


with plt.style.context('seaborn-whitegrid'):
    plt.figure(figsize=(6, 4))
    for lab, col in zip((0, 1, 2),
                        ('blue', 'red', 'green')):
        plt.scatter(X_train_transf[y_train==lab, 0],
                    X_train_transf[y_train==lab, 1],
                    label=lab,
                    c=col)
    plt.xlabel('Principal Component 1')
    plt.ylabel('Principal Component 2')
    plt.legend(loc='lower center')
    plt.tight_layout()
    plt.show()

png

np.set_printoptions(precision=1, suppress=True)

print('Covariance matrix:\n')
np.cov(X_train_transf.T)

Covariance matrix:






array([[1., 0.],
       [0., 1.]])

正如我们在上面看到的，白化使得所有特征现在都具有单位方差。即，转换后特征的协方差矩阵变为单位矩阵。

API

PrincipalComponentAnalysis(n_components=None, solver='svd', whitening=False)

主成分分析类

参数

n_components : int (默认值: None)

转换时保留的主成分数量。如果为None，则保留数据集的原始维数。
solver : str (默认值: 'svd')

执行矩阵分解的方法。{'eigen', 'svd'}
whitening : bool (默认值: False)

执行白化，使转换后数据的协方差矩阵成为单位矩阵。

属性

w_ : array-like, shape=[n_features, n_components]

投影矩阵
e_vals_ : array-like, shape=[n_features]

按排序顺序排列的特征值。
e_vecs_ : array-like, shape=[n_features]

按排序顺序排列的特征向量。
e_vals_normalized_ : array-like, shape=[n_features]

归一化的特征值，使其总和为1。这等同于通常所说的“解释方差比”。
loadings_ : array_like, shape=[n_features, n_features]

原始变量在主成分上的因子载荷。列是主成分，行是特征载荷。例如，第一列包含在第一个主成分上的载荷。请注意，符号可能因使用'eigen'或'svd'求解器而翻转；但这不影响对载荷的解释。

示例

有关使用示例，请参阅 https://mlxtend.cn/mlxtend/user_guide/feature_extraction/PrincipalComponentAnalysis/

方法

fit(X, y=None)

从训练数据中学习模型。

参数

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

训练向量，其中n_samples是样本数量，n_features是特征数量。

self : object

get_params(deep=True)

获取此估计器的参数。

参数

deep : boolean, 可选

如果为True，将返回此估计器及其包含的作为估计器的子对象的参数。

params : 将字符串映射到任意值的映射

参数名映射到其值。'

改编自 https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/base.py 作者: Gael Varoquaux gael.varoquaux@normalesup.org 许可证: BSD 3条款

set_params(params)

设置此估计器的参数。此方法适用于简单估计器以及嵌套对象（如pipeline）。后者具有<component>__<parameter>形式的参数，以便可以更新嵌套对象的每个组件。

self

改编自 https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/base.py 作者: Gael Varoquaux gael.varoquaux@normalesup.org 许可证: BSD 3条款

transform(X)

在X上应用线性变换。

参数

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

训练向量，其中n_samples是样本数量，n_features是特征数量。

X_projected : np.ndarray, shape = [n_samples, n_components]

投影后的训练向量。

按键	操作
`?`	打开此帮助
`n`	下一页
`p`	上一页
`s`	搜索