从头实现主成分分析算法（python）

文摘 2024-06-10 22:15 辽宁

点击上方“进修编程”，选择“星标”公众号

超级无敌干货，第一时间送达！！！

金属质感分割线

今天从头讲一下主成分分析算法。适合大家系统学习。

许多成熟的 Python 软件包（如 scikit-learn）都实现了机器学习算法，例如主成分分析 (PCA) 算法。那么，为什么要费心去了解算法的底层工作原理呢？

深入理解底层数学概念对于根据算法的输出做出更好的决策并避免将算法视为“黑匣子”至关重要。

本文展示了 PCA 算法内部工作原理的直观感受，涵盖了降维、特征向量和特征值等关键概念，然后实现了一个 Python 类来封装这些概念并对数据集执行 PCA 分析。

本文不仅适合机器学习初学者还适合对创建自定义机器学习应用程序感兴趣并需要了解算法内部工作原理的从业者。

1. 降维
2. 主成分分析如何工作？
3. 在 Python 中实现
4. 评估和解释
5. 结论和后续步骤

1.降维

机器学习中的许多实际问题涉及具有数千甚至数百万个特征的数据集。训练这样的数据集需要大量的计算，而解释结果则更具挑战性。

随着特征数量的增加，数据点变得更加稀疏，距离度量变得不那么有用，因为点之间的距离不太明显，很难区分近点和远点。这就是所谓的维数灾难。

数据越稀疏，模型就越难训练，也越容易出现过度拟合，捕捉噪音而不是底层模式。这会导致模型对新的、未见过的数据的泛化能力较差。

降维用于数据科学和机器学习，以减少数据集中的变量或特征数量，同时保留尽可能多的原始信息。此技术可用于简化复杂数据集、提高计算效率并帮助实现数据可视化。

2.主成分分析如何工作？

缓解维数灾难的最常用技术之一是主成分分析 (PCA)。PCA 通过找到解释数据集中最大方差的轴，减少数据集中的特征数量，同时保留大部分有用信息。这些轴称为主成分。

由于 PCA 的目的是找到数据集的低维表示，同时保留大部分方差而不是执行预测，因此它被认为是一种无监督学习算法。

但是为什么保留方差就意味着保留重要信息呢？

假设你正在分析一个城市犯罪的数据集。这些数据有许多特征，包括“人身犯罪 - 有伤亡”和“人身犯罪 - 无伤亡”。当然，第一个示例发生率高的地方第二个示例发生率也一定高。

换句话说，该示例的两个特征高度相关，因此可以通过减少数据中的冗余（受害者是否受伤）来减少该数据集的维度。

PCA 算法只不过是一种复杂的实现方法。

现在，让我们按照以下步骤分解 PCA 算法的工作原理：

步骤 1：将数据居中

PCA 受数据规模的影响，因此首先要做的是减去数据集每个特征的平均值，从而确保所有特征的平均值都等于 0。

居中前后的数据。

第 2 步：计算协方差矩阵

现在，我们必须计算协方差矩阵来捕捉数据中每对特征如何一起变化。如果数据集有n 个特征，则生成的协方差矩阵将具有n x n形状。

下图中，相关性越高的特征颜色越接近红色。当然，每个特征都与自身高度相关。

协方差矩阵的热图

步骤 3：特征值分解

接下来，我们必须对协方差矩阵进行特征值分解。如果您不记得了，给定协方差矩阵 Σ（方阵），特征值分解就是找到一组标量（特征值）和向量（特征向量）的过程，使得：

特征值属性

Σ是n×n协方差矩阵。
v是一个非零向量，称为特征向量。
λ是与特征向量v相关的标量，称为特征值。

特征向量表示数据中最大方差的方向（主成分），而特征值量化每个主成分捕获的方差。

如果矩阵A可以分解为特征值和特征向量，则可以表示为：

矩阵的特征分解

Q是一个矩阵，其列是A的特征向量。
Λ 是一个对角矩阵，其对角线元素是A的特征值。

这样，我们可以使用相同的步骤来找到协方差矩阵的特征值和特征向量。

绘制特征向量

在上图中，我们可以看到第一个特征向量指向数据方差最大的方向，第二个特征向量指向数据方差第二大的方向。

步骤 4：选择主成分

如前所述，特征值量化了数据在其对应特征向量方向上的方差。因此，我们按降序对特征值进行排序，并仅保留前 n 个所需的主成分。

下图说明了二维 PCA 中每个主成分捕获的方差比例。

解释 2 个主成分的方差

步骤 5：投影数据

最后，我们必须将原始数据投影到所选主成分所代表的维度上。为此，我们必须将数据集（在中心化后）乘以协方差矩阵分解中找到的特征向量矩阵。

将原始数据集投影到 n 维

3. Python 实现

现在我们已经深刻理解了主成分分析的关键概念，是时候创建一些代码了。

首先，我们必须设置环境，导入 numpy 包进行数学计算，导入 matplotlib 进行可视化：

import numpy as npimport matplotlib.pyplot as plt

接下来，我们将把上一节中介绍的所有概念封装在一个 Python 类中，方法如下：

init 方法

构造函数方法来初始化算法的参数：所需的组件数量、用于存储组件向量的矩阵以及用于存储每个选定维度的解释方差的数组。

拟合方法

在 fit 方法中，上一节中介绍的前四个步骤已通过代码实现。此外，还计算了每个成分的解释方差。

变换方法

转换方法执行上一节中介绍的最后一步：将数据投影到选定的维度上。

绘制解释方差

最后一种方法是一个辅助函数，将每个选定主成分的解释方差绘制为条形图。

完整代码如下：

class PCA:    def __init__(self, n_components):        self.n_components = n_components        self.components = None        self.mean = None        self.explained_variance = None
    def fit(self, X):        # Step 1: Standardize the data (subtract the mean)        self.mean = np.mean(X, axis=0)        X_centered = X - self.mean
        # Step 2: Compute the covariance matrix        cov_matrix = np.cov(X_centered, rowvar=False)
        # Step 3: Compute the eigenvalues and eigenvectors        eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
        # Step 4: Sort the eigenvalues and corresponding eigenvectors        sorted_indices = np.argsort(eigenvalues)[::-1]        eigenvalues = eigenvalues[sorted_indices]        eigenvectors = eigenvectors[:, sorted_indices]
        # Step 5: Select the top n_components        self.components = eigenvectors[:, :self.n_components]
        # Calculate explained variance        total_variance = np.sum(eigenvalues)        self.explained_variance = eigenvalues[:self.n_components] / total_variance
    def transform(self, X):        # Step 6: Project the data onto the selected components        X_centered = X - self.mean        return np.dot(X_centered, self.components)
    def plot_explained_variance(self):        # Create labels for each principal component        labels = [f'PCA{i+1}' for i in range(self.n_components)]
        # Create a bar plot for explained variance        plt.figure(figsize=(8, 6))        plt.bar(range(1, self.n_components + 1), self.explained_variance, alpha=0.7, align='center', color='blue', tick_label=labels)        plt.xlabel('Principal Component')        plt.ylabel('Explained Variance Ratio')        plt.title('Explained Variance by Principal Components')        plt.show()

4. 评估与解释

现在是时候在使用 numpy 包创建的模拟数据集上使用我们刚刚实现的类了。该数据集有 10 个特征和 100 个样本。

# create simulated data for analysisnp.random.seed(42)# Generate a low-dimensional signallow_dim_data = np.random.randn(100, 4)
# Create a random projection matrix to project into higher dimensionsprojection_matrix = np.random.randn(4, 10)
# Project the low-dimensional data to higher dimensionshigh_dim_data = np.dot(low_dim_data, projection_matrix)
# Add some noise to the high-dimensional datanoise = np.random.normal(loc=0, scale=0.5, size=(100, 10))data_with_noise = high_dim_data + noise
X = data_with_noise

在执行 PCA 之前，还有一个问题：我们如何选择正确或最佳的维数？通常，我们必须寻找加起来至少占数据集解释方差 95% 的成分数量。

为此，让我们看看每个主成分对数据集的总方差有何贡献：

# Apply PCApca = PCA(n_components=10)pca.fit(X)X_transformed = pca.transform(X)
print("Explained Variance:\n", pca.explained_variance)
>> Explained Variance (%): [55.406, 25.223, 11.137, 5.298, 0.641, 0.626, 0.511, 0.441, 0.401, 0.317]

我们仅使用 numpy 包创建了一个 PCA 类，成功地将数据集的维数从 10 个特征减少到 4 个，同时保留了数据方差的约 97%。

此外，我们探索了一种方法，以获得 PCA 分析的最佳主成分数量，该方法可以根据我们面临的问题进行定制

完整代码

import numpy as npimport matplotlib.pyplot as plt
class PCA:    def __init__(self, n_components):        self.n_components = n_components        self.components = None        self.mean = None        self.explained_variance = None
    def fit(self, X):        # Step 1: Standardize the data (subtract the mean)        self.mean = np.mean(X, axis=0)        X_centered = (X - self.mean)/X.std()
        # Step 2: Compute the covariance matrix        cov_matrix = np.cov(X_centered, rowvar=False)
        # Step 3: Compute the eigenvalues and eigenvectors        eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
        # Step 4: Sort the eigenvalues and corresponding eigenvectors        sorted_indices = np.argsort(eigenvalues)[::-1]        eigenvalues = eigenvalues[sorted_indices]        eigenvectors = eigenvectors[:, sorted_indices]
        # Step 5: Select the top n_components        self.components = eigenvectors[:, :self.n_components]
        # Calculate explained variance        total_variance = np.sum(eigenvalues)        self.explained_variance = eigenvalues[:self.n_components] / total_variance
    def transform(self, X):        # Step 6: Project the data onto the selected components        X_centered = X - self.mean        return np.dot(X_centered, self.components)
    def plot_explained_variance(self):        # Create labels for each principal component        labels = [f'PCA{i+1}' for i in range(self.n_components)]
        # Create a bar plot for explained variance        plt.figure(figsize=(8, 6))        plt.bar(range(1, self.n_components + 1), self.explained_variance, alpha=0.7, align='center', color='blue', tick_label=labels)        plt.xlabel('Principal Component')        plt.ylabel('Explained Variance Ratio')        plt.title('Explained Variance by Principal Components')        plt.show()
# Example usageif __name__ == "__main__":    import matplotlib.pyplot as plt    import numpy as np
    # create simulated data for analysis    np.random.seed(42)    # Generate a low-dimensional signal    low_dim_data = np.random.randn(100, 4)
    # Create a random projection matrix to project into higher dimensions    projection_matrix = np.random.randn(4, 10)
    # Project the low-dimensional data to higher dimensions    high_dim_data = np.dot(low_dim_data, projection_matrix)
    # Add some noise to the high-dimensional data    noise = np.random.normal(loc=0, scale=0.5, size=(100, 10))    data_with_noise = high_dim_data + noise
    X = data_with_noise
    # Apply PCA    pca = PCA(n_components=10)    pca.fit(X)    X_transformed = pca.transform(X)
    # Print the results    # print("Mean:\n", pca.mean)    # print("Components (Eigenvectors):\n", pca.components)    explained_variance = [round(v,3) for v in pca.explained_variance*100]    print("Explained Variance (%):\n", explained_variance)    # print("Transformed Data (first 5 samples):\n", X_transformed[:5])
    # Plot explained variance    # pca.plot_explained_variance()
    # create a array with the cumulative variance of each principal component    cum_sum = np.cumsum(pca.explained_variance*100)    print(cum_sum)    d = np.argmax(cum_sum >= 95.0) + 1    print(d)
    # create a scree plot to observe how many dimensions add up to 95% of the total variance    plt.figure(figsize=(8, 6))    plt.plot(range(1,11), cum_sum, c='blue')    plt.scatter(d, cum_sum[d-1], c="red")    plt.legend(['cummulative variance', 'optimal point'])    plt.xlabel('# Principal Components')    plt.ylabel('Cummulative Explained Variance')    plt.title('Explained Variance by Principal Components')    plt.show()

点赞关注我吧！本人接matlab、python程序设计欢迎咨询。

— 完 —

进修编程

提升编程技能，学习编程技巧