逻辑回归模型完整指南（python）

文摘 2024-05-26 19:20 辽宁

点击上方“进修编程”，选择“星标”公众号

超级无敌干货，第一时间送达！！！

金属质感分割线

在这篇文章中，我将使用 Scipy 库创建一个逻辑回归模型，并将该模型与Sklearn 的逻辑回归模型进行比较。目标是让大家掌握逻辑回归的基本概念。希望在本文结束时，您将对S 型函数的过程有更深入的了解，S 型函数是逻辑回归最重要的函数之一。

逻辑回归是一种用于分类问题的机器学习模型。它因应用于因变量的逻辑变换而得名。我们使用 S 型函数推导 Logit 变换，应用于 S 型函数的函数 f(x) 的输出范围从 0 到 1。Logit变换的解密由 S 型函数执行。对于逻辑回归方程，我们需要做的就是将线性回归方程插入逻辑函数（S 型）以创建逻辑回归。

使用香蕉数据集（如下）

https://www.kaggle.com/datasets/l3llff/banana?source=post_page-----8f7ecc40c165--------------------------------

我们将建立一个逻辑回归模型来确定香蕉是“好”还是“坏”。我们将此模型与 Sklearn 的模型进行比较。首先，我们将定义线性回归方程并使用 scipy.optimize 库中的最小化函数优化我们的模型。确定模型的值后，我们将把它链接到 sigmoid 函数。

import matplotlib.pyplot as pltimport pandas as pdimport numpy as npfrom scipy.optimize import minimizefrom sklearn.metrics import accuracy_scorefrom sklearn.linear_model import LogisticRegressionpd.set_option('display.max_columns', None)pd.set_option('display.width', 300)## Read Dataframedf = pd.read_csv("banana_quality.csv")##Explore Dataframedf.head()df.info()df.describe().Tdf["Quality"] = df["Quality"].map({"Good":1,"Bad":0})

In [20]: df.head()Out[20]:        Size    Weight  Sweetness  Softness  HarvestTime  Ripeness   Acidity Quality0 -1.924968  0.468078   3.077832 -1.472177     0.294799  2.435570  0.271290    Good1 -2.409751  0.486870   0.346921 -2.495099    -0.892213  2.067549  0.307325    Good2 -0.357607  1.483176   1.568452 -2.645145    -0.647267  3.090643  1.427322    Good3 -0.868524  1.566201   1.889605 -1.273761    -1.006278  1.873001  0.477862    Good4  0.651825  1.319199  -0.022459 -1.209709    -1.430692  1.078345  2.812442    Good

In [21]: df.info()<class 'pandas.core.frame.DataFrame'>RangeIndex: 8000 entries, 0 to 7999Data columns (total 8 columns): #   Column       Non-Null Count  Dtype  ---  ------       --------------  -----   0   Size         8000 non-null   float64 1   Weight       8000 non-null   float64 2   Sweetness    8000 non-null   float64 3   Softness     8000 non-null   float64 4   HarvestTime  8000 non-null   float64 5   Ripeness     8000 non-null   float64 6   Acidity      8000 non-null   float64 7   Quality      8000 non-null   object dtypes: float64(7), object(1)memory usage: 500.1+ KB

In [22]: df.describe().TOut[22]:               count      mean       std       min       25%       50%       75%       maxSize         8000.0 -0.747802  2.136023 -7.998074 -2.277651 -0.897514  0.654216  7.970800Weight       8000.0 -0.761019  2.015934 -8.283002 -2.223574 -0.868659  0.775491  5.679692Sweetness    8000.0 -0.770224  1.948455 -6.434022 -2.107329 -1.020673  0.311048  7.539374Softness     8000.0 -0.014441  2.065216 -6.959320 -1.590458  0.202644  1.547120  8.241555HarvestTime  8000.0 -0.751288  1.996661 -7.570008 -2.120659 -0.934192  0.507326  6.293280Ripeness     8000.0  0.781098  2.114289 -7.423155 -0.574226  0.964952  2.261650  7.249034Acidity      8000.0  0.008725  2.293467 -8.226977 -1.629450  0.098735  1.682063  7.411633

使用map()将“Quality”变量中的“Good”和“Bad”值设置为 1–0 。这样我们就可以建立模型了。我构建的完整函数如下所示。我将逐步详细说明它的工作原理。

def sigmoid(x):    return 1 / (1 + np.exp(-x))
def my_logistic_regression(dataframe, column1, column2, target):    x1 = dataframe[column1].astype(float)  # Ensure data types are float    x2 = dataframe[column2].astype(float)    y = dataframe[target].astype(float)
    def objective(params):        intercept, coefficient1, coefficient2 = params        z = intercept + coefficient1 * x1 + coefficient2 * x2        y_pred_proba = sigmoid(z)        # Cross-entropy loss (transformation of exp by using log)        loss = -np.mean(y * np.log(y_pred_proba) + (1 - y) * np.log(1 - y_pred_proba))        return loss
    # Initialize parameters and minimize the objective function    initial_guess = [0, 0, 0]    result = minimize(objective, initial_guess)
    # Extract the optimized parameters    intercept, coefficient1, coefficient2 = result.x
    # Calculate predictions using the optimized parameters    z = intercept + coefficient1 * x1 + coefficient2 * x2    y_pred_proba = sigmoid(z)    # Threshold probabilities to get binary predictions (0 or 1)    y_pred = (y_pred_proba >= 0.5).astype(int)
    # Add predictions to the dataframe    dataframe["Logistic_Predictions"] = y_pred
    # Calculate accuracy    accuracy = accuracy_score(y, y_pred)    # Print results    print("********************* My Logistic Regression Model ****************")    print(f"Intercept: {intercept}")    print(f"Coefficients: {coefficient1, coefficient2}")    print(f"Accuracy: {accuracy}")

首先，我们将sigmoid(x)定义为函数。因此，在创建线性方程后，我们可以直接使用 sigmoid。

def sigmoid(x):    return 1 / (1 + np.exp(-x))

在my_logistic_regression 函数中，使用了数据集的两个独立变量，它们分别称为x1和x2。y是目标变量。通过使用这两个变量，将优化截距和系数值。

def my_logistic_regression(dataframe, column1, column2, target):    x1 = dataframe[column1].astype(float)  # Ensure data types are float    x2 = dataframe[column2].astype(float)    y = dataframe[target].astype(float)

我们创建一个内部函数（objective(params)），params 为截距，coefficient1，coefficient2，我们将其定义为z。如前所述，我在 sigmoid 函数（sigmoid(z) ）中部署了线性方程z 。我得到了y_pred_proba的值。我设置了一个损失值来优化这些值。我使用sigmoid 函数的逆（log）来计算y 和 y_pred_proba 值之间的误差。

def objective(params):        intercept, coefficient1, coefficient2 = params        z = intercept + coefficient1 * x1 + coefficient2 * x2        y_pred_proba = sigmoid(z)        # Cross-entropy loss (transformation of exp by using log)        loss = -np.mean(y * np.log(y_pred_proba) + (1 - y) * np.log(1 - y_pred_proba))        return loss

此行使用scipy.optimize.minimize函数来查找使目标函数最小化的截距、系数 1和系数 2的值。它从params的初始猜测[0,0,0]开始，并将优化值存储在结果变量中。

# Initialize parameters and minimize the objective function    initial_guess = [0, 0, 0]    result = minimize(objective, initial_guess)

我们从结果中获取优化的截距和系数。使用优化的系数和输入特征 x1 和 x2 计算z。然后我们插入 sigmoid 函数，得到y_pred_proba。

# Extract the optimized parameters    intercept, coefficient1, coefficient2 = result.x
    # Calculate predictions using the optimized parameters    z = intercept + coefficient1 * x1 + coefficient2 * x2    y_pred_proba = sigmoid(z)

此步骤对于将连续概率转换为分类任务的离散二进制预测是必不可少的。S 型函数的值范围从 0 到 1，因此，y 值的概率计算为y_pred_proba。在下面的代码中，我们将阈值定义为 0.5。如果y_pred_proba大于或等于 0.5，我们认为 y_pred 为1；否则为 0 。

# Threshold probabilities to get binary predictions (0 or 1)    y_pred = (y_pred_proba >= 0.5).astype(int)

我们为预测添加了一列。使用accuracy_score函数，我们可以计算估计值与实际值的分数。

# Add predictions to the dataframe    dataframe["Logistic_Predictions"] = y_pred
    # Calculate accuracy    accuracy = accuracy_score(y, y_pred)    # Print results    print("********************* My Logistic Regression Model ****************")    print(f"Intercept: {intercept}")    print(f"Coefficients: {coefficient1, coefficient2}")    print(f"Accuracy: {accuracy}")

现在是时候测试 my_logistic_regression 模型了。

In [24]: my_logistic_regression(df, "Size", "Weight", "Quality")********************* Logistic Regression Model ****************Intercept: 1.0061546749829708Coefficients: (0.6225788453474588, 0.6907510395097968)Accuracy: 0.76675

还要测试sklearn逻辑回归模型；

In [29]: logistic_regression_sklearn(df, "Size", "Weight", "Quality")********************* Logistic Regression Model ****************Coefficients: [[0.62229463 0.69044133]]Intercept: [1.00569715]Accuracy: 0.76675

虽然准确度得分相同，但截距和系数存在差异。当然，我们的目标不是挑战 Sklearn 库。因此，逻辑回归是一种线性回归，它变成了具有 S 型函数的分类问题。不要被太多公式所困扰。希望你喜欢。使用了 Kaggle 数据集。

import matplotlib.pyplot as pltimport pandas as pdimport numpy as npfrom scipy.optimize import minimizefrom sklearn.metrics import accuracy_scorefrom sklearn.linear_model import LogisticRegressionpd.set_option('display.max_columns', None)pd.set_option('display.width', 300)## Read Dataframedf = pd.read_csv("banana_quality.csv")##Explore Dataframedf.head()df.info()df.describe().Tdf["Quality"] = df["Quality"].map({"Good":1,"Bad":0})

def sigmoid(x):    return 1 / (1 + np.exp(-x))
def my_logistic_regression(dataframe, column1, column2, target):    x1 = dataframe[column1].astype(float)  # Ensure data types are float    x2 = dataframe[column2].astype(float)    y = dataframe[target].astype(float)
    def objective(params):        intercept, coefficient1, coefficient2 = params        z = intercept + coefficient1 * x1 + coefficient2 * x2        y_pred_proba = sigmoid(z)        # Transformation of exp by using log        loss = -np.mean(y * np.log(y_pred_proba) + (1 - y) * np.log(1 - y_pred_proba))        return loss
    # Initialize parameters and minimize the objective function    initial_guess = [0, 0, 0]    result = minimize(objective, initial_guess)
    # Extract the optimized parameters    intercept, coefficient1, coefficient2 = result.x
    # Calculate predictions using the optimized parameters    z = intercept + coefficient1 * x1 + coefficient2 * x2    y_pred_proba = sigmoid(z)    # Threshold probabilities to get binary predictions (0 or 1)    y_pred = (y_pred_proba >= 0.5).astype(int)
    # Add predictions to the dataframe    dataframe["Logistic_Predictions"] = y_pred
    # Calculate accuracy    accuracy = accuracy_score(y, y_pred)
    # Print results    print("********************* Logistic Regression Model ****************")    print(f"Intercept: {intercept}")    print(f"Coefficients: {coefficient1, coefficient2}")    print(f"Accuracy: {accuracy}")
my_logistic_regression(df, "Size", "Weight", "Quality")

def logistic_regression_sklearn(dataframe, column1, column2, target):    X = dataframe[[column1, column2]]    y = dataframe[target]    # Initialize logistic regression model    model = LogisticRegression()    model.fit(X, y)    y_pred = model.predict(X)    dataframe["Logistic_Predictions"] = y_pred    accuracy = accuracy_score(y, y_pred)    # Print results    print("********************* Logistic Regression Model ****************")    print(f"Coefficients: {model.coef_}")    print(f"Intercept: {model.intercept_}")    print(f"Accuracy: {accuracy}")
logistic_regression_sklearn(df, "Size", "Weight", "Quality")my_logistic_regression(df, "Size", "Weight", "Quality")
### Sigmoid visulizationdef sigmoid_visual():    # Define the sigmoid function    # Generate x values    x = np.linspace(-10, 10, 100)    # Calculate corresponding y values using the sigmoid function    y = sigmoid(x)    # Plot the sigmoid function    plt.figure(figsize=(8, 6))    plt.plot(x, y, label='$\sigma(x) = \\frac{1}{1 + e^{-x}}$', color='blue')    # Add a horizontal line at y = 0.5    plt.axhline(y=0.5, color='red', linestyle='--', label='Threshold')    # Shade the area above the threshold as 1    plt.title('Sigmoid Function')    plt.xlabel('x')    plt.ylabel('Sigmoid(x)')    plt.grid(True)    plt.legend()    plt.show(block=True)

sigmoid_visual()

点赞关注一下各位，本人五年项目开发经验接matlab、python程序设计

— 完 —

进修编程

提升编程技能，学习编程技巧