点击上方“进修编程”,选择“星标”公众号
超级无敌干货,第一时间送达!!!
在这篇文章中,我将使用 Scipy 库创建一个逻辑回归模型,并将该模型与Sklearn 的逻辑回归模型进行比较。目标是让大家掌握逻辑回归的基本概念。希望在本文结束时,您将对S 型函数的过程有更深入的了解,S 型函数是逻辑回归最重要的函数之一。
逻辑回归是一种用于分类问题的机器学习模型。它因应用于因变量的逻辑变换而得名。我们使用 S 型函数推导 Logit 变换,应用于 S 型函数的函数 f(x) 的输出范围从 0 到 1。Logit变换的解密由 S 型函数执行。对于逻辑回归方程,我们需要做的就是将线性回归方程插入逻辑函数(S 型)以创建逻辑回归。
使用香蕉数据集(如下)
https://www.kaggle.com/datasets/l3llff/banana?source=post_page-----8f7ecc40c165-------------------------------- |
我们将建立一个逻辑回归模型来确定香蕉是“好”还是“坏”。我们将此模型与 Sklearn 的模型进行比较。首先,我们将定义线性回归方程并使用 scipy.optimize 库中的最小化函数优化我们的模型。确定模型的值后,我们将把它链接到 sigmoid 函数。
import
matplotlib.pyplotas
pltimport pandas as pd
import numpy as np
from scipy.optimize import minimize
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 300)
## Read Dataframe
df = pd.read_csv("banana_quality.csv")
##Explore Dataframe
df.head()
df.info()
df.describe().T
df["Quality"] = df["Quality"].map({"Good":1,"Bad":0})
In [20]: df.head()
Out[20]:
Size Weight Sweetness Softness HarvestTime Ripeness Acidity Quality
0 -1.924968 0.468078 3.077832 -1.472177 0.294799 2.435570 0.271290 Good
1 -2.409751 0.486870 0.346921 -2.495099 -0.892213 2.067549 0.307325 Good
2 -0.357607 1.483176 1.568452 -2.645145 -0.647267 3.090643 1.427322 Good
3 -0.868524 1.566201 1.889605 -1.273761 -1.006278 1.873001 0.477862 Good
4 0.651825 1.319199 -0.022459 -1.209709 -1.430692 1.078345 2.812442 Good
In [21]: df.info()
'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
------ -------------- -----
0 Size 8000 non-null float64
1 Weight 8000 non-null float64
2 Sweetness 8000 non-null float64
3 Softness 8000 non-null float64
4 HarvestTime 8000 non-null float64
5 Ripeness 8000 non-null float64
6 Acidity 8000 non-null float64
7 Quality 8000 non-null object
dtypes: float64(7), object(1)
memory usage: 500.1+ KB
In [22]: df.describe().T
Out[22]:
count mean std min 25% 50% 75% max
Size 8000.0 -0.747802 2.136023 -7.998074 -2.277651 -0.897514 0.654216 7.970800
Weight 8000.0 -0.761019 2.015934 -8.283002 -2.223574 -0.868659 0.775491 5.679692
Sweetness 8000.0 -0.770224 1.948455 -6.434022 -2.107329 -1.020673 0.311048 7.539374
Softness 8000.0 -0.014441 2.065216 -6.959320 -1.590458 0.202644 1.547120 8.241555
HarvestTime 8000.0 -0.751288 1.996661 -7.570008 -2.120659 -0.934192 0.507326 6.293280
Ripeness 8000.0 0.781098 2.114289 -7.423155 -0.574226 0.964952 2.261650 7.249034
Acidity 8000.0 0.008725 2.293467 -8.226977 -1.629450 0.098735 1.682063 7.411633
使用map()将“Quality”变量中的“Good”和“Bad”值设置为 1–0 。这样我们就可以建立模型了。我构建的完整函数如下所示。我将逐步详细说明它的工作原理。
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def my_logistic_regression(dataframe, column1, column2, target):
x1 = dataframe[column1].astype(float) # Ensure data types are float
x2 = dataframe[column2].astype(float)
y = dataframe[target].astype(float)
def objective(params):
intercept, coefficient1, coefficient2 = params
z = intercept + coefficient1 * x1 + coefficient2 * x2
y_pred_proba = sigmoid(z)
# Cross-entropy loss (transformation of exp by using log)
loss = -np.mean(y * np.log(y_pred_proba) + (1 - y) * np.log(1 - y_pred_proba))
return loss
# Initialize parameters and minimize the objective function
initial_guess = [0, 0, 0]
result = minimize(objective, initial_guess)
# Extract the optimized parameters
intercept, coefficient1, coefficient2 = result.x
# Calculate predictions using the optimized parameters
z = intercept + coefficient1 * x1 + coefficient2 * x2
y_pred_proba = sigmoid(z)
# Threshold probabilities to get binary predictions (0 or 1)
y_pred = (y_pred_proba >= 0.5).astype(int)
# Add predictions to the dataframe
dataframe["Logistic_Predictions"] = y_pred
# Calculate accuracy
accuracy = accuracy_score(y, y_pred)
# Print results
print("********************* My Logistic Regression Model ****************")
print(f"Intercept: {intercept}")
print(f"Coefficients: {coefficient1, coefficient2}")
print(f"Accuracy: {accuracy}")
首先,我们将sigmoid(x)定义为函数。因此,在创建线性方程后,我们可以直接使用 sigmoid。
我们创建一个内部函数(objective(params)),params 为截距,coefficient1,coefficient2,我们将其定义为z。如前所述,我在 sigmoid 函数(sigmoid(z) )中部署了线性方程z 。我得到了y_pred_proba的值。我设置了一个损失值来优化这些值。我使用sigmoid 函数的逆(log)来计算y 和 y_pred_proba 值之间的误差。
def objective(params):
coefficient1, coefficient2 = params
z = intercept + coefficient1 * x1 + coefficient2 * x2
y_pred_proba = sigmoid(z)
# Cross-entropy loss (transformation of exp by using log)
loss = -np.mean(y * np.log(y_pred_proba) + (1 - y) * np.log(1 - y_pred_proba))
return loss
此行使用scipy.optimize.minimize函数来查找使目标函数最小化的截距、系数 1和系数 2的值。它从params的初始猜测[0,0,0]开始,并将优化值存储在结果变量中。
# Initialize parameters and minimize the objective function
initial_guess = [0, 0, 0]
result = minimize(objective, initial_guess)
我们从结果中获取优化的截距和系数。使用优化的系数和输入特征 x1 和 x2 计算z。然后我们插入 sigmoid 函数,得到y_pred_proba。
# Extract the optimized parameters
coefficient1, coefficient2 = result.x
# Calculate predictions using the optimized parameters
z = intercept + coefficient1 * x1 + coefficient2 * x2
y_pred_proba = sigmoid(z)
此步骤对于将连续概率转换为分类任务的离散二进制预测是必不可少的。S 型函数的值范围从 0 到 1,因此,y 值的概率计算为y_pred_proba。在下面的代码中,我们将阈值定义为 0.5。如果y_pred_proba大于或等于 0.5,我们认为 y_pred 为1;否则为 0 。
# Threshold probabilities to get binary predictions (0 or 1)
y_pred = (y_pred_proba >= 0.5).astype(int)
我们为预测添加了一列。使用accuracy_score函数,我们可以计算估计值与实际值的分数。
# Add predictions to the dataframe
dataframe["Logistic_Predictions"] = y_pred
# Calculate accuracy
accuracy = accuracy_score(y, y_pred)
# Print results
print("********************* My Logistic Regression Model ****************")
print(f"Intercept: {intercept}")
print(f"Coefficients: {coefficient1, coefficient2}")
print(f"Accuracy: {accuracy}")
现在是时候测试 my_logistic_regression 模型了。
In [24]: my_logistic_regression(df, "Size", "Weight", "Quality")
********************* Logistic Regression Model ****************
Intercept: 1.0061546749829708
Coefficients: (0.6225788453474588, 0.6907510395097968)
Accuracy: 0.76675
还要测试sklearn逻辑回归模型;
In [29]: logistic_regression_sklearn(df, "Size", "Weight", "Quality")
********************* Logistic Regression Model ****************
Coefficients: [[0.62229463 0.69044133]]
Intercept: [1.00569715]
Accuracy: 0.76675
虽然准确度得分相同,但截距和系数存在差异。当然,我们的目标不是挑战 Sklearn 库。因此,逻辑回归是一种线性回归,它变成了具有 S 型函数的分类问题。不要被太多公式所困扰。希望你喜欢。使用了 Kaggle 数据集。
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy.optimize import minimize
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 300)
## Read Dataframe
df = pd.read_csv("banana_quality.csv")
##Explore Dataframe
df.head()
df.info()
df.describe().T
df["Quality"] = df["Quality"].map({"Good":1,"Bad":0})
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def my_logistic_regression(dataframe, column1, column2, target):
x1 = dataframe[column1].astype(float) # Ensure data types are float
x2 = dataframe[column2].astype(float)
y = dataframe[target].astype(float)
def objective(params):
intercept, coefficient1, coefficient2 = params
z = intercept + coefficient1 * x1 + coefficient2 * x2
y_pred_proba = sigmoid(z)
# Transformation of exp by using log
loss = -np.mean(y * np.log(y_pred_proba) + (1 - y) * np.log(1 - y_pred_proba))
return loss
# Initialize parameters and minimize the objective function
initial_guess = [0, 0, 0]
result = minimize(objective, initial_guess)
# Extract the optimized parameters
intercept, coefficient1, coefficient2 = result.x
# Calculate predictions using the optimized parameters
z = intercept + coefficient1 * x1 + coefficient2 * x2
y_pred_proba = sigmoid(z)
# Threshold probabilities to get binary predictions (0 or 1)
y_pred = (y_pred_proba >= 0.5).astype(int)
# Add predictions to the dataframe
dataframe["Logistic_Predictions"] = y_pred
# Calculate accuracy
accuracy = accuracy_score(y, y_pred)
# Print results
print("********************* Logistic Regression Model ****************")
print(f"Intercept: {intercept}")
print(f"Coefficients: {coefficient1, coefficient2}")
print(f"Accuracy: {accuracy}")
my_logistic_regression(df, "Size", "Weight", "Quality")
def logistic_regression_sklearn(dataframe, column1, column2, target):
X = dataframe[[column1, column2]]
y = dataframe[target]
# Initialize logistic regression model
model = LogisticRegression()
model.fit(X, y)
y_pred = model.predict(X)
dataframe["Logistic_Predictions"] = y_pred
accuracy = accuracy_score(y, y_pred)
# Print results
print("********************* Logistic Regression Model ****************")
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")
print(f"Accuracy: {accuracy}")
logistic_regression_sklearn(df, "Size", "Weight", "Quality")
my_logistic_regression(df, "Size", "Weight", "Quality")
### Sigmoid visulization
def sigmoid_visual():
# Define the sigmoid function
# Generate x values
x = np.linspace(-10, 10, 100)
# Calculate corresponding y values using the sigmoid function
y = sigmoid(x)
# Plot the sigmoid function
plt.figure(figsize=(8, 6))
plt.plot(x, y, label='$\sigma(x) = \\frac{1}{1 + e^{-x}}$', color='blue')
# Add a horizontal line at y = 0.5
plt.axhline(y=0.5, color='red', linestyle='--', label='Threshold')
# Shade the area above the threshold as 1
plt.title('Sigmoid Function')
plt.xlabel('x')
plt.ylabel('Sigmoid(x)')
plt.grid(True)
plt.legend()
plt.show(block=True)
sigmoid_visual()
点赞关注一下各位,本人五年项目开发经验接matlab、python程序设计
— 完 —