机器学习 | 基于KNN近邻和随机森林模型对用户转化进行分析与预测

科技科技 2024-10-29 08:11 天津

点击上方"蓝字"，关注"Python当打之年"

后台回复"1"，领取众多Python学习资料

大家好，我是欧K~

本期利用KNN近邻算法和随机森林模型对用户转化率进行分析与预测，对比两个模型，看看两个模型更适合对此类问题进行预测，以及哪些特征量对用户转化率影响比较大，希望对大家有所帮助，如有疑问或者需要改进的地方可以联系小编。

涉及到的库：
Pandas — 数据处理
Matplotlib/Seaborn — 数据可视化

Sklearn — 机器学习

1. 导入模块

import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

2. Pandas数据处理

2.1 读取数据

df1 = pd.read_csv('./用户转化预测数据集.csv')

2.2 查看数据信息

df.info()

2.3 字段说明

2.4 删除重复值

df = df.drop_duplicates()

2.5 删除空值

df = df.drop_duplicates()

3. 数据分析-特征分析

3.1 年龄及转化率分析

def get_Age_analyze1():
    fig = plt.figure(figsize=(12, 6), dpi=80)
    axis1 = fig.add_axes((0.1, 0.1, 0.8, 0.8))
    axis1.set_xlabel('年龄(岁)', color=color1, fontsize=size)
    axis1.set_ylabel('人数', color=color1, fontsize=size)
    axis1.tick_params('both', colors=color1, labelsize=size)
    sns.histplot(df['Age'], kde=True, bins=10)
    axis1.grid(True, which='both', linestyle='--', linewidth=0.5) 
    plt.title('年龄分布',size=16)

def get_Age_analyze2():
    fig = plt.figure(figsize=(12, 6), dpi=80)
    axis1 = fig.add_axes((0.1, 0.1, 0.8, 0.8))
    axis2 = axis1.twinx()

    # 绘制bar
    axis1.bar(labels, y_data1, label='人数',color=color1, alpha=0.8)
    axis1.set_ylim(0, 2500)
    axis1.set_ylabel('人数', color=color1, fontsize=size)
    axis1.tick_params('both', colors=color1, labelsize=size)

    for i, (xt, yt) in enumerate(zip(labels, y_data1)): 
        axis1.text(xt, yt + 50, f'{yt:.2f}',size=size,ha='center', va='bottom', color=color1)

    # 绘制plot
    axis2.plot(labels, y_data2,label='转化率', color=range_color[-1], marker="o", linewidth=2)
    axis2.set_ylabel('转化率(%)', color=color2,fontsize=size)
    axis2.tick_params('y', colors=color2, labelsize=size)
    axis2.set_ylim(80, 90)
    for i, (xt, yt) in enumerate(zip(labels, y_data2)): 
        axis2.text(xt, yt + 0.3, f'{yt:.2f}',size=size,ha='center', va='bottom', color=color2) 

    axis1.legend(loc=(0.88, 0.92))
    axis2.legend(loc=(0.88, 0.87))
    plt.gca().spines["left"].set_color(range_color[0])
    plt.gca().spines["right"].set_color(range_color[-1]) 
    plt.gca().spines["left"].set_linewidth(2)
    plt.gca().spines["right"].set_linewidth(2)
    plt.title("各年龄人数及转化率分布",size=16)

客户年龄集中分布在10-70岁之间
不同年龄段的客户转化率波动不大，所以年龄对客户是否转化没有太大影响

3.2 各营销渠道人数及转化率分析

def get_CampaignChannel_analyze():
    fig = plt.figure(figsize=(12, 6), dpi=80)
    axis1 = fig.add_axes((0.1, 0.1, 0.8, 0.8))# (left, bottom, width, height) 
    axis2 = axis1.twinx()

    # 绘制bar
    axis1.bar(x_data, y_data1, label='人数',color=range_color[2], alpha=0.8)
    axis1.set_ylim(0, 2500)
    axis1.set_ylabel('人数', color=color1, fontsize=size)
    axis1.tick_params('both', colors=color1, labelsize=size)

    for i, (xt, yt) in enumerate(zip(x_data, y_data1)): 
        axis1.text(xt, yt + 50, f'{yt:.2f}',size=size,ha='center', va='bottom', color=range_color[2])

    # 绘制plot
    axis2.plot(x_data, y_data2,label='转化率', color=color2, marker="o", linewidth=2)
    axis2.set_ylabel('转化率(%)', color=color2,fontsize=size)
    axis2.tick_params('y', colors=color2, labelsize=size)
    axis2.set_ylim(80, 90)
    for i, (xt, yt) in enumerate(zip(x_data, y_data2)): 
        axis2.text(xt, yt + 0.3, f'{yt:.2f}',size=size,ha='center', va='bottom', color=color2) 

    axis1.legend(loc=(0.88, 0.92))
    axis2.legend(loc=(0.88, 0.87))
    plt.gca().spines["left"].set_color(range_color[2])
    plt.gca().spines["right"].set_color(range_color[-1]) 
    plt.gca().spines["left"].set_linewidth(2)
    plt.gca().spines["right"].set_linewidth(2)
    plt.title("各渠道人数及转化率分布",size=16)

五个不同营销渠道的用户转化率波动在2%以内，所以营销渠道对客户是否转化没有太大影响

3.3 各营销类型人数及转化率分析

不同营销类型的用户转化率波动最大接近10%，所以营销类型对客户是否转化有一定的影响

3.4 营销花费分析

def get_AdSpend_analyze1():
    fig = plt.figure(figsize=(12, 6), dpi=80)
    axis1 = fig.add_axes((0.1, 0.1, 0.8, 0.8))
    axis1.set_xlabel('营销花费(美元)', color=color1, fontsize=size)
    axis1.set_ylabel('人数', color=color1, fontsize=size)
    axis1.tick_params('both', colors=color1, labelsize=size)
    sns.histplot(df['AdSpend'], kde=True, bins=10)
    axis1.grid(True, which='both', linestyle='--', linewidth=0.5) 
    plt.title('营销花费分布',size=16)

不同营销花费的用户转化率波动比较明显，最大超过了10%，所以营销花费对客户是否转化有明显的影响

3.5 网站点击率分析

def get_ClickThroughRate_analyze1():
    fig = plt.figure(figsize=(12, 6), dpi=80)
    axis1 = fig.add_axes((0.1, 0.1, 0.8, 0.8))
    axis1.set_xlabel('网站点击率(%)', color=color1, fontsize=size)
    axis1.set_ylabel('人数', color=color1, fontsize=size)
    axis1.tick_params('both', colors=color1, labelsize=size)
    sns.histplot(df['ClickThroughRate'], kde=True, bins=10)
    axis1.grid(True, which='both', linestyle='--', linewidth=0.5) 
    plt.title('网站点击率分布',size=16)

不同点击率的用户转化率波动同样比较明显，最大超过了10%，所以点击率对客户是否转化也有明显的影响

3.6 客户访问网站总次数与转化率分析

def get_WebsiteVisits_analyze1():
    fig = plt.figure(figsize=(12, 6), dpi=80)
    axis1 = fig.add_axes((0.1, 0.1, 0.8, 0.8))
    axis1.set_xlabel('访问网站总次数', color=color1, fontsize=size)
    axis1.set_ylabel('人数', color=color1, fontsize=size)
    axis1.tick_params('both', colors=color1, labelsize=size)
    sns.histplot(df['WebsiteVisits'], kde=True, bins=10)
    axis1.grid(True, which='both', linestyle='--', linewidth=0.5) 
    plt.title('访问网站总次数分布',size=16)

不同客户访问网站总次数的用户转化率波动比较明显，最大超过了10%，所以客户访问网站总次数对客户是否转化有明显的影响

3.7 客户每次访问网站时间与转化率分析

def get_TimeOnSite_analyze1():
    fig = plt.figure(figsize=(12, 6), dpi=80)
    axis1 = fig.add_axes((0.1, 0.1, 0.8, 0.8))
    axis1.set_xlabel('每次访问平均时间', color=color1, fontsize=size)
    axis1.set_ylabel('人数', color=color1, fontsize=size)
    axis1.tick_params('both', colors=color1, labelsize=size)
    sns.histplot(df['TimeOnSite'], kde=True, bins=10)
    axis1.grid(True, which='both', linestyle='--', linewidth=0.5) 
    plt.title('每次访问平均时间分布',size=16)

不同客户每次访问网站时间的用户转化率波动比较明显，所以客户每次访问网站时间对客户是否转化有明显的影响。

4. 模型分析

from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import resample
from scipy.interpolate import interp1d
from sklearn.metrics import confusion_matrix, roc_auc_score, roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

4.1 各特征相关性热图

def get_model_analyze(): 
    corrdf = data.corr()
    plt.figure(figsize=(12, 12), dpi=80)
    sns.heatmap(corrdf, annot=True,cmap="rainbow", linewidths=0.05,square=True,annot_kws={"size":8}, cbar_kws={'shrink': 0.8})
    plt.title("各特征相关性热图",size=16)

通过各特征相关性热图可以看出：AdSpend、ClickThroughRate、ConversionRate、WebsiteVisits、PagesPerVisit、TimeOnSite、EmailOpens、EmailClicks、PreviousPurchases、LoyaltyPoints等特征量相较于其他特征量对用户是否转化有较明显的影响

4.2 KNN近邻算法

4.2.1 找到高精度的k值

k_values = range(1, 21)
accuracies = []
for k in k_values:
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test.values)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)

k = 15 时，模型的精度最高，达到 0.88625

4.2.2 模型准确性

model = KNeighborsClassifier(n_neighbors=best_k)
model.fit(x_train, y_train)
train_accuracy = accuracy_score(y_train, model.predict(x_train.values))
test_accuracy = accuracy_score(y_test, model.predict(x_test.values))

y_pred = model.predict(x_test.values)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

4.2.3 混淆矩阵

4.2.4 ROC曲线

def get_model_roc1():
    y_probs = model.predict_proba(x_test.values)[:, 1]
    fpr, tpr, thresholds = roc_curve(y_test, y_probs)
    auc = roc_auc_score(y_test, y_probs)
    
    fig = plt.figure(figsize=(12, 6), dpi=80)
    axis1 = fig.add_axes((0.1, 0.1, 0.8, 0.8))
    axis1.tick_params('both', colors=color1, labelsize=size)
    axis1.plot(fpr, tpr, color='blue', lw=2, label=f'AUC = {auc:.2f}')
    axis1.plot([0, 1], [0, 1], color='gray', linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('假正率',color=color1, fontsize=size)
    plt.ylabel('召回率',color=color1, fontsize=size)
    plt.title(f'ROC曲线 - {type(model).__name__}',size=16)
    plt.legend(loc="lower right",fontsize=size)

AUC = 0.57
AUC = 1，是完美分类器，采用这个预测模型时，存在至少一个阈值能得出完美预测
0.5 < AUC < 1，优于随机猜测。这个分类器妥善设定阈值的话，能有预测价值
AUC = 0.5，跟随机猜测一样（例：抛硬币），模型没有预测价值
AUC < 0.5，比随机猜测还差

4.3 随机森林

4.3.1 模型准确性

x_data = data.drop(columns=['Conversion'])
y = data['Conversion']
x_train, x_test, y_train, y_test = train_test_split(x_data, y, test_size=0.2, random_state=7)
model = RandomForestClassifier(random_state=15)
model.fit(x_train, y_train)

y_pred = model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

4.3.2 混淆矩阵

4.3.3 ROC曲线

AUC = 0.83，效果比KNN更好

4.3.4 ROC曲线-置信区间

def get_model_roc3():
    # 计算原始ROC曲线的FPR, TPR, 和thresholds 
    fpr_orig, tpr_orig, thresholds_orig = roc_curve(y_test, y_probs) 

    # 计算多个ROC曲线 
    for i in range(n_bootstraps): 
        x_resample, y_resample = resample(x_test, y_test) 
        y_probs_resample = model.predict_proba(x_resample)[:, 1] 
        fpr_resample, tpr_resample, _ = roc_curve(y_resample, y_probs_resample) 
        # 线性插值
        fpr_interp = interp1d(np.linspace(0, 1, len(fpr_resample)), fpr_resample, fill_value="extrapolate")(np.linspace(0, 1, len(fpr_orig))) 
        tpr_interp = interp1d(np.linspace(0, 1, len(tpr_resample)), tpr_resample, fill_value="extrapolate")(np.linspace(0, 1, len(tpr_orig))) 
        fpr_bootstraps[i] = fpr_interp 
        tpr_bootstraps[i] = tpr_interp 

    # 计算置信区间 
    tpr_ci = np.percentile(tpr_bootstraps, [2.5, 97.5], axis=0) 

    # 绘制ROC曲线和置信区间 
    plt.figure(figsize=(12, 6), dpi=80) 
    plt.plot(fpr_orig, tpr_orig, color='blue', lw=2, label=f'AUC = {auc:.2f}') 
    plt.fill_between(fpr_orig, tpr_ci[0], tpr_ci[1], color='blue', alpha=0.2, label='95%置信区间') 
    plt.plot([0, 1], [0, 1], color='gray', linestyle='--') # 随机猜测线 
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05]) 
    plt.xlabel('假正率',color=color1, fontsize=size)
    plt.ylabel('召回率',color=color1, fontsize=size)
    plt.title('ROC曲线 - 95%置信区间')

4.4 特征重要性

def get_feature_importances():
    x_data = features_rf['特征'].tolist()
    y_data = features_rf['重要度'].tolist()
    fig = plt.figure(figsize=(12, 6), dpi=80)
    ax = fig.add_axes((0.1, 0.1, 0.8, 0.8))
    ax.set_xlim(0, 0.1)
    ax.tick_params('both', colors=color1, labelsize=size)
    bars = ax.barh(x_data, y_data, color=range_color[1])
    for bar in bars: 
        w = bar.get_width()
        ax.text(w+0.001, bar.get_y()+bar.get_height()/2, '%.4f'%w, ha='left', va='center')
        
    plt.xlabel('重要度',color=color1, fontsize=size)
    plt.ylabel('特征',color=color1, fontsize=size)
    plt.title('随机森林特征重要性',size=16)
    plt.grid(True, which='both', linestyle='--', linewidth=0.5) 
    plt.tight_layout()
    plt.show()
get_feature_importances()

点击转化率、每次访问平均在网站上花费的时间、点击率、每次访问网站总页数、营销花费、忠诚度积分、访问网站总次数、电子邮件被打开次数等特征对用户是否转化有显著性影响。

总结：

通过对特征量进行分析，可以看出，年龄、营销活动传递渠道、营销类型等特征对用户是否转化没有显著的影响，营销花费、点击率、点击转化率、访问网站总次数、每次访问平均在网站上花费的时间等特征对用户是否转化有显著的影响。
通过KNN和随机森林模型的预测对比，随机森林的模型准确性、AUC数值方面要优于KNN近邻模型，可以通过该模型去预测用户北转化的概率。
从特征重要性图中可以看出，点击转化率、每次访问平均在网站上花费的时间、点击率、每次访问网站总页数、营销花费、忠诚度积分、访问网站总次数、电子邮件被打开次数等特征对用户是否转化有显著性影响，后期营销策略应着重优化这些方面，以提高用户的转化率。