ModelCube实验 | 在线送餐客户流失分析与预测

文摘 2024-07-26 07:37 浙江

ModelCube（modelcube.cn）是博雅数智自主研发的一站式人工智能科研平台。为全国高校和科研机构的大数据和人工智能科研团队提供一站式科研服务。基于MLOps的实践和企业核心技术，实现了科研场景中全类型数据管理与标注，实验环境快速获取与灵活定制，模型的全生命周期管理，科研成果的管理与发布，以及 AI驱动的论文检索和学习等功能。

在线送餐客户流失分析与预测

在印度班加罗尔等大都市，在线送货的需求有所上升。为什么需求会增加一直是一个挥之不去的问题。因此，我们进行了一项调查并提供了数据。目标是看看我们是否能够预测客户流失。在对数据进行预处理之前，我们将进行一些可视化处理，之后我们将实现分类模型。

本实验中的数据集有近55个基于以下类别的变量：

消费者人口统计
总体/一般采购决策
影响采购决策的交货时间
影响购买决策的餐厅评级

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from IPython.display import HTML
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.metrics import confusion_matrix, classification_report, precision_score, recall_score, accuracy_score

1. 导入数据

data = pd.read_csv('../dataset/11795/onlinedeliverydata.csv')

观察数据结构。

data.head()

data.describe()

data.info()

即使在进入数据可视化步骤之前，我们也可以看到，当我们将模型拟合到这些数据上时，需要进行某种形式的降维，以避免过度拟合。在接下来的步骤中，我们将了解如何做到这一点。

2. 数据可视化

2.1 连续变量

%matplotlib inline

_ = [0, 6]
_ = list(enumerate([list(data.columns)[i] for i in _], start=1))

fig = plt.figure(figsize=[16,24])
for index, col_name in _:
    ax = fig.add_subplot(3, 1, index)    
    sns.countplot(x=col_name, data=data, hue='Output', palette='viridis')

两个连续变量(年龄和家庭规模)都没有极值，因此我们不需要将数据视为异常值。我们还可以看到，重新排序在25岁以下的年龄组和4岁以下的家庭中更为普遍。

2.2 范畴变量

# Creating a class for grouping categorical variables into frequency tables
class CategoricalGrouping():
    
    def __init__(self, data, col1, col2):
        self.data = data  # Pandas dataframe
        self.col1 = col1  # Column with categories for analysis
        self.col2 = col2  # Output variable
        
    @property
    def table(self):
        return self.data.groupby([self.col1, self.col2]).size().reset_index().pivot(
            columns=self.col1, index=self.col2, values=0).fillna(0)


# Defining a function to plot a nested pie chart
def nested_piechart(data, axis, wedge_width, pie_colors, chart_title):
    """This function takes the following arguments:
        
        data: a pandas dataframe of dimension greater than 2x1 (row x column)
        axis: matplotlib.axes.Axes object for plotting
        wedge_width: float, should be <=1
        pie_colors: list, color codes in hex, should be >= maximum # of categories
        chart_title: str, chart title to display
        
    """

    # Outer wedges
    wedges_outer, texts_outer = axis.pie(data.iloc[1], radius=1, wedgeprops=dict(width=wedge_width, edgecolor='w'), 
           startangle=90, colors=pie_colors)

    # Inner wedges
    axis.pie(data.iloc[0], radius=(1-wedge_width), wedgeprops=dict(width=wedge_width, edgecolor='w', alpha=0.7), 
           startangle=90, colors=pie_colors)

    axis.set(aspect="equal", title=chart_title)

    axis.legend(wedges_outer, list(data.columns),
              title=chart_title,
              loc="lower center",
              bbox_to_anchor=(0.85, -0.1, 0.5, 1))

    # Defining properties for annotations
    bbox_props = dict(boxstyle="square,pad=0.3", fc="w", ec="k", lw=0.72)
    kw = dict(arrowprops=dict(arrowstyle="-"),
              bbox=bbox_props, zorder=0, va="center")

    y = np.sin(np.deg2rad(120))  # Converting degrees to radians
    x = np.cos(np.deg2rad(120))  # Converting degrees to radians

    horizontalalignment = {-1: "right", 1: "left"}[int(np.sign(x))]  # Depending on the radians of x, will give -1 or 1
    connectionstyle = "angle,angleA=0,angleB={}".format(120)
    kw["arrowprops"].update({"connectionstyle": connectionstyle})  # adding connection style args to kw dict
    axis.annotate(data.index[1], xy=(x, y), xytext=(1*np.sign(x), 1.2*y), 
                horizontalalignment=horizontalalignment, **kw)

    y = np.sin(np.deg2rad(140)) - 0.60  # Converting degrees to radians
    x = np.cos(np.deg2rad(140)) + 0.37  # Converting degrees to radians

    horizontalalignment = {-1: "right", 1: "left"}[int(np.sign(x))]  # Depending on the radians of x, will give -1 or 1
    connectionstyle = "angle,angleA=0,angleB={}".format(140)
    kw["arrowprops"].update({"connectionstyle": connectionstyle})  # adding connection style args to kw dict
    axis.annotate(data.index[0], xy=(x, y), xytext=(0.01*np.sign(x), -2*y), 
                horizontalalignment=horizontalalignment, **kw)

下面是为每个分类变量创建的饼图数组，色调设置为输出变量。

%matplotlib inline

fig = plt.figure(figsize=[16,40])

size = 0.3
c2 = 'Output'
c_palette = ['#003f5c', '#58508d', '#bc5090', '#ff6361', '#ffa600']

cat_var = ['Gender', 'Marital Status', 'Occupation', 'Monthly Income', 'Educational Qualifications', 'Medium (P1)', 
           'Medium (P2)', 'Meal(P1)', 'Meal(P2)', 'Perference(P1)', 'Perference(P2)']

ax_list = []

for ind, var in enumerate(cat_var):
    ax_list.append(fig.add_subplot(6, 2, (ind+1)))
    nested_piechart(CategoricalGrouping(data, var, c2).table, ax_list[ind], size, c_palette, var)

变量Medium(P1)、Medium(P2)、Meal(P1)和Meal(P2)几乎没有提供有助于预测客户流失的信息。例如，在Medium(P1)中，一位没有再次点餐的顾客正在使用送餐应用程序，这是多余的。其他变量也没有提供更多的信息，因为类的分布似乎是一样的。我们现在将删除这些变量。

data.drop(['Medium (P1)', 'Medium (P2)', 'Meal(P1)', 'Meal(P2)'], axis=1, inplace=True)

2.3 地理空间分析

%matplotlib inline

x = data.groupby(['latitude', 'longitude', 'Pin code']).size().reset_index()
x.columns = ['latitude', 'longitude', 'pincode', 'frequency']
x.sort_values(by=['frequency'], ascending=False, inplace=True)

latitude = 12.972442
longitude = 77.580643
delivery_map = folium.Map(location=[latitude, longitude], zoom_start=11)

for lat, lon, freq, pin in zip(x['latitude'], x['longitude'], x['frequency'], x['pincode']):
    folium.CircleMarker([lat, lon], radius=freq, 
                        popup = ('Pincode: ' + str(pin) + '<br>' 
                                 '# of customers: ' + str(freq)
                                ), 
                        tooltip='Click to expand',
                        color='b', 
                        fill_color='red', 
                        fill=True, 
                        fill_opacity=0.6).add_to(delivery_map)

delivery_map

我们有388行数据，但只有77个唯一坐标，这表明要么这个数据集中只有77个客户，要么lat-lon位置不准确，并且附着在某个区域。由于该数据集来源于调查，我们可以假设是后者。数据集中有77个独特的Pincode这一事实也证实了这一点。上面的地图可视化显示了班加罗尔各个地区的客户密度。

我们将从数据集中删除lat、lon和pincode变量。这是因为数据由不同个体的反应组成，而这些变量集除了提供这些个体在班加罗尔的位置之外，没有其他信息。

data.drop(['latitude', 'longitude', 'Pin code'], axis=1, inplace=True)

2.4 相关矩阵

在绘制相关矩阵之前，我们需要考虑以likert量表形式存在的变量，并以有序秩序量表的形式表示它们。然后，我们将使用Spearman的秩相关性来计算这些变量的相关矩阵，以便更好地了解我们的特征。

# Finding out all unique sets of categorical variables within the features
data_list = []
for r in data.iloc[:, np.r_[1, 4, 5, 9:47]].columns:
    df_row = list(data[r].unique())
    df_row = pd.Series(dict(enumerate(sorted(df_row))))
    data_list.append(df_row)

df_1 = pd.DataFrame(data=data_list, index=data.iloc[:,np.r_[1, 4, 5, 9:47]].columns)

df_1['combined'] = df_1.apply(lambda x: ', '.join(x.dropna().values.tolist()), axis=1)
df_1.drop([0, 1, 2, 3, 4], inplace=True, axis=1)
df_1 = df_1.reset_index()

df_1['index'] = df_1.groupby(['combined'])['index'].transform(lambda x : ','.join(x))
df_1 = df_1.drop_duplicates()
df_1.columns = ['features', 'data_categories']
df_1.drop([23, 24], axis=0, inplace=True)
df_1 = df_1.reset_index()
df_1.drop('index', axis=1, inplace=True)

x1 = list(enumerate([[0, 1], [2, 3, 1, 4, 0], 
                     [2, 4, 3, 1, 0], [4, 2, 3, 5, 1], 
                     [1, 2, 3, 4, 5], [4, 2, 3, 5, 1], 
                     [4, 3, 2, 1, 5], [1, 0]]
                   ))

for i, l in x1:
    df_1.at[i, 'data_categories'] = dict(zip(df_1.iloc[i, 1].split(', '), l))

for i in df_1.index:
    for j in df_1.iloc[i, 0].split(','):
        data[j] = data[j].apply(lambda x: df_1.iloc[i, 1][x])

%matplotlib inline
data_numberic = data[data.dtypes[data.dtypes!='object'].index]

fig = plt.figure(figsize=[32, 18])
sns.heatmap(data_numberic.corr(method='spearman'), annot=False, mask=np.triu(data_numberic.corr(method='spearman')), cmap='Spectral', 
            linewidths=0.1, linecolor='white')

我们注意到，从"轻松方便"到"良好的跟踪系统"等功能不仅与我们的输出变量相关，而且相互关联。事实上，如果我们观察对角线，我们可以看到哪些特征是相互关联的，这让我们了解了如何压缩变量中的多维。主成分分析是一种流行的降维方法，在这种情况下没有帮助，因为变量是有序分类的，而不是连续的。我们稍后将在数据预处理阶段使用其他特征选择方法。

3. 数据预处理

一些预处理已经在可视化步骤中完成，我们将likert特征转换为有序尺度。还有一些特征仍然是分类的。我们现在将这些转换为伪变量。

_ = pd.get_dummies(data.iloc[:, [2, 3, 7, 8, 29, 30, 37]], drop_first=True)
data.drop([data.columns[i] for i in [2, 3, 7, 8, 29, 30, 37, 47]], axis=1, inplace=True)
data = data.join(_)
data

X = data.drop('Output', axis=1) # input categorical features
y = data['Output'] # target variable

sf = SelectKBest(chi2, k='all')
sf_fit = sf.fit(X, y)

chi2_scores = pd.DataFrame([sf_fit.scores_, X.columns], index=['feature_score', 'feature_name']).transpose()
chi2_scores = chi2_scores.sort_values(by=['feature_score'], ascending=True).reset_index().drop('index', axis=1)

%matplotlib inline

fig = plt.figure(figsize=[10, 20])
plt.barh(chi2_scores['feature_name'], chi2_scores['feature_score'], align='center', alpha=0.5)
plt.xlabel('Score')

作为一个起点，我们将选择特征得分最高的前20个特征，并查看稍后是否需要因过度拟合而删除任何变量。

X = X[list(chi2_scores.iloc[-20:,1])]
X

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)

4. 模型建立

4.1 逻辑回归

log_model = LogisticRegression(max_iter=10000)
log_model.fit(X_train, y_train)

log_pred = log_model.predict(X_test)

print(classification_report(y_test, log_pred))
print(confusion_matrix(y_test, log_pred))

4.2 随机森林

rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

rfc_pred = rfc.predict(X_test)

print(classification_report(y_test, rfc_pred))
print(confusion_matrix(y_test, rfc_pred))

4.3 K近邻

检查不同K值的度量。

error_rate = []

for i in range(1,11):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append([accuracy_score(y_test, pred_i), precision_score(y_test, pred_i), 
                       recall_score(y_test, pred_i)])

error_rate = pd.DataFrame(error_rate, columns=['accuracy_score', 'precision_score', 'recall_score'])
error_rate.plot()

我们在K=7时获得了最佳性能，因此我们将对模型进行这样的训练。

knn_model = KNeighborsClassifier(n_neighbors=7)
knn_model.fit(X_train, y_train)

knn_pred = knn_model.predict(X_test)

print(classification_report(y_test, knn_pred))
print(confusion_matrix(y_test, knn_pred))

5. 结论

在创建的所有模型中，随机森林和K近邻的性能最好。K近邻的另一个优点是解释起来相对简单，而不会牺牲准确性。

另一个考虑因素是数据本身。我们发现，对客户流失有重大影响的大多数变量都与轻松和方便的广泛方面有关。事实上，在模型中选择的功能子集可以为业务决策提供信息，关于服务的哪些方面可以专注于最大限度地减少流失，以及营销资金应该花在哪些用户细分市场。

接下来的步骤：

为了从数据集中提取更多的数据，我们可以进一步探索其他降维方法，例如多重对应分析，它解决了主成分分析不适合分类数据的问题
我们忽略了文本数据，即本分析中的评论，使用NLP方法进行分析也可以增加分析
我们可以通过添加/减去变量来进一步调整模型吗？

在线运行本实验请登录ModelCube
http://modelcube.cn/experiment/experiment-detail/1008080

http://mp.weixin.qq.com/s?__biz=MzU2NTcxODIyMg==&mid=2247515319&idx=1&sn=230b0d2b84cdf1a1cc92dc1e1d34b06a

数据科学人工智能

聚焦数据科学，大数据，人工智能，区块链和云计算等话题。技术资料分享，院士名家观点分享，前沿资讯分享。

最新文章

ModelCube数据集 | 2016年美国大选数据集

ModelCube数据集 | 印度食品数据集

ModelCube数据集 | 数据分析师职位数据集

ModelCube数据集 | 房价预测数据集

ModelCube数据集 | 杂货数据集

ModelCube数据集 | 欧洲51.5万酒店评论数据集

ModelCube数据集 | 葡萄酒质量数据集

ModelCube数据集 | 世界人口数据集

ModelCube数据集 | 糖尿病数据集

ModelCube数据集 | 宾夕法尼亚州蒙哥马利县911电话数据集

ModelCube数据集 | 80种谷物营养成分数据集

ModelCube数据集 | 线性回归数据集

ModelCube数据集 | 真实/虚假职位发布预测数据集

ModelCube数据集 | 波士顿房价数据集

ModelCube数据集 | 英雄联盟钻石排名游戏（10分钟）

ModelCube数据集 | 百万新闻标题数据集

ModelCube数据集 | 中国台湾省公司破产数据集

ModelCube数据集 | 用于命名实体识别标注语料库

ModelCube数据集 | 欧洲足球赛事数据集（9074场）

ModelCube数据集 | 超市分店销售分析数据集

ModelCube数据集 | Netflix电视节目和电影数据集

ModelCube数据集 | Zomato餐厅数据

ModelCube数据集 | 新闻类别数据集

ModelCube数据集 | 车辆保险数据集

ModelCube数据集 | 墨尔本住房市场数据集

ModelCube数据集 | CSV格式的MNIST数据集

ModelCube数据集 | 各大平台电影数据集（Netflix、Prime Video、Hulu和迪士尼）

ModelCube数据集 | 信用卡审批预测数据集

ModelCube数据集 | 亚马逊森林火灾数据集

ModelCube数据集 | 销售样例数据

ModelCube数据集 | 快速约会实验

ModelCube数据集 | 鸢尾花数据集

ModelCube数据集 | NIFTY-50股市数据（2000-2001）数据集

ModelCube数据集 | 航空公司乘客满意度数据集

ModelCube数据集 | TED演讲数据集

2024中国大数据产业发展指数重磅发布

ModelCube数据集 | 学生学习成绩数据集

重磅上线！基于卷积神经网络的岩相分类综合实训项目正式发布

重磅上线！MyScale图像智能检索综合实训项目正式发布

ModelCube数据集 | 世界各国/地区人口数据集

ModelCube数据集 | NBA球员数据集（1950年至今）

ModelCube数据集 | 泰坦尼克号数据集

ModelCube数据集 | 英国二手车数据集

ModelCube数据集 | 语音性别识别数据集

ModelCube数据集 | 印度板球超级联赛数据集

ModelCube数据集 | 心脏病数据集

ModelCube数据集 | 手语数字数据集

ModelCube数据集 | 2020年世界卫生统计报告数据集|完整|地理分析

ModelCube数据集 | 印度创业基金数据集

ModelCube数据集 | 太阳能发电数据集

分类

时事

民生

政务

教育

文化

科技

财富

体娱

健康

情感

旅行

百科

职场

楼市

企业

乐活

学术

汽车

时尚

创业

美食

幽默

美体

文摘

原创标签

时事社会财经军事教育体育科技汽车科学房产搞笑综艺明星音乐动漫游戏时尚健康旅游美食生活摄影宠物职场育儿情感小说曲艺文化历史三农文学娱乐电影视频图片新闻宗教电视剧纪录片广告创意壁纸头像心灵鸡汤星座命理教育培训艺术文化金融财经健康医疗美妆时尚餐饮美食母婴育儿社会新闻工业农业时事政治星座占卜幽默笑话独立短篇连载作品文化历史科技互联网

发布位置

广东北京山东江苏河南浙江山西福建河北上海四川陕西湖南安徽湖北内蒙古江西云南广西甘肃辽宁黑龙江贵州新疆重庆吉林天津海南青海宁夏西藏香港澳门台湾美国加拿大澳大利亚日本新加坡英国西班牙新西兰韩国泰国法国德国意大利缅甸菲律宾马来西亚越南荷兰柬埔寨俄罗斯巴西智利卢森堡芬兰瑞典比利时瑞士土耳其斐济挪威朝鲜尼日利亚阿根廷匈牙利爱尔兰印度老挝葡萄牙乌克兰印度尼西亚哈萨克斯坦塔吉克斯坦希腊南非蒙古奥地利肯尼亚加纳丹麦津巴布韦埃及坦桑尼亚捷克阿联酋安哥拉