ModelCube实验 | 基于出租车数据的城市交通流分析

文摘 2024-08-02 08:08 浙江

ModelCube（modelcube.cn）是博雅数智自主研发的一站式人工智能科研平台。为全国高校和科研机构的大数据和人工智能科研团队提供一站式科研服务。基于MLOps的实践和企业核心技术，实现了科研场景中全类型数据管理与标注，实验环境快速获取与灵活定制，模型的全生命周期管理，科研成果的管理与发布，以及 AI驱动的论文检索和学习等功能。

基于出租车数据的城市交通流分析

乘坐出租车的交通量是如何变化的？为了回答这个问题，实验中会使用K-means聚类，根据位置将纽约分为不同的组，并将进出每个集群的交通量作为一天中时间的函数进行分析。人们可以预期，住宅区在晚上会有更多的交通，而商业区在白天大多会吸引人，夜生活丰富的地区在晚上会出现更多的交通。

这有助于预测持续时间，因为我们可以了解一天中不同时间每个地区的可能目的地。

import os
import pandas as pd
import numpy as np
from matplotlib.pyplot import *
import matplotlib.pyplot as plt
from matplotlib import animation
from matplotlib import cm
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier
from dateutil import parser
import io
import base64
from IPython.display import HTML
from imblearn.under_sampling import RandomUnderSampler
from subprocess import check_output

import warnings
warnings.filterwarnings("ignore")

1. 读取数据

df = pd.read_csv('../dataset/1001442/train.csv')

df.head()

2. 将游乐设施从远离的区域移走

xlim = [-74.03, -73.77]
ylim = [40.63, 40.85]
df = df[(df.pickup_longitude> xlim[0]) & (df.pickup_longitude < xlim[1])]
df = df[(df.dropoff_longitude> xlim[0]) & (df.dropoff_longitude < xlim[1])]
df = df[(df.pickup_latitude> ylim[0]) & (df.pickup_latitude < ylim[1])]
df = df[(df.dropoff_latitude> ylim[0]) & (df.dropoff_latitude < ylim[1])]

3. 情节游乐设施

longitude = list(df.pickup_longitude) + list(df.dropoff_longitude)
latitude = list(df.pickup_latitude) + list(df.dropoff_latitude)


plt.figure(figsize = (10,10))
plt.plot(longitude,latitude,'.', alpha = 0.4, markersize = 0.05)
plt.show()

loc_df = pd.DataFrame()
loc_df['longitude'] = longitude
loc_df['latitude'] = latitude

4. 群集

让我们根据每次乘坐出租车的上下车点对纽约市进行分组。

kmeans = KMeans(n_clusters=15, random_state=2, n_init = 10).fit(loc_df)
loc_df['label'] = kmeans.labels_

loc_df = loc_df.sample(200000)
plt.figure(figsize = (10,10))
for label in loc_df.label.unique():
    plt.plot(loc_df.longitude[loc_df.label == label],loc_df.latitude[loc_df.label == label],'.', alpha = 0.3, markersize = 0.3)

plt.title('Clusters of New York')
plt.show()

正如我们所看到的，聚类产生了一个分区，这在某种程度上类似于纽约被划分为不同社区的方式。我们可以看到中央公园的上东区和西区分别是灰色和粉红色。西中城为蓝色，切尔西和西村为棕色，市中心为蓝色，东村和苏豪为紫色。

肯尼迪机场和拉瓜迪亚机场都有自己的集群，皇后区和哈莱姆区也是如此。布鲁克林被分为两个集群，布朗克斯区的游乐设施太少，无法与哈莱姆区分开。

让我们绘制聚类中心：

fig,ax = plt.subplots(figsize = (10,10))
for label in loc_df.label.unique():
    ax.plot(loc_df.longitude[loc_df.label == label],loc_df.latitude[loc_df.label == label],'.', alpha = 0.4, markersize = 0.1, color = 'gray')
    ax.plot(kmeans.cluster_centers_[label,0],kmeans.cluster_centers_[label,1],'o', color = 'r')
    ax.annotate(label, (kmeans.cluster_centers_[label,0],kmeans.cluster_centers_[label,1]), color = 'b', fontsize = 20)
ax.set_title('Cluster Centers')
plt.show()

df['pickup_cluster'] = kmeans.predict(df[['pickup_longitude','pickup_latitude']].values)
df['dropoff_cluster'] = kmeans.predict(df[['dropoff_longitude','dropoff_latitude']].values)
df['pickup_hour'] = df.pickup_datetime.apply(lambda x: parser.parse(x).hour )

clusters = pd.DataFrame()
clusters['x'] = kmeans.cluster_centers_[:,0]
clusters['y'] = kmeans.cluster_centers_[:,1]
clusters['label'] = range(len(clusters))

loc_df = loc_df.sample(5000)

5. 出租车从一个集群到另一个集群

在下面的动画中，每个箭头都表示从一个簇到另一个簇的骑行。箭头的宽度与相关小时内的相对行程量成比例。

fig, ax = plt.subplots(1, 1, figsize = (10,10))

def animate(hour):
    ax.clear()
    ax.set_title('Absolute Traffic - Hour ' + str(int(hour)) + ':00')    
    plt.figure(figsize = (10,10));
    for label in loc_df.label.unique():
        ax.plot(loc_df.longitude[loc_df.label == label],loc_df.latitude[loc_df.label == label],'.', alpha = 1, markersize = 2, color = 'gray');
        ax.plot(kmeans.cluster_centers_[label,0],kmeans.cluster_centers_[label,1],'o', color = 'r');


    for label in clusters.label:
        for dest_label in clusters.label:
            num_of_rides = len(df[(df.pickup_cluster == label) & (df.dropoff_cluster == dest_label) & (df.pickup_hour == hour)])
            dist_x = clusters.x[clusters.label == label].values[0] - clusters.x[clusters.label == dest_label].values[0]
            dist_y = clusters.y[clusters.label == label].values[0] - clusters.y[clusters.label == dest_label].values[0]
            pct = np.true_divide(num_of_rides,len(df))
            arr = Arrow(clusters.x[clusters.label == label].values, clusters.y[clusters.label == label].values, -dist_x, -dist_y, edgecolor='white', width = 15*pct)
            ax.add_patch(arr)
            arr.set_facecolor('g')


ani = animation.FuncAnimation(fig,animate,sorted(df.pickup_hour.unique()), interval = 1000)
ani.save('animation.gif', writer='imagemagick', fps=2)
filename = 'animation.gif'
video = io.open(filename, 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''<img lay-src="data:image/gif;base64,{0}" type="gif" />'''.format(encoded.decode('ascii')))

fig, ax = plt.subplots(1, 1, figsize = (10,10))

def animate(hour):
    ax.clear()
    ax.set_title('Relative Traffic - Hour ' + str(int(hour)) + ':00')    
    plt.figure(figsize = (10,10))
    for label in loc_df.label.unique():
        ax.plot(loc_df.longitude[loc_df.label == label],loc_df.latitude[loc_df.label == label],'.', alpha = 1, markersize = 2, color = 'gray')
        ax.plot(kmeans.cluster_centers_[label,0],kmeans.cluster_centers_[label,1],'o', color = 'r')


    for label in clusters.label:
        for dest_label in clusters.label:
            num_of_rides = len(df[(df.pickup_cluster == label) & (df.dropoff_cluster == dest_label) & (df.pickup_hour == hour)])
            dist_x = clusters.x[clusters.label == label].values[0] - clusters.x[clusters.label == dest_label].values[0]
            dist_y = clusters.y[clusters.label == label].values[0] - clusters.y[clusters.label == dest_label].values[0]
            pct = np.true_divide(num_of_rides,len(df[df.pickup_hour == hour]))
            arr = Arrow(clusters.x[clusters.label == label].values, clusters.y[clusters.label == label].values, -dist_x, -dist_y, edgecolor='white', width = pct)
            ax.add_patch(arr)
            arr.set_facecolor('g')


ani = animation.FuncAnimation(fig,animate,sorted(df.pickup_hour.unique()), interval = 1000)
ani.save('animation.gif', writer='imagemagick', fps=2)
filename = 'animation.gif'
video = io.open(filename, 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''<img lay-src="data:image/gif;base64,{0}" type="gif" />'''.format(encoded.decode('ascii')))

我们可以看到，早上大部分交通都在曼哈顿岛。

傍晚时分，前往布鲁克林地区(主要是威廉斯堡)的出租车比例会大得多。由于早上没有类似的运动(相反的方向)，这不太可能是通勤的结果。相反，由于交通大多出现在22:00之后，这些人可能会外出。

由于箭头代表了相关时间内的相对交通量，因此通往布鲁克林的箭头宽度的增加也可能只是由于曼哈顿大部分地区的商业性质，乘车次数减少的结果。但从绝对交通量来看，从曼哈顿到布鲁克林的箭头在一天中的大部分时间里几乎看不到。

在很早的时候，大部分交通都是往返于这两个机场。从绝对图中我们可以看出，这只是城市其他地区交通量减少的结果。

6. 邻域分析

不手动为每个集群分配邻域名称。

neighborhood = {-74.0019368351: 'Chelsea',-73.837549761: 'Queens',-73.7854240738: 'JFK',-73.9810421975:'Midtown-North-West',-73.9862336241: 'East Village',
                -73.971273324:'Midtown-North-East',-73.9866739677: 'Brooklyn-parkslope',-73.8690098118: 'LaGuardia',-73.9890572967:'Midtown',-74.0081765545: 'Downtown'
                ,-73.9213024854: 'Queens-Astoria',-73.9470256923: 'Harlem',-73.9555565018: 'Uppe East Side',
               -73.9453487097: 'Brooklyn-Williamsburgt',-73.9745967889:'Upper West Side'}

rides_df = pd.DataFrame(columns = neighborhood.values())
rides_df['name'] = neighborhood.values()

neigh = KNeighborsClassifier(n_neighbors=1)
neigh.fit(np.array(list(neighborhood.keys())).reshape(-1, 1), list(neighborhood.values()))

df['pickup_neighborhood'] = neigh.predict(df.pickup_longitude.values.reshape(-1,1))
df['dropoff_neighborhood'] = neigh.predict(df.dropoff_longitude.values.reshape(-1,1))

for col in rides_df.columns[:-1]:
    rides_df[col] = rides_df.name.apply(lambda x: len(df[(df.pickup_neighborhood == x) & (df.dropoff_neighborhood == col)]))

现在，让我们绘制一张热图，看看乘客往返于何处(这些是所有游乐设施的总值)：

fig,ax = plt.subplots(figsize = (12,12))
cax = ax.matshow(rides_df.drop('name',axis = 1),interpolation='nearest',cmap=cm.afmhot)
cbar = fig.colorbar(cax)
ax.grid('off')
ax.set_xticks(range(len(rides_df)))
ax.set_xticklabels(rides_df.name, rotation =90,fontsize = 15)
ax.set_yticks(range(len(rides_df)))
ax.set_yticklabels(rides_df.name,fontsize = 15)
ax.set_xlabel('To', fontsize = 25)
ax.set_ylabel('From', fontsize = 25)
ax.set_title('Neighborhoods Interaction', y=1.35, fontsize = 30)

rides_df.index = rides_df.name
rides_df = rides_df.drop('name', axis = 1)

我们可以看到，曼哈顿市中心的街区是出租车最拥挤的，以上东区为主导。我们还看到，最常见的骑行是在集群内(在动画中看不到)。

热图是相当对称的，这意味着没有集群的上升次数明显多于下降次数或相反。让我们放大它：

fig,ax = plt.subplots(figsize = (12,12))
for i in range(len(rides_df)):  
    ax.plot(rides_df.sum(axis = 1)[i],rides_df.sum(axis = 0)[i],'o', color = 'b')
    ax.annotate(rides_df.index.tolist()[i], (rides_df.sum(axis = 1)[i],rides_df.sum(axis = 0)[i]), color = 'b', fontsize = 12)

ax.plot([0,250000],[0,250000], color = 'r', linewidth = 1)
ax.grid('off')
ax.set_xlim([0,250000])
ax.set_ylim([0,250000])
ax.set_xlabel('Outbound Taxis')
ax.set_ylabel('Inbound Taxis')
ax.set_title('Inbound and Outbound rides for each cluster')

我们可以看到，每个邻域的入站-出站比率是相对平衡的。

这两个机场的出境乘车次数比入境乘车次数多，这是有道理的——即使没有乘客，司机也可能会去机场，有机会带人们进城。住宅区——昆斯、布鲁克林和哈莱姆区有更多的入境乘车，而商业和旅游区则有更多的出境乘车。上东区和西区，既是商业区又是住宅区，几乎处于曲线上。

人们似乎会乘坐其他交通工具进入曼哈顿，但更可能乘坐出租车离开。

7. 冬季与夏季

df['pickup_month'] = df.pickup_datetime.apply(lambda x: parser.parse(x).month )

fig,ax = plt.subplots(2,figsize = (12,12))

rides_df = pd.DataFrame(columns = neighborhood.values())
rides_df['name'] = neighborhood.values()
rides_df.index = rides_df.name


for col in rides_df.columns[:-1]:
    rides_df[col] = rides_df.name.apply(lambda x: len(df[(df.pickup_neighborhood == x) & (df.dropoff_neighborhood == col) & (df.pickup_month == 6)]))
for i in range(len(rides_df)):  
    ax[0].plot(rides_df.iloc[:,:-1].sum(axis = 1)[i],rides_df.iloc[:,:-1].sum(axis = 0)[i],'o', color = 'b')
    ax[0].annotate(rides_df.index.tolist()[i], (rides_df.iloc[:,:-1].sum(axis = 1)[i],rides_df.iloc[:,:-1].sum(axis = 0)[i]), color = 'b', fontsize = 12)

ax[0].grid('off')
ax[0].set_xlabel('Outbound Taxis')
ax[0].set_ylabel('Inbound Taxis')
ax[0].set_title('Inbound and Outbound rides for each cluster - June')
ax[0].set_xlim([0,40000])
ax[0].set_ylim([0,40000])
ax[0].plot([0,40000],[0,40000])

for col in rides_df.columns[:-1]:
    rides_df[col] = rides_df.name.apply(lambda x: len(df[(df.pickup_neighborhood == x) & (df.dropoff_neighborhood == col) & (df.pickup_month == 1)]))
rides_df = rides_df.drop('name', axis = 1)
for i in range(len(rides_df)):  
    ax[1].plot(rides_df.sum(axis = 1)[i],rides_df.sum(axis = 0)[i],'o', color = 'b')
    ax[1].annotate(rides_df.index.tolist()[i], (rides_df.sum(axis = 1)[i],rides_df.sum(axis = 0)[i]), color = 'b', fontsize = 12)

ax[1].grid('off')
ax[1].set_xlabel('Outbound Taxis')
ax[1].set_ylabel('Inbound Taxis')
ax[1].set_title('Inbound and Outbound rides for each cluster - January')
ax[1].set_xlim([0,40000])
ax[1].set_ylim([0,40000])
ax[1].plot([0,40000],[0,40000])

正如我们所看到的，无论哪个月，模式都几乎相同。下雪的一月与潮湿旅游的六月产生了非常相似的出租车模式。

在线运行本实验请登录ModelCube
http://modelcube.cn/experiment/experiment-detail/1009586

http://mp.weixin.qq.com/s?__biz=MzU2NTcxODIyMg==&mid=2247515324&idx=1&sn=df2b3afb094af5dc23e4032725757bec

数据科学人工智能

聚焦数据科学，大数据，人工智能，区块链和云计算等话题。技术资料分享，院士名家观点分享，前沿资讯分享。

最新文章

ModelCube数据集 | 2016年美国大选数据集

ModelCube数据集 | 印度食品数据集

ModelCube数据集 | 数据分析师职位数据集

ModelCube数据集 | 房价预测数据集

ModelCube数据集 | 杂货数据集

ModelCube数据集 | 欧洲51.5万酒店评论数据集

ModelCube数据集 | 葡萄酒质量数据集

ModelCube数据集 | 世界人口数据集

ModelCube数据集 | 糖尿病数据集

ModelCube数据集 | 宾夕法尼亚州蒙哥马利县911电话数据集

ModelCube数据集 | 80种谷物营养成分数据集

ModelCube数据集 | 线性回归数据集

ModelCube数据集 | 真实/虚假职位发布预测数据集

ModelCube数据集 | 波士顿房价数据集

ModelCube数据集 | 英雄联盟钻石排名游戏（10分钟）

ModelCube数据集 | 百万新闻标题数据集

ModelCube数据集 | 中国台湾省公司破产数据集

ModelCube数据集 | 用于命名实体识别标注语料库

ModelCube数据集 | 欧洲足球赛事数据集（9074场）

ModelCube数据集 | 超市分店销售分析数据集

ModelCube数据集 | Netflix电视节目和电影数据集

ModelCube数据集 | Zomato餐厅数据

ModelCube数据集 | 新闻类别数据集

ModelCube数据集 | 车辆保险数据集

ModelCube数据集 | 墨尔本住房市场数据集

ModelCube数据集 | CSV格式的MNIST数据集

ModelCube数据集 | 各大平台电影数据集（Netflix、Prime Video、Hulu和迪士尼）

ModelCube数据集 | 信用卡审批预测数据集

ModelCube数据集 | 亚马逊森林火灾数据集

ModelCube数据集 | 销售样例数据

ModelCube数据集 | 快速约会实验

ModelCube数据集 | 鸢尾花数据集

ModelCube数据集 | NIFTY-50股市数据（2000-2001）数据集

ModelCube数据集 | 航空公司乘客满意度数据集

ModelCube数据集 | TED演讲数据集

2024中国大数据产业发展指数重磅发布

ModelCube数据集 | 学生学习成绩数据集

重磅上线！基于卷积神经网络的岩相分类综合实训项目正式发布

重磅上线！MyScale图像智能检索综合实训项目正式发布

ModelCube数据集 | 世界各国/地区人口数据集

ModelCube数据集 | NBA球员数据集（1950年至今）

ModelCube数据集 | 泰坦尼克号数据集

ModelCube数据集 | 英国二手车数据集

ModelCube数据集 | 语音性别识别数据集

ModelCube数据集 | 印度板球超级联赛数据集

ModelCube数据集 | 心脏病数据集

ModelCube数据集 | 手语数字数据集

ModelCube数据集 | 2020年世界卫生统计报告数据集|完整|地理分析

ModelCube数据集 | 印度创业基金数据集

ModelCube数据集 | 太阳能发电数据集

分类

时事

民生

政务

教育

文化

科技

财富

体娱

健康

情感

旅行

百科

职场

楼市

企业

乐活

学术

汽车

时尚

创业

美食

幽默

美体

文摘

原创标签

时事社会财经军事教育体育科技汽车科学房产搞笑综艺明星音乐动漫游戏时尚健康旅游美食生活摄影宠物职场育儿情感小说曲艺文化历史三农文学娱乐电影视频图片新闻宗教电视剧纪录片广告创意壁纸头像心灵鸡汤星座命理教育培训艺术文化金融财经健康医疗美妆时尚餐饮美食母婴育儿社会新闻工业农业时事政治星座占卜幽默笑话独立短篇连载作品文化历史科技互联网

发布位置

广东北京山东江苏河南浙江山西福建河北上海四川陕西湖南安徽湖北内蒙古江西云南广西甘肃辽宁黑龙江贵州新疆重庆吉林天津海南青海宁夏西藏香港澳门台湾美国加拿大澳大利亚日本新加坡英国西班牙新西兰韩国泰国法国德国意大利缅甸菲律宾马来西亚越南荷兰柬埔寨俄罗斯巴西智利卢森堡芬兰瑞典比利时瑞士土耳其斐济挪威朝鲜尼日利亚阿根廷匈牙利爱尔兰印度老挝葡萄牙乌克兰印度尼西亚哈萨克斯坦塔吉克斯坦希腊南非蒙古奥地利肯尼亚加纳丹麦津巴布韦埃及坦桑尼亚捷克阿联酋安哥拉