ModelCube(modelcube.cn)是博雅数智自主研发的一站式人工智能科研平台。为全国高校和科研机构的大数据和人工智能科研团队提供一站式科研服务。基于MLOps的实践和企业核心技术,实现了科研场景中全类型数据管理与标注,实验环境快速获取与灵活定制,模型的全生命周期管理,科研成果的管理与发布,以及 AI驱动的论文检索和学习等功能。
基于出租车数据的城市交通流分析
乘坐出租车的交通量是如何变化的?为了回答这个问题,实验中会使用K-means聚类,根据位置将纽约分为不同的组,并将进出每个集群的交通量作为一天中时间的函数进行分析。人们可以预期,住宅区在晚上会有更多的交通,而商业区在白天大多会吸引人,夜生活丰富的地区在晚上会出现更多的交通。
这有助于预测持续时间,因为我们可以了解一天中不同时间每个地区的可能目的地。
import os
import pandas as pd
import numpy as np
from matplotlib.pyplot import *
import matplotlib.pyplot as plt
from matplotlib import animation
from matplotlib import cm
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier
from dateutil import parser
import io
import base64
from IPython.display import HTML
from imblearn.under_sampling import RandomUnderSampler
from subprocess import check_output
import warnings
warnings.filterwarnings("ignore")
1. 读取数据
df = pd.read_csv('../dataset/1001442/train.csv')
df.head()
2. 将游乐设施从远离的区域移走
xlim = [-74.03, -73.77]
ylim = [40.63, 40.85]
df = df[(df.pickup_longitude> xlim[0]) & (df.pickup_longitude < xlim[1])]
df = df[(df.dropoff_longitude> xlim[0]) & (df.dropoff_longitude < xlim[1])]
df = df[(df.pickup_latitude> ylim[0]) & (df.pickup_latitude < ylim[1])]
df = df[(df.dropoff_latitude> ylim[0]) & (df.dropoff_latitude < ylim[1])]
3. 情节游乐设施
longitude = list(df.pickup_longitude) + list(df.dropoff_longitude)
latitude = list(df.pickup_latitude) + list(df.dropoff_latitude)
plt.figure(figsize = (10,10))
plt.plot(longitude,latitude,'.', alpha = 0.4, markersize = 0.05)
plt.show()
loc_df = pd.DataFrame()
loc_df['longitude'] = longitude
loc_df['latitude'] = latitude
4. 群集
让我们根据每次乘坐出租车的上下车点对纽约市进行分组。
kmeans = KMeans(n_clusters=15, random_state=2, n_init = 10).fit(loc_df)
loc_df['label'] = kmeans.labels_
loc_df = loc_df.sample(200000)
plt.figure(figsize = (10,10))
for label in loc_df.label.unique():
plt.plot(loc_df.longitude[loc_df.label == label],loc_df.latitude[loc_df.label == label],'.', alpha = 0.3, markersize = 0.3)
plt.title('Clusters of New York')
plt.show()
正如我们所看到的,聚类产生了一个分区,这在某种程度上类似于纽约被划分为不同社区的方式。我们可以看到中央公园的上东区和西区分别是灰色和粉红色。西中城为蓝色,切尔西和西村为棕色,市中心为蓝色,东村和苏豪为紫色。
肯尼迪机场和拉瓜迪亚机场都有自己的集群,皇后区和哈莱姆区也是如此。布鲁克林被分为两个集群,布朗克斯区的游乐设施太少,无法与哈莱姆区分开。
让我们绘制聚类中心:
fig,ax = plt.subplots(figsize = (10,10))
for label in loc_df.label.unique():
ax.plot(loc_df.longitude[loc_df.label == label],loc_df.latitude[loc_df.label == label],'.', alpha = 0.4, markersize = 0.1, color = 'gray')
ax.plot(kmeans.cluster_centers_[label,0],kmeans.cluster_centers_[label,1],'o', color = 'r')
ax.annotate(label, (kmeans.cluster_centers_[label,0],kmeans.cluster_centers_[label,1]), color = 'b', fontsize = 20)
ax.set_title('Cluster Centers')
plt.show()
df['pickup_cluster'] = kmeans.predict(df[['pickup_longitude','pickup_latitude']].values)
df['dropoff_cluster'] = kmeans.predict(df[['dropoff_longitude','dropoff_latitude']].values)
df['pickup_hour'] = df.pickup_datetime.apply(lambda x: parser.parse(x).hour )
clusters = pd.DataFrame()
clusters['x'] = kmeans.cluster_centers_[:,0]
clusters['y'] = kmeans.cluster_centers_[:,1]
clusters['label'] = range(len(clusters))
loc_df = loc_df.sample(5000)
5. 出租车从一个集群到另一个集群
在下面的动画中,每个箭头都表示从一个簇到另一个簇的骑行。箭头的宽度与相关小时内的相对行程量成比例。
fig, ax = plt.subplots(1, 1, figsize = (10,10))
def animate(hour):
ax.clear()
ax.set_title('Absolute Traffic - Hour ' + str(int(hour)) + ':00')
plt.figure(figsize = (10,10));
for label in loc_df.label.unique():
ax.plot(loc_df.longitude[loc_df.label == label],loc_df.latitude[loc_df.label == label],'.', alpha = 1, markersize = 2, color = 'gray');
ax.plot(kmeans.cluster_centers_[label,0],kmeans.cluster_centers_[label,1],'o', color = 'r');
for label in clusters.label:
for dest_label in clusters.label:
num_of_rides = len(df[(df.pickup_cluster == label) & (df.dropoff_cluster == dest_label) & (df.pickup_hour == hour)])
dist_x = clusters.x[clusters.label == label].values[0] - clusters.x[clusters.label == dest_label].values[0]
dist_y = clusters.y[clusters.label == label].values[0] - clusters.y[clusters.label == dest_label].values[0]
pct = np.true_divide(num_of_rides,len(df))
arr = Arrow(clusters.x[clusters.label == label].values, clusters.y[clusters.label == label].values, -dist_x, -dist_y, edgecolor='white', width = 15*pct)
ax.add_patch(arr)
arr.set_facecolor('g')
ani = animation.FuncAnimation(fig,animate,sorted(df.pickup_hour.unique()), interval = 1000)
ani.save('animation.gif', writer='imagemagick', fps=2)
filename = 'animation.gif'
video = io.open(filename, 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''<img lay-src="data:image/gif;base64,{0}" type="gif" />'''.format(encoded.decode('ascii')))
fig, ax = plt.subplots(1, 1, figsize = (10,10))
def animate(hour):
ax.clear()
ax.set_title('Relative Traffic - Hour ' + str(int(hour)) + ':00')
plt.figure(figsize = (10,10))
for label in loc_df.label.unique():
ax.plot(loc_df.longitude[loc_df.label == label],loc_df.latitude[loc_df.label == label],'.', alpha = 1, markersize = 2, color = 'gray')
ax.plot(kmeans.cluster_centers_[label,0],kmeans.cluster_centers_[label,1],'o', color = 'r')
for label in clusters.label:
for dest_label in clusters.label:
num_of_rides = len(df[(df.pickup_cluster == label) & (df.dropoff_cluster == dest_label) & (df.pickup_hour == hour)])
dist_x = clusters.x[clusters.label == label].values[0] - clusters.x[clusters.label == dest_label].values[0]
dist_y = clusters.y[clusters.label == label].values[0] - clusters.y[clusters.label == dest_label].values[0]
pct = np.true_divide(num_of_rides,len(df[df.pickup_hour == hour]))
arr = Arrow(clusters.x[clusters.label == label].values, clusters.y[clusters.label == label].values, -dist_x, -dist_y, edgecolor='white', width = pct)
ax.add_patch(arr)
arr.set_facecolor('g')
ani = animation.FuncAnimation(fig,animate,sorted(df.pickup_hour.unique()), interval = 1000)
ani.save('animation.gif', writer='imagemagick', fps=2)
filename = 'animation.gif'
video = io.open(filename, 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''<img lay-src="data:image/gif;base64,{0}" type="gif" />'''.format(encoded.decode('ascii')))
我们可以看到,早上大部分交通都在曼哈顿岛。
傍晚时分,前往布鲁克林地区(主要是威廉斯堡)的出租车比例会大得多。由于早上没有类似的运动(相反的方向),这不太可能是通勤的结果。相反,由于交通大多出现在22:00之后,这些人可能会外出。
由于箭头代表了相关时间内的相对交通量,因此通往布鲁克林的箭头宽度的增加也可能只是由于曼哈顿大部分地区的商业性质,乘车次数减少的结果。但从绝对交通量来看,从曼哈顿到布鲁克林的箭头在一天中的大部分时间里几乎看不到。
在很早的时候,大部分交通都是往返于这两个机场。从绝对图中我们可以看出,这只是城市其他地区交通量减少的结果。
6. 邻域分析
不手动为每个集群分配邻域名称。
neighborhood = {-74.0019368351: 'Chelsea',-73.837549761: 'Queens',-73.7854240738: 'JFK',-73.9810421975:'Midtown-North-West',-73.9862336241: 'East Village',
-73.971273324:'Midtown-North-East',-73.9866739677: 'Brooklyn-parkslope',-73.8690098118: 'LaGuardia',-73.9890572967:'Midtown',-74.0081765545: 'Downtown'
,-73.9213024854: 'Queens-Astoria',-73.9470256923: 'Harlem',-73.9555565018: 'Uppe East Side',
-73.9453487097: 'Brooklyn-Williamsburgt',-73.9745967889:'Upper West Side'}
rides_df = pd.DataFrame(columns = neighborhood.values())
rides_df['name'] = neighborhood.values()
neigh = KNeighborsClassifier(n_neighbors=1)
neigh.fit(np.array(list(neighborhood.keys())).reshape(-1, 1), list(neighborhood.values()))
df['pickup_neighborhood'] = neigh.predict(df.pickup_longitude.values.reshape(-1,1))
df['dropoff_neighborhood'] = neigh.predict(df.dropoff_longitude.values.reshape(-1,1))
for col in rides_df.columns[:-1]:
rides_df[col] = rides_df.name.apply(lambda x: len(df[(df.pickup_neighborhood == x) & (df.dropoff_neighborhood == col)]))
现在,让我们绘制一张热图,看看乘客往返于何处(这些是所有游乐设施的总值):
fig,ax = plt.subplots(figsize = (12,12))
cax = ax.matshow(rides_df.drop('name',axis = 1),interpolation='nearest',cmap=cm.afmhot)
cbar = fig.colorbar(cax)
ax.grid('off')
ax.set_xticks(range(len(rides_df)))
ax.set_xticklabels(rides_df.name, rotation =90,fontsize = 15)
ax.set_yticks(range(len(rides_df)))
ax.set_yticklabels(rides_df.name,fontsize = 15)
ax.set_xlabel('To', fontsize = 25)
ax.set_ylabel('From', fontsize = 25)
ax.set_title('Neighborhoods Interaction', y=1.35, fontsize = 30)
rides_df.index = rides_df.name
rides_df = rides_df.drop('name', axis = 1)
我们可以看到,曼哈顿市中心的街区是出租车最拥挤的,以上东区为主导。我们还看到,最常见的骑行是在集群内(在动画中看不到)。
热图是相当对称的,这意味着没有集群的上升次数明显多于下降次数或相反。让我们放大它:
fig,ax = plt.subplots(figsize = (12,12))
for i in range(len(rides_df)):
ax.plot(rides_df.sum(axis = 1)[i],rides_df.sum(axis = 0)[i],'o', color = 'b')
ax.annotate(rides_df.index.tolist()[i], (rides_df.sum(axis = 1)[i],rides_df.sum(axis = 0)[i]), color = 'b', fontsize = 12)
ax.plot([0,250000],[0,250000], color = 'r', linewidth = 1)
ax.grid('off')
ax.set_xlim([0,250000])
ax.set_ylim([0,250000])
ax.set_xlabel('Outbound Taxis')
ax.set_ylabel('Inbound Taxis')
ax.set_title('Inbound and Outbound rides for each cluster')
我们可以看到,每个邻域的入站-出站比率是相对平衡的。
这两个机场的出境乘车次数比入境乘车次数多,这是有道理的——即使没有乘客,司机也可能会去机场,有机会带人们进城。住宅区——昆斯、布鲁克林和哈莱姆区有更多的入境乘车,而商业和旅游区则有更多的出境乘车。上东区和西区,既是商业区又是住宅区,几乎处于曲线上。
人们似乎会乘坐其他交通工具进入曼哈顿,但更可能乘坐出租车离开。
7. 冬季与夏季
df['pickup_month'] = df.pickup_datetime.apply(lambda x: parser.parse(x).month )
fig,ax = plt.subplots(2,figsize = (12,12))
rides_df = pd.DataFrame(columns = neighborhood.values())
rides_df['name'] = neighborhood.values()
rides_df.index = rides_df.name
for col in rides_df.columns[:-1]:
rides_df[col] = rides_df.name.apply(lambda x: len(df[(df.pickup_neighborhood == x) & (df.dropoff_neighborhood == col) & (df.pickup_month == 6)]))
for i in range(len(rides_df)):
ax[0].plot(rides_df.iloc[:,:-1].sum(axis = 1)[i],rides_df.iloc[:,:-1].sum(axis = 0)[i],'o', color = 'b')
ax[0].annotate(rides_df.index.tolist()[i], (rides_df.iloc[:,:-1].sum(axis = 1)[i],rides_df.iloc[:,:-1].sum(axis = 0)[i]), color = 'b', fontsize = 12)
ax[0].grid('off')
ax[0].set_xlabel('Outbound Taxis')
ax[0].set_ylabel('Inbound Taxis')
ax[0].set_title('Inbound and Outbound rides for each cluster - June')
ax[0].set_xlim([0,40000])
ax[0].set_ylim([0,40000])
ax[0].plot([0,40000],[0,40000])
for col in rides_df.columns[:-1]:
rides_df[col] = rides_df.name.apply(lambda x: len(df[(df.pickup_neighborhood == x) & (df.dropoff_neighborhood == col) & (df.pickup_month == 1)]))
rides_df = rides_df.drop('name', axis = 1)
for i in range(len(rides_df)):
ax[1].plot(rides_df.sum(axis = 1)[i],rides_df.sum(axis = 0)[i],'o', color = 'b')
ax[1].annotate(rides_df.index.tolist()[i], (rides_df.sum(axis = 1)[i],rides_df.sum(axis = 0)[i]), color = 'b', fontsize = 12)
ax[1].grid('off')
ax[1].set_xlabel('Outbound Taxis')
ax[1].set_ylabel('Inbound Taxis')
ax[1].set_title('Inbound and Outbound rides for each cluster - January')
ax[1].set_xlim([0,40000])
ax[1].set_ylim([0,40000])
ax[1].plot([0,40000],[0,40000])
正如我们所看到的,无论哪个月,模式都几乎相同。下雪的一月与潮湿旅游的六月产生了非常相似的出租车模式。
在线运行本实验请登录ModelCube
http://modelcube.cn/experiment/experiment-detail/1009586
在线运行本实验请登录ModelCube
http://modelcube.cn/experiment/experiment-detail/1009586