作者:第八星系-李智
邮箱:lizhi258147369@163.com
训练模型
首先我们将6个气象要素设置成预测(解释)变量,或者叫特征。
臭氧设置为响应变量,或者叫目标变量。
然后拆分数据集,前70%的数据作为训练集,后30%作为测试集。
from sklearn.metrics import mean_squared_errorfrom sklearn.metrics import mean_absolute_errorfrom sklearn.metrics import r2_scorefrom sklearn.model_selection import train_test_splitimport numpy as npimport pandas as pdfrom matplotlib import pyplot as pltfrom sklearn.decomposition import PCAfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.model_selection import train_test_splitimport seaborn as snsfrom six import StringIOfrom IPython.display import Imagefrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScalerfrom sklearn.tree import export_graphvizimport ostarget = 'O3'features = df.columns[df.columns != target]X = df[features].valuesy = df[target].valuesX = df.drop(columns=['O3'])y = df['O3']X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=123)
接下来,我们查看数据是否符合正态分布。
# 正偏态分布图sns.distplot(df['O3'], color='green')plt.show()print("偏度为 %f " % df['O3'].skew())print("峰度为 %f" % df['O3'].kurt())
我们开始训练模型,并查看其平均绝对误差(MAE)与决定系数(R2)。
from sklearn.ensemble import RandomForestRegressorforest = RandomForestRegressor(n_estimators=100, criterion='squared_error', random_state=1, n_jobs=-1)forest.fit(X_train, y_train)y_train_pred = forest.predict(X_train)y_test_pred = forest.predict(X_test)mae_train = mean_absolute_error(y_train, y_train_pred)mae_test = mean_absolute_error(y_test, y_test_pred)print(f'MAE train: {mae_train:.2f}')print(f'MAE test: {mae_test:.2f}')r2_train = r2_score(y_train, y_train_pred)r2_test =r2_score(y_test, y_test_pred)print(f'R^2 train: {r2_train:.2f}')print(f'R^2 test: {r2_test:.2f}')
测试集的结果明显不如训练集
我们再看看残差
x_max = np.max([np.max(y_train_pred), np.max(y_test_pred)])x_min = np.min([np.min(y_train_pred), np.min(y_test_pred)])fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(7, 3), sharey=True)ax1.scatter(y_test_pred, y_test_pred - y_test, c='limegreen', marker='s', edgecolor='white', label='Test data')ax2.scatter(y_train_pred, y_train_pred - y_train, c='steelblue', marker='o', edgecolor='white', label='Training data')ax1.set_ylabel('Residuals')for ax in (ax1, ax2): ax.set_xlabel('Predicted values') ax.legend(loc='upper left') ax.hlines(y=0, xmin=x_min-100, xmax=x_max+100, color='black', lw=2)plt.tight_layout()#plt.savefig('figures/09_16.png', dpi=300)plt.show()
预测的残差并非完全随机分布在零中心点周围,说明该模型无法捕获所有的解释性信息。
得分一般
score = forest.score(X_test, y_test)print('随机森林模型得分: ', score)
预测值与真实值偏差还是明显的,预测值明显偏小
y_validation_pred = forest.predict(X_test)plt.figure()plt.plot(np.arange(1000), y_test[:1000], "go-", label="True value")plt.plot(np.arange(1000), y_validation_pred[:1000], "ro-", label="Predict value")plt.title("True value And Predict value")plt.legend()plt.show()
我们再从其它角度看看回归性能
# 评估回归性能from sklearn import metricsprint('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_validation_pred))print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_validation_pred))print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_validation_pred)))df_output = pd.DataFrame(columns=['t1000','r1000','u1000','v1000','blh','e', 'y_true', 'y_pred'])df_output['t1000'] = X_test['t1000']df_output['r1000'] = X_test['r1000']df_output['u1000'] = X_test['u1000']df_output['v1000'] = X_test['v1000']df_output['blh'] = X_test['blh']df_output['e'] = X_test['e']df_output['y_true'] = y_testdf_output['y_pred'] = y_validation_preddf_output.to_excel('result_Y_validation.xlsx')
最后,我们看看各气象要素对臭氧的贡献率
pipe = Pipeline([('scaler', StandardScaler()), ('reduce_dim', PCA()), ('regressor', forest)])with open('./wine.dot','w',encoding='utf-8') as f: f=export_graphviz(pipe.named_steps['regressor'].estimators_[0], out_file=f) f=export_graphviz(pipe.named_steps['regressor'].estimators_[0], out_file=f)col = list(X_train.columns.values)importances = forest.feature_importances_x_columns = ['t1000','r1000','u1000','v1000','blh','e']indices = np.argsort(importances)[::-1]list01 = []list02 = []for f in range(X_train.shape[1]): print("%2d) %-*s %f" % (f + 1, 30, col[indices[f]], importances[indices[f]])) list01.append(col[indices[f]]) list02.append(importances[indices[f]])from pandas.core.frame import DataFramec = {"columns": list01, "importances": list02}data_impts = DataFrame(c)data_impts.to_excel('data_importances.xlsx')importances = list(forest.feature_importances_)feature_list = list(X_train.columns)feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]feature_importances = sorted(feature_importances, key=lambda x: x[1], reverse=True)import matplotlib.pyplot as pltx_values = list(range(len(importances)))print(x_values)plt.bar(x_values, importances, orientation='vertical')plt.xticks(x_values, feature_list, rotation=96)plt.ylabel('Importance')plt.xlabel('Variable')plt.title('Variable Importances')plt.show()
本文编辑:CL
回复:第八星系
获取进群方式
进群可获取完整代码脚本