遗传算法优化随机森林模型（python）

文摘 2024-07-26 18:57 山东

点击上方“进修编程”，选择“星标”公众号

超级无敌干货，第一时间送达！！！

金属质感分割线

介绍

心脏病仍然是全球死亡的主要原因之一。准确的早期诊断对于有效治疗和管理至关重要。机器学习模型，尤其是随机森林，在预测心脏病方面表现出了良好的前景。然而，优化这些模型以获得最佳性能是一项艰巨的任务。本文探讨了使用遗传算法 (GA) 优化随机森林模型进行心脏病分类。

什么是随机森林？

随机森林是一种集成学习方法，它将多个决策树组合在一起，以提高分类准确率并控制过度拟合。森林中的每棵树都是从训练数据的随机子集构建的，最终的预测是通过汇总所有树的预测得出的。

随机森林的关键超参数

在本文中，需要调整的超参数包括：
1. 树的数量 (n_estimators)：森林中的树的数量。

2. 最大深度 (max_depth)：每棵树的最大深度。

3. 最小样本分割 (min_samples_split)：分割内部节点所需的最小样本数。

4. 最小样本叶 (min_samples_leaf)：叶节点所需的最小样本数。

什么是遗传算法？

遗传算法 (GA) 是一种受自然选择和遗传学原理启发的优化技术。它通过在连续几代中选择最佳解决方案来解决优化问题来模拟进化过程。当搜索空间很大且很复杂，传统的优化方法变得不可行时，GA 特别有用。遗传算法的核心概念：
1. 种群：潜在解决方案（个体）的集合。

2.染色体：将个体的解决方案表示为一系列参数（基因）。

3. 适应度函数：根据特定标准评估每个解决方案的优劣，通常与当前问题相关。

4. 选择：选择表现较好的个体进行繁殖的过程。

5. 交叉：将两个个体结合起来产生具有混合性状的后代。

6.突变：对个体基因引入随机变化以保持遗传多样性。

执行

数据准备

首先，本文使用的数据是“心脏病分类数据集”，可以在这里访问。

https://www.kaggle.com/datasets/bharath011/heart-disease-classification-dataset

df = pd.read_csv('./Data/Heart Attack.csv')

数据探索

本部分将使用条形图和直方图等可视化方式探索数据。

描述性统计数据
我们可以看到描述性统计数据，例如平均值、标准差、最小值、最大值和四分位数

column = df.drop('class', axis=1)stat = column.describe()
# Exclude the 'count' value from the statistics for all columnsstat = stat.drop(['count'])
# Mengatur ukuran dan grid subplot (2x4)fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(20, 10))fig.suptitle('Descriptive Statistics', fontsize=16)
# Iterate over each column and create bar plotsfor i, column in enumerate(stat.columns):    row_num = i // 4    col_num = i % 4        # Select column statistics    column_stats = stat[column]        # Add bar trace to the subplot    axes[row_num, col_num].bar(column_stats.index, column_stats.values, color='skyblue')    axes[row_num, col_num].set_title(f'Statistics for {column}')    axes[row_num, col_num].set_ylabel('Values')    axes[row_num, col_num].tick_params(axis='x', rotation=45)
# Adjust layout to prevent overlapplt.tight_layout(rect=[0, 0, 1, 0.96])plt.show()

数据分布
我们还可以用直方图查看数据分布。用直方图可视化的变量是“年龄”、“性别”、“高压”、“低压”、“脉冲”、“葡萄糖”、“kcm”、“肌钙蛋白”。

# The column for which we want to make a histogramcolumns = ['age', 'gender', 'pressurehight', 'pressurelow', 'impluse', 'glucose', 'kcm', 'troponin']
# Create figure & axis for layout 2x4fig, axes = plt.subplots(2, 4, figsize=(20, 10))
# Loop through columns and axes to create a histogramfor i, column in enumerate(columns):    ax = axes[i // 4, i % 4]    ax.hist(df[column], bins=20)    ax.set_title(f'Histogram of {column.capitalize()}')    ax.spines[['top', 'right']].set_visible(False)    ax.set_xlabel(column.capitalize())    ax.set_ylabel('Frequency')
# Adjust layoutplt.tight_layout()plt.show()

按性别划分的班级数量
此外，我们可以创建一个条形图来查看每个性别的每个班级的数量。

按性别划分的班级条形图

数据建模

建模基础随机森林

我们将创建一个没有优化的随机森林模型，然后使用遗传算法对其进行优化。

1.分离 x 和 y

# Separate x and yx = scaled_dfy = df['class']

2. 分割数据

from sklearn.model_selection import train_test_split# Split dataX_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

3. 创建模型

# Create model base model=RandomForestClassifier().fit(X_train,y_train)prediction=model.predict(X_test)
# Evaluate modelprint(confusion_matrix(y_test,prediction))print(accuracy_score(y_test,prediction))print(classification_report(y_test,prediction))

4. 结果

这是不使用遗传算法的随机森林基础模型的结果。准确率为 96%。现在，我们将使用遗传算法来提高准确率。

遗传算法优化

步骤 1：设置环境

pip install numpy pandas scikit-learn deap

第 2 步：设置 DEAP 框架
DEAP 库有助于实现进化算法。在这里，我们设置 DEAP 框架来执行超参数优化。

# Define the evaluation functiondef evaluate(individual):    # Extract the hyperparameters    n_estimators, max_depth, min_samples_split, min_samples_leaf = individual
    # Create the classifier    clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth,                                  min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf, random_state=42)
    # Evaluate the classifier using cross-validation    scores = cross_val_score(clf, X_train, y_train, cv=5)    return np.mean(scores),
# Set up DEAP frameworkcreator.create("FitnessMax", base.Fitness, weights=(1.0,))creator.create("Individual", list, fitness=creator.FitnessMax)
toolbox = base.Toolbox()
# Attribute generatortoolbox.register("n_estimators", random.randint, 10, 200)toolbox.register("max_depth", random.randint, 1, 20)toolbox.register("min_samples_split", random.randint, 2, 20)toolbox.register("min_samples_leaf", random.randint, 1, 20)
# Structure initializerstoolbox.register("individual", tools.initCycle, creator.Individual,                 (toolbox.n_estimators, toolbox.max_depth,                   toolbox.min_samples_split, toolbox.min_samples_leaf), n=1)toolbox.register("population", tools.initRepeat, list, toolbox.individual)
toolbox.register("mate", tools.cxTwoPoint)toolbox.register("mutate", tools.mutUniformInt, low=[10, 1, 2, 1], up=[200, 20, 20, 20], indpb=0.2)toolbox.register("select", tools.selTournament, tournsize=3)toolbox.register("evaluate", evaluate)

步骤 3：运行遗传算法
设置 DEAP 框架后，我们现在可以执行遗传算法来寻找最佳超参数。

def main():    random.seed(42)  # For reproducibility    pop = toolbox.population(n=50)  # Create a population of 50 individuals    hof = tools.HallOfFame(1)  # Hall of Fame to store the best solution    stats = tools.Statistics(lambda ind: ind.fitness.values)    stats.register("avg", np.mean)    stats.register("std", np.std)    stats.register("min", np.min)    stats.register("max", np.max)
    # Running the Genetic Algorithm    pop, log = algorithms.eaSimple(        pop,         toolbox,         cxpb=0.5,  # Crossover probability        mutpb=0.2,  # Mutation probability        ngen=40,  # Number of generations        stats=stats,         halloffame=hof,         verbose=True    )
    return pop, log, hof
if __name__ == "__main__":    pop, log, hof = main()    print("Best individual is: ", hof[0])    print("Best accuracy: ", evaluate(hof[0])[0])

根据这些结果，使用 GA 获得的最佳超参数是 n_estimators=32、max_depth=15、min_samples_split=13、min_samples_leaf=7。

步骤 4：使用 GA 的超参数训练新模型

# Create new model with hyperparameter from GAmodel=RandomForestClassifier(n_estimators=32, max_depth=15, min_samples_split=13,min_samples_leaf=7).fit(X_train,y_train)prediction=model.predict(X_test)
# Evaluate modelprint(confusion_matrix(y_test,prediction))print(accuracy_score(y_test,prediction))print(classification_report(y_test,prediction))

在使用 GA 的超参数重新训练模型后，获得了 97% 的准确率，比不使用 GA 的模型提高了 1%。这意味着 GA 成功提高了随机森林的性能，尽管只是一点点。

结论

使用遗传算法优化随机森林超参数是提高心脏病分类模型性能的有效方法。DEAP 库提供了一个灵活的框架来实现进化算法，从而能够高效地搜索最佳超参数。通过采用这种技术，从业者可以在机器学习模型中实现更高的准确性和可靠性，最终有助于在医疗保健和其他领域实现更好的预测分析。

python\matlab程序设计找我。

— 完 —

进修编程

提升编程技能，学习编程技巧

最新文章

遗传算法优化交通信号灯配时（python）

时间序列分析与可视化（python）

一文搞懂非线性优化（python）

聚类算法指南（python）

五大核心优化算法（python）

使用遗传算法优化支持向量机（SVM）

一文学会A*算法（python）

使用 PyTorch 实现遗传算法

非支配排序遗传算法II（NSGA-II）

人工智能中的遗传算法与局部搜索优化算法（python）

机器学习中的优化算法

遗传算法优化随机森林模型（python）

构建可视化 Dijkstra 算法（python）

从头实现萤火虫优化算法（python）

从头开始构建 K 最近邻算法

从头实现Transformer

一文搞懂人工蜂群算法（python）

Pandas数据可视化完整指南（python）

从头实现主成分分析算法（python）

一文搞懂ARIMA时间序列预测模型（python）

深度学习入门教程第一篇（python）

深度学习入门教程第二篇（python）

K-Means 聚类算法完整指南（python）

逻辑回归模型完整指南（python）

分类

时事

民生

政务

教育

文化

科技

财富

体娱

健康

情感

旅行

百科

职场

楼市

企业

乐活

学术

汽车

时尚

创业

美食

幽默

美体

文摘

原创标签

时事社会财经军事教育体育科技汽车科学房产搞笑综艺明星音乐动漫游戏时尚健康旅游美食生活摄影宠物职场育儿情感小说曲艺文化历史三农文学娱乐电影视频图片新闻宗教电视剧纪录片广告创意壁纸头像心灵鸡汤星座命理教育培训艺术文化金融财经健康医疗美妆时尚餐饮美食母婴育儿社会新闻工业农业时事政治星座占卜幽默笑话独立短篇连载作品文化历史科技互联网

发布位置

广东北京山东江苏河南浙江山西福建河北上海四川陕西湖南安徽湖北内蒙古江西云南广西甘肃辽宁黑龙江贵州新疆重庆吉林天津海南青海宁夏西藏香港澳门台湾美国加拿大澳大利亚日本新加坡英国西班牙新西兰韩国泰国法国德国意大利缅甸菲律宾马来西亚越南荷兰柬埔寨俄罗斯巴西智利卢森堡芬兰瑞典比利时瑞士土耳其斐济挪威朝鲜尼日利亚阿根廷匈牙利爱尔兰印度老挝葡萄牙乌克兰印度尼西亚哈萨克斯坦塔吉克斯坦希腊南非蒙古奥地利肯尼亚加纳丹麦津巴布韦埃及坦桑尼亚捷克阿联酋安哥拉