应用实操 | 数据科学全流程 Github 仓库汇总

文摘其他 2024-01-27 21:28 日本

0 前言

两年前，公众号曾经发布了一篇数据科学资源整合的文章《关于数据科学的一些资源（Python）》，相比于前一版，这版会更加关注概率统计模型以及相关的自动化机器学习的相关技术。

在整理过程中将现已失效或未维护四年以上的仓库进行剔除，并在文末存放了参考文献，方便对各个资源进行更进一步了解。

‍

一级目录	二级目录
数据生成
特征工程
	探索性分析
	缺失值填补
	特征衍生
	特征选择
	异常检测
	不均衡处理
模型搜索
参数优化
模型解释

照例，给出一些有意思的仓库列表，它们是各自领域的汇总整理。

名称	地址	说明
awesome-AutoML	https://github.com/windmaple/awesome-AutoML	整理 AutoML 相关研究、工具、项目和其他资源的列表
awesome-AutoML-and-Lightweight-Models	https://github.com/guan-yuan/awesome-AutoML-and-Lightweight-Models	高质量（最新）AutoML成果和轻量级模型列表
awesome-imbalanced-learning	https://github.com/ZhiningLiu1998/awesome-imbalanced-learning	类别不平衡/长尾学习的一切：论文、代码、框架与库‍
Awesome Learning with Label Noise	https://github.com/subeeshvasu/Awesome-Learning-with-Label-Noise	使用噪声标签学习的精选资源列表。
Awesome Time Series Anomaly Detection	https://github.com/rob-med/awesome-TS-anomaly-detection	用于时间序列数据异常检测的工具和数据集列表。
Awesome Causality	https://github.com/rguo12/awesome-causality-algorithms	用数据学习因果关系的算法列表。

1 数据生成

名称	地址	说明
timeseries-generator	https://github.com/Nike-Inc/timeseries-generator	通过易于使用的因子和生成器生成合成时间序列数据的库
DeepEcho	https://github.com/sdv-dev/DeepEcho	混合类型、多元时间序列的合成数据生成
SDV	https://github.com/sdv-dev/SDV	表格数据的综合数据生成
ydata-synthetic	https://github.com/ydataai/ydata-synthetic	用于表格和时间序列数据的综合数据生成器
time_series_simulator	https://github.com/IPavlak/time_series_simulator	主要目的是可视化和模拟股票价格（K线），但可以针对任何时间序列进行修改
GAN-for-tabular-data	https://github.com/Diyago/GAN-for-tabular-data	关于 GAN 于表格的实际应用
picka	https://github.com/antlong/picka	基于Python的数据生成和随机化模块
faker	https://github.com/joke2k/faker	一个为你生成假数据的 Python 包

2 特征工程

2.1 探索性分析

名称	地址	说明
dtale	https://github.com/man-group/dtale	pandas 数据结构的可视化工具
ydata-profiling	https://github.com/ydataai/ydata-profiling	Pandas 和 Spark DataFrames 的 1 行代码数据质量分析和探索性数据分析
sweetviz	https://github.com/fbdesignpro/sweetviz	使用一行代码即可可视化并比较数据集、目标值和关联
AutoViz	https://github.com/AutoViML/AutoViz	使用一行代码自动可视化任何大小的任何数据集
Rath	https://github.com/Kanaries/Rath	下一代自动化数据探索性分析和可视化平台
lux	https://github.com/lux-org/lux	通过一次打印自动可视化您的 pandas 数据框！
dataprep	https://github.com/sfu-db/dataprep	python 中的开源低代码数据处理库
KLib	https://github.com/akanz1/klib	易于使用的自定义函数 Python 库，用于清理和分析数据
dabl	https://github.com/dabl/dabl	数据分析基础库
edaviz	https://github.com/tkrabel/edaviz	用于在 Jupyter Notebook 或 Jupyter Lab 中进行探索性数据分析和可视化的 Python 库
PyGWalker	https://github.com/Kanaries/pygwalker	将 pandas 数据框转变为交互式 UI 以进行可视化分析
missingno	https://github.com/ResidentMario/missingno	缺失值可视化检测

2.2 缺失值填补

名称	地址	说明
ycimpute	https://github.com/OpenIDEA-YunanUniversity/ycimpute	基于机器学习的缺失值插补库。它的实现missForest，简单版的MICE(R pacakge)，knn，EM等......
MissForest	https://github.com/yuenshingyan/MissForest	说是最好的缺失值插补方法
pygmmis	https://github.com/pmelchior/pygmmis	针对不完整（缺失或截断）和噪声数据的高斯混合模型
impyute	https://github.com/eltonlaw/impyute	用于预处理缺失数据的数据插补库
missingpy	https://github.com/epsilon-machine/missingpy	缺失填补的包

2.3 特征选择

名称	地址	说明
feature-selector	https://github.com/WillKoehrsen/feature-selector	机器学习数据集降维工具
feature_selection_GAAlgorithm	https://github.com/rogeroyer/feature_selection_GAAlgorithm	基于遗传算法的特征选择

2.4 特征衍生

名称	地址	说明
featuretools	https://github.com/alteryx/featuretools/tree/main	用于自动化特征工程的开源Python库
tsfresh	https://github.com/blue-yonder/tsfresh	从时间序列中自动提取相关特征
feature_engine	https://github.com/feature-engine/feature_engine	具有类似 sklearn 功能的特征工程包
tsfel	https://github.com/fraunhoferportugal/tsfel	用于从时间序列中提取特征的直观库
NitroFE	https://github.com/NITRO-AI/NitroFE	特征工程引擎，提供各种模块，旨在内部保存过去的依赖值以提供连续计算
EvolutionaryForest	https://github.com/zhenlingcn/EvolutionaryForest	基于遗传编程的自动化特征工程的开源python库
AutoX	https://github.com/4paradigm/AutoX	AutoX 是一款高效的 automl 工具，主要针对表格数据的数据挖掘比赛‍

2.5 异常检测

名称	地址	说明
tslumen	https://github.com/hsbc/tslumen	时间序列 EDA（探索性数据分析）库
pyod	https://github.com/yzhao062/pyod	用于异常值检测（异常检测）的全面且可扩展的 Python 库
darts	https://github.com/unit8co/darts	一个 Python 库，用于对时间序列进行用户友好的预测和异常检测
alibi-detect	https://github.com/SeldonIO/alibi-detect	异常值、对抗性和漂移检测算法
flow-forecast	https://github.com/AIStream-Peelout/flow-forecast	用于时间序列预测、分类和异常检测（最初用于洪水预测）的深度学习 PyTorch 库
surpriver	https://github.com/tradytics/surpriver	使用异常检测和机器学习在股价波动之前发现波动较大的股票
luminol	https://github.com/linkedin/luminol	异常检测和关系库
adtk	https://github.com/arundo/adtk	用于时间序列中基于规则/无监督异常检测的 Python 工具包
datastream	https://github.com/MentatInnovations/datastream.io	使用 Python、ElasticSearch 和 Kibana 进行实时异常检测的开源框架
pygod	https://github.com/pygod-team/pygod	用于图的异常值检测（异常检测）的 Python 库
Hampel	https://github.com/MichaelisTrofficus/hampel_filter	Hampel 过滤器的 Python 实现

2.6 不均衡处理

名称	地址	说明
imbalanced-learn	https://github.com/scikit-learn-contrib/imbalanced-learn	解决机器学习中数据集不平衡的问题
imbalanced-algorithms	https://github.com/dialnd/imbalanced-algorithms	用于学习不平衡数据的算法

3 模型搜索

名称	地址	说明
model_search	https://github.com/google/model_search	一个实现 AutoML 算法以进行大规模模型架构搜索的框架
blobcity	https://github.com/blobcity/autoai/tree/main	基于 Python 的自动 AI 框架，用于对数值数据进行回归和分类。执行模型搜索、超参数调整和高质量 Jupyter Notebook 代码生成
autofeat	https://github.com/cod3licious/autofeat	具有自动特征工程和选择功能的线性预测模型
lazypredict	https://github.com/shankarpandala/lazypredict	无需太多代码即可帮助构建许多基本模型，并有助于了解哪些模型在无需任何参数调整的情况下效果更好
LightAutoML	https://github.com/sberbank-ai-lab/LightAutoML	自动模型创建框架
AutoML_Alex	https://github.com/Alex-Lekov/AutoML_Alex	最先进的表格数据自动机器学习 python 库
pycaret	https://github.com/pycaret/pycaret	Python 中的开源低代码机器学习库
Sketch	https://github.com/approximatelabs/sketch	理解数据内容的AI代码编写助手
tpot	https://github.com/EpistasisLab/tpot	一种 Python 自动化机器学习工具，可使用遗传编程优化机器学习管道
auto_ml	https://github.com/ClimbsRocks/auto_ml	用于分析和生产的自动化机器学习
auto-sklearn	https://github.com/automl/auto-sklearn	使用 scikit-learn 进行自动化机器学习
mljar	https://github.com/mljar/mljar-supervised	自动化机器学习
AlphaPy	https://github.com/ScottfreeLLC/AlphaPy	使用 scikit-learn xgboost、LightGBM 等进行自动化机器学习
MLBox	https://github.com/AxeldeRomblay/MLBox	一个强大的自动化机器学习Python库
FLAML	https://github.com/microsoft/FLAML	用于 AutoML 和参数调优的快速库
Hypernets	https://github.com/DataCanvasIO/Hypernets	通用自动化机器学习框架，用于简化特定领域端到端 AutoML 工具包的开发
Cooka	https://github.com/DataCanvasIO/Cooka	轻量级可视化 AutoML 系统
auptimizer	https://github.com/LGE-ARC-AdvancedAI/auptimizer	一种自动 ML 模型优化工具
evalml	https://github.com/alteryx/evalml	用 python 编写的 AutoML 库

4 参数优化

名称	地址	说明
sklearn	https://scikit-learn.org/stable/index.html	GridSearchCV，RandomizedSearchCV
hyperopt	https://github.com/hyperopt/hyperopt	超参数优化
optuna	https://github.com/pfnet/optuna	超参数优化
dragonfly	https://github.com/dragonfly/dragonfly	可扩展贝叶斯优化
katib	https://github.com/kubeflow/katib	用于超参数调整的仓库
scikit-Opitimize	https://github.com/scikit-optimize/scikit-optimize	使用“scipy.optimize”接口进行基于序列模型的优化
BayesianOptimization	https://github.com/bayesian-optimization/BayesianOptimization	高斯过程全局优化的 Python 实现
SMAC3	https://github.com/automl/SMAC3	用于超参数优化的多功能贝叶斯优化包
spearmint	https://github.com/JasperSnoek/spearmint	执行贝叶斯优化的包
HpBandSter	https://github.com/automl/HpBandSter	分布式超参数优化

5 模型解释

名称	地址	说明
shap	https://github.com/slundberg/shap	解释任何机器学习模型输出的博弈论方法
lime	https://github.com/marcotcr/lime	解释任何机器学习分类器的预测
eli5	https://github.com/TeamHG-Memex/eli5	检查机器学习分类器并解释它们的预测
lofo-importance	https://github.com/aerdem4/lofo-importance	Leave One Feature Out Importance
pdpbox	https://github.com/SauceCat/PDPbox	部分依赖图工具箱
anchor	https://github.com/marcotcr/anchor	分类器的高精度模型无关解释
interpretml	https://github.com/interpretml/interpret	拟合可解释模型，解释模型
shapash	https://github.com/MAIF/shapash	用户友好的可解释性和可解释性，可开发可靠且透明的机器学习模型
imodels	https://github.com/csinva/imodels	可解释的 ML 包，用于简洁、透明和准确的预测建模（与 sklearn 兼容）。
TrustScore	https://github.com/google/TrustScore	任何经过训练（可能是黑盒）分类器的不确定性度量，它比分类器自己的隐含置信度更有效
dtreeviz	https://github.com/parrt/dtreeviz	用于决策树可视化和模型解释的 python 库。目前支持 scikit-learn、XGBoost、Spark MLlib 和 LightGBM 树
alibi	https://github.com/SeldonIO/alibi	解释机器学习模型的算法
fairlearn	https://github.com/fairlearn/fairlearn	评估和提高机器学习模型的公平性
yellowbrick	https://github.com/DistrictDataLabs/yellowbrick	可视化分析和诊断工具促进机器学习模型选择
mlxtend	https://github.com/rasbt/mlxtend	用于 Python 数据分析和机器学习库的扩展和帮助模块库

6 参考文献

《区区几行Python代码，就能实现全面自动探索性数据分析！》：https://mp.weixin.qq.com/s/x-BLWUmd49UoyhBiNim49Q
《自动探索数据新秘器—Lux》：https://mp.weixin.qq.com/s/lbxfaevDzWrd6g5coBOgXA
《使用这些 Python 工具可视化地探索数据 | Linux 中国》：https://mp.weixin.qq.com/s/ryv2eHNykqfRPkVur-eKCg
《Feature-engine: 一个完备的特征工程Python库，实现端到端的特征流水线》：https://mp.weixin.qq.com/s/YXR3_wgK4pr0hBy3dtRBxg
《缺失值可视化Python工具库：missingno》：https://mp.weixin.qq.com/s/djyerwjGnYNAtz4-YpwRJA
《推荐10款优秀的Python异常检测开源库》：https://mp.weixin.qq.com/s/XIP5sy7f_2-t-1jKsxfZTA
《时间序列异常值检验神器——Hampel滤波器》：https://mp.weixin.qq.com/s/_BD9w59-1baR5JHEuyRKmw
《时间序列预测神器Prophet python实现》：https://mp.weixin.qq.com/s/dgWjk1LkszyieuQtHMaQQg
《python实战技能-04-tsfel-时序数据自动特征提取》：https://mp.weixin.qq.com/s/JKLS70lYi7FamO49C5qc6w
《python实战技能-02-tsfresh时序数据自动特征提取》：https://mp.weixin.qq.com/s/MWwYhD7oqpATXpNH5hm-xQ
《独家 | 用Python Featuretools库实现自动化特征工程（附链接）》：https://mp.weixin.qq.com/s/hfevCfAW1pMAhIilxdrWoA
《处理不平衡数据的十大Python库（附代码）》：https://mp.weixin.qq.com/s/iDrKTyV602OxR9lGbM0Zfg
《全面总结机器学习超参数调优（附代码）》：https://mp.weixin.qq.com/s/VzDDSBKT6SjhDoH4FvKkKw
《机器学习模型可解释性的6种Python工具包，总有一款适合你！》：https://mp.weixin.qq.com/s/vLOC9WWEZOeZhmCLbnzrAA
《模型解释——特征重要性、LIME与SHAP》：https://mp.weixin.qq.com/s/ytyUjlkq92fQ5u5TwLTOIw
《10个用于可解释AI的Python库整理分享》：https://mp.weixin.qq.com/s/p5uwcK-0ms7TfZHg0LrDww
《你需要知道的17个顶级MLOps工具》：https://mp.weixin.qq.com/s/mojJYeyI5O_rX6i9igQmMA

http://mp.weixin.qq.com/s?__biz=Mzg2NjcyNzg3NQ==&mid=2247486896&idx=1&sn=ab30901cc1d365d21b8843dfa16997e0

知守溪的收纳屋

存放觉得有用的文章。关键词：金融量化、因子选择、因果推断、可解释性、人工智能