在公众号「python风控模型」里回复关键字:学习资料
QQ学习群:1026993837 领学习资料
在数据挖掘中,我们经常用到逻辑回归算法。逐步回归又是筛选变量的一个自动化算法,被诸多大学教授讲述。甚至多个构建评分卡的第三方库中集成了逐步回归。我在机器学习项目中累计经验说明逐步回归有时是有用的,特别是存在较多相关性高的变量时,逐步回归可以很好降低模型维度,降低逻辑回归模型多重共线性。当然逐步回归不是完全消除模型多重共线性,而是很好改善情况,多重共线性是很难完全消除的。
下图是对乳腺癌数据集的逐步回归项目,模型维度降低一半,模型性能反而略有提高。这说明逐步回归是有效的。
当变量相关性不高情况下,我认为可以不用逐步回归,用了后模型性能反而下降。下图是give me some credit数据集测试,逐步回归后模型性能反而略有下降。
我今天看了国内某知名大学教授讲述逐步回归视频,案例是青岛市财政收入分析。他把很多自己观念强行和逐步回归结果联系起来。此教授过于强调GDP在经济中作用,我认为是不可取的。他认可制造业和工业这点我是赞同的。经济是一个非常复杂模型,变量之间存在复杂交互关系,我认为他光用逐步回归来解释是不全面的。
我认为逐步回归是一种变量筛选方法,但不能神话逐步回归。逐步回归还是有争议的。变量自动化筛选过程始终用的是同样数据集,这容易过渡拟合。逐步回归容易导致排除有价值变量,造成模型过于简单。其它争议还有很多,不一一阐述。
还是那句话,逐步回归是一种方法,只要能降低模型维度,得到满意模型性能,变量能够被业务方所解释,就是可以用的,但不能神话它和夸大它的作用。
机器学习是一门严谨学科,希望各位同学今后使用时要谨慎对待,要全面了解一种算法的利和弊以及什么时候可以使用。
最后附上一些逐步回归英文的解释
Criticism
Stepwise regression procedures are used in data mining, but are controversial. Several points of criticism have been made.
The tests themselves are biased, since they are based on the same data.Wilkinson and Dallal (1981)computed percentage points of the multiple correlation coefficient by simulation and showed that a final regression obtained by forward selection, said by the F-procedure to be significant at 0.1%, was in fact only significant at 5%.
When estimating the degrees of freedom, the number of the candidate independent variables from the best fit selected may be smaller than the total number of final model variables, causing the fit to appear better than it is when adjusting the r2 value for the number of degrees of freedom. It is important to consider how many degrees of freedom have been used in the entire model, not just count the number of independent variables in the resulting fit.
Models that are created may be over-simplifications of the real models of the data.
Such criticisms, based upon limitations of the relationship between a model and procedure and data set used to fit it, are usually addressed by verifying the model on an independent data set, as in the PRESS procedure.
Critics regard the procedure as a paradigmatic example of data dredging, intense computation often being an inadequate substitute for subject area expertise. Additionally, the results of stepwise regression are often used incorrectly without adjusting them for the occurrence of model selection. Especially the practice of fitting the final selected model as if no model selection had taken place and reporting of estimates and confidence intervals as if least-squares theory were valid for them, has been described as a scandal.Widespread incorrect usage and the availability of alternatives such as ensemble learning, leaving all variables in the model, or using expert judgement to identify relevant variables have led to calls to totally avoid stepwise model selection.
2024年四川省大学生数据科学与统计建模竞赛(算法赛)就为大家介绍到这里。如果大家对这次模型竞赛感兴趣,欢迎大家报名课程《python金融风控评分卡模型和数据分析系列》。微信二维码扫一扫收藏课程。该课程包含这次模型竞赛用到的评分卡,集成树,神经网络算法,数据清洗,IV值计算,变量重要性计算,描述性统计等知识,对这次模型竞赛提升有帮助。
商务联系
#
如果你需要三农绿色信贷感兴趣,例如研究生,博士生论文,企业建模需求,我们公司提供一对一机器学习模型定制服务,提供公司正规发票和合同。
商务联系QQ:231469242,微信:drug666123,或扫描下面二维码加微信咨询。
QQ学习群:1026993837,免费领取200G学习资料。