人工智能的发展正在改变材料科学领域。然而,大规模材料数据集中存在错误数据以及利用机器学习预测与温度相关的性质时出现过拟合等挑战。
近日,日本东北大学李昊副教授等人在Science China Materials发表研究论文,以热电材料为例,提出了一个处理和克服材料科学中的人工智能大数据挑战的示例。
1) 首先采取一系列合理的方法删除问题数据,从Starrydata2数据库中获得包括7295种成分在不同温度下的92,291个数据。2) 然后,提出了一种基于成分的交叉验证方法避免过拟合,进而使用梯度提升决策树方法构建了机器学习模型,并获得了显著的R2。3) 最后,使用该模型对Materials Project数据库中的材料进行评估,Ge2Te5As2和Ge3(Te3As)2表现出较高的zT值。4) 理论计算得到n型和p型Ge2Te5As2的最大zT值分别为1.98和2.12, n型和p型Ge3(Te3As)2的最大zT值分别为0.58和0.74,表明它们是有潜力的热电材料。Figure 1. Workflow for the thorough preprocessing of the data from the Starrydata2 repository.Figure 2. Illustration of how the dataset was split up based on composition using a 10-fold cross-validation.Figure 3. Identification of outliers or problematic data using ML models and statistical analysis of the finally determined dataset. Performance of the ML model based on (a) 108,116 data points from 8541 different compositions and (b) 92,291 data points from 7295 compositions, using 10-fold cross validations. (c, d) Bar charts showing (c) the zT values and (d) the top 20 elements that appear most frequently in our dataset.Figure 4. Calculated band structures. Electronic band structures for (a) Ge2Te5As2 and (b) Ge3(Te3As)2, showing the band energy levels along high symmetry paths in the Brillouin zone. The band energies are calculated at the PBE + SOC level of theory with the band gap shifted to the values obtained at the HSE06 + SOC level of theory. The Fermi energy is set to zero. The valence bands are shown in blue and the conduction bands in orange and plotted using the sumo code.Figure 5. Calculated TE transport properties. Calculated (a, e) S, (b, f) σ, (c, g) κe, and (d, h) zT values for n- and p-type Ge2Te5As2 and Ge3(Te3As)2.Xue Jia, Alex Aziz, Yusuke Hashimoto, Hao Li. Dealing with the big data challenges in AI for thermoelectric materials. Sci. China Mater. (2024).https://doi.org/10.1007/s40843-023-2777-2
点击左下角“阅读原文”,阅读以上文章PDF原文
![]()