内容：如何使用matminer、automatminer、pandas和scikit-learn来获取机器学习材料属性。

1.数据检索和过滤

操作和检查 pandas DataFrame 对象

检查数据集

对数据集进行索引

过滤数据集

生成新列

2.为机器学习生成描述符

特征化方法和基础

dataframes特征化（Featurizing dataframes）

结构特征器（Structure Featurizers）

转换特征器（Conversion Featurizers）

高级功能（Advanced capabilities）

处理错误

引用作者

3.机器学习模型

Scikit-Learn

加载并准备一个预先特征化的模型

使用scikit-learn尝试随机森林模型（random forest model）

评估模型性能

交叉验证（Cross validation）

可视化模型性能（Visualizing model performance）

模型解释（Model interpretation）

4.使用自动机的自动机器学习

用Automatminer's MatPipe进行拟合和预测

安装管道（Fitting the pipeline）

预测新数据（Examine predictions）

检查预测集

评分预测（Score predictions）

检查MatPipe的内部（Examining the internals of MatPipe）

直接访问MatPipe的内部对象

管道的持久性（Persistence of pipelines）

典型的机器学习工作流，整个过程可以概括为:

1 .获取原始输入，如作文列表和相关的目标属性来学习。

2.将原始输入转换成可通过机器学习算法学习的描述符或特征。

3.在数据上训练机器学习模型。

4.绘制并分析模型的性能。

1.数据检索和过滤

Matminer与许多材料数据库接口，包括:-材料项目- Citrine - AFLOW -材料数据设施(MDF) -数据科学材料平台(MPDS)此外，它还包括来自已发表文献的数据集。Matminer拥有一个由26个(并且还在增长的)数据集组成的存储库，这些数据集来自对材料特性的已发表和同行评审的机器学习研究或高通量计算研究的出版物。在本节中，我们将展示如何访问和操作已发表文献中的数据集。有关访问其他材料数据库的更多信息，请参见matminer_examples知识库。

可以使用get _ available _ datasets()函数打印基于文献的数据集列表。

这还会打印数据集包含的信息，例如样本数量、目标属性以及数据是如何获得的(例如，通过理论或实验)。

from matminer.datasets import get_available_datasetsget_available_datasets()结果：
boltztrap_mp: Effective mass and thermoelectric properties of 8924 compounds in The  Materials Project database that are calculated by the BoltzTraP software package run on the GGA-PBE or GGA+U density functional theory calculation results. The properties are reported at the temperature of 300 Kelvin and the carrier concentration of 1e18 1/cm3.brgoch_superhard_training: 2574 materials used for training regressors that predict shear and bulk modulus.castelli_perovskites: 18,928 perovskites generated with ABX combinatorics, calculating gllbsc band gap and pbe structure, and also reporting absolute band edge positions and heat of formation.citrine_thermal_conductivity: Thermal conductivity of 872 compounds measured experimentally and retrieved from Citrine database from various references. The reported values are measured at various temperatures of which 295 are at room temperature.dielectric_constant: 1,056 structures with dielectric properties, calculated with DFPT-PBE.double_perovskites_gap: Band gap of 1306 double perovskites (a_1-b_1-a_2-b_2-O6) calculated using Gritsenko, van Leeuwen, van Lenthe and Baerends potential (gllbsc) in GPAW.double_perovskites_gap_lumo: Supplementary lumo data of 55 atoms for the double_perovskites_gap dataset.elastic_tensor_2015: 1,181 structures with elastic properties calculated with DFT-PBE.expt_formation_enthalpy: Experimental formation enthalpies for inorganic compounds, collected from years of calorimetric experiments. There are 1,276 entries in this dataset, mostly binary compounds. Matching mpids or oqmdids as well as the DFT-computed formation energies are also added (if any).expt_gap: Experimental band gap of 6354 inorganic semiconductors.flla: 3938 structures and computed formation energies from "Crystal Structure Representations for Machine Learning Models of Formation Energies."glass_binary: Metallic glass formation data for binary alloys, collected from various experimental techniques such as melt-spinning or mechanical alloying. This dataset covers all compositions with an interval of 5 at. % in 59 binary systems, containing a total of 5959 alloys in the dataset. The target property of this dataset is the glass forming ability (GFA), i.e. whether the composition can form monolithic glass or not, which is either 1 for glass forming or 0 for non-full glass forming.glass_binary_v2: Identical to glass_binary dataset, but with duplicate entries merged. If there was a disagreement in gfa when merging the class was defaulted to 1.glass_ternary_hipt: Metallic glass formation dataset for ternary alloys, collected from the high-throughput sputtering experiments measuring whether it is possible to form a glass using sputtering. The hipt experimental data are of the Co-Fe-Zr, Co-Ti-Zr, Co-V-Zr and Fe-Ti-Nb ternary systems.glass_ternary_landolt: Metallic glass formation dataset for ternary alloys, collected from the "Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys,’ a volume of the Landolt– Börnstein collection. This dataset contains experimental measurements of whether it is possible to form a glass using a variety of processing techniques at thousands of compositions from hundreds of ternary systems. The processing techniques are designated in the "processing" column. There are originally 7191 experiments in this dataset, will be reduced to 6203 after deduplicated, and will be further reduced to 6118 if combining multiple data for one composition. There are originally 6780 melt-spinning experiments in this dataset, will be reduced to 5800 if deduplicated, and will be further reduced to 5736 if combining multiple experimental data for one composition.heusler_magnetic: 1153 Heusler alloys with DFT-calculated magnetic and electronic properties. The 1153 alloys include 576 full, 449 half and 128 inverse Heusler alloys. The data are extracted and cleaned (including de-duplicating) from Citrine.jarvis_dft_2d: Various properties of 636 2D materials computed with the OptB88vdW and TBmBJ functionals taken from the JARVIS DFT database.jarvis_dft_3d: Various properties of 25,923 bulk materials computed with the OptB88vdW and TBmBJ functionals taken from the JARVIS DFT database.jarvis_ml_dft_training: Various properties of 24,759 bulk and 2D materials computed with the OptB88vdW and TBmBJ functionals taken from the JARVIS DFT database.m2ax: Elastic properties of 223 stable M2AX compounds from "A comprehensive survey of M2AX phase elastic properties" by Cover et al. Calculations are PAW PW91.matbench_dielectric: Matbench v0.1 test dataset for predicting refractive index from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those having refractive indices less than 1 and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.matbench_expt_gap: Matbench v0.1 test dataset for predicting experimental band gap from composition alone. Retrieved from Zhuo et al. supplementary information. Deduplicated according to composition, removing compositions with reported band gaps spanning more than a 0.1eV range; remaining compositions were assigned values based on the closest experimental value to the mean experimental value for that composition among all reports. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.matbench_expt_is_metal: Matbench v0.1 test dataset for classifying metallicity from composition alone. Retrieved from Zhuo et al. supplementary information. Deduplicated according to composition, ensuring no conflicting reports were entered for any compositions (i.e., no reported compositions were both metal and nonmetal). For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.matbench_glass: Matbench v0.1 test dataset for predicting full bulk metallic glass formation ability from chemical formula. Retrieved from "Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys,’ a volume of the Landolt– Börnstein collection. Deduplicated according to composition, ensuring no compositions were reported as both GFA and not GFA (i.e., all reports agreed on the classification designation). For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.matbench_jdft2d: Matbench v0.1 test dataset for predicting exfoliation energies from crystal structure (computed with the OptB88vdW and TBmBJ functionals). Adapted from the JARVIS DFT database. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.matbench_log_gvrh: Matbench v0.1 test dataset for predicting DFT log10 VRH-average shear modulus from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those having negative G_Voigt, G_Reuss, G_VRH, K_Voigt, K_Reuss, or K_VRH and those failing G_Reuss <= G_VRH <= G_Voigt or K_Reuss <= K_VRH <= K_Voigt and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.matbench_log_kvrh: Matbench v0.1 test dataset for predicting DFT log10 VRH-average bulk modulus from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those having negative G_Voigt, G_Reuss, G_VRH, K_Voigt, K_Reuss, or K_VRH and those failing G_Reuss <= G_VRH <= G_Voigt or K_Reuss <= K_VRH <= K_Voigt and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.matbench_mp_e_form: Matbench v0.1 test dataset for predicting DFT formation energy from structure. Adapted from Materials Project database. Removed entries having formation energy more than 3.0eV and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.matbench_mp_gap: Matbench v0.1 test dataset for predicting DFT PBE band gap from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.matbench_mp_is_metal: Matbench v0.1 test dataset for predicting DFT metallicity from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.matbench_perovskites: Matbench v0.1 test dataset for predicting formation energy from crystal structure. Adapted from an original dataset generated by Castelli et al. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.matbench_phonons: Matbench v0.1 test dataset for predicting vibration properties from crystal structure. Original data retrieved from Petretto et al. Original calculations done via ABINIT in the harmonic approximation based on density functional perturbation theory. Removed entries having a formation energy (or energy above the convex hull) more than 150meV. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.matbench_steels: Matbench v0.1 test dataset for predicting steel yield strengths from chemical composition alone. Retrieved from Citrine informatics. Deduplicated. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.mp_all_20181018: A complete copy of the Materials Project database as of 10/18/2018. mp_all files contain structure data for each material while mp_nostruct does not.mp_nostruct_20181018: A complete copy of the Materials Project database as of 10/18/2018. mp_all files contain structure data for each material while mp_nostruct does not.phonon_dielectric_mp: Phonon (lattice/atoms vibrations) and dielectric properties of 1296 compounds computed via ABINIT software package in the harmonic approximation based on density functional perturbation theory.piezoelectric_tensor: 941 structures with piezoelectric properties, calculated with DFT-PBE.steel_strength: 312 steels with experimental yield strength and ultimate tensile strength, extracted and cleaned (including de-duplicating) from Citrine.wolverton_oxides: 4,914 perovskite oxides containing composition data, lattice constants, and formation + vacancy formation energies. All perovskites are of the form ABO3. Adapted from a dataset presented by Emery and Wolverton.

数据集列表为：

['boltztrap_mp','brgoch_superhard_training','castelli_perovskites','citrine_thermal_conductivity','dielectric_constant','double_perovskites_gap','double_perovskites_gap_lumo','elastic_tensor_2015','expt_formation_enthalpy','expt_gap','flla','glass_binary','glass_binary_v2','glass_ternary_hipt','glass_ternary_landolt','heusler_magnetic','jarvis_dft_2d','jarvis_dft_3d','jarvis_ml_dft_training','m2ax','matbench_dielectric','matbench_expt_gap','matbench_expt_is_metal','matbench_glass','matbench_jdft2d','matbench_log_gvrh','matbench_log_kvrh','matbench_mp_e_form','matbench_mp_gap','matbench_mp_is_metal','matbench_perovskites','matbench_phonons','matbench_steels','mp_all_20181018','mp_nostruct_20181018','phonon_dielectric_mp','piezoelectric_tensor','steel_strength','wolverton_oxides']

可以使用load_dataset()函数和数据库名称加载数据集。为了节省安装空间，安装matminer时不会自动下载数据集。相反，第一次加载数据集时，它将从互联网上下载并存储在matminer安装目录中。
让我们加载介电常数数据集。它包含1056个用DFPT-PBE计算的介电性质的结构。

from matminer.datasets import load_datasetdf = load_dataset("dielectric_constant")结果：
Fetching dielectric_constant.json.gz from https://ndownloader.figshare.com/files/13213475 to D:\anaconda3\lib\site-packages\matminer\datasets\dielectric_constant.json.gz

操作和检查 pandas DataFrame 对象

数据集以pandas $DataFrame$ 对象的形式提供。在Python中，您可以将这些视为一种“电子表格”对象。数据框架有几种有用的方法可以用来探索和清理数据，其中一些我们将在下面探讨。

检查数据集

head()函数打印数据集前几行的摘要。你可以滚动查看更多栏。由此，很容易看出数据集中可用的数据类型。

from matminer.datasets import load_datasetdf = load_dataset("dielectric_constant")
print(df.head())结果：material_id  ...                                             poscar
0      mp-441  ...  Rb2 Te1\n1.0\n5.271776 0.000000 3.043661\n1.75...
1    mp-22881  ...  Cd1 Cl2\n1.0\n3.850977 0.072671 5.494462\n1.78...
2    mp-28013  ...  Mn1 I2\n1.0\n4.158086 0.000000 0.000000\n-2.07...
3   mp-567290  ...  La2 N2\n1.0\n4.132865 0.000000 0.000000\n-2.06...
4   mp-560902  ...  Mn2 F4\n1.0\n3.354588 0.000000 0.000000\n0.000...[5 rows x 16 columns]

有时，如果数据集非常大，将无法看到所有可用的列。相反，可以使用columns属性查看列的完整列表:

from matminer.datasets import load_datasetdf = load_dataset("dielectric_constant")
print(df.columns)结果:
Index(['material_id', 'formula', 'nsites', 'space_group', 'volume','structure', 'band_gap', 'e_electronic', 'e_total', 'n','poly_electronic', 'poly_total', 'pot_ferroelectric', 'cif', 'meta','poscar'],dtype='object')

pandas $Dataframe$ 包括一个名为description()的函数，它有助于确定数据中各种数字/分类列的统计数据。请注意，description()函数默认情况下只描述数字列。
有时，description()函数会显示异常值，表明数据中存在错误。

from matminer.datasets import load_datasetdf = load_dataset("dielectric_constant")
print(df.describe())结果：nsites  space_group  ...  poly_electronic   poly_total
count  1056.000000  1056.000000  ...      1056.000000  1056.000000
mean      7.530303   142.970644  ...         7.248049    14.777898
std       3.388443    67.264591  ...        13.054947    19.435303
min       2.000000     1.000000  ...         1.630000     2.080000
25%       5.000000    82.000000  ...         3.130000     7.557500
50%       8.000000   163.000000  ...         4.790000    10.540000
75%       9.000000   194.000000  ...         7.440000    15.482500
max      20.000000   229.000000  ...       256.840000   277.780000[8 rows x 7 columns]进程已结束，退出代码为 0

对数据集进行索引

我们可以通过使用列名索引对象来访问数据帧的特定列。例如:

from matminer.datasets import load_datasetdf = load_dataset("dielectric_constant")
print(df["band_gap"])结果：
0       1.88
1       3.52
2       1.17
3       1.12
4       2.87...
1051    0.87
1052    3.60
1053    0.14
1054    0.21
1055    0.26
Name: band_gap, Length: 1056, dtype: float64

或者，我们可以使用iloc属性访问Dataframe的特定行。

from matminer.datasets import load_datasetdf = load_dataset("dielectric_constant")
print(df.iloc[100])结果：
material_id                                                    mp-7140
formula                                                            SiC
nsites                                                               4
space_group                                                        186
volume                                                       42.005504
structure            [[-1.87933700e-06  1.78517223e+00  2.53458835e...
band_gap                                                           2.3
e_electronic         [[6.9589498, -3.29e-06, 0.0014472600000000001]...
e_total              [[10.193825310000001, -3.7090000000000006e-05,...
n                                                                 2.66
poly_electronic                                                   7.08
poly_total                                                       10.58
pot_ferroelectric                                                False
cif                  #\#CIF1.1\n###################################...
meta                 {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...
poscar               Si2 C2\n1.0\n3.092007 0.000000 0.000000\n-1.54...
Name: 100, dtype: object

过滤数据集

pandas $Dataframe$ 对象使得基于特定列过滤数据变得非常容易。我们可以使用典型的Python比较运算符(==、>、> =、<，等等)来过滤数值。例如，让我们查找单元格体积大于580的所有条目。我们通过对 volume 列进行过滤来实现这一点。
请注意，我们首先生成一个布尔掩码——一系列取决于比较的 true 和 false 。然后，我们可以使用掩码来过滤 DataFrame。

from matminer.datasets import load_datasetdf = load_dataset("dielectric_constant")mask = df["volume"]>=580
print(df[mask])结果：material_id  ...                                             poscar
206    mp-23280  ...  As4 Cl12\n1.0\n4.652758 0.000000 0.000000\n0.0...
216     mp-9064  ...  Rb6 Te6\n1.0\n10.118717 0.000000 0.000000\n-5....
219    mp-23230  ...  P4 Cl12\n1.0\n6.523152 0.000000 0.000000\n0.00...
251     mp-2160  ...  Sb8 Se12\n1.0\n4.029937 0.000000 0.000000\n0.0...[4 rows x 16 columns]

我们可以使用这种过滤方法来清理数据集。例如，如果我们只希望我们的数据集包含半导体(band gap 非零的材料)，我们可以通过过滤 band gap 列轻松做到这一点。

from matminer.datasets import load_datasetdf = load_dataset("dielectric_constant")mask = df["band_gap"] > 0
semiconductor_df = df[mask]print(semiconductor_df)结果：material_id  ...                                             poscar
0         mp-441  ...  Rb2 Te1\n1.0\n5.271776 0.000000 3.043661\n1.75...
1       mp-22881  ...  Cd1 Cl2\n1.0\n3.850977 0.072671 5.494462\n1.78...
2       mp-28013  ...  Mn1 I2\n1.0\n4.158086 0.000000 0.000000\n-2.07...
3      mp-567290  ...  La2 N2\n1.0\n4.132865 0.000000 0.000000\n-2.06...
4      mp-560902  ...  Mn2 F4\n1.0\n3.354588 0.000000 0.000000\n0.000...
...          ...  ...                                                ...
1051   mp-568032  ...  Cd1 In2 Se4\n1.0\n5.912075 0.000000 0.000000\n...
1052   mp-696944  ...  La2 H2 Br4\n1.0\n4.137833 0.000000 0.000000\n-...
1053    mp-16238  ...  Li2 Ag1 Sb1\n1.0\n4.078957 0.000000 2.354987\n...
1054     mp-4405  ...  Rb3 Au1 O1\n1.0\n5.617516 0.000000 0.000000\n0...
1055     mp-3486  ...  K2 Sn2 Sb2\n1.0\n4.446803 0.000000 0.000000\n-...[1056 rows x 16 columns]

通常，数据集包含许多机器学习不需要的附加列。在我们可以在数据上训练模型之前，我们需要移除任何无关的列。我们可以使用 drop() 函数从数据集中移除整列。该函数可用于删除行和列。

该函数接受要删除的项目列表。对于列，这是列名，而对于行，这是行号。最后，axis 选项指定要删除的数据是列( axis=1 )还是行( axis=0 )。

例如，要删除nsites、space_group、e_electronic和e_total列，我们可以运行:

from matminer.datasets import load_datasetdf = load_dataset("dielectric_constant")print("drop前：")
print(df.describe())
print("--"*20)
cleaned_df = df.drop(["nsites", "space_group", "e_electronic", "e_total"],axis=1)
print("drop后")
print(cleaned_df.describe())结果：
drop前：nsites  space_group  ...  poly_electronic   poly_total
count  1056.000000  1056.000000  ...      1056.000000  1056.000000
mean      7.530303   142.970644  ...         7.248049    14.777898
std       3.388443    67.264591  ...        13.054947    19.435303
min       2.000000     1.000000  ...         1.630000     2.080000
25%       5.000000    82.000000  ...         3.130000     7.557500
50%       8.000000   163.000000  ...         4.790000    10.540000
75%       9.000000   194.000000  ...         7.440000    15.482500
max      20.000000   229.000000  ...       256.840000   277.780000[8 rows x 7 columns]
----------------------------------------
drop后volume     band_gap            n  poly_electronic   poly_total
count  1056.000000  1056.000000  1056.000000      1056.000000  1056.000000
mean    166.420376     2.119432     2.434886         7.248049    14.777898
std      97.425084     1.604924     1.148849        13.054947    19.435303
min      13.980548     0.110000     1.280000         1.630000     2.080000
25%      96.262337     0.890000     1.770000         3.130000     7.557500
50%     145.944691     1.730000     2.190000         4.790000    10.540000
75%     212.106405     2.885000     2.730000         7.440000    15.482500
max     597.341134     8.320000    16.030000       256.840000   277.780000

生成新列

pandas $Dataframe$ 对象还使得对数据进行简单计算变得容易。可以把这想象成在Excel电子表格中使用公式。可以使用所有基本的Python数学运算符(如+、-、/和*)。

例如，电介质数据集（dielectric dataset）包含对介电常数的电子贡献(在poly_electronic列中)和总(静态)介电常数(在poly_total 列中)。离子对数据集的贡献由下式给出:

下面，我们计算离子对介电常数的贡献，并将其存储在一个名为poly_的新列中。这就像将数据分配给新列一样简单，即使该列尚不存在。

from matminer.datasets import load_datasetdf = load_dataset("dielectric_constant")df['poly_ionic'] = df['poly_total']-df['poly_electronic']
print(df['poly_ionic'])结果：
0        2.79
1        3.57
2        5.67
3       10.95
4        4.77...
1051     4.09
1052     3.09
1053    19.99
1054    16.03
1055     3.08
Name: poly_ionic, Length: 1056, dtype: float64

2.为机器学习生成描述符

在这一节中，我们将学习如何在pymatgen中从材料对象生成机器学习描述符。首先，我们将使用matminer的“特征器”类生成一些描述符。接下来，我们将使用前一节中关于dataframes 的一些知识来检查我们的描述符，并准备它们作为机器学习模型的输入。

特征化器将材质图元转换为机器可学习的特征。特征化器的一般思想是接受一个材质图元(例如pymatgen Composition)并输出一个向量。例如:\begin{align}f(\mathrm{Fe}_2\mathrm{O}_3) \rightarrow [1.5, 7.8, 9.1, 0.09] \end{align}

Matminer包含以下pymatgen对象的特征:*成分*晶体结构*晶体位置*能带结构*态密度

根据特征，返回的特征可能是:*数值、分类或混合向量*矩阵*其他pymatgen对象(用于进一步处理)

由于我们大部分时间都在处理 pandas 的 dataframes，所以所有的特征器都在pandas dataFrame上工作。我们将在本课稍后提供这方面的示例。

matminer Matminer中的特征器包含60多个特征器，其中大多数是通过发表在同行评审论文中的方法实现的。你可以在[matminer网站](bandstructure — matminer 0.7.6 documentation)上找到特征器的完整列表。所有的特征器都在它们的核心方法中内置了并行化和方便的容错功能。

在这一课中，我们将复习所有特征化程序中的主要方法。在本单元结束时，你将能够使用一个通用的软件界面为广泛的材料信息学问题生成描述符。

特征化方法和基础

任何matminer的核心方法都是“特征化”。该方法接受一个材料对象，并返回一个机器学习向量或矩阵。让我们看一个pymatgen结构的例子:

from pymatgen import Composition
fe2o3 = Composition("Fe2O3")
print("fe2o3:",fe2o3)
#作为一个简单的例子，我们将使用element分数特征化器获得元素分数。
from matminer.featurizers.composition import ElementFraction
ef = ElementFraction()#现在我们可以把我们的结果具体化了。
element_fractions = ef.featurize(fe2o3)print(element_fractions)结果：
fe2o3: Fe2 O3
[0, 0, 0, 0, 0, 0, 0, 0.6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

我们已经成功地生成了用于学习的特性，但是它们意味着什么呢？检查的一种方法是阅读任何Features文档中的特征部分...但是更简单的方法是使用 feature_labels() 方法。

from matminer.featurizers.composition import ElementFractionef = ElementFraction()
element_fraction_labels = ef.feature_labels()
print(element_fraction_labels)结果：
['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg', 'Al', 'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Tc', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Xe', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 'Nd', 'Pm', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Po', 'At', 'Rn', 'Fr', 'Ra', 'Ac', 'Th', 'Pa', 'U', 'Np', 'Pu', 'Am', 'Cm', 'Bk', 'Cf', 'Es', 'Fm', 'Md', 'No', 'Lr']

我们现在按照生成特征的顺序来查看标签。

from pymatgen import Composition
fe2o3 = Composition("Fe2O3")#作为一个简单的例子，我们将使用element分数特征化器获得元素分数。
from matminer.featurizers.composition import ElementFraction
ef = ElementFraction()
#现在我们可以把我们的结果具体化了。
element_fractions = ef.featurize(fe2o3)element_fraction_labels = ef.feature_labels()
print(element_fraction_labels[7], element_fractions[7])
print(element_fraction_labels[25], element_fractions[25])结果：
O 0.6
Fe 0.4

dataframes特征化（Featurizing dataframes）

我们刚刚从一个单独的样本中生成了一些描述符和它们的标签，但是大多数时候我们的数据都在pandas的dataframs中。幸运的是，matminer featurizers实现了一个featurize_dataframe()方法，该方法与dataframes进行交互。

让我们从matminer中获取一个新的数据集，并在其上使用我们的 ElementFraxtion 特征器。

首先，我们像上一节一样下载数据集。在这个例子中，我们将下载一个超硬材料的数据集。

from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("brgoch_superhard_training")
print(df.head())结果：
Fetching brgoch_superhard_training.json.gz from https://ndownloader.figshare.com/files/13858931 to D:\anaconda3\envs\pythonProject1\lib\site-packages\matminer\datasets\brgoch_superhard_training.json.gzformula  ...  suspect_value
0   AlPt3  ...          False
1   Mn2Nb  ...          False
2    HfO2  ...          False
3   Cu3Pt  ...          False
4   Mg3Pt  ...          False

接下来，我们可以使用featurize_dataframe()方法 (由所有featurizer实现) 一次将ElementFraction 应用于所有数据。唯一需要的参数是作为输入的 dataframe 和输入列名(在本例中是 composition )。默认情况下，featurize _ dataframe()使用多处理并行化。

import pandas as pd#显示Dateframe所有列(参数设置为None代表显示所有行，也可以自行设置数字)
pd.set_option('display.max_columns',None)
#禁止Dateframe自动换行(设置为Flase不自动换行，True反之)
pd.set_option('expand_frame_repr', False)from matminer.datasets.dataset_retrieval import load_datasetdf = load_dataset("brgoch_superhard_training")
print(df.head())
print("---" * 20)from matminer.featurizers.composition import ElementFractionef = ElementFraction()if __name__ == '__main__':df = ef.featurize_dataframe(df, "composition")print(df.head())结果：formula  bulk_modulus  shear_modulus composition material_id                                          structure                                       brgoch_feats  suspect_value
0   AlPt3    225.230461      91.197748    (Al, Pt)      mp-188  [[0. 0. 0.] Al, [0.         1.96140395 1.96140...  {'atomic_number_feat_1': 123.5, 'atomic_number...          False
1   Mn2Nb    232.696340      74.590157    (Mn, Nb)    mp-12659  [[-2.23765223e-08  1.42974191e+00  5.92614104e...  {'atomic_number_feat_1': 45.5, 'atomic_number_...          False
2    HfO2    204.573433      98.564374     (Hf, O)      mp-352  [[2.24450185 3.85793022 4.83390736] O, [2.7788...  {'atomic_number_feat_1': 44.0, 'atomic_number_...          False
3   Cu3Pt    159.312640      51.778816    (Cu, Pt)    mp-12086  [[0.         1.86144248 1.86144248] Cu, [1.861...  {'atomic_number_feat_1': 82.5, 'atomic_number_...          False
4   Mg3Pt     69.637565      27.588765    (Mg, Pt)    mp-18707  [[0.         0.         2.73626461] Mg, [0.   ...  {'atomic_number_feat_1': 57.0, 'atomic_number_...          False
------------------------------------------------------------formula  bulk_modulus  shear_modulus composition material_id                                          structure                                       brgoch_feats  suspect_value  H  He   Li   Be    B    C    N         O    F  Ne   Na    Mg    Al   Si    P    S   Cl  Ar    K   Ca   Sc   Ti    V   Cr        Mn   Fe   Co   Ni    Cu   Zn   Ga   Ge   As   Se   Br  Kr   Rb   Sr    Y   Zr        Nb   Mo   Tc   Ru   Rh   Pd   Ag   Cd   In   Sn   Sb   Te    I  Xe   Cs   Ba  La  Ce  Pr  Nd  Pm  Sm  Eu  Gd  Tb  Dy  Ho  Er  Tm  Yb  Lu        Hf   Ta    W   Re   Os   Ir    Pt   Au   Hg   Tl   Pb   Bi  Po  At  Rn  Fr  Ra  Ac  Th  Pa  U  Np  Pu  Am  Cm  Bk  Cf  Es  Fm  Md  No  Lr
0   AlPt3    225.230461      91.197748    (Al, Pt)      mp-188  [[0. 0. 0.] Al, [0.         1.96140395 1.96140...  {'atomic_number_feat_1': 123.5, 'atomic_number...          False  0   0  0.0  0.0  0.0  0.0  0.0  0.000000  0.0   0  0.0  0.00  0.25  0.0  0.0  0.0  0.0   0  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  0.0  0.0  0.0  0.00  0.0  0.0  0.0  0.0  0.0  0.0   0  0.0  0.0  0.0  0.0  0.000000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   0  0.0  0.0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  0.000000  0.0  0.0  0.0  0.0  0.0  0.75  0.0  0.0  0.0  0.0  0.0   0   0   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0   0   0   0
1   Mn2Nb    232.696340      74.590157    (Mn, Nb)    mp-12659  [[-2.23765223e-08  1.42974191e+00  5.92614104e...  {'atomic_number_feat_1': 45.5, 'atomic_number_...          False  0   0  0.0  0.0  0.0  0.0  0.0  0.000000  0.0   0  0.0  0.00  0.00  0.0  0.0  0.0  0.0   0  0.0  0.0  0.0  0.0  0.0  0.0  0.666667  0.0  0.0  0.0  0.00  0.0  0.0  0.0  0.0  0.0  0.0   0  0.0  0.0  0.0  0.0  0.333333  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   0  0.0  0.0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  0.000000  0.0  0.0  0.0  0.0  0.0  0.00  0.0  0.0  0.0  0.0  0.0   0   0   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0   0   0   0
2    HfO2    204.573433      98.564374     (Hf, O)      mp-352  [[2.24450185 3.85793022 4.83390736] O, [2.7788...  {'atomic_number_feat_1': 44.0, 'atomic_number_...          False  0   0  0.0  0.0  0.0  0.0  0.0  0.666667  0.0   0  0.0  0.00  0.00  0.0  0.0  0.0  0.0   0  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  0.0  0.0  0.0  0.00  0.0  0.0  0.0  0.0  0.0  0.0   0  0.0  0.0  0.0  0.0  0.000000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   0  0.0  0.0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  0.333333  0.0  0.0  0.0  0.0  0.0  0.00  0.0  0.0  0.0  0.0  0.0   0   0   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0   0   0   0
3   Cu3Pt    159.312640      51.778816    (Cu, Pt)    mp-12086  [[0.         1.86144248 1.86144248] Cu, [1.861...  {'atomic_number_feat_1': 82.5, 'atomic_number_...          False  0   0  0.0  0.0  0.0  0.0  0.0  0.000000  0.0   0  0.0  0.00  0.00  0.0  0.0  0.0  0.0   0  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  0.0  0.0  0.0  0.75  0.0  0.0  0.0  0.0  0.0  0.0   0  0.0  0.0  0.0  0.0  0.000000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   0  0.0  0.0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  0.000000  0.0  0.0  0.0  0.0  0.0  0.25  0.0  0.0  0.0  0.0  0.0   0   0   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0   0   0   0
4   Mg3Pt     69.637565      27.588765    (Mg, Pt)    mp-18707  [[0.         0.         2.73626461] Mg, [0.   ...  {'atomic_number_feat_1': 57.0, 'atomic_number_...          False  0   0  0.0  0.0  0.0  0.0  0.0  0.000000  0.0   0  0.0  0.75  0.00  0.0  0.0  0.0  0.0   0  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  0.0  0.0  0.0  0.00  0.0  0.0  0.0  0.0  0.0  0.0   0  0.0  0.0  0.0  0.0  0.000000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   0  0.0  0.0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  0.000000  0.0  0.0  0.0  0.0  0.0  0.25  0.0  0.0  0.0  0.0  0.0   0   0   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0   0   0   0

结构特征器（Structure Featurizers）

我们可以对其他类型的特征器使用相同的语法。现在让我们给一个结构分配描述符。我们使用与组合特征器相同的语法来完成这个任务。首先，让我们加载一个包含结构的数据集。

from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("phonon_dielectric_mp")print(df.head())结果：
Fetching phonon_dielectric_mp.json.gz from https://ndownloader.figshare.com/files/13297571 to D:\anaconda3\envs\pythonProject1\lib\site-packages\matminer\datasets\phonon_dielectric_mp.json.gzmpid  ...  formula
0     mp-1000  ...     BaTe
1  mp-1002124  ...      HfC
2  mp-1002164  ...      GeC
3    mp-10044  ...      BAs
4  mp-1008223  ...     CaSe

让我们使用 DensityFeatures 来计算这些结构的一些基本密度特征。

from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("phonon_dielectric_mp")
from matminer.featurizers.structure import DensityFeaturesdensityf = DensityFeatures()
print(densityf.feature_labels())结果：
['density', 'vpa', 'packing fraction']

这些是我们将获得的特性。现在，我们使用 featurize_dataframe() 为dataframe中的所有样本生成这些特征。因为我们使用结构作为特征器的输入，所以我们选择“structure”列。

import pandas as pd#显示Dateframe所有列(参数设置为None代表显示所有行，也可以自行设置数字)
pd.set_option('display.max_columns',None)
#禁止Dateframe自动换行(设置为Flase不自动换行，True反之)
pd.set_option('expand_frame_repr', False)from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("phonon_dielectric_mp")
print(df.head())
print("---"*20)from matminer.featurizers.structure import DensityFeatures
densityf = DensityFeatures()
if __name__ == '__main__':df = densityf.featurize_dataframe(df, "structure")print(df.head())结果:mpid  eps_electronic  eps_total  last phdos peak                                          structure formula
0     mp-1000        6.311555  12.773454        98.585771  [[2.8943817  2.04663693 5.01321616] Te, [0. 0....    BaTe
1  mp-1002124       24.137743  32.965593       677.585725  [[0. 0. 0.] Hf, [-3.78195772 -3.78195772 -3.78...     HfC
2  mp-1002164        8.111021  11.169464       761.585719  [[0. 0. 0.] Ge, [ 3.45311592  3.45311592 -3.45...     GeC
3    mp-10044       10.032168  10.128936       701.585723  [[0.98372595 0.69559929 1.70386332] B, [0. 0. ...     BAs
4  mp-1008223        3.979201   6.394043       204.585763            [[0. 0. 0.] Ca, [ 4.95  4.95 -4.95] Se]    CaSe
------------------------------------------------------------mpid  eps_electronic  eps_total  last phdos peak                                          structure formula   density        vpa  packing fraction
0     mp-1000        6.311555  12.773454        98.585771  [[2.8943817  2.04663693 5.01321616] Te, [0. 0....    BaTe  4.937886  44.545547          0.596286
1  mp-1002124       24.137743  32.965593       677.585725  [[0. 0. 0.] Hf, [-3.78195772 -3.78195772 -3.78...     HfC  9.868234  16.027886          0.531426
2  mp-1002164        8.111021  11.169464       761.585719  [[0. 0. 0.] Ge, [ 3.45311592  3.45311592 -3.45...     GeC  5.760895  12.199996          0.394180
3    mp-10044       10.032168  10.128936       701.585723  [[0.98372595 0.69559929 1.70386332] B, [0. 0. ...     BAs  5.087634  13.991016          0.319600
4  mp-1008223        3.979201   6.394043       204.585763            [[0. 0. 0.] Ca, [ 4.95  4.95 -4.95] Se]    CaSe  2.750191  35.937000          0.428523

转换特征器（Conversion Featurizers）

除了 Bandstructure/DOS/Structure/Composition 特征器之外，matminer还提供了一个特征器接口，用于以容错方式在pymatgen对象之间进行转换(例如，将氧化状态辅助到成分)。这些特征器可以在 matminer.featurizers.conversion 中找到，并使用相同的featurize/featurize_dataframe等。语法和其他特征一样。

我们之前加载的数据集只包含一个带有字符串对象的 formula 列。要将这些数据转换成 composition 列包含pymatgen Composition对象，我们可以在 formula 列上使用 StrToComposition 转换特性器。

import pandas as pd
#显示Dateframe所有列(参数设置为None代表显示所有行，也可以自行设置数字)
pd.set_option('display.max_columns',None)
#禁止Dateframe自动换行(设置为Flase不自动换行，True反之)
pd.set_option('expand_frame_repr', False)from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("phonon_dielectric_mp")
print(df.head())
print("---"*20)from matminer.featurizers.conversions import StrToComposition
stc = StrToComposition()if __name__ == '__main__':df = stc.featurize_dataframe(df, "formula")print(df.head())结果：mpid  eps_electronic  eps_total  last phdos peak                                          structure formula
0     mp-1000        6.311555  12.773454        98.585771  [[2.8943817  2.04663693 5.01321616] Te, [0. 0....    BaTe
1  mp-1002124       24.137743  32.965593       677.585725  [[0. 0. 0.] Hf, [-3.78195772 -3.78195772 -3.78...     HfC
2  mp-1002164        8.111021  11.169464       761.585719  [[0. 0. 0.] Ge, [ 3.45311592  3.45311592 -3.45...     GeC
3    mp-10044       10.032168  10.128936       701.585723  [[0.98372595 0.69559929 1.70386332] B, [0. 0. ...     BAs
4  mp-1008223        3.979201   6.394043       204.585763            [[0. 0. 0.] Ca, [ 4.95  4.95 -4.95] Se]    CaSe
------------------------------------------------------------mpid  eps_electronic  eps_total  last phdos peak                                          structure formula composition
0     mp-1000        6.311555  12.773454        98.585771  [[2.8943817  2.04663693 5.01321616] Te, [0. 0....    BaTe    (Ba, Te)
1  mp-1002124       24.137743  32.965593       677.585725  [[0. 0. 0.] Hf, [-3.78195772 -3.78195772 -3.78...     HfC     (Hf, C)
2  mp-1002164        8.111021  11.169464       761.585719  [[0. 0. 0.] Ge, [ 3.45311592  3.45311592 -3.45...     GeC     (Ge, C)
3    mp-10044       10.032168  10.128936       701.585723  [[0.98372595 0.69559929 1.70386332] B, [0. 0. ...     BAs     (B, As)
4  mp-1008223        3.979201   6.394043       204.585763            [[0. 0. 0.] Ca, [ 4.95  4.95 -4.95] Se]    CaSe    (Ca, Se)

高级功能（Advanced capabilities）

在我们开始练习之前，Featurizers有一些强大的功能值得一提(这里没有提到更多)。

处理错误

通常，数据是混乱的，某些特性化者会遇到错误。在 featurize _ dataframe() 中设置ignore_errors=True 以跳过错误；如果您希望在附加列中看到返回的错误，也可以将return_errors 设置为True。

引用作者

许多特征器是使用在同行评审研究中发现的方法实现的。请使用 citations() 方法引用这些原始作品，该方法在Python列表中返回BibTex格式的引用。

3.机器学习模型

在第1部分和第2部分中，我们演示了如何下载数据集和添加机器可学习的特性。在第3部分中，我们展示了如何在数据集上训练机器学习模型并分析结果。

Scikit-Learn

本部分广泛使用scikit-learn包，这是一个用于机器学习的开源python包。Matminer旨在使scikit-learn的机器学习尽可能简单。还存在其他机器学习包，例如实现神经网络体系结构的 TensorFlow。这些软件包也可以与matminer一起使用，但不在本研讨会的讨论范围内。

加载并准备一个预先特征化的模型

首先，让我们加载一个可以用于机器学习的数据集。事先，我们已经为练习1和练习2中使用的elastic_tensor_2015数据集添加了一些 composition 和 structure 特征。

import pandas as pd
#显示Dateframe所有列(参数设置为None代表显示所有行，也可以自行设置数字)
pd.set_option('display.max_columns',None)
#禁止Dateframe自动换行(设置为Flase不自动换行，True反之)
pd.set_option('expand_frame_repr', False)from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("elastic_tensor_2015")from matminer.featurizers.structure import DensityFeatures
densityf = DensityFeatures()
from matminer.featurizers.conversions import StrToComposition
stc = StrToComposition()if __name__ == '__main__':df = densityf.featurize_dataframe(df, "structure")df = stc.featurize_dataframe(df, "formula")
print(df.head())结果：material_id    formula  nsites  space_group      volume                                          structure  elastic_anisotropy     G_Reuss       G_VRH     G_Voigt     K_Reuss       K_VRH     K_Voigt  poisson_ratio                                  compliance_tensor                                     elastic_tensor                            elastic_tensor_original                                                cif  kpoint_density                                             poscar    density        vpa  packing fraction   composition
0    mp-10003    Nb4CoSi      12          124  194.419802  [[0.94814328 2.07280467 2.5112    ] Nb, [5.273...            0.030688   96.844535   97.141604   97.438674  194.267623  194.268884  194.270146       0.285701  [[0.004385293093993, -0.0016070693558990002, -...  [[311.33514638650246, 144.45092552856926, 126....  [[311.33514638650246, 144.45092552856926, 126....  #\#CIF1.1\n###################################...            7000  Nb8 Co2 Si2\n1.0\n6.221780 0.000000 0.000000\n...   7.834556  16.201654          0.688834  (Nb, Co, Si)
1    mp-10010  Al(CoSi)2       5          164   61.987320  [[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278...            0.266910   93.939650   96.252006   98.564362  173.647763  175.449907  177.252050       0.268105  [[0.0037715428949660003, -0.000844229828709, -...  [[306.93357350984974, 88.02634955100905, 105.6...  [[306.93357350984974, 88.02634955100905, 105.6...  #\#CIF1.1\n###################################...            7000  Al1 Co2 Si2\n1.0\n3.932782 0.000000 0.000000\n...   5.384968  12.397466          0.644386  (Al, Co, Si)
2    mp-10015       SiOs       2          221   25.952539   [[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os]            0.756489  120.962289  130.112955  139.263621  295.077545  295.077545  295.077545       0.307780  [[0.0019959391925840004, -0.000433146670736000...  [[569.5291276937579, 157.8517489654999, 157.85...  [[569.5291276937579, 157.8517489654999, 157.85...  #\#CIF1.1\n###################################...            7000  Si1 Os1\n1.0\n2.960692 0.000000 0.000000\n0.00...  13.968635  12.976265          0.569426      (Si, Os)
3    mp-10021         Ga       4           63   76.721433  [[0.         1.09045794 0.84078375] Ga, [0.   ...            2.376805   12.205989   15.101901   17.997812   49.025963   49.130670   49.235377       0.360593  [[0.021647143908635, -0.005207263618160001, -0...  [[69.28798774976904, 34.7875015216915, 37.3877...  [[70.13259066665267, 40.60474945058445, 37.387...  #\#CIF1.1\n###################################...            7000  Ga4\n1.0\n2.803229 0.000000 0.000000\n0.000000...   6.036267  19.180359          0.479802          (Ga)
4    mp-10025      SiRu2      12           62  160.300999  [[1.0094265  4.24771709 2.9955487 ] Si, [3.028...            0.196930  100.110773  101.947798  103.784823  255.055257  256.768081  258.480904       0.324682  [[0.00410214297725, -0.001272204332729, -0.001...  [[349.3767766177825, 186.67131003104407, 176.4...  [[407.4791016459293, 176.4759188081947, 213.83...  #\#CIF1.1\n###################################...            7000  Si4 Ru8\n1.0\n4.037706 0.000000 0.000000\n0.00...   9.539514  13.358418          0.598395      (Si, Ru)

我们首先需要将数据集分成 “target” 属性和用于学习的 “features” 。在这个模型中，我们将使用体积模量(K_VRH)作为 target 属性。我们使用dataframe 的 values 属性给 target 属性一个numpy数组，而不是pandas Series 对象。

import pandas as pd
#显示Dateframe所有列(参数设置为None代表显示所有行，也可以自行设置数字)
pd.set_option('display.max_columns',None)
#禁止Dateframe自动换行(设置为Flase不自动换行，True反之)
pd.set_option('expand_frame_repr', False)from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("elastic_tensor_2015")from matminer.featurizers.structure import DensityFeatures
densityf = DensityFeatures()
from matminer.featurizers.conversions import StrToComposition
stc = StrToComposition()if __name__ == '__main__':df = densityf.featurize_dataframe(df, "structure")df = stc.featurize_dataframe(df, "formula")y = df['K_VRH'].valuesprint(y)结果：
[194.26888436 175.44990675 295.07754499 ...  89.41816126  99.384565335.93865993]

机器学习算法只能使用数字特征进行训练。因此，我们需要从数据集中移除任何非数字列。此外，我们希望从特征集中移除 K_VRH 列，因为模型不应该预先知道 target 属性。

上面加载的数据集包括以前用于生成机器可学习特征的structure, formula和 composition 。让我们使用 pandas drop() 函数删除它们，在章节1中讨论过。请记住，axis=1表示我们删除的是列而不是行。

import pandas as pd
#显示Dateframe所有列(参数设置为None代表显示所有行，也可以自行设置数字)
pd.set_option('display.max_columns',None)
#禁止Dateframe自动换行(设置为Flase不自动换行，True反之)
pd.set_option('expand_frame_repr', False)from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("elastic_tensor_2015")from matminer.featurizers.structure import DensityFeatures
densityf = DensityFeatures()
from matminer.featurizers.conversions import StrToComposition
stc = StrToComposition()
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=1)if __name__ == '__main__':df = densityf.featurize_dataframe(df, "structure")df = stc.featurize_dataframe(df, "formula")y = df['K_VRH'].valuesX = df.drop(['material_id',"structure", "formula", "composition", "K_VRH",'elastic_tensor_original','poscar','compliance_tensor','elastic_tensor','cif'], axis=1)print("There are {} possible descriptors:".format(X.columns))print(X.columns)结果：
There are Index(['nsites', 'space_group', 'volume', 'elastic_anisotropy', 'G_Reuss','G_VRH', 'G_Voigt', 'K_Reuss', 'K_Voigt', 'poisson_ratio','kpoint_density', 'density', 'vpa', 'packing fraction'],dtype='object') possible descriptors:
Index(['nsites', 'space_group', 'volume', 'elastic_anisotropy', 'G_Reuss','G_VRH', 'G_Voigt', 'K_Reuss', 'K_Voigt', 'poisson_ratio','kpoint_density', 'density', 'vpa', 'packing fraction'],dtype='object')

使用scikit-learn尝试随机森林模型（random forest model）

scikit-learn 库可以轻松地使用我们生成的功能来训练机器学习模型。它实现了各种不同的回归模型，并包含用于交叉验证的工具。

为了节省时间，在这个例子中，我们将只试验一个模型，但最好试验多个模型，看看哪一个对你的机器学习问题表现最好。一个好的“开始”模型是随机森林模型。让我们创建一个随机森林模型

from sklearn.ensemble import RandomForestRegressorrf = RandomForestRegressor(n_estimators=100, random_state=1)

请注意，我们创建的模型的估计量(n_estimators)设置为100。n_estimators是机器学习超参数的一个例子。大多数模型包含许多可调超参数。为了获得良好的性能，有必要为每个单独的机器学习问题微调这些参数。目前没有简单的方法提前知道什么样的超参数是最优的。通常，使用试错法。
我们现在可以训练我们的模型使用输入特征(X)来预测 target 属性(y)。这是使用 fit() 函数实现的。

rf.fit(X, y)

import pandas as pd
#显示Dateframe所有列(参数设置为None代表显示所有行，也可以自行设置数字)
pd.set_option('display.max_columns',None)
#禁止Dateframe自动换行(设置为Flase不自动换行，True反之)
pd.set_option('expand_frame_repr', False)from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("elastic_tensor_2015")from matminer.featurizers.structure import DensityFeatures
densityf = DensityFeatures()
from matminer.featurizers.conversions import StrToComposition
stc = StrToComposition()
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=1)if __name__ == '__main__':df = densityf.featurize_dataframe(df, "structure")df = stc.featurize_dataframe(df, "formula")y = df['K_VRH'].valuesX = df.drop(['material_id',"structure", "formula", "composition", "K_VRH",'elastic_tensor_original','poscar','compliance_tensor','elastic_tensor','cif'], axis=1)rf.fit(X, y)

评估模型性能

接下来，我们需要评估模型的性能。为此，我们首先要求模型预测原始dataframe中每个条目的体积模量。

y_pred = rf.predict(X)

接下来，我们可以通过查看预测的均方根误差来检查模型的准确性。Scikit-learn提供了一个 mean_squared_error() 函数来计算均方差。然后，我们取其平方根，以获得最终的性能指标。

import numpy as np
from sklearn.metrics import mean_squared_errormse = mean_squared_error(y, y_pred)
print('training RMSE = {:.3f} GPa'.format(np.sqrt(mse)))

import pandas as pd
#显示Dateframe所有列(参数设置为None代表显示所有行，也可以自行设置数字)
pd.set_option('display.max_columns',None)
#禁止Dateframe自动换行(设置为Flase不自动换行，True反之)
pd.set_option('expand_frame_repr', False)from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("elastic_tensor_2015")from matminer.featurizers.structure import DensityFeatures
densityf = DensityFeatures()
from matminer.featurizers.conversions import StrToComposition
stc = StrToComposition()
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=1)
import numpy as np
from sklearn.metrics import mean_squared_errorif __name__ == '__main__':df = densityf.featurize_dataframe(df, "structure")df = stc.featurize_dataframe(df, "formula")y = df['K_VRH'].valuesX = df.drop(['material_id',"structure", "formula", "composition", "K_VRH",'elastic_tensor_original','poscar','compliance_tensor','elastic_tensor','cif'], axis=1)rf.fit(X, y)y_pred = rf.predict(X)mse = mean_squared_error(y, y_pred)print('training RMSE = {:.3f} GPa'.format(np.sqrt(mse)))结果：
training RMSE = 0.801 GPa

一个0.801 GPa的RMSE（均方根误差）看起来很合理！然而，由于模型是在完全相同的数据上训练和评估的，这不是对模型在未知材料(机器学习研究的主要目的)上的表现的真实估计。

交叉验证（Cross validation）

为了获得更准确的预测性能估计，并验证我们没有过度拟合，我们需要检查交叉验证分数，而不是拟合分数。

在交叉验证中，数据被随机划分为 n 份(splits)(在本例中为10个)，每个分割包含大致相同数量的样本。对模型进行 n-1 份分割训练(训练集)，通过比较最终分割(测试集)的实际值和预测值来评估模型性能。总的来说，这个过程被重复多次，这样每个分割在某个点被用作测试集。交叉验证分数是所有测试集的平均分数。

有许多方法可以将数据分割成多个部分。在这个例子中，我们使用 KFold 方法，并选择拆分的数量为10。即90 %的数据将用作训练集，10 %用作测试集。

from sklearn.model_selection import KFoldkfold = KFold(n_splits=10, random_state=1)结果：
D:\anaconda3\envs\pythonProject1\lib\site-packages\sklearn\model_selection\_split.py:293: FutureWarning: Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.warnings.warn(

请注意，我们将设置 random_state=1，以确保每个参与者对他们的模型得到相同的答案。

最后，可以使用 Scikit-Learn cross_val_score() 函数自动获得交叉验证分数。这个函数需要一个机器学习模型、输入特征和目标属性作为参数。注意，我们将kfold对象作为 cv 参数传递，以使 cross_val_score() 使用正确的测试/训练分割。

对于每次拆分，在评估性能之前，将从头开始训练模型。由于我们必须训练和预测10次，交叉验证通常需要一些时间来执行。在我们的例子中，模型相当小，所以这个过程只需要大约一分钟。最终的交叉验证分数是所有分割的平均值。

from sklearn.model_selection import cross_val_scorescores = cross_val_score(rf, X, y, scoring='neg_mean_squared_error', cv=kfold)rmse_scores = [np.sqrt(abs(s)) for s in scores]
print('Mean RMSE: {:.3f}'.format(np.mean(rmse_scores)))

import pandas as pd
#显示Dateframe所有列(参数设置为None代表显示所有行，也可以自行设置数字)
pd.set_option('display.max_columns',None)
#禁止Dateframe自动换行(设置为Flase不自动换行，True反之)
pd.set_option('expand_frame_repr', False)from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("elastic_tensor_2015")from matminer.featurizers.structure import DensityFeatures
densityf = DensityFeatures()
from matminer.featurizers.conversions import StrToComposition
stc = StrToComposition()
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=1)
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_scoreif __name__ == '__main__':df = densityf.featurize_dataframe(df, "structure")df = stc.featurize_dataframe(df, "formula")y = df['K_VRH'].valuesX = df.drop(['material_id',"structure", "formula", "composition", "K_VRH",'elastic_tensor_original','poscar','compliance_tensor','elastic_tensor','cif'], axis=1)rf.fit(X, y)y_pred = rf.predict(X)mse = mean_squared_error(y, y_pred)print('training RMSE = {:.3f} GPa'.format(np.sqrt(mse)))kfold = KFold(n_splits=10, random_state=1)scores = cross_val_score(rf, X, y, scoring='neg_mean_squared_error', cv=kfold)rmse_scores = [np.sqrt(abs(s)) for s in scores]print('Mean RMSE: {:.3f}'.format(np.mean(rmse_scores)))结果：
training RMSE = 0.801 GPa
D:\anaconda3\envs\pythonProject1\lib\site-packages\sklearn\model_selection\_split.py:293: FutureWarning: Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.warnings.warn(
Mean RMSE: 1.731

请注意，我们的RMSE有点变化，因为现在它反映了模型的真正预测能力。不过一个~0.9GPa的均方根误差还是不错的！

可视化模型性能（Visualizing model performance）

对于所有测试/训练分割的测试集中的每个样本，我们可以通过对照实际值绘制我们的预测来可视化我们的模型的预测性能。

首先，我们使用 cross_val_predict 方法获得每个分割的测试集的预测值。这类似于cross_val_score 方法，只是它返回的是实际预测值，而不是模型得分。

from sklearn.model_selection import cross_val_predicty_pred = cross_val_predict(rf, X, y, cv=kfold)

为了绘图，我们使用matminer的 PlotlyFig 模块，它可以帮助您快速生成出版就绪的图表。 PlotlyFig 可以生成许多不同类型的图。详细解释它的使用超出了本教程的范围，但是可用图的示例在 FigRecipes section of the matminer_examples repository.

from matminer.figrecipes.plot import PlotlyFigpf = PlotlyFig(x_title='DFT (MP) bulk modulus (GPa)',y_title='Predicted bulk modulus (GPa)',mode='notebook')pf.xy(xy_pairs=[(y, y_pred), ([0, 400], [0, 400])], labels=df['formula'], modes=['markers', 'lines'],lines=[{}, {'color': 'black', 'dash': 'dash'}], showlegends=False)

#此代码需在Jupyter中运行
#*Jupyter最简便方式为在anaconda中安装JupyterLab和Jupyter Notebook
#*参考matminer.figrecipes手册，需在jupyter_notebook_config.py中设置NotebookApp.iopub_data_rate_limit=1.0e10，否则无法作图，可在当前python环境的终端中输入jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10import pandas as pd
#显示Dateframe所有列(参数设置为None代表显示所有行，也可以自行设置数字)
pd.set_option('display.max_columns',None)
#禁止Dateframe自动换行(设置为Flase不自动换行，True反之)
pd.set_option('expand_frame_repr', False)from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("elastic_tensor_2015")from matminer.featurizers.structure import DensityFeatures
densityf = DensityFeatures()
from matminer.featurizers.conversions import StrToComposition
stc = StrToComposition()
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=1)
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from matminer.figrecipes.plot import PlotlyFigif __name__ == '__main__':df = densityf.featurize_dataframe(df, "structure")df = stc.featurize_dataframe(df, "formula")y = df['K_VRH'].valuesX = df.drop(['material_id',"structure", "formula", "composition", "K_VRH",'elastic_tensor_original','poscar','compliance_tensor','elastic_tensor','cif'], axis=1)# rf.fit(X, y)# y_pred = rf.predict(X)# mse = mean_squared_error(y, y_pred)# print('training RMSE = {:.3f} GPa'.format(np.sqrt(mse)))kfold = KFold(n_splits=10, random_state=1)# scores = cross_val_score(rf, X, y, scoring='neg_mean_squared_error', cv=kfold)# rmse_scores = [np.sqrt(abs(s)) for s in scores]# print('Mean RMSE: {:.3f}'.format(np.mean(rmse_scores)))y_pred = cross_val_predict(rf, X, y, cv=kfold)pf = PlotlyFig(x_title='DFT (MP) bulk modulus (GPa)',y_title='Predicted bulk modulus (GPa)',mode='notebook')pf.xy(xy_pairs=[(y, y_pred), ([0, 400], [0, 400])],labels=df['formula'],modes=['markers', 'lines'],lines=[{}, {'color': 'black', 'dash': 'dash'}],showlegends=False)

还不错！但是，肯定也有一些离群点(可以用鼠标悬停在点上看看是什么)。

模型解释（Model interpretation）

机器学习的一个重要方面是能够理解为什么模型会做出某些预测。随机森林模型特别容易解释，因为它们具有feature_importances 属性，该属性包含每个特征在决定最终预测中的重要性。让我们看看我们模型的特征重要性。

import pandas as pd
#显示Dateframe所有列(参数设置为None代表显示所有行，也可以自行设置数字)
pd.set_option('display.max_columns',None)
#禁止Dateframe自动换行(设置为Flase不自动换行，True反之)
pd.set_option('expand_frame_repr', False)from pymatgen
from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("elastic_tensor_2015")from matminer.featurizers.structure import DensityFeatures
densityf = DensityFeatures()
from matminer.featurizers.conversions import StrToComposition
stc = StrToComposition()
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=1)if __name__ == '__main__':df = densityf.featurize_dataframe(df, "structure")df = stc.featurize_dataframe(df, "formula")y = df['K_VRH'].valuesX = df.drop(['material_id',"structure", "formula", "composition", "K_VRH",'elastic_tensor_original','poscar','compliance_tensor','elastic_tensor','cif'], axis=1)rf.fit(X, y)print(rf.feature_importances_)结果：
[1.33190021e-05 1.59243590e-05 3.00289964e-05 1.31361719e-044.30431755e-04 4.14454063e-04 2.24079389e-04 8.12596104e-011.85751384e-01 1.10635270e-04 1.32575216e-06 3.02149617e-054.80359906e-05 2.02700688e-04]

为了理解这一点，我们需要知道每个数字对应于哪个特征。我们可以使用 PlotlyFig 来绘制5个最重要特征的重要性。

importances = rf.feature_importances_
included = X.columns.values
indices = np.argsort(importances)[::-1]pf = PlotlyFig(y_title='Importance (%)',title='Feature by importances',mode='notebook')pf.bar(x=included[indices][0:5], y=importances[indices][0:5])

import pandas as pd
#显示Dateframe所有列(参数设置为None代表显示所有行，也可以自行设置数字)
from matminer import PlotlyFigpd.set_option('display.max_columns',None)
#禁止Dateframe自动换行(设置为Flase不自动换行，True反之)
pd.set_option('expand_frame_repr', False)from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("elastic_tensor_2015")from matminer.featurizers.structure import DensityFeatures
densityf = DensityFeatures()
from matminer.featurizers.conversions import StrToComposition
stc = StrToComposition()
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=1)
import numpy as np
import pymatgenif __name__ == '__main__':df = densityf.featurize_dataframe(df, "structure")df = stc.featurize_dataframe(df, "formula")y = df['K_VRH'].valuesX = df.drop(['material_id',"structure", "formula", "composition", "K_VRH",'elastic_tensor_original','poscar','compliance_tensor','elastic_tensor','cif'], axis=1)rf.fit(X, y)importances = rf.feature_importances_included = X.columns.valuesindices = np.argsort(importances)[::-1]pf = PlotlyFig(y_title='Importance (%)',title='Feature by importances',mode='notebook')pf.bar(x=included[indices][0:5], y=importances[indices][0:5])

4.使用自动机的自动机器学习

Automatminer 是一个包，用于使用matminer的特征器、特征约简技术和自动机器学习(AutoML)自动创建ML管道。Automatminer以端到端工作，从原始数据到预测，不需要任何人工输入。

放入一个数据集，得到一个预测材料属性的机器。* Automatminer在材料信息学的多个领域与最先进的手动调整机器学习模型竞争。Automatminer还包括运行MatBench的实用程序，Matbench是一种材料科学ML基准。*从[官方文档]中了解更多关于Automatminer和MatBench的信息。

automatminer是如何工作的？Automatminer使用matminer描述符库中的数百种描述符技术来自动修饰数据集，挑选最有用的特征进行学习，并运行一个单独的AutoML管道。一旦管道被安装好，它可以被总结成一个文本文件，保存到磁盘上，或者用于对新材料进行预测。Automatminer概述材料图元(如晶体结构)在一端，而属性预测则在另一端。MatPipe处理中间操作，如分配描述符、清理有问题的数据、数据转换、插补和机器学习。

MatPipe是Automatminer '`MatPipe`'的主要对象，` MatPipe '是Automatminer的中心对象。它有一个sklearn BaseEstimator语法用于“拟合”和“预测”操作。简单地“适应”你的训练数据，然后“预测”你的测试数据。MatPipe使用[pandas]。作为输入和输出的dataframes 。放入(材料的)dataframes ，取出(属性预测的)dataframes 。

概述在本节中，我们将介绍使用自动挖掘器对数据进行训练和预测的基本步骤。我们还将使用自动挖掘器的应用编程接口来查看我们的自动管道的内部。*首先，我们将从材料项目中加载大约4600个介电常数的数据集。*接下来，我们将为数据拟合一个Automatminer‘MatPipe’(管道)*然后，我们将从结构中预测介电常数，并看看我们的预测是如何进行的(注意，这不是一个容易的问题！)*我们将使用“MatPipe”的内省方法来检查我们的管道。*最后，我们看看如何保存和加载管道，以实现可再现的预测。*注意:为了简洁起见，我们将在本笔记本中使用单个列车测试分割。要运行完整的Automatminer基准测试，请参见“MatPipe.benchmark”文档。

为机器学习准备数据集让我们加载一个数据集来玩。在这个例子中，我们将使用matminer来加载 MatBench v0.1数据集之一。

from matminer.datasets.dataset_retrieval import load_dataset
import pymatgen
if __name__ == '__main__':df = load_dataset("matbench_dielectric")print(df.head())结果：structure         n
0  [[4.29304147 2.4785886  1.07248561] S, [4.2930...  1.752064
1  [[3.95051434 4.51121437 0.28035002] K, [4.3099...  1.652859
2  [[-1.78688104  4.79604117  1.53044621] Rb, [-1...  1.867858
3  [[4.51438064 4.51438064 0.        ] Mn, [0.133...  2.676887
4  [[-4.36731958  6.8886097   0.50929706] Li, [-2...  1.793232

通过检查数据集，我们可以看到只有“structure”和“n”(介电常数)列。

接下来，我们可以生成一个 train-test 分割。用于评估automatminer。

model_selection import train_test_splittrain_df, test_df = train_test_split(df, test_size=0.2, shuffle=True, random_state=20191014)

让我们删除测试中 dataframe 的 target 属性，这样我们就可以确定我们没有给自动挖掘器任何测试信息。

我们的 target 变量是“n”。

target = "n"
prediction_df = test_df.drop(columns=[target])
prediction_df.head()

from matminer.datasets.dataset_retrieval import load_dataset
import pymatgen
from sklearn.model_selection import train_test_splitif __name__ == '__main__':df = load_dataset("matbench_dielectric")train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True, random_state=20191014)target = "n"prediction_df = test_df.drop(columns=[target])print(prediction_df.head())结果：structure
1802  [[3.71205866 2.14315394 1.14375057] Si, [-3.71...
1881  [[0. 0. 0.] Cd, [1.35314892 0.95682078 2.34372...
1288  [[-0.50714072  4.9893142   6.08288682] K, [-1....
4490  [[3.90704797 2.76270011 6.76720559] Si, [0.558...
32    [[1.91506173 1.23473956 4.58373805] P, [ 5.553...

用Automatminer's MatPipe进行拟合和预测

现在我们拥有了启动我们的 AutoML管道所需的一切。为了简单起见，我们将使用 MatPipe 预设。MatPipe 是高度可定制的，有数百个配置选项，但大多数用例将通过使用预设配置之一来满足。我们使用 from_preset 方法。

在这个例子中，出于时间的考虑，我们将使用“debug”预设，这将花费大约1.5分钟进行机器学习。如果你有更多的时间，“express”预设是一个很好的选择。

from automatminer import MatPipepipe = MatPipe.from_preset("debug")

from matminer.datasets.dataset_retrieval import load_dataset
import pymatgen
from sklearn.model_selection import train_test_split
from automatminer import MatPipeif __name__ == '__main__':df = load_dataset("matbench_dielectric")train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True, random_state=20191014)target = "n"prediction_df = test_df.drop(columns=[target])pipe = MatPipe.from_preset("debug")结果：
D:\anaconda3\envs\pythonProject1\lib\site-packages\sklearn\utils\deprecation.py:144: FutureWarning: The sklearn.metrics.scorer module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API.warnings.warn(message, FutureWarning)
D:\anaconda3\envs\pythonProject1\lib\site-packages\sklearn\utils\deprecation.py:144: FutureWarning: The sklearn.feature_selection.base module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.feature_selection. Anything that cannot be imported from sklearn.feature_selection is now part of the private API.warnings.warn(message, FutureWarning)
D:\anaconda3\envs\pythonProject1\lib\site-packages\sklearn\utils\deprecation.py:144: FutureWarning: The sklearn.neighbors.unsupervised module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.neighbors. Anything that cannot be imported from sklearn.neighbors is now part of the private API.warnings.warn(message, FutureWarning)
D:\anaconda3\envs\pythonProject1\lib\site-packages\sklearn\externals\joblib\__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.warnings.warn(msg, category=FutureWarning)

安装管道（Fitting the pipeline）

要将 Automatminer MatPipe与数据匹配，请输入您的训练数据和所需目标。

pipe.fit(train_df, target)

from matminer.datasets.dataset_retrieval import load_dataset
import pymatgen
from sklearn.model_selection import train_test_split
from automatminer import MatPipeif __name__ == '__main__':df = load_dataset("matbench_dielectric")train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True, random_state=20191014)target = "n"prediction_df = test_df.drop(columns=[target])pipe = MatPipe.from_preset("debug")pipe.fit(train_df, target)结果：
2021-03-15 20:45:47 INFO     Problem type is: regression
2021-03-15 20:45:47 INFO     Fitting MatPipe pipeline to data.
2021-03-15 20:45:47 INFO     AutoFeaturizer: Starting fitting.
2021-03-15 20:45:47 INFO     AutoFeaturizer: Adding compositions from structures.
2021-03-15 20:45:47 INFO     AutoFeaturizer: Guessing oxidation states of structures if they were not present in input.
StructureToOxidStructure:   0%|          | 0/3811 [00:00<?, ?it/s]D:\anaconda3\envs\pythonProject1\lib\site-packages\sklearn\utils\deprecation.py:144: FutureWarning: The sklearn.metrics.scorer module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API.StructureToComposition:   0%|          | 0/3811 [00:00<?, ?it/s]D:\anaconda3\envs\pythonProject1\lib\site-packages\sklearn\utils\deprecation.py:144: FutureWarning: The sklearn.metrics.scorer module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API.zeros[:len(eigs)] = eigs
D:\anaconda3\envs\pythonProject1\lib\site-packages\matminer\featurizers\structure.py:743: ComplexWarning: Casting complex values to real discards the imaginary partzeros[:len(eigs)] = eigs
SineCoulombMatrix: 100%|██████████| 3811/3811 [00:17<00:00, 219.80it/s]
2021-03-15 20:48:44 INFO     AutoFeaturizer: Featurizer type bandstructure not in the dataframe. Skipping...
2021-03-15 20:48:44 INFO     AutoFeaturizer: Featurizer type dos not in the dataframe. Skipping...
2021-03-15 20:48:44 INFO     AutoFeaturizer: Finished transforming.
2021-03-15 20:48:44 INFO     DataCleaner: Starting fitting.
2021-03-15 20:48:44 INFO     DataCleaner: Cleaning with respect to samples with sample na_method 'drop'
2021-03-15 20:48:44 INFO     DataCleaner: Replacing infinite values with nan for easier screening.
2021-03-15 20:48:44 INFO     DataCleaner: Before handling na: 3811 samples, 421 features
2021-03-15 20:48:44 INFO     DataCleaner: 0 samples did not have target values. They were dropped.
2021-03-15 20:48:44 INFO     DataCleaner: Handling feature na by max na threshold of 0.01 with method 'drop'.
2021-03-15 20:48:44 INFO     DataCleaner: After handling na: 3811 samples, 421 features
2021-03-15 20:48:44 INFO     DataCleaner: Finished fitting.
2021-03-15 20:48:44 INFO     FeatureReducer: Starting fitting.
2021-03-15 20:48:45 INFO     FeatureReducer: 285 features removed due to cross correlation more than 0.95
2021-03-15 20:52:46 INFO     TreeFeatureReducer: Finished tree-based feature reduction of 135 initial features to 13
2021-03-15 20:52:46 INFO     FeatureReducer: Finished fitting.
2021-03-15 20:52:46 INFO     FeatureReducer: Starting transforming.
2021-03-15 20:52:46 INFO     FeatureReducer: Finished transforming.
2021-03-15 20:52:46 INFO     TPOTAdaptor: Starting fitting.
27 operators have been imported by TPOT.
Optimization Progress:   0%|          | 0/10 [00:00<?, ?pipeline/s]D:\anaconda3\envs\pythonProject1\lib\site-packages\sklearn\utils\deprecation.py:144: FutureWarning: The sklearn.metrics.scorer module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API.
ExtraTreesRegressor(StandardScaler(SelectPercentile(input_matrix, SelectPercentile__percentile=99)), ExtraTreesRegressor__bootstrap=True, ExtraTreesRegressor__max_features=0.8500000000000002, ExtraTreesRegressor__min_samples_leaf=1, ExtraTreesRegressor__min_samples_split=5, ExtraTreesRegressor__n_estimators=500)Optimization Progress: 100%|██████████| 30/30 [00:57<00:00,  1.41s/pipeline]Generation 2 - Current Pareto front scores:
-3  -0.4658520716454634 ExtraTreesRegressor(StandardScaler(SelectPercentile(input_matrix, SelectPercentile__percentile=99)), ExtraTreesRegressor__bootstrap=True, ExtraTreesRegressor__max_features=0.8500000000000002, ExtraTreesRegressor__min_samples_leaf=1, ExtraTreesRegressor__min_samples_split=5, ExtraTreesRegressor__n_estimators=500)Optimization Progress: 100%|██████████| 30/30 [00:58<00:00,  1.41s/pipeline]_pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required by RobustScaler..
Optimization Progress: 100%|██████████| 30/30 [00:58<00:00,  1.41s/pipeline]_pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required by RobustScaler..1.01 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.
WARNING: TPOT may not provide a good pipeline if TPOT is stopped/interrupted in a early generation.TPOT closed prematurely. Will use the current best pipeline.
2021-03-15 20:53:51 INFO     TPOTAdaptor: Finished fitting.
2021-03-15 20:53:51 INFO     MatPipe successfully fit.

预测新数据（Examine predictions）

我们的MatPipe现在适合了。让我们用 MatPipe.predict 预测我们的测试数据。这应该只需要几分钟

prediction_df = pipe.predict(prediction_df)

from matminer.datasets.dataset_retrieval import load_dataset
import pymatgen
from sklearn.model_selection import train_test_split
from automatminer import MatPipeif __name__ == '__main__':df = load_dataset("matbench_dielectric")train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True, random_state=20191014)target = "n"prediction_df = test_df.drop(columns=[target])pipe = MatPipe.from_preset("debug")pipe.fit(train_df, target)prediction_df = pipe.predict(prediction_df)结果：
2020-07-27 14:36:25 INFO     Beginning MatPipe prediction using fitted pipeline.
2020-07-27 14:36:25 INFO     AutoFeaturizer: Starting transforming.
2020-07-27 14:36:25 INFO     AutoFeaturizer: Adding compositions from structures.
2020-07-27 14:36:25 INFO     AutoFeaturizer: Guessing oxidation states of structures if they were not present in input.2020-07-27 14:37:09 INFO     AutoFeaturizer: Guessing oxidation states of compositions, as they were not present in input.2020-07-27 14:37:15 INFO     AutoFeaturizer: Featurizing with ElementProperty.2020-07-27 14:37:19 INFO     AutoFeaturizer: Guessing oxidation states of structures if they were not present in input.2020-07-27 14:37:22 INFO     AutoFeaturizer: Featurizing with SineCoulombMatrix.2020-07-27 14:37:28 INFO     AutoFeaturizer: Featurizer type bandstructure not in the dataframe. Skipping...
2020-07-27 14:37:28 INFO     AutoFeaturizer: Featurizer type dos not in the dataframe. Skipping...
2020-07-27 14:37:28 INFO     AutoFeaturizer: Finished transforming.
2020-07-27 14:37:28 INFO     DataCleaner: Starting transforming.
2020-07-27 14:37:28 INFO     DataCleaner: Cleaning with respect to samples with sample na_method 'fill'
2020-07-27 14:37:28 INFO     DataCleaner: Replacing infinite values with nan for easier screening.
2020-07-27 14:37:28 INFO     DataCleaner: Before handling na: 953 samples, 420 features
2020-07-27 14:37:28 INFO     DataCleaner: After handling na: 953 samples, 420 features
2020-07-27 14:37:28 INFO     DataCleaner: Target not found in df columns. Ignoring...
2020-07-27 14:37:28 INFO     DataCleaner: Finished transforming.
2020-07-27 14:37:28 INFO     FeatureReducer: Starting transforming.
2020-07-27 14:37:28 WARNING  FeatureReducer: Target not found in columns to transform.
2020-07-27 14:37:28 INFO     FeatureReducer: Finished transforming.
2020-07-27 14:37:28 INFO     TPOTAdaptor: Starting predicting.
2020-07-27 14:37:28 INFO     TPOTAdaptor: Prediction finished successfully.
2020-07-27 14:37:28 INFO     TPOTAdaptor: Finished predicting.
2020-07-27 14:37:28 INFO     MatPipe prediction completed.

检查预测集

MatPipe将预测放在一个名为“{target} predicted”的列中:

prediction_df.head()

from matminer.datasets.dataset_retrieval import load_dataset
import pymatgen
from sklearn.model_selection import train_test_split
from automatminer import MatPipeif __name__ == '__main__':df = load_dataset("matbench_dielectric")train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True, random_state=20191014)target = "n"prediction_df = test_df.drop(columns=[target])pipe = MatPipe.from_preset("debug")pipe.fit(train_df, target)prediction_df = pipe.predict(prediction_df)print(prediction_df.head())结果：MagpieData range AtomicWeight  ...  n predicted
1802                     102.710600  ...     1.951822
1881                      15.189000  ...     3.295348
1288                      49.380600  ...     1.656971
4490                       0.000000  ...     4.706100
32                        32.572238  ...     2.754411

评分预测（Score predictions）

现在让我们用平均误差来给我们的预测打分，并将它们与sklearn的虚拟回归器进行比较。

from matminer.datasets.dataset_retrieval import load_dataset
import pymatgen
from sklearn.model_selection import train_test_split
from automatminer import MatPipe
from sklearn.metrics import mean_absolute_error
from sklearn.dummy import DummyRegressorif __name__ == '__main__':df = load_dataset("matbench_dielectric")train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True, random_state=20191014)target = "n"prediction_df = test_df.drop(columns=[target])pipe = MatPipe.from_preset("debug")pipe.fit(train_df, target)prediction_df = pipe.predict(prediction_df)# fit the dummydr = DummyRegressor()dr.fit(train_df["structure"], train_df[target])dummy_test = dr.predict(test_df["structure"])# Score dummy and MatPipetrue = test_df[target]matpipe_test = prediction_df[target + " predicted"]mae_matpipe = mean_absolute_error(true, matpipe_test)mae_dummy = mean_absolute_error(true, dummy_test)print("Dummy MAE: {}".format(mae_dummy))print("MatPipe MAE: {}".format(mae_matpipe))结果：
Dummy MAE: 0.7772666142371938
MatPipe MAE: 0.5030822760911582

检查MatPipe的内部（Examining the internals of MatPipe）

使用 MatPipe.inspect (所有适当属性名称的长而全面的版本)或 MatPipe.summarize (执行摘要)中的dict/text摘要检查MatPipe内部。

import pprint# Get a summary and save a copy to json
summary = pipe.summarize(filename="MatPipe_predict_experimental_gap_from_composition_summary.json")pprint.pprint(summary)结果：
{'data_cleaning': {'drop_na_targets': 'True','encoder': 'one-hot','feature_na_method': 'drop','na_method_fit': 'drop','na_method_transform': 'fill'},'feature_reduction': {'reducer_params': "{'tree': {'importance_percentile': ""0.9, 'mode': 'regression', ""'random_state': 0}}",'reducers': "('corr', 'tree')"},'features': ['MagpieData range AtomicWeight','MagpieData avg_dev AtomicWeight','MagpieData mean MeltingT','MagpieData maximum Electronegativity','MagpieData mean Electronegativity','MagpieData avg_dev Electronegativity','MagpieData avg_dev NUnfilled','MagpieData mean GSvolume_pa','sine coulomb matrix eig 0','sine coulomb matrix eig 6','sine coulomb matrix eig 7'],'featurizers': {'bandstructure': [BandFeaturizer()],'composition': [ElementProperty(data_source=<matminer.utils.data.MagpieData object at 0x7f92058afaf0>,features=['Number', 'MendeleevNumber', 'AtomicWeight','MeltingT', 'Column', 'Row', 'CovalentRadius','Electronegativity', 'NsValence', 'NpValence','NdValence', 'NfValence', 'NValence', 'NsUnfilled','NpUnfilled', 'NdUnfilled', 'NfUnfilled', 'NUnfilled','GSvolume_pa', 'GSbandgap', 'GSmagmom','SpaceGroupNumber'],stats=['minimum', 'maximum', 'range', 'mean', 'avg_dev','mode'])],'dos': [DOSFeaturizer()],'structure': [SineCoulombMatrix()]},'ml_model': 'Pipeline(memory=Memory(location=/var/folders/x6/mzkjfgpx3m9cr_6mcy9759qw0000gn/T/tmps0ji7j_y/joblib),\n'"         steps=[('selectpercentile',\n"'                 SelectPercentile(percentile=23,\n''                                  score_func=<function ''f_regression at 0x7f92217f2040>)),\n'"                ('robustscaler', RobustScaler()),\n""                ('randomforestregressor',\n"'                 RandomForestRegressor(bootstrap=False, ''max_features=0.05,\n''                                       min_samples_leaf=7, ''min_samples_split=5,\n''                                       n_estimators=20))])'}

# Explain the MatPipe's internals more comprehensivelydetails = pipe.inspect(filename="MatPipe_predict_experimental_gap_from_composition_details.json")print(details)结果：
{'autofeaturizer': {'autofeaturizer': {'cache_src': None, 'preset': 'debug', 'featurizers': {'composition': [ElementProperty(data_source=<matminer.utils.data.MagpieData object at 0x7f92058afaf0>,features=['Number', 'MendeleevNumber', 'AtomicWeight','MeltingT', 'Column', 'Row', 'CovalentRadius','Electronegativity', 'NsValence', 'NpValence','NdValence', 'NfValence', 'NValence', 'NsUnfilled','NpUnfilled', 'NdUnfilled', 'NfUnfilled', 'NUnfilled','GSvolume_pa', 'GSbandgap', 'GSmagmom','SpaceGroupNumber'],stats=['minimum', 'maximum', 'range', 'mean', 'avg_dev','mode'])], 'structure': [SineCoulombMatrix()], 'bandstructure': [BandFeaturizer()], 'dos': [DOSFeaturizer()]}, 'exclude': [], 'functionalize': False, 'ignore_cols': [], 'fitted_input_df': {'obj': <class 'pandas.core.frame.DataFrame'>, 'columns': 2, 'samples': 3811}, 'converted_input_df': {'obj': <class 'pandas.core.frame.DataFrame'>, 'columns': 3, 'samples': 3811}, 'ignore_errors': True, 'drop_inputs': True, 'multiindex': False, 'do_precheck': True, 'n_jobs': 2, 'guess_oxistates': True, 'features': ['MagpieData minimum Number', 'MagpieData maximum Number', 'MagpieData range Number', 'MagpieData mean Number', 'MagpieData avg_dev Number', 'MagpieData mode Number', 'MagpieData minimum MendeleevNumber', 'MagpieData maximum MendeleevNumber', 'MagpieData range MendeleevNumber', 'MagpieData mean MendeleevNumber', 'MagpieData avg_dev MendeleevNumber', 'MagpieData mode MendeleevNumber', 'MagpieData minimum AtomicWeight', 'MagpieData maximum AtomicWeight', 'MagpieData range AtomicWeight', 'MagpieData mean AtomicWeight', 'MagpieData avg_dev AtomicWeight', 'MagpieData mode AtomicWeight', 'MagpieData minimum MeltingT', 'MagpieData maximum MeltingT', 'MagpieData range MeltingT', 'MagpieData mean MeltingT', 'MagpieData avg_dev MeltingT', 'MagpieData mode MeltingT', 'MagpieData minimum Column', 'MagpieData maximum Column', 'MagpieData range Column', 'MagpieData mean Column', 'MagpieData avg_dev Column', 'MagpieData mode Column', 'MagpieData minimum Row', 'MagpieData maximum Row', 'MagpieData range Row', 'MagpieData mean Row', 'MagpieData avg_dev Row', 'MagpieData mode Row', 'MagpieData minimum CovalentRadius', 'MagpieData maximum CovalentRadius', 'MagpieData range CovalentRadius', 'MagpieData mean CovalentRadius', 'MagpieData avg_dev CovalentRadius', 'MagpieData mode CovalentRadius', 'MagpieData minimum Electronegativity', 'MagpieData maximum Electronegativity', 'MagpieData range Electronegativity', 'MagpieData mean Electronegativity', 'MagpieData avg_dev Electronegativity', 'MagpieData mode Electronegativity', 'MagpieData minimum NsValence', 'MagpieData maximum NsValence', 'MagpieData range NsValence', 'MagpieData mean NsValence', 'MagpieData avg_dev NsValence', 'MagpieData mode NsValence', 'MagpieData minimum NpValence', 'MagpieData maximum NpValence', 'MagpieData range NpValence', 'MagpieData mean NpValence', 'MagpieData avg_dev NpValence', 'MagpieData mode NpValence', 'MagpieData minimum NdValence', 'MagpieData maximum NdValence', 'MagpieData range NdValence', 'MagpieData mean NdValence', 'MagpieData avg_dev NdValence', 'MagpieData mode NdValence', 'MagpieData minimum NfValence', 'MagpieData maximum NfValence', 'MagpieData range NfValence', 'MagpieData mean NfValence', 'MagpieData avg_dev NfValence', 'MagpieData mode NfValence', 'MagpieData minimum NValence', 'MagpieData maximum NValence', 'MagpieData range NValence', 'MagpieData mean NValence', 'MagpieData avg_dev NValence', 'MagpieData mode NValence', 'MagpieData minimum NsUnfilled', 'MagpieData maximum NsUnfilled', 'MagpieData range NsUnfilled', 'MagpieData mean NsUnfilled', 'MagpieData avg_dev NsUnfilled', 'MagpieData mode NsUnfilled', 'MagpieData minimum NpUnfilled', 'MagpieData maximum NpUnfilled', 'MagpieData range NpUnfilled', 'MagpieData mean NpUnfilled', 'MagpieData avg_dev NpUnfilled', 'MagpieData mode NpUnfilled', 'MagpieData minimum NdUnfilled', 'MagpieData maximum NdUnfilled', 'MagpieData range NdUnfilled', 'MagpieData mean NdUnfilled', 'MagpieData avg_dev NdUnfilled', 'MagpieData mode NdUnfilled', 'MagpieData minimum NfUnfilled', 'MagpieData maximum NfUnfilled', 'MagpieData range NfUnfilled', 'MagpieData mean NfUnfilled', 'MagpieData avg_dev NfUnfilled', 'MagpieData mode NfUnfilled', 'MagpieData minimum NUnfilled', 'MagpieData maximum NUnfilled', 'MagpieData range NUnfilled', 'MagpieData mean NUnfilled', 'MagpieData avg_dev NUnfilled', 'MagpieData mode NUnfilled', 'MagpieData minimum GSvolume_pa', 'MagpieData maximum GSvolume_pa', 'MagpieData range GSvolume_pa', 'MagpieData mean GSvolume_pa', 'MagpieData avg_dev GSvolume_pa', 'MagpieData mode GSvolume_pa', 'MagpieData minimum GSbandgap', 'MagpieData maximum GSbandgap', 'MagpieData range GSbandgap', 'MagpieData mean GSbandgap', 'MagpieData avg_dev GSbandgap', 'MagpieData mode GSbandgap', 'MagpieData minimum GSmagmom', 'MagpieData maximum GSmagmom', 'MagpieData range GSmagmom', 'MagpieData mean GSmagmom', 'MagpieData avg_dev GSmagmom', 'MagpieData mode GSmagmom', 'MagpieData minimum SpaceGroupNumber', 'MagpieData maximum SpaceGroupNumber', 'MagpieData range SpaceGroupNumber', 'MagpieData mean SpaceGroupNumber', 'MagpieData avg_dev SpaceGroupNumber', 'MagpieData mode SpaceGroupNumber', 'sine coulomb matrix eig 0', 'sine coulomb matrix eig 1', 'sine coulomb matrix eig 2', 'sine coulomb matrix eig 3', 'sine coulomb matrix eig 4', 'sine coulomb matrix eig 5', 'sine coulomb matrix eig 6', 'sine coulomb matrix eig 7', 'sine coulomb matrix eig 8', 'sine coulomb matrix eig 9', 'sine coulomb matrix eig 10', 'sine coulomb matrix eig 11', 'sine coulomb matrix eig 12', 'sine coulomb matrix eig 13', 'sine coulomb matrix eig 14', 'sine coulomb matrix eig 15', 'sine coulomb matrix eig 16', 'sine coulomb matrix eig 17', 'sine coulomb matrix eig 18', 'sine coulomb matrix eig 19', 'sine coulomb matrix eig 20', 'sine coulomb matrix eig 21', 'sine coulomb matrix eig 22', 'sine coulomb matrix eig 23', 'sine coulomb matrix eig 24', 'sine coulomb matrix eig 25', 'sine coulomb matrix eig 26', 'sine coulomb matrix eig 27', 'sine coulomb matrix eig 28', 'sine coulomb matrix eig 29', 'sine coulomb matrix eig 30', 'sine coulomb matrix eig 31', 'sine coulomb matrix eig 32', 'sine coulomb matrix eig 33', 'sine coulomb matrix eig 34', 'sine coulomb matrix eig 35', 'sine coulomb matrix eig 36', 'sine coulomb matrix eig 37', 'sine coulomb matrix eig 38', 'sine coulomb matrix eig 39', 'sine coulomb matrix eig 40', 'sine coulomb matrix eig 41', 'sine coulomb matrix eig 42', 'sine coulomb matrix eig 43', 'sine coulomb matrix eig 44', 'sine coulomb matrix eig 45', 'sine coulomb matrix eig 46', 'sine coulomb matrix eig 47', 'sine coulomb matrix eig 48', 'sine coulomb matrix eig 49', 'sine coulomb matrix eig 50', 'sine coulomb matrix eig 51', 'sine coulomb matrix eig 52', 'sine coulomb matrix eig 53', 'sine coulomb matrix eig 54', 'sine coulomb matrix eig 55', 'sine coulomb matrix eig 56', 'sine coulomb matrix eig 57', 'sine coulomb matrix eig 58', 'sine coulomb matrix eig 59', 'sine coulomb matrix eig 60', 'sine coulomb matrix eig 61', 'sine coulomb matrix eig 62', 'sine coulomb matrix eig 63', 'sine coulomb matrix eig 64', 'sine coulomb matrix eig 65', 'sine coulomb matrix eig 66', 'sine coulomb matrix eig 67', 'sine coulomb matrix eig 68', 'sine coulomb matrix eig 69', 'sine coulomb matrix eig 70', 'sine coulomb matrix eig 71', 'sine coulomb matrix eig 72', 'sine coulomb matrix eig 73', 'sine coulomb matrix eig 74', 'sine coulomb matrix eig 75', 'sine coulomb matrix eig 76', 'sine coulomb matrix eig 77', 'sine coulomb matrix eig 78', 'sine coulomb matrix eig 79', 'sine coulomb matrix eig 80', 'sine coulomb matrix eig 81', 'sine coulomb matrix eig 82', 'sine coulomb matrix eig 83', 'sine coulomb matrix eig 84', 'sine coulomb matrix eig 85', 'sine coulomb matrix eig 86', 'sine coulomb matrix eig 87', 'sine coulomb matrix eig 88', 'sine coulomb matrix eig 89', 'sine coulomb matrix eig 90', 'sine coulomb matrix eig 91', 'sine coulomb matrix eig 92', 'sine coulomb matrix eig 93', 'sine coulomb matrix eig 94', 'sine coulomb matrix eig 95', 'sine coulomb matrix eig 96', 'sine coulomb matrix eig 97', 'sine coulomb matrix eig 98', 'sine coulomb matrix eig 99', 'sine coulomb matrix eig 100', 'sine coulomb matrix eig 101', 'sine coulomb matrix eig 102', 'sine coulomb matrix eig 103', 'sine coulomb matrix eig 104', 'sine coulomb matrix eig 105', 'sine coulomb matrix eig 106', 'sine coulomb matrix eig 107', 'sine coulomb matrix eig 108', 'sine coulomb matrix eig 109', 'sine coulomb matrix eig 110', 'sine coulomb matrix eig 111', 'sine coulomb matrix eig 112', 'sine coulomb matrix eig 113', 'sine coulomb matrix eig 114', 'sine coulomb matrix eig 115', 'sine coulomb matrix eig 116', 'sine coulomb matrix eig 117', 'sine coulomb matrix eig 118', 'sine coulomb matrix eig 119', 'sine coulomb matrix eig 120', 'sine coulomb matrix eig 121', 'sine coulomb matrix eig 122', 'sine coulomb matrix eig 123', 'sine coulomb matrix eig 124', 'sine coulomb matrix eig 125', 'sine coulomb matrix eig 126', 'sine coulomb matrix eig 127', 'sine coulomb matrix eig 128', 'sine coulomb matrix eig 129', 'sine coulomb matrix eig 130', 'sine coulomb matrix eig 131', 'sine coulomb matrix eig 132', 'sine coulomb matrix eig 133', 'sine coulomb matrix eig 134', 'sine coulomb matrix eig 135', 'sine coulomb matrix eig 136', 'sine coulomb matrix eig 137', 'sine coulomb matrix eig 138', 'sine coulomb matrix eig 139', 'sine coulomb matrix eig 140', 'sine coulomb matrix eig 141', 'sine coulomb matrix eig 142', 'sine coulomb matrix eig 143', 'sine coulomb matrix eig 144', 'sine coulomb matrix eig 145', 'sine coulomb matrix eig 146', 'sine coulomb matrix eig 147', 'sine coulomb matrix eig 148', 'sine coulomb matrix eig 149', 'sine coulomb matrix eig 150', 'sine coulomb matrix eig 151', 'sine coulomb matrix eig 152', 'sine coulomb matrix eig 153', 'sine coulomb matrix eig 154', 'sine coulomb matrix eig 155', 'sine coulomb matrix eig 156', 'sine coulomb matrix eig 157', 'sine coulomb matrix eig 158', 'sine coulomb matrix eig 159', 'sine coulomb matrix eig 160', 'sine coulomb matrix eig 161', 'sine coulomb matrix eig 162', 'sine coulomb matrix eig 163', 'sine coulomb matrix eig 164', 'sine coulomb matrix eig 165', 'sine coulomb matrix eig 166', 'sine coulomb matrix eig 167', 'sine coulomb matrix eig 168', 'sine coulomb matrix eig 169', 'sine coulomb matrix eig 170', 'sine coulomb matrix eig 171', 'sine coulomb matrix eig 172', 'sine coulomb matrix eig 173', 'sine coulomb matrix eig 174', 'sine coulomb matrix eig 175', 'sine coulomb matrix eig 176', 'sine coulomb matrix eig 177', 'sine coulomb matrix eig 178', 'sine coulomb matrix eig 179', 'sine coulomb matrix eig 180', 'sine coulomb matrix eig 181', 'sine coulomb matrix eig 182', 'sine coulomb matrix eig 183', 'sine coulomb matrix eig 184', 'sine coulomb matrix eig 185', 'sine coulomb matrix eig 186', 'sine coulomb matrix eig 187', 'sine coulomb matrix eig 188', 'sine coulomb matrix eig 189', 'sine coulomb matrix eig 190', 'sine coulomb matrix eig 191', 'sine coulomb matrix eig 192', 'sine coulomb matrix eig 193', 'sine coulomb matrix eig 194', 'sine coulomb matrix eig 195', 'sine coulomb matrix eig 196', 'sine coulomb matrix eig 197', 'sine coulomb matrix eig 198', 'sine coulomb matrix eig 199', 'sine coulomb matrix eig 200', 'sine coulomb matrix eig 201', 'sine coulomb matrix eig 202', 'sine coulomb matrix eig 203', 'sine coulomb matrix eig 204', 'sine coulomb matrix eig 205', 'sine coulomb matrix eig 206', 'sine coulomb matrix eig 207', 'sine coulomb matrix eig 208', 'sine coulomb matrix eig 209', 'sine coulomb matrix eig 210', 'sine coulomb matrix eig 211', 'sine coulomb matrix eig 212', 'sine coulomb matrix eig 213', 'sine coulomb matrix eig 214', 'sine coulomb matrix eig 215', 'sine coulomb matrix eig 216', 'sine coulomb matrix eig 217', 'sine coulomb matrix eig 218', 'sine coulomb matrix eig 219', 'sine coulomb matrix eig 220', 'sine coulomb matrix eig 221', 'sine coulomb matrix eig 222', 'sine coulomb matrix eig 223', 'sine coulomb matrix eig 224', 'sine coulomb matrix eig 225', 'sine coulomb matrix eig 226', 'sine coulomb matrix eig 227', 'sine coulomb matrix eig 228', 'sine coulomb matrix eig 229', 'sine coulomb matrix eig 230', 'sine coulomb matrix eig 231', 'sine coulomb matrix eig 232', 'sine coulomb matrix eig 233', 'sine coulomb matrix eig 234', 'sine coulomb matrix eig 235', 'sine coulomb matrix eig 236', 'sine coulomb matrix eig 237', 'sine coulomb matrix eig 238', 'sine coulomb matrix eig 239', 'sine coulomb matrix eig 240', 'sine coulomb matrix eig 241', 'sine coulomb matrix eig 242', 'sine coulomb matrix eig 243', 'sine coulomb matrix eig 244', 'sine coulomb matrix eig 245', 'sine coulomb matrix eig 246', 'sine coulomb matrix eig 247', 'sine coulomb matrix eig 248', 'sine coulomb matrix eig 249', 'sine coulomb matrix eig 250', 'sine coulomb matrix eig 251', 'sine coulomb matrix eig 252', 'sine coulomb matrix eig 253', 'sine coulomb matrix eig 254', 'sine coulomb matrix eig 255', 'sine coulomb matrix eig 256', 'sine coulomb matrix eig 257', 'sine coulomb matrix eig 258', 'sine coulomb matrix eig 259', 'sine coulomb matrix eig 260', 'sine coulomb matrix eig 261', 'sine coulomb matrix eig 262', 'sine coulomb matrix eig 263', 'sine coulomb matrix eig 264', 'sine coulomb matrix eig 265', 'sine coulomb matrix eig 266', 'sine coulomb matrix eig 267', 'sine coulomb matrix eig 268', 'sine coulomb matrix eig 269', 'sine coulomb matrix eig 270', 'sine coulomb matrix eig 271', 'sine coulomb matrix eig 272', 'sine coulomb matrix eig 273', 'sine coulomb matrix eig 274', 'sine coulomb matrix eig 275', 'sine coulomb matrix eig 276', 'sine coulomb matrix eig 277', 'sine coulomb matrix eig 278', 'sine coulomb matrix eig 279', 'sine coulomb matrix eig 280', 'sine coulomb matrix eig 281', 'sine coulomb matrix eig 282', 'sine coulomb matrix eig 283', 'sine coulomb matrix eig 284', 'sine coulomb matrix eig 285', 'sine coulomb matrix eig 286', 'sine coulomb matrix eig 287'], 'auto_featurizer': True, 'removed_featurizers': [], 'composition_col': 'composition', 'structure_col': 'structure', 'bandstruct_col': 'bandstructure', 'dos_col': 'dos', 'is_fit': True, 'fittable_fcls': {'BagofBonds', 'PartialRadialDistributionFunction', 'BondFractions'}, 'needs_fit': False, 'min_precheck_frac': 0.9}}, 'cleaner': {'cleaner': {'max_na_frac': 0.01, 'feature_na_method': 'drop', 'encoder': 'one-hot', 'encode_categories': True, 'drop_na_targets': True, 'na_method_fit': 'drop', 'na_method_transform': 'fill', 'dropped_features': [], 'object_cols': [], 'number_cols': ['MagpieData minimum Number', 'MagpieData maximum Number', 'MagpieData range Number', 'MagpieData mean Number', 'MagpieData avg_dev Number', 'MagpieData mode Number', 'MagpieData minimum MendeleevNumber', 'MagpieData maximum MendeleevNumber', 'MagpieData range MendeleevNumber', 'MagpieData mean MendeleevNumber', 'MagpieData avg_dev MendeleevNumber', 'MagpieData mode MendeleevNumber', 'MagpieData minimum AtomicWeight', 'MagpieData maximum AtomicWeight', 'MagpieData range AtomicWeight', 'MagpieData mean AtomicWeight', 'MagpieData avg_dev AtomicWeight', 'MagpieData mode AtomicWeight', 'MagpieData minimum MeltingT', 'MagpieData maximum MeltingT', 'MagpieData range MeltingT', 'MagpieData mean MeltingT', 'MagpieData avg_dev MeltingT', 'MagpieData mode MeltingT', 'MagpieData minimum Column', 'MagpieData maximum Column', 'MagpieData range Column', 'MagpieData mean Column', 'MagpieData avg_dev Column', 'MagpieData mode Column', 'MagpieData minimum Row', 'MagpieData maximum Row', 'MagpieData range Row', 'MagpieData mean Row', 'MagpieData avg_dev Row', 'MagpieData mode Row', 'MagpieData minimum CovalentRadius', 'MagpieData maximum CovalentRadius', 'MagpieData range CovalentRadius', 'MagpieData mean CovalentRadius', 'MagpieData avg_dev CovalentRadius', 'MagpieData mode CovalentRadius', 'MagpieData minimum Electronegativity', 'MagpieData maximum Electronegativity', 'MagpieData range Electronegativity', 'MagpieData mean Electronegativity', 'MagpieData avg_dev Electronegativity', 'MagpieData mode Electronegativity', 'MagpieData minimum NsValence', 'MagpieData maximum NsValence', 'MagpieData range NsValence', 'MagpieData mean NsValence', 'MagpieData avg_dev NsValence', 'MagpieData mode NsValence', 'MagpieData minimum NpValence', 'MagpieData maximum NpValence', 'MagpieData range NpValence', 'MagpieData mean NpValence', 'MagpieData avg_dev NpValence', 'MagpieData mode NpValence', 'MagpieData minimum NdValence', 'MagpieData maximum NdValence', 'MagpieData range NdValence', 'MagpieData mean NdValence', 'MagpieData avg_dev NdValence', 'MagpieData mode NdValence', 'MagpieData minimum NfValence', 'MagpieData maximum NfValence', 'MagpieData range NfValence', 'MagpieData mean NfValence', 'MagpieData avg_dev NfValence', 'MagpieData mode NfValence', 'MagpieData minimum NValence', 'MagpieData maximum NValence', 'MagpieData range NValence', 'MagpieData mean NValence', 'MagpieData avg_dev NValence', 'MagpieData mode NValence', 'MagpieData minimum NsUnfilled', 'MagpieData maximum NsUnfilled', 'MagpieData range NsUnfilled', 'MagpieData mean NsUnfilled', 'MagpieData avg_dev NsUnfilled', 'MagpieData mode NsUnfilled', 'MagpieData minimum NpUnfilled', 'MagpieData maximum NpUnfilled', 'MagpieData range NpUnfilled', 'MagpieData mean NpUnfilled', 'MagpieData avg_dev NpUnfilled', 'MagpieData mode NpUnfilled', 'MagpieData minimum NdUnfilled', 'MagpieData maximum NdUnfilled', 'MagpieData range NdUnfilled', 'MagpieData mean NdUnfilled', 'MagpieData avg_dev NdUnfilled', 'MagpieData mode NdUnfilled', 'MagpieData minimum NfUnfilled', 'MagpieData maximum NfUnfilled', 'MagpieData range NfUnfilled', 'MagpieData mean NfUnfilled', 'MagpieData avg_dev NfUnfilled', 'MagpieData mode NfUnfilled', 'MagpieData minimum NUnfilled', 'MagpieData maximum NUnfilled', 'MagpieData range NUnfilled', 'MagpieData mean NUnfilled', 'MagpieData avg_dev NUnfilled', 'MagpieData mode NUnfilled', 'MagpieData minimum GSvolume_pa', 'MagpieData maximum GSvolume_pa', 'MagpieData range GSvolume_pa', 'MagpieData mean GSvolume_pa', 'MagpieData avg_dev GSvolume_pa', 'MagpieData mode GSvolume_pa', 'MagpieData minimum GSbandgap', 'MagpieData maximum GSbandgap', 'MagpieData range GSbandgap', 'MagpieData mean GSbandgap', 'MagpieData avg_dev GSbandgap', 'MagpieData mode GSbandgap', 'MagpieData minimum GSmagmom', 'MagpieData maximum GSmagmom', 'MagpieData range GSmagmom', 'MagpieData mean GSmagmom', 'MagpieData avg_dev GSmagmom', 'MagpieData mode GSmagmom', 'MagpieData minimum SpaceGroupNumber', 'MagpieData maximum SpaceGroupNumber', 'MagpieData range SpaceGroupNumber', 'MagpieData mean SpaceGroupNumber', 'MagpieData avg_dev SpaceGroupNumber', 'MagpieData mode SpaceGroupNumber', 'sine coulomb matrix eig 0', 'sine coulomb matrix eig 1', 'sine coulomb matrix eig 2', 'sine coulomb matrix eig 3', 'sine coulomb matrix eig 4', 'sine coulomb matrix eig 5', 'sine coulomb matrix eig 6', 'sine coulomb matrix eig 7', 'sine coulomb matrix eig 8', 'sine coulomb matrix eig 9', 'sine coulomb matrix eig 10', 'sine coulomb matrix eig 11', 'sine coulomb matrix eig 12', 'sine coulomb matrix eig 13', 'sine coulomb matrix eig 14', 'sine coulomb matrix eig 15', 'sine coulomb matrix eig 16', 'sine coulomb matrix eig 17', 'sine coulomb matrix eig 18', 'sine coulomb matrix eig 19', 'sine coulomb matrix eig 20', 'sine coulomb matrix eig 21', 'sine coulomb matrix eig 22', 'sine coulomb matrix eig 23', 'sine coulomb matrix eig 24', 'sine coulomb matrix eig 25', 'sine coulomb matrix eig 26', 'sine coulomb matrix eig 27', 'sine coulomb matrix eig 28', 'sine coulomb matrix eig 29', 'sine coulomb matrix eig 30', 'sine coulomb matrix eig 31', 'sine coulomb matrix eig 32', 'sine coulomb matrix eig 33', 'sine coulomb matrix eig 34', 'sine coulomb matrix eig 35', 'sine coulomb matrix eig 36', 'sine coulomb matrix eig 37', 'sine coulomb matrix eig 38', 'sine coulomb matrix eig 39', 'sine coulomb matrix eig 40', 'sine coulomb matrix eig 41', 'sine coulomb matrix eig 42', 'sine coulomb matrix eig 43', 'sine coulomb matrix eig 44', 'sine coulomb matrix eig 45', 'sine coulomb matrix eig 46', 'sine coulomb matrix eig 47', 'sine coulomb matrix eig 48', 'sine coulomb matrix eig 49', 'sine coulomb matrix eig 50', 'sine coulomb matrix eig 51', 'sine coulomb matrix eig 52', 'sine coulomb matrix eig 53', 'sine coulomb matrix eig 54', 'sine coulomb matrix eig 55', 'sine coulomb matrix eig 56', 'sine coulomb matrix eig 57', 'sine coulomb matrix eig 58', 'sine coulomb matrix eig 59', 'sine coulomb matrix eig 60', 'sine coulomb matrix eig 61', 'sine coulomb matrix eig 62', 'sine coulomb matrix eig 63', 'sine coulomb matrix eig 64', 'sine coulomb matrix eig 65', 'sine coulomb matrix eig 66', 'sine coulomb matrix eig 67', 'sine coulomb matrix eig 68', 'sine coulomb matrix eig 69', 'sine coulomb matrix eig 70', 'sine coulomb matrix eig 71', 'sine coulomb matrix eig 72', 'sine coulomb matrix eig 73', 'sine coulomb matrix eig 74', 'sine coulomb matrix eig 75', 'sine coulomb matrix eig 76', 'sine coulomb matrix eig 77', 'sine coulomb matrix eig 78', 'sine coulomb matrix eig 79', 'sine coulomb matrix eig 80', 'sine coulomb matrix eig 81', 'sine coulomb matrix eig 82', 'sine coulomb matrix eig 83', 'sine coulomb matrix eig 84', 'sine coulomb matrix eig 85', 'sine coulomb matrix eig 86', 'sine coulomb matrix eig 87', 'sine coulomb matrix eig 88', 'sine coulomb matrix eig 89', 'sine coulomb matrix eig 90', 'sine coulomb matrix eig 91', 'sine coulomb matrix eig 92', 'sine coulomb matrix eig 93', 'sine coulomb matrix eig 94', 'sine coulomb matrix eig 95', 'sine coulomb matrix eig 96', 'sine coulomb matrix eig 97', 'sine coulomb matrix eig 98', 'sine coulomb matrix eig 99', 'sine coulomb matrix eig 100', 'sine coulomb matrix eig 101', 'sine coulomb matrix eig 102', 'sine coulomb matrix eig 103', 'sine coulomb matrix eig 104', 'sine coulomb matrix eig 105', 'sine coulomb matrix eig 106', 'sine coulomb matrix eig 107', 'sine coulomb matrix eig 108', 'sine coulomb matrix eig 109', 'sine coulomb matrix eig 110', 'sine coulomb matrix eig 111', 'sine coulomb matrix eig 112', 'sine coulomb matrix eig 113', 'sine coulomb matrix eig 114', 'sine coulomb matrix eig 115', 'sine coulomb matrix eig 116', 'sine coulomb matrix eig 117', 'sine coulomb matrix eig 118', 'sine coulomb matrix eig 119', 'sine coulomb matrix eig 120', 'sine coulomb matrix eig 121', 'sine coulomb matrix eig 122', 'sine coulomb matrix eig 123', 'sine coulomb matrix eig 124', 'sine coulomb matrix eig 125', 'sine coulomb matrix eig 126', 'sine coulomb matrix eig 127', 'sine coulomb matrix eig 128', 'sine coulomb matrix eig 129', 'sine coulomb matrix eig 130', 'sine coulomb matrix eig 131', 'sine coulomb matrix eig 132', 'sine coulomb matrix eig 133', 'sine coulomb matrix eig 134', 'sine coulomb matrix eig 135', 'sine coulomb matrix eig 136', 'sine coulomb matrix eig 137', 'sine coulomb matrix eig 138', 'sine coulomb matrix eig 139', 'sine coulomb matrix eig 140', 'sine coulomb matrix eig 141', 'sine coulomb matrix eig 142', 'sine coulomb matrix eig 143', 'sine coulomb matrix eig 144', 'sine coulomb matrix eig 145', 'sine coulomb matrix eig 146', 'sine coulomb matrix eig 147', 'sine coulomb matrix eig 148', 'sine coulomb matrix eig 149', 'sine coulomb matrix eig 150', 'sine coulomb matrix eig 151', 'sine coulomb matrix eig 152', 'sine coulomb matrix eig 153', 'sine coulomb matrix eig 154', 'sine coulomb matrix eig 155', 'sine coulomb matrix eig 156', 'sine coulomb matrix eig 157', 'sine coulomb matrix eig 158', 'sine coulomb matrix eig 159', 'sine coulomb matrix eig 160', 'sine coulomb matrix eig 161', 'sine coulomb matrix eig 162', 'sine coulomb matrix eig 163', 'sine coulomb matrix eig 164', 'sine coulomb matrix eig 165', 'sine coulomb matrix eig 166', 'sine coulomb matrix eig 167', 'sine coulomb matrix eig 168', 'sine coulomb matrix eig 169', 'sine coulomb matrix eig 170', 'sine coulomb matrix eig 171', 'sine coulomb matrix eig 172', 'sine coulomb matrix eig 173', 'sine coulomb matrix eig 174', 'sine coulomb matrix eig 175', 'sine coulomb matrix eig 176', 'sine coulomb matrix eig 177', 'sine coulomb matrix eig 178', 'sine coulomb matrix eig 179', 'sine coulomb matrix eig 180', 'sine coulomb matrix eig 181', 'sine coulomb matrix eig 182', 'sine coulomb matrix eig 183', 'sine coulomb matrix eig 184', 'sine coulomb matrix eig 185', 'sine coulomb matrix eig 186', 'sine coulomb matrix eig 187', 'sine coulomb matrix eig 188', 'sine coulomb matrix eig 189', 'sine coulomb matrix eig 190', 'sine coulomb matrix eig 191', 'sine coulomb matrix eig 192', 'sine coulomb matrix eig 193', 'sine coulomb matrix eig 194', 'sine coulomb matrix eig 195', 'sine coulomb matrix eig 196', 'sine coulomb matrix eig 197', 'sine coulomb matrix eig 198', 'sine coulomb matrix eig 199', 'sine coulomb matrix eig 200', 'sine coulomb matrix eig 201', 'sine coulomb matrix eig 202', 'sine coulomb matrix eig 203', 'sine coulomb matrix eig 204', 'sine coulomb matrix eig 205', 'sine coulomb matrix eig 206', 'sine coulomb matrix eig 207', 'sine coulomb matrix eig 208', 'sine coulomb matrix eig 209', 'sine coulomb matrix eig 210', 'sine coulomb matrix eig 211', 'sine coulomb matrix eig 212', 'sine coulomb matrix eig 213', 'sine coulomb matrix eig 214', 'sine coulomb matrix eig 215', 'sine coulomb matrix eig 216', 'sine coulomb matrix eig 217', 'sine coulomb matrix eig 218', 'sine coulomb matrix eig 219', 'sine coulomb matrix eig 220', 'sine coulomb matrix eig 221', 'sine coulomb matrix eig 222', 'sine coulomb matrix eig 223', 'sine coulomb matrix eig 224', 'sine coulomb matrix eig 225', 'sine coulomb matrix eig 226', 'sine coulomb matrix eig 227', 'sine coulomb matrix eig 228', 'sine coulomb matrix eig 229', 'sine coulomb matrix eig 230', 'sine coulomb matrix eig 231', 'sine coulomb matrix eig 232', 'sine coulomb matrix eig 233', 'sine coulomb matrix eig 234', 'sine coulomb matrix eig 235', 'sine coulomb matrix eig 236', 'sine coulomb matrix eig 237', 'sine coulomb matrix eig 238', 'sine coulomb matrix eig 239', 'sine coulomb matrix eig 240', 'sine coulomb matrix eig 241', 'sine coulomb matrix eig 242', 'sine coulomb matrix eig 243', 'sine coulomb matrix eig 244', 'sine coulomb matrix eig 245', 'sine coulomb matrix eig 246', 'sine coulomb matrix eig 247', 'sine coulomb matrix eig 248', 'sine coulomb matrix eig 249', 'sine coulomb matrix eig 250', 'sine coulomb matrix eig 251', 'sine coulomb matrix eig 252', 'sine coulomb matrix eig 253', 'sine coulomb matrix eig 254', 'sine coulomb matrix eig 255', 'sine coulomb matrix eig 256', 'sine coulomb matrix eig 257', 'sine coulomb matrix eig 258', 'sine coulomb matrix eig 259', 'sine coulomb matrix eig 260', 'sine coulomb matrix eig 261', 'sine coulomb matrix eig 262', 'sine coulomb matrix eig 263', 'sine coulomb matrix eig 264', 'sine coulomb matrix eig 265', 'sine coulomb matrix eig 266', 'sine coulomb matrix eig 267', 'sine coulomb matrix eig 268', 'sine coulomb matrix eig 269', 'sine coulomb matrix eig 270', 'sine coulomb matrix eig 271', 'sine coulomb matrix eig 272', 'sine coulomb matrix eig 273', 'sine coulomb matrix eig 274', 'sine coulomb matrix eig 275', 'sine coulomb matrix eig 276', 'sine coulomb matrix eig 277', 'sine coulomb matrix eig 278', 'sine coulomb matrix eig 279', 'sine coulomb matrix eig 280', 'sine coulomb matrix eig 281', 'sine coulomb matrix eig 282', 'sine coulomb matrix eig 283', 'sine coulomb matrix eig 284', 'sine coulomb matrix eig 285', 'sine coulomb matrix eig 286', 'sine coulomb matrix eig 287'], 'fitted_df': {'obj': <class 'pandas.core.frame.DataFrame'>, 'columns': 421, 'samples': 3811}, 'fitted_target': 'n', 'dropped_samples': {'obj': <class 'pandas.core.frame.DataFrame'>, 'columns': 421, 'samples': 0}, 'max_problem_col_warning_threshold': 0.3, 'warnings': [], 'is_fit': True}}, 'reducer': {'reducer': {'reducers': ('corr', 'tree'), 'corr_threshold': 0.95, 'n_pca_features': 'auto', 'tree_importance_percentile': 0.9, 'n_rebate_features': 0.3, '_keep_features': [], '_remove_features': [], 'removed_features': {'corr': ['MagpieData range Number', 'MagpieData mean Number', 'MagpieData avg_dev Number', 'MagpieData minimum MendeleevNumber', 'MagpieData minimum AtomicWeight', 'MagpieData maximum AtomicWeight', 'MagpieData mean AtomicWeight', 'MagpieData mode AtomicWeight', 'MagpieData maximum MeltingT', 'MagpieData minimum Column', 'MagpieData range NsValence', 'MagpieData mean NsValence', 'MagpieData avg_dev NsValence', 'MagpieData range NfValence', 'MagpieData minimum NsUnfilled', 'MagpieData range NsUnfilled', 'MagpieData maximum NdUnfilled', 'MagpieData range NdUnfilled', 'MagpieData avg_dev NdUnfilled', 'MagpieData maximum NfUnfilled', 'MagpieData range NfUnfilled', 'MagpieData mean NfUnfilled', 'MagpieData maximum GSvolume_pa', 'MagpieData range GSbandgap', 'MagpieData avg_dev GSbandgap', 'MagpieData maximum GSmagmom', 'MagpieData range GSmagmom', 'MagpieData avg_dev GSmagmom', 'sine coulomb matrix eig 11', 'sine coulomb matrix eig 15', 'sine coulomb matrix eig 17', 'sine coulomb matrix eig 21', 'sine coulomb matrix eig 22', 'sine coulomb matrix eig 23', 'sine coulomb matrix eig 25', 'sine coulomb matrix eig 26', 'sine coulomb matrix eig 27', 'sine coulomb matrix eig 28', 'sine coulomb matrix eig 29', 'sine coulomb matrix eig 30', 'sine coulomb matrix eig 32', 'sine coulomb matrix eig 33', 'sine coulomb matrix eig 35', 'sine coulomb matrix eig 37', 'sine coulomb matrix eig 38', 'sine coulomb matrix eig 39', 'sine coulomb matrix eig 40', 'sine coulomb matrix eig 43', 'sine coulomb matrix eig 44', 'sine coulomb matrix eig 45', 'sine coulomb matrix eig 47', 'sine coulomb matrix eig 48', 'sine coulomb matrix eig 49', 'sine coulomb matrix eig 50', 'sine coulomb matrix eig 51', 'sine coulomb matrix eig 52', 'sine coulomb matrix eig 53', 'sine coulomb matrix eig 54', 'sine coulomb matrix eig 55', 'sine coulomb matrix eig 56', 'sine coulomb matrix eig 57', 'sine coulomb matrix eig 58', 'sine coulomb matrix eig 59', 'sine coulomb matrix eig 60', 'sine coulomb matrix eig 61', 'sine coulomb matrix eig 62', 'sine coulomb matrix eig 63', 'sine coulomb matrix eig 64', 'sine coulomb matrix eig 65', 'sine coulomb matrix eig 66', 'sine coulomb matrix eig 67', 'sine coulomb matrix eig 68', 'sine coulomb matrix eig 70', 'sine coulomb matrix eig 71', 'sine coulomb matrix eig 72', 'sine coulomb matrix eig 73', 'sine coulomb matrix eig 74', 'sine coulomb matrix eig 76', 'sine coulomb matrix eig 77', 'sine coulomb matrix eig 78', 'sine coulomb matrix eig 79', 'sine coulomb matrix eig 80', 'sine coulomb matrix eig 83', 'sine coulomb matrix eig 84', 'sine coulomb matrix eig 86', 'sine coulomb matrix eig 87', 'sine coulomb matrix eig 88', 'sine coulomb matrix eig 89', 'sine coulomb matrix eig 90', 'sine coulomb matrix eig 91', 'sine coulomb matrix eig 92', 'sine coulomb matrix eig 93', 'sine coulomb matrix eig 94', 'sine coulomb matrix eig 95', 'sine coulomb matrix eig 96', 'sine coulomb matrix eig 97', 'sine coulomb matrix eig 98', 'sine coulomb matrix eig 100', 'sine coulomb matrix eig 101', 'sine coulomb matrix eig 102', 'sine coulomb matrix eig 103', 'sine coulomb matrix eig 104', 'sine coulomb matrix eig 105', 'sine coulomb matrix eig 106', 'sine coulomb matrix eig 107', 'sine coulomb matrix eig 108', 'sine coulomb matrix eig 109', 'sine coulomb matrix eig 110', 'sine coulomb matrix eig 111', 'sine coulomb matrix eig 112', 'sine coulomb matrix eig 113', 'sine coulomb matrix eig 114', 'sine coulomb matrix eig 115', 'sine coulomb matrix eig 116', 'sine coulomb matrix eig 117', 'sine coulomb matrix eig 118', 'sine coulomb matrix eig 119', 'sine coulomb matrix eig 120', 'sine coulomb matrix eig 121', 'sine coulomb matrix eig 122', 'sine coulomb matrix eig 123', 'sine coulomb matrix eig 124', 'sine coulomb matrix eig 125', 'sine coulomb matrix eig 126', 'sine coulomb matrix eig 127', 'sine coulomb matrix eig 128', 'sine coulomb matrix eig 129', 'sine coulomb matrix eig 130', 'sine coulomb matrix eig 131', 'sine coulomb matrix eig 132', 'sine coulomb matrix eig 133', 'sine coulomb matrix eig 134', 'sine coulomb matrix eig 135', 'sine coulomb matrix eig 136', 'sine coulomb matrix eig 137', 'sine coulomb matrix eig 138', 'sine coulomb matrix eig 139', 'sine coulomb matrix eig 140', 'sine coulomb matrix eig 141', 'sine coulomb matrix eig 142', 'sine coulomb matrix eig 143', 'sine coulomb matrix eig 144', 'sine coulomb matrix eig 145', 'sine coulomb matrix eig 146', 'sine coulomb matrix eig 147', 'sine coulomb matrix eig 148', 'sine coulomb matrix eig 149', 'sine coulomb matrix eig 150', 'sine coulomb matrix eig 152', 'sine coulomb matrix eig 153', 'sine coulomb matrix eig 154', 'sine coulomb matrix eig 155', 'sine coulomb matrix eig 156', 'sine coulomb matrix eig 157', 'sine coulomb matrix eig 158', 'sine coulomb matrix eig 159', 'sine coulomb matrix eig 160', 'sine coulomb matrix eig 161', 'sine coulomb matrix eig 162', 'sine coulomb matrix eig 163', 'sine coulomb matrix eig 164', 'sine coulomb matrix eig 165', 'sine coulomb matrix eig 166', 'sine coulomb matrix eig 167', 'sine coulomb matrix eig 168', 'sine coulomb matrix eig 169', 'sine coulomb matrix eig 170', 'sine coulomb matrix eig 171', 'sine coulomb matrix eig 172', 'sine coulomb matrix eig 173', 'sine coulomb matrix eig 174', 'sine coulomb matrix eig 175', 'sine coulomb matrix eig 176', 'sine coulomb matrix eig 177', 'sine coulomb matrix eig 178', 'sine coulomb matrix eig 179', 'sine coulomb matrix eig 180', 'sine coulomb matrix eig 181', 'sine coulomb matrix eig 182', 'sine coulomb matrix eig 183', 'sine coulomb matrix eig 184', 'sine coulomb matrix eig 185', 'sine coulomb matrix eig 186', 'sine coulomb matrix eig 187', 'sine coulomb matrix eig 188', 'sine coulomb matrix eig 189', 'sine coulomb matrix eig 190', 'sine coulomb matrix eig 191', 'sine coulomb matrix eig 192', 'sine coulomb matrix eig 193', 'sine coulomb matrix eig 194', 'sine coulomb matrix eig 195', 'sine coulomb matrix eig 196', 'sine coulomb matrix eig 197', 'sine coulomb matrix eig 198', 'sine coulomb matrix eig 199', 'sine coulomb matrix eig 200', 'sine coulomb matrix eig 201', 'sine coulomb matrix eig 202', 'sine coulomb matrix eig 203', 'sine coulomb matrix eig 204', 'sine coulomb matrix eig 205', 'sine coulomb matrix eig 206', 'sine coulomb matrix eig 207', 'sine coulomb matrix eig 208', 'sine coulomb matrix eig 209', 'sine coulomb matrix eig 210', 'sine coulomb matrix eig 211', 'sine coulomb matrix eig 212', 'sine coulomb matrix eig 213', 'sine coulomb matrix eig 214', 'sine coulomb matrix eig 215', 'sine coulomb matrix eig 216', 'sine coulomb matrix eig 217', 'sine coulomb matrix eig 218', 'sine coulomb matrix eig 219', 'sine coulomb matrix eig 220', 'sine coulomb matrix eig 221', 'sine coulomb matrix eig 222', 'sine coulomb matrix eig 223', 'sine coulomb matrix eig 224', 'sine coulomb matrix eig 225', 'sine coulomb matrix eig 226', 'sine coulomb matrix eig 227', 'sine coulomb matrix eig 228', 'sine coulomb matrix eig 229', 'sine coulomb matrix eig 230', 'sine coulomb matrix eig 231', 'sine coulomb matrix eig 232', 'sine coulomb matrix eig 233', 'sine coulomb matrix eig 234', 'sine coulomb matrix eig 235', 'sine coulomb matrix eig 236', 'sine coulomb matrix eig 237', 'sine coulomb matrix eig 238', 'sine coulomb matrix eig 239', 'sine coulomb matrix eig 240', 'sine coulomb matrix eig 241', 'sine coulomb matrix eig 242', 'sine coulomb matrix eig 243', 'sine coulomb matrix eig 244', 'sine coulomb matrix eig 245', 'sine coulomb matrix eig 246', 'sine coulomb matrix eig 247', 'sine coulomb matrix eig 248', 'sine coulomb matrix eig 249', 'sine coulomb matrix eig 250', 'sine coulomb matrix eig 251', 'sine coulomb matrix eig 252', 'sine coulomb matrix eig 253', 'sine coulomb matrix eig 254', 'sine coulomb matrix eig 255', 'sine coulomb matrix eig 256', 'sine coulomb matrix eig 257', 'sine coulomb matrix eig 258', 'sine coulomb matrix eig 259', 'sine coulomb matrix eig 260', 'sine coulomb matrix eig 261', 'sine coulomb matrix eig 262', 'sine coulomb matrix eig 263', 'sine coulomb matrix eig 264', 'sine coulomb matrix eig 265', 'sine coulomb matrix eig 266', 'sine coulomb matrix eig 267', 'sine coulomb matrix eig 268', 'sine coulomb matrix eig 269', 'sine coulomb matrix eig 270', 'sine coulomb matrix eig 271', 'sine coulomb matrix eig 272', 'sine coulomb matrix eig 273', 'sine coulomb matrix eig 274', 'sine coulomb matrix eig 275', 'sine coulomb matrix eig 276', 'sine coulomb matrix eig 277', 'sine coulomb matrix eig 278', 'sine coulomb matrix eig 279', 'sine coulomb matrix eig 280', 'sine coulomb matrix eig 281', 'sine coulomb matrix eig 282', 'sine coulomb matrix eig 283', 'sine coulomb matrix eig 284', 'sine coulomb matrix eig 285', 'sine coulomb matrix eig 286', 'sine coulomb matrix eig 287'], 'tree': ['MagpieData minimum Number', 'MagpieData maximum Number', 'MagpieData mode Number', 'MagpieData maximum MendeleevNumber', 'MagpieData range MendeleevNumber', 'MagpieData mean MendeleevNumber', 'MagpieData avg_dev MendeleevNumber', 'MagpieData mode MendeleevNumber', 'MagpieData minimum MeltingT', 'MagpieData range MeltingT', 'MagpieData avg_dev MeltingT', 'MagpieData mode MeltingT', 'MagpieData maximum Column', 'MagpieData range Column', 'MagpieData mean Column', 'MagpieData avg_dev Column', 'MagpieData mode Column', 'MagpieData minimum Row', 'MagpieData maximum Row', 'MagpieData range Row', 'MagpieData mean Row', 'MagpieData avg_dev Row', 'MagpieData mode Row', 'MagpieData minimum CovalentRadius', 'MagpieData maximum CovalentRadius', 'MagpieData range CovalentRadius', 'MagpieData mean CovalentRadius', 'MagpieData avg_dev CovalentRadius', 'MagpieData mode CovalentRadius', 'MagpieData minimum Electronegativity', 'MagpieData range Electronegativity', 'MagpieData mode Electronegativity', 'MagpieData minimum NsValence', 'MagpieData maximum NsValence', 'MagpieData mode NsValence', 'MagpieData minimum NpValence', 'MagpieData maximum NpValence', 'MagpieData range NpValence', 'MagpieData mean NpValence', 'MagpieData avg_dev NpValence', 'MagpieData mode NpValence', 'MagpieData minimum NdValence', 'MagpieData maximum NdValence', 'MagpieData range NdValence', 'MagpieData mean NdValence', 'MagpieData avg_dev NdValence', 'MagpieData mode NdValence', 'MagpieData minimum NfValence', 'MagpieData maximum NfValence', 'MagpieData mean NfValence', 'MagpieData avg_dev NfValence', 'MagpieData mode NfValence', 'MagpieData minimum NValence', 'MagpieData maximum NValence', 'MagpieData range NValence', 'MagpieData mean NValence', 'MagpieData avg_dev NValence', 'MagpieData mode NValence', 'MagpieData maximum NsUnfilled', 'MagpieData mean NsUnfilled', 'MagpieData avg_dev NsUnfilled', 'MagpieData mode NsUnfilled', 'MagpieData minimum NpUnfilled', 'MagpieData maximum NpUnfilled', 'MagpieData range NpUnfilled', 'MagpieData mean NpUnfilled', 'MagpieData avg_dev NpUnfilled', 'MagpieData mode NpUnfilled', 'MagpieData minimum NdUnfilled', 'MagpieData mean NdUnfilled', 'MagpieData mode NdUnfilled', 'MagpieData minimum NfUnfilled', 'MagpieData avg_dev NfUnfilled', 'MagpieData mode NfUnfilled', 'MagpieData minimum NUnfilled', 'MagpieData maximum NUnfilled', 'MagpieData range NUnfilled', 'MagpieData mean NUnfilled', 'MagpieData mode NUnfilled', 'MagpieData minimum GSvolume_pa', 'MagpieData range GSvolume_pa', 'MagpieData avg_dev GSvolume_pa', 'MagpieData mode GSvolume_pa', 'MagpieData minimum GSbandgap', 'MagpieData maximum GSbandgap', 'MagpieData mean GSbandgap', 'MagpieData mode GSbandgap', 'MagpieData minimum GSmagmom', 'MagpieData mean GSmagmom', 'MagpieData mode GSmagmom', 'MagpieData minimum SpaceGroupNumber', 'MagpieData maximum SpaceGroupNumber', 'MagpieData range SpaceGroupNumber', 'MagpieData mean SpaceGroupNumber', 'MagpieData avg_dev SpaceGroupNumber', 'MagpieData mode SpaceGroupNumber', 'sine coulomb matrix eig 1', 'sine coulomb matrix eig 2', 'sine coulomb matrix eig 3', 'sine coulomb matrix eig 4', 'sine coulomb matrix eig 5', 'sine coulomb matrix eig 8', 'sine coulomb matrix eig 9', 'sine coulomb matrix eig 10', 'sine coulomb matrix eig 12', 'sine coulomb matrix eig 13', 'sine coulomb matrix eig 14', 'sine coulomb matrix eig 16', 'sine coulomb matrix eig 18', 'sine coulomb matrix eig 19', 'sine coulomb matrix eig 20', 'sine coulomb matrix eig 24', 'sine coulomb matrix eig 31', 'sine coulomb matrix eig 34', 'sine coulomb matrix eig 36', 'sine coulomb matrix eig 41', 'sine coulomb matrix eig 42', 'sine coulomb matrix eig 46', 'sine coulomb matrix eig 69', 'sine coulomb matrix eig 75', 'sine coulomb matrix eig 81', 'sine coulomb matrix eig 82', 'sine coulomb matrix eig 85', 'sine coulomb matrix eig 99', 'sine coulomb matrix eig 151']}, 'retained_features': ['sine coulomb matrix eig 6', 'MagpieData range AtomicWeight', 'MagpieData avg_dev NUnfilled', 'MagpieData mean GSvolume_pa', 'sine coulomb matrix eig 0', 'MagpieData mean MeltingT', 'MagpieData mean Electronegativity', 'MagpieData avg_dev AtomicWeight', 'MagpieData maximum Electronegativity', 'sine coulomb matrix eig 7', 'MagpieData avg_dev Electronegativity'], 'reducer_params': {'tree': {'importance_percentile': 0.9, 'mode': 'regression', 'random_state': 0}}, '_pca': None, '_pca_feats': None, 'is_fit': True}}, 'learner': {'learner': {'mode': 'regression', 'tpot_kwargs': {'max_time_mins': 1, 'max_eval_time_mins': 1, 'population_size': 10, 'n_jobs': 2, 'cv': 5, 'verbosity': 3, 'memory': 'auto', 'template': 'Selector-Transformer-Regressor', 'config_dict': {'sklearn.linear_model.ElasticNetCV': {'l1_ratio': array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ]), 'tol': [1e-05, 0.0001, 0.001, 0.01, 0.1]}, 'sklearn.ensemble.ExtraTreesRegressor': {'n_estimators': [20, 100, 200, 500, 1000], 'max_features': array([0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95]), 'min_samples_split': range(2, 21, 3), 'min_samples_leaf': range(1, 21, 3), 'bootstrap': [True, False]}, 'sklearn.ensemble.GradientBoostingRegressor': {'n_estimators': [20, 100, 200, 500, 1000], 'loss': ['ls', 'lad', 'huber', 'quantile'], 'learning_rate': [0.01, 0.1, 0.5, 1.0], 'max_depth': range(1, 11, 2), 'min_samples_split': range(2, 21, 3), 'min_samples_leaf': range(1, 21, 3), 'subsample': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ]), 'max_features': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ]), 'alpha': [0.75, 0.8, 0.85, 0.9, 0.95, 0.99]}, 'sklearn.tree.DecisionTreeRegressor': {'max_depth': range(1, 11, 2), 'min_samples_split': range(2, 21, 3), 'min_samples_leaf': range(1, 21, 3)}, 'sklearn.neighbors.KNeighborsRegressor': {'n_neighbors': range(1, 101), 'weights': ['uniform', 'distance'], 'p': [1, 2]}, 'sklearn.linear_model.LassoLarsCV': {'normalize': [True, False]}, 'sklearn.svm.LinearSVR': {'loss': ['epsilon_insensitive', 'squared_epsilon_insensitive'], 'dual': [True, False], 'tol': [1e-05, 0.0001, 0.001, 0.01, 0.1], 'C': [0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 5.0, 10.0, 15.0, 20.0, 25.0], 'epsilon': [0.0001, 0.001, 0.01, 0.1, 1.0]}, 'sklearn.ensemble.RandomForestRegressor': {'n_estimators': [20, 100, 200, 500, 1000], 'max_features': array([0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95]), 'min_samples_split': range(2, 21, 3), 'min_samples_leaf': range(1, 21, 3), 'bootstrap': [True, False]}, 'sklearn.linear_model.RidgeCV': {}, 'sklearn.preprocessing.Binarizer': {'threshold': array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ])}, 'sklearn.decomposition.FastICA': {'tol': array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ])}, 'sklearn.cluster.FeatureAgglomeration': {'linkage': ['ward', 'complete', 'average'], 'affinity': ['euclidean', 'l1', 'l2', 'manhattan', 'cosine']}, 'sklearn.preprocessing.MaxAbsScaler': {}, 'sklearn.preprocessing.MinMaxScaler': {}, 'sklearn.preprocessing.Normalizer': {'norm': ['l1', 'l2', 'max']}, 'sklearn.kernel_approximation.Nystroem': {'kernel': ['rbf', 'cosine', 'chi2', 'laplacian', 'polynomial', 'poly', 'linear', 'additive_chi2', 'sigmoid'], 'gamma': array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ]), 'n_components': range(1, 11)}, 'sklearn.decomposition.PCA': {'svd_solver': ['randomized'], 'iterated_power': range(1, 11)}, 'sklearn.preprocessing.PolynomialFeatures': {'degree': [2], 'include_bias': [False], 'interaction_only': [False]}, 'sklearn.kernel_approximation.RBFSampler': {'gamma': array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ])}, 'sklearn.preprocessing.RobustScaler': {}, 'sklearn.preprocessing.StandardScaler': {}, 'tpot.builtins.ZeroCount': {}, 'tpot.builtins.OneHotEncoder': {'minimum_fraction': [0.05, 0.1, 0.15, 0.2, 0.25], 'sparse': [False], 'threshold': [10]}, 'sklearn.feature_selection.SelectFwe': {'alpha': array([0.   , 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008,0.009, 0.01 , 0.011, 0.012, 0.013, 0.014, 0.015, 0.016, 0.017,0.018, 0.019, 0.02 , 0.021, 0.022, 0.023, 0.024, 0.025, 0.026,0.027, 0.028, 0.029, 0.03 , 0.031, 0.032, 0.033, 0.034, 0.035,0.036, 0.037, 0.038, 0.039, 0.04 , 0.041, 0.042, 0.043, 0.044,0.045, 0.046, 0.047, 0.048, 0.049]), 'score_func': {'sklearn.feature_selection.f_regression': None}}, 'sklearn.feature_selection.SelectPercentile': {'percentile': range(1, 100), 'score_func': {'sklearn.feature_selection.f_regression': None}}, 'sklearn.feature_selection.VarianceThreshold': {'threshold': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2]}, 'sklearn.feature_selection.SelectFromModel': {'threshold': array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ]), 'estimator': {'sklearn.ensemble.ExtraTreesRegressor': {'n_estimators': [100], 'max_features': array([0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95])}}}}, 'scoring': 'neg_mean_absolute_error'}, 'models': None, 'random_state': None, 'greater_score_is_better': None, '_fitted_target': 'n', '_backend': TPOTRegressor(config_dict={'sklearn.cluster.FeatureAgglomeration': {'affinity': ['euclidean','l1','l2','manhattan','cosine'],'linkage': ['ward','complete','average']},'sklearn.decomposition.FastICA': {'tol': array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ])},'sklearn.decomposition.PCA': {'iterated_power'...'tpot.builtins.OneHotEncoder': {'minimum_fraction': [0.05,0.1,0.15,0.2,0.25],'sparse': [False],'threshold': [10]},'tpot.builtins.ZeroCount': {}},log_file=<ipykernel.iostream.OutStream object at 0x7f9228143730>,max_eval_time_mins=1, max_time_mins=1, memory='auto', n_jobs=2,population_size=10, scoring='neg_mean_absolute_error',template='Selector-Transformer-Regressor', verbosity=3), '_features': ['MagpieData range AtomicWeight', 'MagpieData avg_dev AtomicWeight', 'MagpieData mean MeltingT', 'MagpieData maximum Electronegativity', 'MagpieData mean Electronegativity', 'MagpieData avg_dev Electronegativity', 'MagpieData avg_dev NUnfilled', 'MagpieData mean GSvolume_pa', 'sine coulomb matrix eig 0', 'sine coulomb matrix eig 6', 'sine coulomb matrix eig 7'], 'from_serialized': False, '_best_models': None, 'is_fit': True}}, 'pre_fit_df': {'obj': <class 'pandas.core.frame.DataFrame'>, 'columns': 2, 'samples': 3811}, 'post_fit_df': {'obj': <class 'pandas.core.frame.DataFrame'>, 'columns': 12, 'samples': 3811}, 'ml_type': 'regression', 'target': 'n', 'version': '1.0.3.20191111', 'is_fit': True}

直接访问MatPipe的内部对象

您可以直接访问MatPipe的内部对象，而不是通过文本摘要；你只需要知道要访问哪些属性。有关更多信息，请参见在线应用编程接口文档或源代码。

# Access some attributes of MatPipe directly, instead of via a text digestprint(pipe.learner.best_pipeline)结果：
Pipeline(memory=Memory(location=/var/folders/x6/mzkjfgpx3m9cr_6mcy9759qw0000gn/T/tmps0ji7j_y/joblib),steps=[('selectpercentile',SelectPercentile(percentile=23,score_func=<function f_regression at 0x7f92217f2040>)),('robustscaler', RobustScaler()),('randomforestregressor',RandomForestRegressor(bootstrap=False, max_features=0.05,min_samples_leaf=7, min_samples_split=5,n_estimators=20))])

print(pipe.autofeaturizer.featurizers["composition"])结果：
[ElementProperty(data_source=<matminer.utils.data.MagpieData object at 0x7f92058afaf0>,features=['Number', 'MendeleevNumber', 'AtomicWeight','MeltingT', 'Column', 'Row', 'CovalentRadius','Electronegativity', 'NsValence', 'NpValence','NdValence', 'NfValence', 'NValence', 'NsUnfilled','NpUnfilled', 'NdUnfilled', 'NfUnfilled', 'NUnfilled','GSvolume_pa', 'GSbandgap', 'GSmagmom','SpaceGroupNumber'],stats=['minimum', 'maximum', 'range', 'mean', 'avg_dev','mode'])]

print(pipe.autofeaturizer.featurizers["structure"])结果：
[SineCoulombMatrix()]

管道的持久性（Persistence of pipelines）

能够复制你的结果是材料信息学的一个重要方面。MatPipe 提供了方便保存和加载整个管道的方法，供他人使用。

用MatPipe.save 保存一个 MatPipe 供以后使用。装上 MatPipe.load。

filename = "MatPipe_predict_experimental_gap_from_composition.p"pipe.save(filename)pipe_loaded = MatPipe.load(filename)结果：
2020-07-27 14:37:33 INFO     Loaded MatPipe from file MatPipe_predict_experimental_gap_from_composition.p.
2020-07-27 14:37:33 WARNING  Only use this model to make predictions (do not retrain!). Backend was serialzed as only the top model, not the full automl backend.

材料数据科学:描述符和机器学习相关推荐

【科普】一文把数据科学、人工智能与机器学习讲清楚
目录什么是数据科学? 什么是人工智能? 什么是机器学习? 数据科学.人工智能.机器学习的关系数据科学.人工智能.机器学习的区别数据科学.人工智能.机器学习工作尽管"数据科学" ...
Python 数据科学入门教程：机器学习：回归
Python 数据科学入门教程:机器学习:回归原文:Regression - Intro and Data 译者:飞龙协议:CC BY-NC-SA 4.0 引言和数据欢迎阅读 Python 机器 ...
数据科学、人工智能与机器学习傻傻分不清楚，看这篇就够了
数据科学.人工智能与机器学习的区别什么是数据科学? 什么是人工智能? 什么是机器学习? 数据科学.人工智能.机器学习的关系数据科学.人工智能.机器学习的区别数据科学.人工智能.机器学习工作尽管 ...
数据科学+python+R+数据库+机器学习+（速查表）cheat sheets大全
数据科学+python+R+数据库+机器学习+(速查表)cheat sheets大全 Learn, compete, hack and get hired! 学习.竞争.精进.996. 东西永远学不完 ...
自学是一门艺术：踏上数据科学、人工智能和机器学习的自学之路
全文共3217字,预计学习时长9分钟图源:unsplash 学习是最好的投资,在B站最大的作用都变成学习之后,人们在互联网上学习什么都不稀奇了.没错,数据科学.人工智能和机器学习也是可以自学的. 时 ...
数据段描述符和代码段描述符（二）——《x86汇编语言：从实模式到保护模式》读书笔记11
这篇博文,我们编写一个C语言的小程序,来解析数据段或者代码段描述符的各个字段.这样我们阅读原书的代码就会方便一点,只要运行这个小程序,就可以明白程序中定义的数据段或者代码段的描述符了. 这段代码,我用 ...
数据段描述符和代码段描述符（一）——《x86汇编语言：从实模式到保护模式》读书笔记10
一.段描述符的分类在上一篇博文中已经说过,为了使用段,我们必须要创建段描述符.80X86中有各种各样的段描述符,下图展示了它们的分类. 看了上图,你也许会说:天啊,怎么这么多段描述符啊!我可怎么记住 ...
操作系统学习(五) 、代码段和数据段描述符
一.代码段和数据段描述符格式段描述符通用格式如下所示: 代码段和数据段描述符中各个位的含义如下所示: 二.代码段和数据段描述符类型当段描述符中S标志位(描述符类型)被置位,则该描述符用于代码段或数 ...
火哥x86数据段描述符向下拓展的解说
火哥 windows内核交流QQ 群 1026716399 学保护模式,我们不可避免会学段描述符,那么在段描述符中又分数据段与代码段以及系统相关的门或任务的一些东东, 那么今天我们就来详细说一说数据 ...

材料数据科学:描述符和机器学习

1.数据检索和过滤

操作和检查 pandas DataFrame 对象

检查数据集

对数据集进行索引

过滤数据集

生成新列

2.为机器学习生成描述符

特征化方法和基础

dataframes特征化（Featurizing dataframes）

结构特征器（Structure Featurizers）

转换特征器（Conversion Featurizers）

高级功能（Advanced capabilities）

处理错误

引用作者

3.机器学习模型

Scikit-Learn

加载并准备一个预先特征化的模型

使用scikit-learn尝试随机森林模型（random forest model）

评估模型性能

交叉验证（Cross validation）

可视化模型性能（Visualizing model performance）

模型解释（Model interpretation）

4.使用自动机的自动机器学习

用Automatminer's MatPipe进行拟合和预测

安装管道（Fitting the pipeline）

预测新数据（Examine predictions）

检查预测集

评分预测（Score predictions）

检查MatPipe的内部（Examining the internals of MatPipe）

直接访问MatPipe的内部对象

管道的持久性（Persistence of pipelines）

材料数据科学:描述符和机器学习相关推荐

最新文章

热门文章