案例:该数据集的是一个关于每个学生成绩的数据集,接下来我们对该数据集进行分析,判断学生是否适合继续深造

数据集特征展示

1 GRE 成绩 (290 to 340)2 TOEFL 成绩(92 to 120)3 学校等级 (1 to 5)4 自身的意愿 (1 to 5)5 推荐信的力度 (1 to 5)6 CGPA成绩 (6.8 to 9.92)7 是否有研习经验 (0 or 1)8 读硕士的意向 (0.34 to 0.97)

1.导入包

importnumpy as npimportpandas as pdimportmatplotlib.pyplot as pltimportseaborn as snsimport os,sys

2.导入并查看数据集

df = pd.read_csv("D:\\machine-learning\\score\\Admission_Predict.csv",sep = ",")

print('There are ',len(df.columns),'columns')

for c in df.columns:

sys.stdout.write(str(c)+', '

There are 9 columns

Serial No., GRE Score, TOEFL Score, University Rating, SOP, LOR , CGPA, Research, Chance of Admit ,

一共有9列特征

df.info()

RangeIndex: 400 entries, 0 to 399

Data columns (total 9 columns):

Serial No. 400 non-null int64

GRE Score 400 non-null int64

TOEFL Score 400 non-null int64

University Rating 400 non-null int64

SOP 400 non-null float64

LOR 400 non-null float64

CGPA 400 non-null float64

Research 400 non-null int64

Chance of Admit 400 non-null float64

dtypes: float64(4), int64(5)

memory usage: 28.2 KB

数据集信息:

1.数据有9个特征,分别是学号,GRE分数,托福分数,学校等级,SOP,LOR,CGPA,是否参加研习,进修的几率

2.数据集中没有空值

3.一共有400条数据

#整理列名称

df = df.rename(columns={'Chance of Admit':'Chance of Admit'})

#显示前5列数据

df.head()

3.查看每个特征的相关性

fig,ax = plt.subplots(figsize=(10,10))

sns.heatmap(df.corr(),ax=ax,annot=True,linewidths=0.05,fmt='.2f',cmap='magma')

plt.show()

结论:1.最有可能影响是否读硕士的特征是GRE,CGPA,TOEFL成绩

2.影响相对较小的特征是LOR,SOP,和Research

4.数据可视化,双变量分析

4.1 进行Research的人数

print("Not Having Research:",len(df[df.Research ==0]))print("Having Research:",len(df[df.Research == 1]))

y= np.array([len(df[df.Research == 0]),len(df[df.Research == 1])])

x= np.arange(2)

plt.bar(x,y)

plt.title("Research Experience")

plt.xlabel("Canditates")

plt.ylabel("Frequency")

plt.xticks(x,('Not having research','Having research'))

plt.show()

结论:进行research的人数是219,本科没有research人数是181

4.2 学生的托福成绩

y = np.array([df['TOEFL Score'].min(),df['TOEFL Score'].mean(),df['TOEFL Score'].max()])

x= np.arange(3)

plt.bar(x,y)

plt.title('TOEFL Score')

plt.xlabel('Level')

plt.ylabel('TOEFL Score')

plt.xticks(x,('Worst','Average','Best'))

plt.show()

结论:最低分92分,最高分满分,进修学生的英语成绩很不错

4.3 GRE成绩

df['GRE Score'].plot(kind='hist',bins=200,figsize=(6,6))

plt.title('GRE Score')

plt.xlabel('GRE Score')

plt.ylabel('Frequency')

plt.show()

结论:310和330的分值的学生居多

4.4 CGPA和学校等级的关系

plt.scatter(df['University Rating'],df['CGPA'])

plt.title('CGPA Scores for University ratings')

plt.xlabel('University Rating')

plt.ylabel('CGPA')

plt.show()

结论:学校越好,学生的GPA可能就越高

4.5 GRE成绩和CGPA的关系

plt.scatter(df['GRE Score'],df['CGPA'])

plt.title('CGPA for GRE Scores')

plt.xlabel('GRE Score')

plt.ylabel('CGPA')

plt.show()

结论:GPA基点越高,GRE分数越高,2者的相关性很大

4.6 托福成绩和GRE成绩的关系

df[df['CGPA']>=8.5].plot(kind='scatter',x='GRE Score',y='TOEFL Score',color='red')

plt.xlabel('GRE Score')

plt.ylabel('TOEFL Score')

plt.title('CGPA >= 8.5')

plt.grid(True)

plt.show()

结论:多数情况下GRE和托福成正相关,但是GRE分数高,托福一定高。

4.6 学校等级和是否读硕士的关系

s = df[df['Chance of Admit'] >= 0.75]['University Rating'].value_counts().head(5)

plt.title('University Ratings of Candidates with an 75% acceptance chance')

s.plot(kind='bar',figsize=(20,10),cmap='Pastel1')

plt.xlabel('University Rating')

plt.ylabel('Candidates')

plt.show()

结论:排名靠前的学校的学生,进修的可能性更大

4.7 SOP和GPA的关系

plt.scatter(df['CGPA'],df['SOP'])

plt.xlabel('CGPA')

plt.ylabel('SOP')

plt.title('SOP for CGPA')

plt.show()

结论: GPA很高的学生,选择读硕士的自我意愿更强烈

4.8 SOP和GRE的关系

plt.scatter(df['GRE Score'],df['SOP'])

plt.xlabel('GRE Score')

plt.ylabel('SOP')

plt.title('SOP for GRE Score')

plt.show()

结论:读硕士意愿强的学生,GRE分数较高

5.模型

5.1 准备数据集

#读取数据集

df = pd.read_csv('D:\\machine-learning\\score\\Admission_Predict.csv',sep=',')

serialNO= df['Serial No.'].values

df.drop(['Serial No.'],axis=1,inplace=True)

df= df.rename(columns={'Chance of Admit':'Chance of Admit'})#分割数据集

y = df['Chance of Admit'].values

x= df.drop(['Chance of Admit'],axis=1)from sklearn.model_selection importtrain_test_split

x_train,x_test,y_train,y_test= train_test_split(x,y,test_size=0.2,random_state=42)

#归一化数据

from sklearn.preprocessing import MinMaxScaler

scaleX = MinMaxScaler(feature_range=[0,1])

x_train[x_train.columns] = scaleX.fit_transform(x_train[x_train.columns])

x_test[x_test.columns] = scaleX.fit_transform(x_test[x_test.columns])

5.2 回归

5.2.1 线性回归

from sklearn.linear_model importLinearRegression

lr=LinearRegression()

lr.fit(x_train,y_train)

y_head_lr=lr.predict(x_test)print('Real value of y_test[1]:'+str(y_test[1]) + '-> predict value:' + str(lr.predict(x_test.iloc[[1],:])))print('Real value of y_test[2]:'+str(y_test[2]) + '-> predict value:' + str(lr.predict(x_test.iloc[[2],:])))from sklearn.metrics importr2_scoreprint('r_square score:',r2_score(y_test,y_head_lr))

y_head_lr_train=lr.predict(x_train)print('r_square score(train data):',r2_score(y_train,y_head_lr_train))

5.2.2 随机森林回归

from sklearn.ensemble importRandomForestRegressor

rfr= RandomForestRegressor(n_estimators=100,random_state=42)

rfr.fit(x_train,y_train)

y_head_rfr=rfr.predict(x_test)print('Real value of y_test[1]:'+str(y_test[1]) + '-> predict value:' + str(rfr.predict(x_test.iloc[[1],:])))print('Real value of y_test[2]:'+str(y_test[2]) + '-> predict value:' + str(rfr.predict(x_test.iloc[[2],:])))from sklearn.metrics importr2_scoreprint('r_square score:',r2_score(y_test,y_head_rfr))

y_head_rfr_train=rfr.predict(x_train)print('r_square score(train data):',r2_score(y_train,y_head_rfr_train))

5.2.3 决策树回归

from sklearn.tree importDecisionTreeRegressor

dt= DecisionTreeRegressor(random_state=42)

dt.fit(x_train,y_train)

y_head_dt=dt.predict(x_test)print('Real value of y_test[1]:'+str(y_test[1]) + '-> predict value:' + str(dt.predict(x_test.iloc[[1],:])))print('Real value of y_test[2]:'+str(y_test[2]) + '-> predict value:' + str(dt.predict(x_test.iloc[[2],:])))from sklearn.metrics importr2_scoreprint('r_square score:',r2_score(y_test,y_head_dt))

y_head_dt_train=dt.predict(x_train)print('r_square score(train data):',r2_score(y_train,y_head_dt_train))

5.2.4 三种回归方法比较

y =np.array([r2_score(y_test,y_head_lr),r2_score(y_test,y_head_rfr),r2_score(y_test,y_head_dt)])

x= np.arange(3)

plt.bar(x,y)

plt.title('Comparion of Regression Algorithms')

plt.xlabel('Regression')

plt.ylabel('r2_score')

plt.xticks(x,("LinearRegression","RandomForestReg.","DecisionTreeReg."))

plt.show()

结论 : 回归算法中,线性回归的性能更优

5.2.5 三种回归方法与实际值的比较

​red = plt.scatter(np.arange(0,80,5),y_head_lr[0:80:5],color='red')

blue= plt.scatter(np.arange(0,80,5),y_head_rfr[0:80:5],color='blue')

green= plt.scatter(np.arange(0,80,5),y_head_dt[0:80:5],color='green')

black= plt.scatter(np.arange(0,80,5),y_test[0:80:5],color='black')

plt.title('Comparison of Regression Algorithms')

plt.xlabel('Index of candidate')

plt.ylabel('Chance of admit')

plt.legend([red,blue,green,black],['LR','RFR','DT','REAL'])

plt.show()

结论:在数据集中有70%的候选人有可能读硕士,从上图来看还有些点没有很好的得到预测

5.3 分类算法

5.3.1 准备数据

df = pd.read_csv('D:\\machine-learning\\score\\Admission_Predict.csv',sep=',')

SerialNO= df['Serial No.'].values

df.drop(['Serial No.'],axis=1,inplace=True)

df= df.rename(columns={'Chance of Admit':'Chance of Admit'})

y= df['Chance of Admit'].values

x= df.drop(['Chance of Admit'],axis=1)from sklearn.model_selection importtrain_test_split

x_train,x_test,y_train,y_test= train_test_split(x,y,test_size=0.2,random_state=42)from sklearn.preprocessing importMinMaxScaler

scaleX= MinMaxScaler(feature_range=[0,1])

x_train[x_train.columns]=scaleX.fit_transform(x_train[x_train.columns])

x_test[x_test.columns]=scaleX.fit_transform(x_test[x_test.columns])#如果chance >0.8, chance of admit 就是1,否则就是0

y_train_01 = [1 if each > 0.8 else 0 for each iny_train]

y_test_01= [1 if each > 0.8 else 0 for each iny_test]

y_train_01=np.array(y_train_01)

y_test_01= np.array(y_test_01)

5.3.2 逻辑回归

from sklearn.linear_model importLogisticRegression

lrc=LogisticRegression()

lrc.fit(x_train,y_train_01)print('score:',lrc.score(x_test,y_test_01))print('Real value of y_test_01[1]:'+str(y_test_01[1]) + '-> predict value:' + str(lrc.predict(x_test.iloc[[1],:])))print('Real value of y_test_01[2]:'+str(y_test_01[2]) + '-> predict value:' + str(lrc.predict(x_test.iloc[[2],:])))from sklearn.metrics importconfusion_matrix

cm_lrc=confusion_matrix(y_test_01,lrc.predict(x_test))

f,ax= plt.subplots(figsize=(5,5))

sns.heatmap(cm_lrc,annot=True,linewidths=0.5,linecolor='red',fmt='.0f',ax=ax)

plt.title('Test for Test dataset')

plt.xlabel('predicted y values')

plt.ylabel('real y value')

plt.show()from sklearn.metrics importrecall_score,precision_score,f1_scoreprint('precision_score is :',precision_score(y_test_01,lrc.predict(x_test)))print('recall_score is :',recall_score(y_test_01,lrc.predict(x_test)))print('f1_score is :',f1_score(y_test_01,lrc.predict(x_test)))#Test for Train Dataset:

cm_lrc_train=confusion_matrix(y_train_01,lrc.predict(x_train))

f,ax= plt.subplots(figsize=(5,5))

sns.heatmap(cm_lrc_train,annot=True,linewidths=0.5,linecolor='blue',fmt='.0f',ax=ax)

plt.title('Test for Train dataset')

plt.xlabel('predicted y values')

plt.ylabel('real y value')

plt.show()

结论:1.通过混淆矩阵,逻辑回归算法在训练集样本上,有23个分错的样本,有72人想进一步读硕士

2.在测试集上有7个分错的样本

5.3.3 支持向量机(SVM)

from sklearn.svm importSVC

svm= SVC(random_state=1,kernel='rbf')

svm.fit(x_train,y_train_01)print('score:',svm.score(x_test,y_test_01))print('Real value of y_test_01[1]:'+str(y_test_01[1]) + '-> predict value:' + str(svm.predict(x_test.iloc[[1],:])))print('Real value of y_test_01[2]:'+str(y_test_01[2]) + '-> predict value:' + str(svm.predict(x_test.iloc[[2],:])))from sklearn.metrics importconfusion_matrix

cm_svm=confusion_matrix(y_test_01,svm.predict(x_test))

f,ax= plt.subplots(figsize=(5,5))

sns.heatmap(cm_svm,annot=True,linewidths=0.5,linecolor='red',fmt='.0f',ax=ax)

plt.title('Test for Test dataset')

plt.xlabel('predicted y values')

plt.ylabel('real y value')

plt.show()from sklearn.metrics importrecall_score,precision_score,f1_scoreprint('precision_score is :',precision_score(y_test_01,svm.predict(x_test)))print('recall_score is :',recall_score(y_test_01,svm.predict(x_test)))print('f1_score is :',f1_score(y_test_01,svm.predict(x_test)))#Test for Train Dataset:

cm_svm_train=confusion_matrix(y_train_01,svm.predict(x_train))

f,ax= plt.subplots(figsize=(5,5))

sns.heatmap(cm_svm_train,annot=True,linewidths=0.5,linecolor='blue',fmt='.0f',ax=ax)

plt.title('Test for Train dataset')

plt.xlabel('predicted y values')

plt.ylabel('real y value')

plt.show()

结论:1.通过混淆矩阵,SVM算法在训练集样本上,有22个分错的样本,有70人想进一步读硕士

2.在测试集上有8个分错的样本

5.3.4 朴素贝叶斯

from sklearn.naive_bayes importGaussianNB

nb=GaussianNB()

nb.fit(x_train,y_train_01)print('score:',nb.score(x_test,y_test_01))print('Real value of y_test_01[1]:'+str(y_test_01[1]) + '-> predict value:' + str(nb.predict(x_test.iloc[[1],:])))print('Real value of y_test_01[2]:'+str(y_test_01[2]) + '-> predict value:' + str(nb.predict(x_test.iloc[[2],:])))from sklearn.metrics importconfusion_matrix

cm_nb=confusion_matrix(y_test_01,nb.predict(x_test))

f,ax= plt.subplots(figsize=(5,5))

sns.heatmap(cm_nb,annot=True,linewidths=0.5,linecolor='red',fmt='.0f',ax=ax)

plt.title('Test for Test dataset')

plt.xlabel('predicted y values')

plt.ylabel('real y value')

plt.show()from sklearn.metrics importrecall_score,precision_score,f1_scoreprint('precision_score is :',precision_score(y_test_01,nb.predict(x_test)))print('recall_score is :',recall_score(y_test_01,nb.predict(x_test)))print('f1_score is :',f1_score(y_test_01,nb.predict(x_test)))#Test for Train Dataset:

cm_nb_train=confusion_matrix(y_train_01,nb.predict(x_train))

f,ax= plt.subplots(figsize=(5,5))

sns.heatmap(cm_nb_train,annot=True,linewidths=0.5,linecolor='blue',fmt='.0f',ax=ax)

plt.title('Test for Train dataset')

plt.xlabel('predicted y values')

plt.ylabel('real y value')

plt.show()

结论:1.通过混淆矩阵,朴素贝叶斯算法在训练集样本上,有20个分错的样本,有78人想进一步读硕士

2.在测试集上有7个分错的样本

5.3.5 随机森林分类器

from sklearn.ensemble importRandomForestClassifier

rfc= RandomForestClassifier(n_estimators=100,random_state=1)

rfc.fit(x_train,y_train_01)print('score:',rfc.score(x_test,y_test_01))print('Real value of y_test_01[1]:'+str(y_test_01[1]) + '-> predict value:' + str(rfc.predict(x_test.iloc[[1],:])))print('Real value of y_test_01[2]:'+str(y_test_01[2]) + '-> predict value:' + str(rfc.predict(x_test.iloc[[2],:])))from sklearn.metrics importconfusion_matrix

cm_rfc=confusion_matrix(y_test_01,rfc.predict(x_test))

f,ax= plt.subplots(figsize=(5,5))

sns.heatmap(cm_rfc,annot=True,linewidths=0.5,linecolor='red',fmt='.0f',ax=ax)

plt.title('Test for Test dataset')

plt.xlabel('predicted y values')

plt.ylabel('real y value')

plt.show()from sklearn.metrics importrecall_score,precision_score,f1_scoreprint('precision_score is :',precision_score(y_test_01,rfc.predict(x_test)))print('recall_score is :',recall_score(y_test_01,rfc.predict(x_test)))print('f1_score is :',f1_score(y_test_01,rfc.predict(x_test)))#Test for Train Dataset:

cm_rfc_train=confusion_matrix(y_train_01,rfc.predict(x_train))

f,ax= plt.subplots(figsize=(5,5))

sns.heatmap(cm_rfc_train,annot=True,linewidths=0.5,linecolor='blue',fmt='.0f',ax=ax)

plt.title('Test for Train dataset')

plt.xlabel('predicted y values')

plt.ylabel('real y value')

plt.show()

结论:1.通过混淆矩阵,随机森林算法在训练集样本上,有0个分错的样本,有88人想进一步读硕士

2.在测试集上有5个分错的样本

5.3.6 决策树分类器

from sklearn.tree importDecisionTreeClassifier

dtc= DecisionTreeClassifier(criterion='entropy',max_depth=3)

dtc.fit(x_train,y_train_01)print('score:',dtc.score(x_test,y_test_01))print('Real value of y_test_01[1]:'+str(y_test_01[1]) + '-> predict value:' + str(dtc.predict(x_test.iloc[[1],:])))print('Real value of y_test_01[2]:'+str(y_test_01[2]) + '-> predict value:' + str(dtc.predict(x_test.iloc[[2],:])))from sklearn.metrics importconfusion_matrix

cm_dtc=confusion_matrix(y_test_01,dtc.predict(x_test))

f,ax= plt.subplots(figsize=(5,5))

sns.heatmap(cm_dtc,annot=True,linewidths=0.5,linecolor='red',fmt='.0f',ax=ax)

plt.title('Test for Test dataset')

plt.xlabel('predicted y values')

plt.ylabel('real y value')

plt.show()from sklearn.metrics importrecall_score,precision_score,f1_scoreprint('precision_score is :',precision_score(y_test_01,dtc.predict(x_test)))print('recall_score is :',recall_score(y_test_01,dtc.predict(x_test)))print('f1_score is :',f1_score(y_test_01,dtc.predict(x_test)))#Test for Train Dataset:

cm_dtc_train=confusion_matrix(y_train_01,dtc.predict(x_train))

f,ax= plt.subplots(figsize=(5,5))

sns.heatmap(cm_dtc_train,annot=True,linewidths=0.5,linecolor='blue',fmt='.0f',ax=ax)

plt.title('Test for Train dataset')

plt.xlabel('predicted y values')

plt.ylabel('real y value')

plt.show()

结论:1.通过混淆矩阵,决策树算法在训练集样本上,有20个分错的样本,有78人想进一步读硕士

2.在测试集上有7个分错的样本

5.3.7 K临近分类器

from sklearn.neighbors importKNeighborsClassifier

scores=[]for each in range(1,50):

knn_n= KNeighborsClassifier(n_neighbors =each)

knn_n.fit(x_train,y_train_01)

scores.append(knn_n.score(x_test,y_test_01))

plt.plot(range(1,50),scores)

plt.xlabel('k')

plt.ylabel('Accuracy')

plt.show()

knn= KNeighborsClassifier(n_neighbors=7)

knn.fit(x_train,y_train_01)print('score 7 :',knn.score(x_test,y_test_01))print('Real value of y_test_01[1]:'+str(y_test_01[1]) + '-> predict value:' + str(knn.predict(x_test.iloc[[1],:])))print('Real value of y_test_01[2]:'+str(y_test_01[2]) + '-> predict value:' + str(knn.predict(x_test.iloc[[2],:])))from sklearn.metrics importconfusion_matrix

cm_knn=confusion_matrix(y_test_01,knn.predict(x_test))

f,ax= plt.subplots(figsize=(5,5))

sns.heatmap(cm_knn,annot=True,linewidths=0.5,linecolor='red',fmt='.0f',ax=ax)

plt.title('Test for Test dataset')

plt.xlabel('predicted y values')

plt.ylabel('real y value')

plt.show()from sklearn.metrics importrecall_score,precision_score,f1_scoreprint('precision_score is :',precision_score(y_test_01,knn.predict(x_test)))print('recall_score is :',recall_score(y_test_01,knn.predict(x_test)))print('f1_score is :',f1_score(y_test_01,knn.predict(x_test)))#Test for Train Dataset:

cm_knn_train=confusion_matrix(y_train_01,knn.predict(x_train))

f,ax= plt.subplots(figsize=(5,5))

sns.heatmap(cm_knn_train,annot=True,linewidths=0.5,linecolor='blue',fmt='.0f',ax=ax)

plt.title('Test for Train dataset')

plt.xlabel('predicted y values')

plt.ylabel('real y value')

plt.show()

结论:1.通过混淆矩阵,K临近算法在训练集样本上,有22个分错的样本,有71人想进一步读硕士

2.在测试集上有7个分错的样本

5.3.8 分类器比较

y =np.array([lrc.score(x_test,y_test_01),svm.score(x_test,y_test_01),nb.score(x_test,y_test_01),

dtc.score(x_test,y_test_01),rfc.score(x_test,y_test_01),knn.score(x_test,y_test_01)])

x= np.arange(6)

plt.bar(x,y)

plt.title('Comparison of Classification Algorithms')

plt.xlabel('Classification')

plt.ylabel('Score')

plt.xticks(x,("LogisticReg.","SVM","GNB","Dec.Tree","Ran.Forest","KNN"))

plt.show()

结论:随机森林和朴素贝叶斯二者的预测值都比较高

5.4 聚类算法

5.4.1 准备数据

df = pd.read_csv('D:\\machine-learning\\score\\Admission_Predict.csv',sep=',')

df= df.rename(columns={'Chance of Admit':'Chance of Admit'})

serialNo= df['Serial No.']

df.drop(['Serial No.'],axis=1,inplace=True)

df= (df - np.min(df)) / (np.max(df)-np.min(df))

y= df['Chance of Admit']

x= df.drop(['Chance of Admit'],axis=1)

5.4.2 降维

from sklearn.decomposition importPCA

pca= PCA(n_components=1,whiten=True)

pca.fit(x)

x_pca=pca.transform(x)

x_pca= x_pca.reshape(400)

dictionary= {'x':x_pca,'y':y}

data=pd.DataFrame(dictionary)print('pca data:',data.head())print()print('orin data:',df.head())

5.4.3 K均值聚类

from sklearn.cluster importKMeans

wcss=[]for k in range(1,15):

kmeans= KMeans(n_clusters=k)

kmeans.fit(x)

wcss.append(kmeans.inertia_)

plt.plot(range(1,15),wcss)

plt.xlabel('Kmeans')

plt.ylabel('WCSS')

plt.show()

df["Serial No."] =serialNo

kmeans= KMeans(n_clusters=3)

clusters_knn=kmeans.fit_predict(x)

df['label_kmeans'] =clusters_knn

plt.scatter(df[df.label_kmeans== 0 ]["Serial No."],df[df.label_kmeans == 0]['Chance of Admit'],color = "red")

plt.scatter(df[df.label_kmeans== 1 ]["Serial No."],df[df.label_kmeans == 1]['Chance of Admit'],color = "blue")

plt.scatter(df[df.label_kmeans== 2 ]["Serial No."],df[df.label_kmeans == 2]['Chance of Admit'],color = "green")

plt.title("K-means Clustering")

plt.xlabel("Candidates")

plt.ylabel("Chance of Admit")

plt.show()

plt.scatter(data.x[df.label_kmeans== 0 ],data[df.label_kmeans == 0].y,color = "red")

plt.scatter(data.x[df.label_kmeans== 1 ],data[df.label_kmeans == 1].y,color = "blue")

plt.scatter(data.x[df.label_kmeans== 2 ],data[df.label_kmeans == 2].y,color = "green")

plt.title("K-means Clustering")

plt.xlabel("X")

plt.ylabel("Chance of Admit")

plt.show()

结论:数据集分成三个类别,一部分学生是决定继续读硕士,一部分放弃,还有一部分学生的比较犹豫,但是深造的可能性较大

5.4.4 层次聚类

from scipy.cluster.hierarchy importlinkage,dendrogram

merg= linkage(x,method='ward')

dendrogram(merg,leaf_rotation=90)

plt.xlabel('data points')

plt.ylabel('euclidean distance')

plt.show()from sklearn.cluster importAgglomerativeClustering

hiyerartical_cluster= AgglomerativeClustering(n_clusters=3,affinity='euclidean',linkage='ward')

clusters_hiyerartical=hiyerartical_cluster.fit_predict(x)

df['label_hiyerartical'] =clusters_hiyerartical

plt.scatter(df[df.label_hiyerartical== 0 ]["Serial No."],df[df.label_hiyerartical == 0]['Chance of Admit'],color = "red")

plt.scatter(df[df.label_hiyerartical== 1 ]["Serial No."],df[df.label_hiyerartical == 1]['Chance of Admit'],color = "blue")

plt.scatter(df[df.label_hiyerartical== 2 ]["Serial No."],df[df.label_hiyerartical == 2]['Chance of Admit'],color = "green")

plt.title('Hierarchical Clustering')

plt.xlabel('Candidates')

plt.ylabel('Chance of Admit')

plt.show()

plt.scatter(data[df.label_hiyerartical== 0].x,data.y[df.label_hiyerartical==0],color='red')

plt.scatter(data[df.label_hiyerartical== 1].x,data.y[df.label_hiyerartical==1],color='blue')

plt.scatter(data[df.label_hiyerartical== 2].x,data.y[df.label_hiyerartical==2],color='green')

plt.title('Hierarchical Clustering')

plt.xlabel('X')

plt.ylabel('Chance of Admit')

plt.show()

结论:从层次聚类的结果中,可以看出和K均值聚类的结果一致,只不过确定了聚类k的取值3

结论:通过本词入门数据集的训练,可以掌握

1.一些特征的展示的方法

2.如何调用sklearn 的API

3.如何取比较不同模型之间的好坏

代码+数据集:https://github.com/Mounment/python-data-analyze/tree/master/kaggle/score

如果有用的话,记得打一个星星,谢谢

python成绩分析器_Python-根据成绩分析是否继续深造相关推荐

  1. python成绩转换_PYTHON将成绩从百分制变换到等级制

    要实现成绩的百分制转化为等级制,我们首先要了解多分支选择结构 多分支选择结构的语法为: if 达式1: 语句块1 elif 表达式2: 语句块2 elif 表达式3: 语句块3 ... else: 语 ...

  2. python找房源_Python租房信息分析!找到最适合自己的房源信息!

    原标题:Python租房信息分析!找到最适合自己的房源信息! 租房信息分析 import numpy as np import pandas as pd import matplotlib.pyplo ...

  3. python相关性分析函数_python实现相关性分析

    从网上记录的一篇如何用python实现相关性分析的文章 ,先摘录,我再一一实现. 概述 在我们的工作中,会有一个这样的场景,有若干数据罗列在我们的面前,这组数据相互之间可能会存在一些联系,可能是此增彼 ...

  4. python行业中性_Python抓取分析淘宝连衣裙数据,128元真的是低价人群分界线吗?...

    1.我是一个低价人群用户 上周发表文章<一个匿名用户的淘宝"连衣裙"大观>后,交流群里面很热闹地讨论了起来,小伙伴们都在秀自己的淘宝连衣裙搜索价格,相较于小伙伴们搜索出 ...

  5. python 连通域面积_python 三维连通域分析

    做材料缺陷分析的时候,会用刀三维连通域分析这个算法,作为一个不入流的JS码农,算法我是万万不会自己去写的,而且还是用python去写.不过好在,确实有人写出了很成功的库,可以得以引用,我这里就来重点介 ...

  6. python seo分析器_python与SEO浅谈Python+ELK打造seo数据分析监控系统

    首先,这是一门工具类的课程,当然也会讲到seo方面一些知识. 其次,这是一门能帮到90%以上从事seo工作人员提升技能和效率的课程. 接着,这门课程的内容很丰富,一定有你想要的内容. 最后,这门课程很 ...

  7. python寻峰算法_python做数字分析,如何找到波峰波谷?

    数字呈现不规律的波动形状,而且值变化也不一定是只朝一个方向,会有反复.请问如何能找出值排名前5的波峰和波谷的大小和位置.有计算相关的函数吗?, 可以尝试使用heapq模块. import heapq ...

  8. python成绩统计_python学习-统计学生成绩-统计学生成绩

    #coding:utf-8 f=file("python.txt") lines=f.readlines() #print lines #得到的是列表list results=[] ...

  9. 用python二重循环求成绩表_python的循环

    python的循环 编写程序时经常有代码需要重复运行,python提供了while和for进行循环操作. 一.while循环 1.while循环可以根据条件进行判断,决定是否要循环执行语句块,语法如下 ...

最新文章

  1. mysql-理想的索引
  2. android 图片列表,Android 列表使用(ListView GridView Gallery图片计时滚动)
  3. 如何在程序中添加iAd广告
  4. [独库骑行之我们穿过草原]巴音布鲁克大草原
  5. 18 个 Python 高效编程技巧,Mark!
  6. html 显示代码块,使用Pre在文章中显示代码块 - 文章教程
  7. C++ string 使用详解(含C++20新特性)
  8. [密码学基础][信息安全][每个信息安全博士生应该知道的52件事][Bristol Cryptography][第11篇]DLP、CDH和DDH问题是什么?
  9. youcans 的 OpenCV 学习课—2.图像读取与显示
  10. linux (fedora 28) 制作启动U盘,启动盘
  11. vscode在windows10系统下进行go语言编程(无法代码提示)
  12. 为什么在C语言中,用scanf输入字符串时,不需加
  13. 【房价预测】基于matlab Elman神经网络房价预测【含Matlab源码 589期】
  14. 性能高的tftp服务器,tftp服务器软件
  15. P1069 细胞分裂
  16. 设计模式之工厂模式——应用最广泛的模式
  17. 云杰恒指:8.19恒指期货仓位管理---交易复盘
  18. 中国石油大学(北京)-《机器人设计》第一阶段在线作业
  19. 神策数据:数字化营销助力鞋服企业转型
  20. tex中对页眉的编辑

热门文章

  1. Hbuilder X 语法助手无法访问问题
  2. 48V LDO三端稳压IC 60v 100V 300V电源降压芯片系统解决方案
  3. 【Pyton】【小甲鱼】永久存储:腌制一缸美味的泡菜
  4. 入门学习-Python-小甲鱼学习资料-Day031-永久存储:腌制一缸美味的泡菜
  5. 企业微信社群该如何引流
  6. 西瓜创客python课程、8岁可以上课吗_有家长让孩子用西瓜创客学编程的吗,感觉怎么样?...
  7. kettle性能及效率提升
  8. 聚合支付码 一码支付的大时代到来
  9. Linux 帮助手册安装
  10. python七段数码管绘制英文字母_Python实例之七段数码管绘制理解