常用的机器学习算法(使用 Python 和 R 代码)
R代码最常用的10种机器学习算法在Python和R中的代码对比:
① 线性回归算法(Linear Regression)
② 逻辑回归算法(Logistic Regression)
③ 决策树算法(Decision Tree
④ 支持向量机算法(SVM)
⑤ 朴素贝叶斯算法(Naive Bayes)
⑥ K邻近算法(k- Nearest Neighbors,kNN)
⑦ K均值算法(k-Means)
⑧ 随机森林算法(Random Forest)
⑨ 主成分分析算法(PCA)
⑩ 梯度提升树(Gradient Boosting)
- GBM
- XGBoos
- LightGBM
- CatBoost
一、线性回归算法(Linear Regression)
线性回归主要有两种类型:简单线性回归和多元线性回归。简单线性回归的特征是一个自变量。而且,多元线性回归(顾名思义)的特征是多个(超过1个)自变量。在查找最佳拟合线时,可以拟合多项式或曲线回归。这些被称为多项式或曲线回归。
1、Python代码
# importing required libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error# read the train and test dataset
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')print(train_data.head())# shape of the dataset
print('\nShape of training data :',train_data.shape)
print('\nShape of testing data :',test_data.shape)# Now, we need to predict the missing target variable in the test data
# target variable - Item_Outlet_Sales# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Item_Outlet_Sales'],axis=1)
train_y = train_data['Item_Outlet_Sales']# seperate the independent and target variable on training data
test_x = test_data.drop(columns=['Item_Outlet_Sales'],axis=1)
test_y = test_data['Item_Outlet_Sales']'''
Create the object of the Linear Regression model
You can also add other parameters and test your code here
Some parameters are : fit_intercept and normalize
Documentation of sklearn LinearRegression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html'''
model = LinearRegression()# fit the model with the training data
model.fit(train_x,train_y)# coefficeints of the trained model
print('\nCoefficient of model :', model.coef_)# intercept of the model
print('\nIntercept of model',model.intercept_)# predict the target on the test dataset
predict_train = model.predict(train_x)
print('\nItem_Outlet_Sales on training data',predict_train) # Root Mean Squared Error on training dataset
rmse_train = mean_squared_error(train_y,predict_train)**(0.5)
print('\nRMSE on train dataset : ', rmse_train)# predict the target on the testing dataset
predict_test = model.predict(test_x)
print('\nItem_Outlet_Sales on test data',predict_test) # Root Mean Squared Error on testing dataset
rmse_test = mean_squared_error(test_y,predict_test)**(0.5)
print('\nRMSE on test dataset : ', rmse_test)
2、R代码
#Load Train and Test datasets
#Identify feature and response variable(s) and values must be numeric and numpy arrays
x_train <- input_variables_values_training_datasets
y_train <- target_variables_values_training_datasets
x_test <- input_variables_values_test_datasets
x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
linear <- lm(y_train ~ ., data = x)
summary(linear)
#Predict Output
predicted= predict(linear,x_test)
二、逻辑回归算法(Logistic Regression)
不要被它的名字弄糊涂了!它是一种分类,而不是回归算法。它用于根据给定的自变量集估计离散值(二进制值,如0/1,是/否,真/假)。简而言之,它通过将数据拟合到logit函数来预测事件发生的概率。因此,它也被称为logit回归。由于它预测概率,因此其输出值介于 0 和 1 之间(如预期的那样)。
同样,让我们通过一个简单的例子来尝试理解这一点。
假设你的朋友给了你一个谜题来解决。只有2个结果场景 - 要么你解决它,要么你不解决它。现在想象一下,你正在接受各种各样的谜题/测验,试图了解你擅长哪些主题。这项研究的结果将是这样的 - 如果你得到一个基于三角学的十年级问题,你有70%的1可能性来解决它。另一方面,如果是五年级历史问题,得到答案的概率只有30%。这就是 Logistic 回归为您提供的。
1、Python代码
# importing required libraries
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')print(train_data.head())# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)# Now, we need to predict the missing target variable in the test data
# target variable - Survived# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']'''
Create the object of the Logistic Regression model
You can also add other parameters and test your code here
Some parameters are : fit_intercept and penalty
Documentation of sklearn LogisticRegression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html'''
model = LogisticRegression()# fit the model with the training data
model.fit(train_x,train_y)# coefficeints of the trained model
print('Coefficient of model :', model.coef_)# intercept of the model
print('Intercept of model',model.intercept_)# predict the target on the train dataset
predict_train = model.predict(train_x)
print('Target on train data',predict_train) # Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('accuracy_score on train dataset : ', accuracy_train)# predict the target on the test dataset
predict_test = model.predict(test_x)
print('Target on test data',predict_test) # Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('accuracy_score on test dataset : ', accuracy_test)
2、R代码
x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
logistic <- glm(y_train ~ ., data = x,family='binomial')
summary(logistic)
#Predict Output
predicted= predict(logistic,x_test)
三、决策树算法(Decision Tree)
1、Python代码
# importing required libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)# Now, we need to predict the missing target variable in the test data
# target variable - Survived# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']'''
Create the object of the Decision Tree model
You can also add other parameters and test your code here
Some parameters are : max_depth and max_features
Documentation of sklearn DecisionTreeClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html'''
model = DecisionTreeClassifier()# fit the model with the training data
model.fit(train_x,train_y)# depth of the decision tree
print('Depth of the Decision Tree :', model.get_depth())# predict the target on the train dataset
predict_train = model.predict(train_x)
print('Target on train data',predict_train) # Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('accuracy_score on train dataset : ', accuracy_train)# predict the target on the test dataset
predict_test = model.predict(test_x)
print('Target on test data',predict_test) # Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('accuracy_score on test dataset : ', accuracy_test)
2、R代码
library(rpart)
x <- cbind(x_train,y_train)
# grow tree
fit <- rpart(y_train ~ ., data = x,method="class")
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
四、支持向量机算法(SVM)
- Python代码
# importing required libraries
import pandas as pd
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)# Now, we need to predict the missing target variable in the test data
# target variable - Survived# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']'''
Create the object of the Support Vector Classifier model
You can also add other parameters and test your code here
Some parameters are : kernal and degree
Documentation of sklearn Support Vector Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html'''
model = SVC()# fit the model with the training data
model.fit(train_x,train_y)# predict the target on the train dataset
predict_train = model.predict(train_x)
print('Target on train data',predict_train) # Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('accuracy_score on train dataset : ', accuracy_train)# predict the target on the test dataset
predict_test = model.predict(test_x)
print('Target on test data',predict_test) # Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('accuracy_score on test dataset : ', accuracy_test)
- R代码
library(e1071)
x <- cbind(x_train,y_train)
# Fitting model
fit <-svm(y_train ~ ., data = x)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
五、朴素贝叶斯算法(Naive Bayes)
- Python代码
# importing required libraries
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)# Now, we need to predict the missing target variable in the test data
# target variable - Survived# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']'''
Create the object of the Naive Bayes model
You can also add other parameters and test your code here
Some parameters are : var_smoothing
Documentation of sklearn GaussianNB: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html'''
model = GaussianNB()# fit the model with the training data
model.fit(train_x,train_y)# predict the target on the train dataset
predict_train = model.predict(train_x)
print('Target on train data',predict_train) # Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('accuracy_score on train dataset : ', accuracy_train)# predict the target on the test dataset
predict_test = model.predict(test_x)
print('Target on test data',predict_test) # Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('accuracy_score on test dataset : ', accuracy_test)
- R代码
library(e1071)
x <- cbind(x_train,y_train)
# Fitting model
fit <-naiveBayes(y_train ~ ., data = x)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
六、K邻近算法(k- Nearest Neighbors,kNN)
1、 Python代码
# importing required libraries
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)# Now, we need to predict the missing target variable in the test data
# target variable - Survived# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']'''
Create the object of the K-Nearest Neighbor model
You can also add other parameters and test your code here
Some parameters are : n_neighbors, leaf_size
Documentation of sklearn K-Neighbors Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html'''
model = KNeighborsClassifier() # fit the model with the training data
model.fit(train_x,train_y)# Number of Neighbors used to predict the target
print('\nThe number of neighbors used to predict the target : ',model.n_neighbors)# predict the target on the train dataset
predict_train = model.predict(train_x)
print('\nTarget on train data',predict_train) # Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('accuracy_score on train dataset : ', accuracy_train)# predict the target on the test dataset
predict_test = model.predict(test_x)
print('Target on test data',predict_test) # Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('accuracy_score on test dataset : ', accuracy_test)
2、 R代码
library(knn)
x <- cbind(x_train,y_train)
# Fitting model
fit <-knn(y_train ~ ., data = x,k=5)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
七、K均值算法(k-Means)
1、 Python代码
# importing required libraries
import pandas as pd
from sklearn.cluster import KMeans# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)# Now, we need to divide the training data into differernt clusters
# and predict in which cluster a particular data point belongs. '''
Create the object of the K-Means model
You can also add other parameters and test your code here
Some parameters are : n_clusters and max_iter
Documentation of sklearn KMeans: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html'''model = KMeans() # fit the model with the training data
model.fit(train_data)# Number of Clusters
print('\nDefault number of Clusters : ',model.n_clusters)# predict the clusters on the train dataset
predict_train = model.predict(train_data)
print('\nCLusters on train data',predict_train) # predict the target on the test dataset
predict_test = model.predict(test_data)
print('Clusters on test data',predict_test) # Now, we will train a model with n_cluster = 3
model_n3 = KMeans(n_clusters=3)# fit the model with the training data
model_n3.fit(train_data)# Number of Clusters
print('\nNumber of Clusters : ',model_n3.n_clusters)# predict the clusters on the train dataset
predict_train_3 = model_n3.predict(train_data)
print('\nCLusters on train data',predict_train_3) # predict the target on the test dataset
predict_test_3 = model_n3.predict(test_data)
print('Clusters on test data',predict_test_3)
2、 R代码
library(cluster)
fit <- kmeans(X, 3) # 5 cluster solution
八、随机森林算法(Random Forest)
1、 Python代码
# importing required libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')# view the top 3 rows of the dataset
print(train_data.head(3))# shape of the dataset
print('\nShape of training data :',train_data.shape)
print('\nShape of testing data :',test_data.shape)# Now, we need to predict the missing target variable in the test data
# target variable - Survived# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']'''Create the object of the Random Forest model
You can also add other parameters and test your code here
Some parameters are : n_estimators and max_depth
Documentation of sklearn RandomForestClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html'''
model = RandomForestClassifier()# fit the model with the training data
model.fit(train_x,train_y)# number of trees used
print('Number of Trees used : ', model.n_estimators)# predict the target on the train dataset
predict_train = model.predict(train_x)
print('\nTarget on train data',predict_train) # Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('\naccuracy_score on train dataset : ', accuracy_train)# predict the target on the test dataset
predict_test = model.predict(test_x)
print('\nTarget on test data',predict_test) # Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('\naccuracy_score on test dataset : ', accuracy_test)
2、 R代码
library(randomForest)
x <- cbind(x_train,y_train)
# Fitting model
fit <- randomForest(Species ~ ., x,ntree=500)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
九、主成分分析算法(PCA)
1、 Python代码
# importing required libraries
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error # read the train and test dataset
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')# view the top 3 rows of the dataset
print(train_data.head(3))# shape of the dataset
print('\nShape of training data :',train_data.shape)
print('\nShape of testing data :',test_data.shape)# Now, we need to predict the missing target variable in the test data
# target variable - Survived# seperate the independent and target variable on training data
# target variable - Item_Outlet_Sales
train_x = train_data.drop(columns=['Item_Outlet_Sales'],axis=1)
train_y = train_data['Item_Outlet_Sales']# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Item_Outlet_Sales'],axis=1)
test_y = test_data['Item_Outlet_Sales']print('\nTraining model with {} dimensions.'.format(train_x.shape[1]))# create object of model
model = LinearRegression()# fit the model with the training data
model.fit(train_x,train_y)# predict the target on the train dataset
predict_train = model.predict(train_x)# Accuray Score on train dataset
rmse_train = mean_squared_error(train_y,predict_train)**(0.5)
print('\nRMSE on train dataset : ', rmse_train)# predict the target on the test dataset
predict_test = model.predict(test_x)# Accuracy Score on test dataset
rmse_test = mean_squared_error(test_y,predict_test)**(0.5)
print('\nRMSE on test dataset : ', rmse_test)# create the object of the PCA (Principal Component Analysis) model
# reduce the dimensions of the data to 12
'''
You can also add other parameters and test your code here
Some parameters are : svd_solver, iterated_power
Documentation of sklearn PCA:https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
'''
model_pca = PCA(n_components=12)new_train = model_pca.fit_transform(train_x)
new_test = model_pca.fit_transform(test_x)print('\nTraining model with {} dimensions.'.format(new_train.shape[1]))# create object of model
model_new = LinearRegression()# fit the model with the training data
model_new.fit(new_train,train_y)# predict the target on the new train dataset
predict_train_pca = model_new.predict(new_train)# Accuray Score on train dataset
rmse_train_pca = mean_squared_error(train_y,predict_train_pca)**(0.5)
print('\nRMSE on new train dataset : ', rmse_train_pca)# predict the target on the new test dataset
predict_test_pca = model_new.predict(new_test)# Accuracy Score on test dataset
rmse_test_pca = mean_squared_error(test_y,predict_test_pca)**(0.5)
print('\nRMSE on new test dataset : ', rmse_test_pca)
2、 R代码
library(stats)
pca <- princomp(train, cor = TRUE)
train_reduced <- predict(pca,train)
test_reduced <- predict(pca,test)
十、梯度提升树(Gradient Boosting)
- GBM
1、 Python代码
# importing required libraries
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)# Now, we need to predict the missing target variable in the test data
# target variable - Survived# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']'''
Create the object of the GradientBoosting Classifier model
You can also add other parameters and test your code here
Some parameters are : learning_rate, n_estimators
Documentation of sklearn GradientBoosting Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
'''
model = GradientBoostingClassifier(n_estimators=100,max_depth=5)# fit the model with the training data
model.fit(train_x,train_y)# predict the target on the train dataset
predict_train = model.predict(train_x)
print('\nTarget on train data',predict_train) # Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('\naccuracy_score on train dataset : ', accuracy_train)# predict the target on the test dataset
predict_test = model.predict(test_x)
print('\nTarget on test data',predict_test) # Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('\naccuracy_score on test dataset : ', accuracy_test)
2、 R代码
library(caret)
x <- cbind(x_train,y_train)
# Fitting model
fitControl <- trainControl( method = "repeatedcv", number = 4, repeats = 4)
fit <- train(y ~ ., data = x, method = "gbm", trControl = fitControl,verbose = FALSE)
predicted= predict(fit,x_test,type= "prob")[,2]
- XGBoost
1、 Python代码
# importing required libraries
import pandas as pd
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)# Now, we need to predict the missing target variable in the test data
# target variable - Survived# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']'''
Create the object of the XGBoost model
You can also add other parameters and test your code here
Some parameters are : max_depth and n_estimators
Documentation of xgboost:https://xgboost.readthedocs.io/en/latest/
'''
model = XGBClassifier()# fit the model with the training data
model.fit(train_x,train_y)# predict the target on the train dataset
predict_train = model.predict(train_x)
print('\nTarget on train data',predict_train) # Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('\naccuracy_score on train dataset : ', accuracy_train)# predict the target on the test dataset
predict_test = model.predict(test_x)
print('\nTarget on test data',predict_test) # Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('\naccuracy_score on test dataset : ', accuracy_test)
2、 R代码
require(caret)x <- cbind(x_train,y_train)# Fitting modelTrainControl <- trainControl( method = "repeatedcv", number = 10, repeats = 4)model<- train(y ~ ., data = x, method = "xgbLinear", trControl = TrainControl,verbose = FALSE)OR model<- train(y ~ ., data = x, method = "xgbTree", trControl = TrainControl,verbose = FALSE)predicted <- predict(model, x_test)
- LightGBM
1、 Python代码
data = np.random.rand(500, 10) # 500 entities, each contains 10 features
label = np.random.randint(2, size=500) # binary targettrain_data = lgb.Dataset(data, label=label)
test_data = train_data.create_valid('test.svm')param = {'num_leaves':31, 'num_trees':100, 'objective':'binary'}
param['metric'] = 'auc'num_round = 10
bst = lgb.train(param, train_data, num_round, valid_sets=[test_data])bst.save_model('model.txt')# 7 entities, each contains 10 features
data = np.random.rand(7, 10)
ypred = bst.predict(data)
2、 R代码
library(RLightGBM)
data(example.binary)
#Parametersnum_iterations <- 100
config <- list(objective = "binary", metric="binary_logloss,auc", learning_rate = 0.1, num_leaves = 63, tree_learner = "serial", feature_fraction = 0.8, bagging_freq = 5, bagging_fraction = 0.8, min_data_in_leaf = 50, min_sum_hessian_in_leaf = 5.0)#Create data handle and booster
handle.data <- lgbm.data.create(x)lgbm.data.setField(handle.data, "label", y)handle.booster <- lgbm.booster.create(handle.data, lapply(config, as.character))#Train for num_iterations iterations and eval every 5 stepslgbm.booster.train(handle.booster, num_iterations, 5)#Predict
pred <- lgbm.booster.predict(handle.booster, x.test)#Test accuracy
sum(y.test == (y.pred > 0.5)) / length(y.test)#Save model (can be loaded again via lgbm.booster.load(filename))
lgbm.booster.save(handle.booster, filename = "/tmp/model.txt")
require(caret)
require(RLightGBM)
data(iris)model <-caretModel.LGBM()fit <- train(Species ~ ., data = iris, method=model, verbosity = 0)
print(fit)
y.pred <- predict(fit, iris[,1:4])library(Matrix)
model.sparse <- caretModel.LGBM.sparse()#Generate a sparse matrix
mat <- Matrix(as.matrix(iris[,1:4]), sparse = T)
fit <- train(data.frame(idx = 1:nrow(iris)), iris$Species, method = model.sparse, matrix = mat, verbosity = 0)
print(fit)
- CatBoost
1、 Python代码
import pandas as pd
import numpy as npfrom catboost import CatBoostRegressor#Read training and testing files
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")#Imputing missing values for both train and test
train.fillna(-999, inplace=True)
test.fillna(-999,inplace=True)#Creating a training set for modeling and validation set to check model performance
X = train.drop(['Item_Outlet_Sales'], axis=1)
y = train.Item_Outlet_Salesfrom sklearn.model_selection import train_test_splitX_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.7, random_state=1234)
categorical_features_indices = np.where(X.dtypes != np.float)[0]#importing library and building model
from catboost import CatBoostRegressormodel=CatBoostRegressor(iterations=50, depth=3, learning_rate=0.1, loss_function='RMSE')model.fit(X_train, y_train,cat_features=categorical_features_indices,eval_set=(X_validation, y_validation),plot=True)submission = pd.DataFrame()submission['Item_Identifier'] = test['Item_Identifier']
submission['Outlet_Identifier'] = test['Outlet_Identifier']
submission['Item_Outlet_Sales'] = model.predict(test)
2、 R代码
set.seed(1)require(titanic)require(caret)require(catboost)tt <- titanic::titanic_train[complete.cases(titanic::titanic_train),]data <- as.data.frame(as.matrix(tt), stringsAsFactors = TRUE)drop_columns = c("PassengerId", "Survived", "Name", "Ticket", "Cabin")x <- data[,!(names(data) %in% drop_columns)]y <- data[,c("Survived")]fit_control <- trainControl(method = "cv", number = 4,classProbs = TRUE)grid <- expand.grid(depth = c(4, 6, 8),learning_rate = 0.1,iterations = 100, l2_leaf_reg = 1e-3, rsm = 0.95, border_count = 64)report <- train(x, as.factor(make.names(y)),method = catboost.caret,verbose = TRUE, preProc = NULL,tuneGrid = grid, trControl = fit_control)print(report)importance <- varImp(report, scale = FALSE)print(importance)
常用的机器学习算法(使用 Python 和 R 代码)相关推荐
- 机器学习算法清单!附Python和R代码
来源:数据与算法之美 本文约6000字,建议阅读8分钟. 通过本文为大家介绍了3种机器学习算法方式以及10种机器学习算法的清单,学起来吧~ 前言 谷歌董事长施密特曾说过:虽然谷歌的无人驾驶汽车和机器人 ...
- 10 种机器学习算法的要点(附 Python 和 R 代码)(转载)
10 种机器学习算法的要点(附 Python 和 R 代码)(转载) from:https://zhuanlan.zhihu.com/p/25273698 前言 谷歌董事长施密特曾说过:虽然谷歌的无人 ...
- 机器学习算法一览(附python和R代码)
机器学习算法一览(附python和R代码) 来源:数据观 时间:2016-04-19 15:20:43 作者:大数据文摘 "谷歌的无人车和机器人得到了很多关注,但我们真正的未来却在于能 ...
- 机器学习系列(9)_机器学习算法一览(附Python和R代码)
转载自:http://blog.csdn.net/longxinchen_ml/article/details/51192086 – 谷歌的无人车和机器人得到了很多关注,但我们真正的未来却在于能够使电 ...
- 10 种机器学习算法的要点(附 Python 和 R 代码)
前言 谷歌董事长施密特曾说过:虽然谷歌的无人驾驶汽车和机器人受到了许多媒体关注,但是这家公司真正的未来在于机器学习,一种让计算机更聪明.更个性化的技术. 也许我们生活在人类历史上最关键的时期:从使用大 ...
- 转机器学习系列(9)_机器学习算法一览(附Python和R代码)
转自http://blog.csdn.net/han_xiaoyang/article/details/51191386 – 谷歌的无人车和机器人得到了很多关注,但我们真正的未来却在于能够使电脑变得更 ...
- 机器学习算法的要点(附 Python 和 R 代码)
前言 谷歌董事长施密特曾说过:虽然谷歌的无人驾驶汽车和机器人受到了许多媒体关注,但是这家公司真正的未来在于机器学习,一种让计算机更聪明.更个性化的技术. 也许我们生活在人类历史上最关键的时期:从使用大 ...
- 机器学习算法与Python实践之k均值聚类(k-means)
机器学习算法与Python实践之(五)k均值聚类(k-means) zouxy09@qq.com http://blog.csdn.net/zouxy09 机器学习算法与Python实践这个系列主要是 ...
- 机器学习算法与Python实践之 k均值聚类(k-means)
文章来源:http://blog.csdn.net/zouxy09/article/details/17589329 机器学习算法与Python实践这个系列主要是参考<机器学习实战>这本书 ...
- 机器学习算法与Python实践之(六)二分k均值聚类
机器学习算法与Python实践这个系列主要是参考<机器学习实战>这本书.因为自己想学习Python,然后也想对一些机器学习算法加深下了解,所以就想通过Python来实现几个比较常用的机器学 ...
最新文章
- (C语言)一种简易记法:生成[a,b]范围内的随机整数
- android 滑动顶部固定,android view滑动到顶部悬停
- ZooKeeper的事务日志和快照
- php语言与jsp,关于开发语言之PHP JSP与ASP NET对比浅析
- Perlin Noise algorithms(备忘)
- 策略模式在公司项目中的运用实践,看完又可以涨一波实战经验了!
- Julia的学习资料从哪里找?
- VS2010打开VS2013、VS2015建立的工程,各种版本之间转换
- 零基础程序员如何自学编程?用这6种方法就够了!
- Windows Server 2012 网络发现选项无法启动 启动不生效(无法保存)
- FPGA-小梅哥时序约束
- RCE(命令执行)总结
- aid learning安装应用_Aid Learning
- WordPress星级评分插件KK Star Ratings评分插件教程
- 如何创建一个原始Mac OS镜像
- 3500字干货!精准解决3大难题,助力服装行业数字化转型
- 4.2.5 Kafka集群与运维(集群的搭建、监控工具 Kafka Eagle)
- 我是如何在12周内由零基础成为一名程序员的——谨以此文激励自己!!!
- 概率图模型(1)--隐马尔科夫模型(1)
- CCF C类会议:PAKDD叶老师和闵老师意见反馈
热门文章
- [云原生专题-1]:总体-云原生初步探究,什么是云原生,云原生的基本特性
- SOT23(Small Outline Transistor)
- Java后端工程师在做什么
- 岗位:python后端工程师
- 数字化势不可挡:“衣食住行”的升级之战,行业巨头如何破局
- 产品新创意,创意产品原型大公开,原来可以这样做!
- python地图制作 - pyecharts(1.9.1)绘制各城市地图
- 已解决-安装CentOS 7时No Caching mode page found和Assuming drive cache:write through报错问题
- 学java好还是学挖机好_现在开挖掘机还能月入上万吗,为何年轻人还是热衷于学挖掘机?...
- java代码如何整合_Java如何合并两个PPT文档?