机器学习实验(一):运用机器学习(Kmeans算法)判定家庭用电主因
声明:版权所有,转载请联系作者并注明出处 http://blog.csdn.net/u013719780?viewmode=contents
博主简介:风雪夜归子(Allen),机器学习算法攻城狮,喜爱钻研Meachine Learning的黑科技,对Deep Learning和Artificial Intelligence充满兴趣,经常关注Kaggle数据挖掘竞赛平台,对数据、Machine Learning和Artificial Intelligence有兴趣的童鞋可以一起探讨哦,个人CSDN博客:http://blog.csdn.net/u013719780?viewmode=contents
运用机器学习(Kmeans算法)确定家庭用电的主要原因
本文将对家庭的用电数据进行一些基本的分析。
本文主要分为两个部分:
Part One: 对数据做一些简单的清洗和分析工作;
Part Two: 运用无监督的机器学习算法-Kmeans算法确定某个特定的时间段家庭用电的主要原因。
首先,想入相应的包并且读取数据集。具体代码如下:
import numpy as np import pandas as pd import matplotlib.pyplot as plt
sensor_data = pd.read_csv('merged-sensor-files.csv',names=["MTU", "Time", "Power", "Cost", "Voltage"], header = 0)weather_data = pd.read_json('weather.json', typ ='series')
import json f=open('weather.json') json_data = json.load(f) Time = [] Temperature = [] for time, temperature in json_data.items():Time.append(int(time))Temperature.append(float(temperature))temperature = pd.DataFrame({'Time':Time, 'Temperature': Temperature}) temperature
Temperature | Time | |
---|---|---|
0 | 84.4 | 1431468000 |
1 | 83.3 | 1431450000 |
2 | 70.7 | 1431403200 |
3 | 72.1 | 1431432000 |
4 | 84.2 | 1431464400 |
5 | 80.9 | 1431446400 |
6 | 68.6 | 1431424800 |
7 | 81.1 | 1431475200 |
8 | 80.7 | 1431442800 |
9 | 69.2 | 1431417600 |
10 | 76.2 | 1431435600 |
11 | 68.8 | 1431414000 |
12 | 72.1 | 1431396000 |
13 | 68.7 | 1431428400 |
14 | 80.1 | 1431439200 |
15 | 83.0 | 1431471600 |
16 | 69.0 | 1431410400 |
17 | 75.4 | 1431388800 |
18 | 71.0 | 1431399600 |
19 | 69.6 | 1431406800 |
20 | 67.9 | 1431421200 |
21 | 85.1 | 1431457200 |
22 | 87.0 | 1431460800 |
23 | 73.2 | 1431392400 |
24 | 84.5 | 1431453600 |
import json f=open('weather.json').read()Time = [] Temperature = [] for line in f.split(','):time, temperature = line.split(':')time = time.replace('"','')time = time.replace('{','')temperature = temperature.replace('"','')temperature = temperature.replace('}','')#print time, temperatureTime.append(int(time))Temperature.append(float(temperature))
# A quick look at the datasets sensor_data.head(5)
MTU | Time | Power | Cost | Voltage | |
---|---|---|---|---|---|
0 | MTU1 | 05/11/2015 19:59:06 | 4.102 | 0.62 | 122.4 |
1 | MTU1 | 05/11/2015 19:59:05 | 4.089 | 0.62 | 122.3 |
2 | MTU1 | 05/11/2015 19:59:04 | 4.089 | 0.62 | 122.3 |
3 | MTU1 | 05/11/2015 19:59:06 | 4.089 | 0.62 | 122.3 |
4 | MTU1 | 05/11/2015 19:59:04 | 4.097 | 0.62 | 122.4 |
sensor_data.describe()
MTU | Time | Power | Cost | Voltage | |
---|---|---|---|---|---|
count | 88914 | 88914 | 88914 | 88914 | 88914 |
unique | 2 | 72359 | 2495 | 88 | 48 |
top | MTU1 | Time | 0.136 | 0.05 | 123.1 |
freq | 88891 | 23 | 6544 | 19476 | 5063 |
sensor_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 88914 entries, 0 to 88913 Data columns (total 5 columns): MTU 88914 non-null object Time 88914 non-null object Power 88914 non-null object Cost 88914 non-null object Voltage 88914 non-null object dtypes: object(5) memory usage: 3.4+ MB
TASK 1: 数据分析
数据清洗
从数据集merged-sensor-files.csv 中我们发现,某些数据有问题。
sensor_data.dtypes
MTU object Time object Power object Cost object Voltage object dtype: object
下面找出有问题的数据
# Get the inconsistent rows indexes faulty_row_idx = sensor_data[sensor_data["Power"] == " Power"].index.tolist() faulty_row_idx
[3784,7582,11385,15004,18773,22363,26049,29795,33554,37193,40951,44563,48227,51934,55660,59431,63041,66706,70468,74305,77951,81617,85327]
删除有问题的数据
sensor_data.drop(faulty_row_idx, inplace=True)sensor_data[sensor_data["Power"] == " Power"].index.tolist()
[]
从上述结果可以知道,有问题的数据已经成功被删除
We have cleaned up the sensor_data and now these can be converted to more appropriate data types. 对数据类型进行转换
sensor_data[["Power", "Cost", "Voltage"]] = sensor_data[["Power", "Cost", "Voltage"]].astype(float) sensor_data[["Time"]] = pd.to_datetime(sensor_data["Time"])sensor_data['Hour'] = pd.DatetimeIndex(sensor_data["Time"]).hoursensor_data.dtypes
MTU object Time datetime64[ns] Power float64 Cost float64 Voltage float64 Hour int32 dtype: object
This is better now. We have got clearly defined datatypes of different columns now. Next step is to convert the weather_data Series to a dataframe so that we can work with it with more ease.
Good!我们现在已经得到了我们所需的数据类型。接下来为了数据操作上的方便,我们将数据集weather_data转换成dataframe格式。
temperature_data = weather_data.to_frame()temperature_data.reset_index(level=0, inplace=True) temperature_data.columns = ["Time", "Temperature"]temperature_data.dtypes temperature_data['Temperature'] = Temperaturetemperature_data["Hour"] = pd.DatetimeIndex(temperature_data["Time"]).hour temperature_data[["Temperature"]] = temperature_data[["Temperature"]].astype(float) temperature_data
Time | Temperature | Hour | |
---|---|---|---|
0 | 2015-05-12 00:00:00 | 75.4 | 0 |
1 | 2015-05-12 01:00:00 | 73.2 | 1 |
2 | 2015-05-12 02:00:00 | 72.1 | 2 |
3 | 2015-05-12 03:00:00 | 71.0 | 3 |
4 | 2015-05-12 04:00:00 | 70.7 | 4 |
5 | 2015-05-12 05:00:00 | 69.6 | 5 |
6 | 2015-05-12 06:00:00 | 69.0 | 6 |
7 | 2015-05-12 07:00:00 | 68.8 | 7 |
8 | 2015-05-12 08:00:00 | 69.2 | 8 |
9 | 2015-05-12 09:00:00 | 67.9 | 9 |
10 | 2015-05-12 10:00:00 | 68.6 | 10 |
11 | 2015-05-12 11:00:00 | 68.7 | 11 |
12 | 2015-05-12 12:00:00 | 72.1 | 12 |
13 | 2015-05-12 13:00:00 | 76.2 | 13 |
14 | 2015-05-12 14:00:00 | 80.1 | 14 |
15 | 2015-05-12 15:00:00 | 80.7 | 15 |
16 | 2015-05-12 16:00:00 | 80.9 | 16 |
17 | 2015-05-12 17:00:00 | 83.3 | 17 |
18 | 2015-05-12 18:00:00 | 84.5 | 18 |
19 | 2015-05-12 19:00:00 | 85.1 | 19 |
20 | 2015-05-12 20:00:00 | 87.0 | 20 |
21 | 2015-05-12 21:00:00 | 84.2 | 21 |
22 | 2015-05-12 22:00:00 | 84.4 | 22 |
23 | 2015-05-12 23:00:00 | 83.0 | 23 |
24 | 2015-05-13 00:00:00 | 81.1 | 0 |
sensor_data.describe()
Power | Cost | Voltage | Hour | |
---|---|---|---|---|
count | 88891.000000 | 88891.000000 | 88891.000000 | 88891.000000 |
mean | 1.315980 | 0.202427 | 123.127744 | 11.531865 |
std | 1.682181 | 0.252357 | 0.838768 | 6.921775 |
min | 0.113000 | 0.020000 | 121.000000 | 0.000000 |
25% | 0.255000 | 0.040000 | 122.600000 | 6.000000 |
50% | 0.367000 | 0.060000 | 123.100000 | 12.000000 |
75% | 1.765000 | 0.270000 | 123.700000 | 18.000000 |
max | 6.547000 | 0.990000 | 125.600000 | 23.000000 |
temperature_data.describe()
Temperature | Hour | |
---|---|---|
count | 25.000000 | 25.00000 |
mean | 76.272000 | 11.04000 |
std | 6.635355 | 7.29429 |
min | 67.900000 | 0.00000 |
25% | 69.600000 | 5.00000 |
50% | 75.400000 | 11.00000 |
75% | 83.000000 | 17.00000 |
max | 87.000000 | 23.00000 |
从上面的统计结果可以知道,耗电的平均值、最小值、最大值分别为1.315980kW,0.11kW and 6.54kW。为了对数据进行更好的理解,我们绘制出耗电 and 温度与时间的关系图。 在绘图之前,需要对数据关于列'hour'group BY:
grouped_sensor_data = sensor_data.groupby(["Hour"], as_index = False).mean() grouped_sensor_data
Hour | Power | Cost | Voltage | |
---|---|---|---|---|
0 | 0 | 0.173790 | 0.029468 | 124.723879 |
1 | 1 | 0.179594 | 0.033805 | 124.522469 |
2 | 2 | 0.185763 | 0.037013 | 123.929979 |
3 | 3 | 0.184510 | 0.036815 | 124.174454 |
4 | 4 | 0.181104 | 0.036366 | 123.847801 |
5 | 5 | 0.184242 | 0.036693 | 122.790974 |
6 | 6 | 0.672423 | 0.106142 | 123.375132 |
7 | 7 | 0.977755 | 0.150614 | 123.722441 |
8 | 8 | 0.382392 | 0.060904 | 122.997544 |
9 | 9 | 0.168447 | 0.027770 | 122.675906 |
10 | 10 | 0.373942 | 0.058812 | 122.986207 |
11 | 11 | 0.383065 | 0.059837 | 123.500554 |
12 | 12 | 0.378432 | 0.059604 | 122.783133 |
13 | 13 | 0.380076 | 0.059766 | 122.991571 |
14 | 14 | 0.378020 | 0.059666 | 122.815359 |
15 | 15 | 0.376586 | 0.059619 | 122.464499 |
16 | 16 | 4.365774 | 0.659342 | 121.766840 |
17 | 17 | 4.318118 | 0.652923 | 121.851496 |
18 | 18 | 4.779928 | 0.721469 | 122.301059 |
19 | 19 | 4.250034 | 0.642619 | 122.103700 |
20 | 20 | 1.967120 | 0.300640 | 122.770635 |
21 | 21 | 1.579896 | 0.242180 | 123.086060 |
22 | 22 | 2.542672 | 0.387109 | 123.542620 |
23 | 23 | 2.269941 | 0.346457 | 123.415791 |
grouped_temperature_data = temperature_data.groupby(["Hour"], as_index = False).mean() grouped_temperature_data
Hour | Temperature | |
---|---|---|
0 | 0 | 78.25 |
1 | 1 | 73.20 |
2 | 2 | 72.10 |
3 | 3 | 71.00 |
4 | 4 | 70.70 |
5 | 5 | 69.60 |
6 | 6 | 69.00 |
7 | 7 | 68.80 |
8 | 8 | 69.20 |
9 | 9 | 67.90 |
10 | 10 | 68.60 |
11 | 11 | 68.70 |
12 | 12 | 72.10 |
13 | 13 | 76.20 |
14 | 14 | 80.10 |
15 | 15 | 80.70 |
16 | 16 | 80.90 |
17 | 17 | 83.30 |
18 | 18 | 84.50 |
19 | 19 | 85.10 |
20 | 20 | 87.00 |
21 | 21 | 84.20 |
22 | 22 | 84.40 |
23 | 23 | 83.00 |
Basic Visualizations:
%pylab inline plt.style.use('ggplot')
Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['f'] `%matplotlib` prevents importing * from pylab and numpy
fig = plt.figure(figsize=(13,7)) plt.hist(sensor_data.Power, bins=50) fig.suptitle('Power Histogram', fontsize = 20) plt.xlabel('Power', fontsize = 16) plt.ylabel('Count', fontsize = 16)
<matplotlib.text.Text at 0x115220310>
从上图可以得出:大部分时间耗电都比较低,但是也有一些时间段用电较多,达到了3.5kW - 5kW之间。接下来绘制用电关于时间的分布图。
fig = plt.figure(figsize=(13,7)) plt.bar(grouped_sensor_data.Hour, grouped_sensor_data.Power) fig.suptitle('Power Distribution with Hours', fontsize = 20) plt.xlabel('Hour', fontsize = 16) plt.ylabel('Power', fontsize = 16) plt.xticks(range(0, 24)) plt.show()
从上面条形图可以得出如下推论:
- 用电需求最高的时间段是在晚上,这可能是因为大部分的电器设备,如:AC、暖气、TV、烤炉、洗衣机等的使用。
- 睡觉时间段(0000 - 0500) and 办公时间段(0900 - 1600)有非常低的需求, 是因为这两个时间段电器设备都已经关闭了.
- 在时间段0600 - 0900用电有少许增加, 这可能是一些电器设备处在激活状态导致的.
稳定状态:
- 在时间段 0000 - 0500, 用电需求很少,其范围处在 0.17kW - 0.18kW
- 另一个稳定状态是时间段 1000 - 1500, 其需求处在 0.373kW - 0.376kW
- 最高的稳定时间段是 1600 - 1900 其需求处在 4.36kW - 4.25kW
在0700 and 1800期间电力需求突然发生了变化,可能是随机事件或者某些电器设备的使用和异常数据导致的。
在0900时间段电力需求同样有轻微震动,从0.38kw下降到了0.16kw随后又上升到了0.37kw。在2100可以看到同样的变化趋势。
让我们进一步绘制temperature and Power的关系图,看看这里是否有一些相关性。
fig = plt.figure(figsize=(13,7)) plt.bar(grouped_temperature_data.Temperature, grouped_sensor_data.Power) fig.suptitle('Power Distribution with Temperature', fontsize = 20) plt.xlabel('Temperature in Fahrenheit', fontsize = 16) plt.ylabel('Power', fontsize = 16) plt.show()
温度和电力需求似乎有一些直接的关系,这很好理解,因为我们当前的数据集是取自于5月,在高峰期(晚上)制冷设备都已经打开了。
Task 2: 机器学习
为了在一个完整的数据集上工作,合并数据集 grouped_sensor_data and grouped_temperature_data。
merged_data = grouped_sensor_data.merge(grouped_temperature_data) merged_data
Hour | Power | Cost | Voltage | Temperature | |
---|---|---|---|---|---|
0 | 0 | 0.173790 | 0.029468 | 124.723879 | 78.25 |
1 | 1 | 0.179594 | 0.033805 | 124.522469 | 73.20 |
2 | 2 | 0.185763 | 0.037013 | 123.929979 | 72.10 |
3 | 3 | 0.184510 | 0.036815 | 124.174454 | 71.00 |
4 | 4 | 0.181104 | 0.036366 | 123.847801 | 70.70 |
5 | 5 | 0.184242 | 0.036693 | 122.790974 | 69.60 |
6 | 6 | 0.672423 | 0.106142 | 123.375132 | 69.00 |
7 | 7 | 0.977755 | 0.150614 | 123.722441 | 68.80 |
8 | 8 | 0.382392 | 0.060904 | 122.997544 | 69.20 |
9 | 9 | 0.168447 | 0.027770 | 122.675906 | 67.90 |
10 | 10 | 0.373942 | 0.058812 | 122.986207 | 68.60 |
11 | 11 | 0.383065 | 0.059837 | 123.500554 | 68.70 |
12 | 12 | 0.378432 | 0.059604 | 122.783133 | 72.10 |
13 | 13 | 0.380076 | 0.059766 | 122.991571 | 76.20 |
14 | 14 | 0.378020 | 0.059666 | 122.815359 | 80.10 |
15 | 15 | 0.376586 | 0.059619 | 122.464499 | 80.70 |
16 | 16 | 4.365774 | 0.659342 | 121.766840 | 80.90 |
17 | 17 | 4.318118 | 0.652923 | 121.851496 | 83.30 |
18 | 18 | 4.779928 | 0.721469 | 122.301059 | 84.50 |
19 | 19 | 4.250034 | 0.642619 | 122.103700 | 85.10 |
20 | 20 | 1.967120 | 0.300640 | 122.770635 | 87.00 |
21 | 21 | 1.579896 | 0.242180 | 123.086060 | 84.20 |
22 | 22 | 2.542672 | 0.387109 | 123.542620 | 84.40 |
23 | 23 | 2.269941 | 0.346457 | 123.415791 | 83.00 |
在之前的数据可视化中,我们看到了当温度低的时候电力需求比较小。但是这主要是与制冷的电器设备有较大的关系:
- Cooling Systems
- TV
- Geyser
- Lights
- Oven
- Home Security Systems
我们接下来用合并后的完整数据集确定这些设备是否打开。
AC, Refrigerator and Other Coooling Systems:
从"Power Distribution with Temperature"图可以明显看出,随着温度的上升电力需求突然增加,这就意味着家里的制冷设备处于开启状态。
TV:
在evening hours(1600 - 2300), 电视机可能是另外一个导致电力需求增加的因素. 从Power特征看它是相当明显的.
Geyser, Oven:
在during morning hours电力需求轻微增加可能是与一些设备的工作是相关的.
Lights:
灯光对用电需求有比较小的影响(认为house owner使用的是节能灯)。
Home Security Systems:
在工作时间有轻微的增加可能是一些家庭设备与其他的一些自动设备导致的。
现在,我们将使用 K-Means clustering 算法. 使用原始数据集中的特征 Hour, Power and Temperature .首先,我们需要合并数据集sensor_data dataframe 和 grouped_temperature_data.
data =sensor_data.merge(grouped_temperature_data)data.drop(["Time", "MTU", "Cost", "Voltage"], axis = 1, inplace = True) data.head()
Power | Hour | Temperature | |
---|---|---|---|
0 | 4.102 | 19 | 85.1 |
1 | 4.089 | 19 | 85.1 |
2 | 4.089 | 19 | 85.1 |
3 | 4.089 | 19 | 85.1 |
4 | 4.097 | 19 | 85.1 |
from sklearn.cluster import KMeans from sklearn.cross_validation import train_test_split
np.random.seed(1234)train_data, test_data = train_test_split(data, test_size = 0.25, random_state = 42)
train_data.shape
(66668, 3)
test_data.shape
(22223, 3)
kmeans = KMeans(n_clusters = 4, n_jobs = 4) kmeans_fit = kmeans.fit(train_data)
/Applications/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by descriptor assignment is deprecated. To maintain the Fortran contiguity of a multidimensional Fortran array, use 'a.T.view(...).T' insteadobj_bytes_view = obj.view(self.np.uint8) /Applications/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by descriptor assignment is deprecated. To maintain the Fortran contiguity of a multidimensional Fortran array, use 'a.T.view(...).T' insteadobj_bytes_view = obj.view(self.np.uint8) /Applications/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by descriptor assignment is deprecated. To maintain the Fortran contiguity of a multidimensional Fortran array, use 'a.T.view(...).T' insteadobj_bytes_view = obj.view(self.np.uint8) /Applications/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by descriptor assignment is deprecated. To maintain the Fortran contiguity of a multidimensional Fortran array, use 'a.T.view(...).T' insteadobj_bytes_view = obj.view(self.np.uint8) /Applications/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by descriptor assignment is deprecated. To maintain the Fortran contiguity of a multidimensional Fortran array, use 'a.T.view(...).T' insteadobj_bytes_view = obj.view(self.np.uint8) /Applications/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by descriptor assignment is deprecated. To maintain the Fortran contiguity of a multidimensional Fortran array, use 'a.T.view(...).T' insteadobj_bytes_view = obj.view(self.np.uint8) /Applications/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by descriptor assignment is deprecated. To maintain the Fortran contiguity of a multidimensional Fortran array, use 'a.T.view(...).T' insteadobj_bytes_view = obj.view(self.np.uint8) /Applications/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by descriptor assignment is deprecated. To maintain the Fortran contiguity of a multidimensional Fortran array, use 'a.T.view(...).T' insteadobj_bytes_view = obj.view(self.np.uint8) /Applications/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by descriptor assignment is deprecated. To maintain the Fortran contiguity of a multidimensional Fortran array, use 'a.T.view(...).T' insteadobj_bytes_view = obj.view(self.np.uint8) /Applications/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by descriptor assignment is deprecated. To maintain the Fortran contiguity of a multidimensional Fortran array, use 'a.T.view(...).T' insteadobj_bytes_view = obj.view(self.np.uint8)
predict = kmeans_fit.predict(test_data)
test_data["Cluster"] = predict test_data.head(20)
/Applications/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copyif __name__ == '__main__':
Power | Hour | Temperature | Cluster | |
---|---|---|---|---|
52595 | 0.114 | 8 | 69.2 | 1 |
86044 | 4.255 | 17 | 83.3 | 0 |
6091 | 3.559 | 20 | 87.0 | 0 |
60185 | 0.453 | 11 | 68.7 | 1 |
37054 | 0.136 | 4 | 70.7 | 2 |
59216 | 0.312 | 10 | 68.6 | 1 |
61848 | 0.453 | 11 | 68.7 | 1 |
278 | 4.162 | 19 | 85.1 | 0 |
30829 | 0.136 | 3 | 71.0 | 2 |
8751 | 0.955 | 21 | 84.2 | 0 |
35134 | 0.276 | 4 | 70.7 | 2 |
31476 | 0.278 | 3 | 71.0 | 2 |
55854 | 0.456 | 10 | 68.6 | 1 |
54992 | 0.370 | 8 | 69.2 | 1 |
78259 | 0.307 | 15 | 80.7 | 3 |
62724 | 0.313 | 11 | 68.7 | 1 |
54132 | 0.260 | 8 | 69.2 | 1 |
44204 | 0.114 | 9 | 67.9 | 1 |
7834 | 1.094 | 21 | 84.2 | 0 |
25231 | 0.125 | 1 | 73.2 | 2 |
这看起来是一个很合理的聚类. 我们进一步将标签分到类里面, 作为一个检测模型. 很明显,我们可以将预测的标签设置成如下的类别:
- 0 - Cooling Systems
- 1 - Oven, Geyser
- 2 - Night Lights
- 3 - Home Security Systems
接下来我们将会把标签和预测结果合并成一个数据框.
label_df = pd.DataFrame({"Cluster": [0, 1, 2, 3],"Appliances": ["Cooling System","Oven, Geyser","Night Lights", "Home Security Systems"]}) label_df
Appliances | Cluster | |
---|---|---|
0 | Cooling System | 0 |
1 | Oven, Geyser | 1 |
2 | Night Lights | 2 |
3 | Home Security Systems | 3 |
result = test_data.merge(label_df) result.head()
Power | Hour | Temperature | Cluster | Appliances | |
---|---|---|---|---|---|
0 | 0.114 | 8 | 69.2 | 1 | Oven, Geyser |
1 | 0.453 | 11 | 68.7 | 1 | Oven, Geyser |
2 | 0.312 | 10 | 68.6 | 1 | Oven, Geyser |
3 | 0.453 | 11 | 68.7 | 1 | Oven, Geyser |
4 | 0.456 | 10 | 68.6 | 1 | Oven, Geyser |
result.tail()
Power | Hour | Temperature | Cluster | Appliances | |
---|---|---|---|---|---|
22218 | 0.306 | 15 | 80.7 | 3 | Home Security Systems |
22219 | 0.450 | 13 | 76.2 | 3 | Home Security Systems |
22220 | 4.426 | 16 | 80.9 | 3 | Home Security Systems |
22221 | 0.452 | 15 | 80.7 | 3 | Home Security Systems |
22222 | 0.307 | 15 | 80.7 | 3 | Home Security Systems |
从result dataframe可以看出,在8, 9, 10时Oven or Geyser有比较高的概率在使用,另一方面,在office hours(1000 - 1600)安全设备使用的可能性很高。
在数据分析的过程中,我们仅仅使用了按照hour的group BY的数据,事实上,如果我们拥有更多的数据,可以使用更多的特征,例如按照day,week,month进行group BY。
我们也应该考虑季节和温度,因为不同的电器设备在不同的季节使用情况是不一样的。因为好的特征能够让我们的模型预测的更加准确。
同时,这个也可以帮助我们进行一个分类任务,因为我们已经知道了某些电器设备需要的耗电情况。
参考文献:
- http://www.sciencedirect.com/science/article/pii/S037877881200151X
- http://cs.gmu.edu/~jessica/publications/astronomy11.pdf
机器学习实验(一):运用机器学习(Kmeans算法)判定家庭用电主因相关推荐
- 机器学习实验:朴素贝叶斯算法
机器学习实验:朴素贝叶斯算法 问题如下: 根据给出的算法naivebayes.py,实现: 1.将数据集文件naivebayes_data.csv中的数据替换成14天打球与天气数据: 2.预测样本{O ...
- 机器学习-聚类之K均值(K-means)算法原理及实战
K-means算法 前言 机器学习方法主要分为监督学习和非监督学习两种.监督学习方法是在样本标签类别已知的情况下进行的,可以统计出各类样本的概率分布.特征空间分布区域等描述量,然后利用这些参数进行分类 ...
- 【机器学习】使用Python实现k-means算法,并根据红酒的13个特征对红酒数据进行聚类。
数据集为一份红酒数据,一共有178个样本,每个样本有13个特征,这里不会提供你红酒的标签,你需要自己根据这13个特征对红酒进行聚类,部分数据如下图: # encoding=utf8 import nu ...
- 机器学习实验之不同含量果汁饮料的聚类(K-Means)
文章目录 K-Means 实操项目:不同含量果汁饮料的聚类 **[实验内容]** **[实验要求]** 加载数据集,读取数据,探索数据 样本数据转化(可将pandasframe格式的数据转化为数组形式 ...
- matlab 职坐标,机器学习入门之机器学习实战ByMatlab(三)K-means算法
本文主要向大家介绍了机器学习入门之机器学习实战ByMatlab(三)K-means算法,通过具体的内容向大家展现,希望对大家学习机器学习入门有所帮助.K-means算法属于无监督学习聚类算法,其计算步 ...
- matlab 职坐标,机器学习入门之机器学习实战ByMatlab(四)二分K-means算法
本文主要向大家介绍了机器学习入门之机器学习实战ByMatlab(四)二分K-means算法,通过具体的内容向大家展现,希望对大家学习机器学习入门有所帮助.前面我们在是实现K-means算法的时候,提到 ...
- K-means算法(理论+opencv实现)
写在前面:之前想分类图像的时候有看过k-means算法,当时一知半解的去使用,不懂原理不懂使用规则...显然最后失败了,然后看了<机器学习>这本书对k-means算法有了理论的认识,现在通 ...
- K-means算法手动实现
1. K-means算法 k均值聚类算法(k-means clustering algorithm)是一种迭代求解的聚类分析算法,其步骤是,预将数据分为K组,则随机选取K个对象作为初始的聚类中心,然后 ...
- 基于Python实现k-means算法和混合高斯模型
1. 实验目的 实现一个 k-means 算法和混合高斯模型,并且用 EM 算法估计模型中的参数. 2. 实验要求 用高斯分布产生 k 个高斯分布的数据(不同均值和方差)(其中参数自己设定). 用 k ...
最新文章
- PTA数据结构与算法题目集(中文)7-39
- [整理]ADB命令行学习笔记
- 富文本框让最大四百像素_TinyMCE 富文本编辑器 ━━ 基本配置
- myeclipse写简单bbs代码_RabbitMQ实现即时通讯居然如此简单!连后端代码都省得写了?...
- delete if only one note header
- C++多线程快速入门(五)简单线程池设计
- Linux下实现流水灯等功能的LED驱动代码及测试实例
- pip intsall 遇到的各种问题
- mysql mgr监控_说MGR - MGR的监控
- 糟糕!原来你的电脑就是这样被木马远控了
- JAVA获取安卓系统下usb_Android 获取 usb 权限的两种方法
- 会议室预定模拟登陆网站
- 河北专接本C语言知识点
- 小程序和app究竟哪个好?
- 清明祭娭毑_原水_新浪博客
- php 修改图像大小,如何改变图片大小
- bat批处理删除日志文件
- leetcode507. 完美数
- pacemaker+corosync的一些总结
- MATLAB颜色图中,小于某个值的所有点设为白色
热门文章
- win7 自动配置ipv4地址169.254...修改方法
- 如何打造一个最强大的“自我”
- 求职面试经验全攻略总结
- 格式转换(音频/视频)
- 【Kubernetes 企业项目实战】07、最新一代微服务网格 Istio 入门到企业实战(下)
- 校园电信翼讯账号写入路由器上网
- 教育APP开发的费用由哪些因素决定?
- 一转倾心 OPPO N1 mini降价 OPPO N1 mini值不值得购买介绍
- 孟宁的Linux内核分析,Linux内核分析-MOOC小结
- adobe flash player安装失败