hadoop下的Kmeans算法实现
前一段时间,从配置Hadoop到运行kmeans的mapreduce程序,着实让我纠结了几天,昨天终于把前面遇到的配置问题和程序运行问题搞定。Kmeans算法看起来很简单,但对于第一次接触mapreduce程序来说,还是有些挑战,还好基本都搞明白了。Kmeans算法是从网上下的在此分析一下过程。
Kmeans.Java
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.fs.FileSystem;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- public class KMeans {
- public static void main(String[] args) throws Exception
- {
- CenterInitial centerInitial = new CenterInitial();
- centerInitial.run(args);//初始化中心点
- int times=0;
- double s = 0,shold = 0.1;//shold是预制。
- do {
- Configuration conf = new Configuration();
- conf.set("fs.default.name", "hdfs://localhost:9000");
- Job job = new Job(conf,"KMeans");//建立KMeans的MapReduce作业
- job.setJarByClass(KMeans.class);//设定作业的启动类
- job.setOutputKeyClass(Text.class);//设定Key输出的格式:Text
- job.setOutputValueClass(Text.class);//设定value输出的格式:Text
- job.setMapperClass(KMapper.class);//设定Mapper类
- job.setMapOutputKeyClass(Text.class);
- job.setMapOutputValueClass(Text.class);//设定Reducer类
- job.setReducerClass(KReducer.class);
- FileSystem fs = FileSystem.get(conf);
- fs.delete(new Path(args[2]),true);//args[2]是output目录,fs.delete是将已存在的output删除
- //解析输入和输出参数,分别作为作业的输入和输出,都是文件
- FileInputFormat.addInputPath(job, new Path(args[0]));
- FileOutputFormat.setOutputPath(job, new Path(args[2]));
- //运行作业并判断是否完成成功
- job.waitForCompletion(true);
- if(job.waitForCompletion(true))//上一次mapreduce过程结束
- {
- //上两个中心点做比较,如果中心点之间的距离小于阈值就停止;如果距离大于阈值,就把最近的中心点作为新中心点
- NewCenter newCenter = new NewCenter();
- s = newCenter.run(args);
- times++;
- }
- } while(s > shold);//当误差小于阈值停止。
- System.out.println("Iterator: " + times);//迭代次数
- }
- }
问题:args[]是什么,这个问题纠结了几日才得到答案,args[]就是最开始向程序中传递的参数,具体在Run Configurations里配置,如下
hdfs://localhost:9000/home/administrator/hadoop/kmeans/input hdfs://localhost:9000/home/administrator/hadoop/kmeans hdfs://localhost:9000/home/administrator/hadoop/kmeans/output
代码的功能在程序中注释。
输入数据,保存在2.txt中:(1,1) (9,9) (2,3) (10,30) (4,4) (34,40) (5,6) (15,20)
3.txt用于保存临时的中心
part-r-00000用于保存reduce的结果
程序的mapreduce过程及结果:
- 初始化过程:(10,30) (2,3)
- 13/01/26 08:58:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
- 13/01/26 08:58:38 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
- 13/01/26 08:58:38 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
- 13/01/26 08:58:38 INFO input.FileInputFormat: Total input paths to process : 2
- 13/01/26 08:58:38 WARN snappy.LoadSnappy: Snappy native library not loaded
- 13/01/26 08:58:38 INFO mapred.JobClient: Running job: job_local_0001
- 13/01/26 08:58:39 INFO util.ProcessTree: setsid exited with exit code 0
- 13/01/26 08:58:39 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@15718f2
- 13/01/26 08:58:39 INFO mapred.MapTask: io.sort.mb = 100
- 13/01/26 08:58:39 INFO mapred.MapTask: data buffer = 79691776/99614720
- 13/01/26 08:58:39 INFO mapred.MapTask: record buffer = 262144/327680
- 0list:1
- 0c:10
- 1list:1
- 1c:30
- 中心点(2,3)对应坐标(1,1)
- Mapper输出:(2,3) (1,1)
- 0list:9
- 0c:10
- 1list:9
- 1c:30
- 中心点(2,3)对应坐标(9,9)
- Mapper输出:(2,3) (9,9)
- 0list:2
- 0c:10
- 1list:3
- 1c:30
- 中心点(2,3)对应坐标(2,3)
- Mapper输出:(2,3) (2,3)
- 0list:10
- 0c:10
- 1list:30
- 1c:30
- 中心点(10,30)对应坐标(10,30)
- Mapper输出:(10,30) (10,30)
- 0list:4
- 0c:10
- 1list:4
- 1c:30
- 中心点(2,3)对应坐标(4,4)
- Mapper输出:(2,3) (4,4)
- 0list:34
- 0c:10
- 1list:40
- 1c:30
- 中心点(10,30)对应坐标(34,40)
- Mapper输出:(10,30) (34,40)
- 0list:5
- 0c:10
- 1list:6
- 1c:30
- 中心点(2,3)对应坐标(5,6)
- Mapper输出:(2,3) (5,6)
- 0list:15
- 0c:10
- 1list:20
- 1c:30
- 中心点(10,30)对应坐标(15,20)
- Mapper输出:(10,30) (15,20)
- 13/01/26 08:58:39 INFO mapred.MapTask: Starting flush of map output
- 13/01/26 08:58:39 INFO mapred.MapTask: Finished spill 0
- 13/01/26 08:58:39 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
- 13/01/26 08:58:39 INFO mapred.JobClient: map 0% reduce 0%
- 13/01/26 08:58:42 INFO mapred.LocalJobRunner:
- 13/01/26 08:58:42 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
- 13/01/26 08:58:42 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@77eaf8
- 13/01/26 08:58:42 INFO mapred.MapTask: io.sort.mb = 100
- 13/01/26 08:58:42 INFO mapred.MapTask: data buffer = 79691776/99614720
- 13/01/26 08:58:42 INFO mapred.MapTask: record buffer = 262144/327680
- 0list:2
- 0c:10
- 1list:3
- 1c:30
- 中心点(2,3)对应坐标(2,3)
- Mapper输出:(2,3) (2,3)
- 0list:10
- 0c:10
- 1list:30
- 1c:30
- 中心点(10,30)对应坐标(10,30)
- Mapper输出:(10,30) (10,30)
- 0list:34
- 0c:10
- 1list:40
- 1c:30
- 中心点(10,30)对应坐标(34,40)
- Mapper输出:(10,30) (34,40)
- 0list:1
- 0c:10
- 1list:1
- 1c:30
- 中心点(2,3)对应坐标(1,1)
- Mapper输出:(2,3) (1,1)
- 13/01/26 08:58:42 INFO mapred.MapTask: Starting flush of map output
- 13/01/26 08:58:42 INFO mapred.MapTask: Finished spill 0
- 13/01/26 08:58:42 INFO mapred.Task: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
- 13/01/26 08:58:42 INFO mapred.JobClient: map 100% reduce 0%
- 13/01/26 08:58:45 INFO mapred.LocalJobRunner:
- 13/01/26 08:58:45 INFO mapred.Task: Task 'attempt_local_0001_m_000001_0' done.
- 13/01/26 08:58:45 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@18d7ace
- 13/01/26 08:58:45 INFO mapred.LocalJobRunner:
- 13/01/26 08:58:45 INFO mapred.Merger: Merging 2 sorted segments
- 13/01/26 08:58:45 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 192 bytes
- 13/01/26 08:58:45 INFO mapred.LocalJobRunner:
- Reduce过程第一次
- (10,30)Reduce
- val:(10,30)
- values:org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@141fab6
- temlength:2
- val:(34,40)
- values:org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@141fab6
- temlength:2
- val:(10,30)
- values:org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@141fab6
- temlength:2
- val:(34,40)
- values:org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@141fab6
- temlength:2
- val:(15,20)
- values:org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@141fab6
- temlength:2
- count:5
- outVal:(10,30) (34,40) (10,30) (34,40) (15,20) /outVal
- ave0i103.0
- ave1i160.0
- 写入part:(10,30) (10,30) (34,40) (10,30) (34,40) (15,20) (20.6,32.0)
- Reduce过程第一次
- (2,3)Reduce
- val:(1,1)
- values:org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@141fab6
- temlength:2
- val:(9,9)
- values:org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@141fab6
- temlength:2
- val:(2,3)
- values:org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@141fab6
- temlength:2
- val:(4,4)
- values:org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@141fab6
- temlength:2
- val:(5,6)
- values:org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@141fab6
- temlength:2
- val:(2,3)
- values:org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@141fab6
- temlength:2
- val:(1,1)
- values:org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@141fab6
- temlength:2
- count:7
- outVal:(1,1) (9,9) (2,3) (4,4) (5,6) (2,3) (1,1) /outVal
- ave0i24.0
- ave1i27.0
- 写入part:(2,3) (1,1) (9,9) (2,3) (4,4) (5,6) (2,3) (1,1) (3.4285715,3.857143)
- 13/01/26 08:58:45 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
- 13/01/26 08:58:45 INFO mapred.LocalJobRunner:
- 13/01/26 08:58:45 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now
- 13/01/26 08:58:45 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://localhost:9000/home/administrator/hadoop/kmeans/output
- 13/01/26 08:58:48 INFO mapred.LocalJobRunner: reduce > reduce
- 13/01/26 08:58:48 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.
- 13/01/26 08:58:48 INFO mapred.JobClient: map 100% reduce 100%
- 13/01/26 08:58:48 INFO mapred.JobClient: Job complete: job_local_0001
- 13/01/26 08:58:48 INFO mapred.JobClient: Counters: 22
- 13/01/26 08:58:48 INFO mapred.JobClient: File Output Format Counters
- 13/01/26 08:58:48 INFO mapred.JobClient: Bytes Written=129
- 13/01/26 08:58:48 INFO mapred.JobClient: FileSystemCounters
- 13/01/26 08:58:48 INFO mapred.JobClient: FILE_BYTES_READ=1818
- 13/01/26 08:58:48 INFO mapred.JobClient: HDFS_BYTES_READ=450
- 13/01/26 08:58:48 INFO mapred.JobClient: FILE_BYTES_WRITTEN=122901
- 13/01/26 08:58:48 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=171
- 13/01/26 08:58:48 INFO mapred.JobClient: File Input Format Counters
- 13/01/26 08:58:48 INFO mapred.JobClient: Bytes Read=82
- 13/01/26 08:58:48 INFO mapred.JobClient: Map-Reduce Framework
- 13/01/26 08:58:48 INFO mapred.JobClient: Map output materialized bytes=200
- 13/01/26 08:58:48 INFO mapred.JobClient: Map input records=2
- 13/01/26 08:58:48 INFO mapred.JobClient: Reduce shuffle bytes=0
- 13/01/26 08:58:48 INFO mapred.JobClient: Spilled Records=24
- 13/01/26 08:58:48 INFO mapred.JobClient: Map output bytes=164
- 13/01/26 08:58:48 INFO mapred.JobClient: Total committed heap usage (bytes)=498860032
- 13/01/26 08:58:48 INFO mapred.JobClient: CPU time spent (ms)=0
- 13/01/26 08:58:48 INFO mapred.JobClient: SPLIT_RAW_BYTES=262
- 13/01/26 08:58:48 INFO mapred.JobClient: Combine input records=0
- 13/01/26 08:58:48 INFO mapred.JobClient: Reduce input records=12
- 13/01/26 08:58:48 INFO mapred.JobClient: Reduce input groups=2
- 13/01/26 08:58:48 INFO mapred.JobClient: Combine output records=0
- 13/01/26 08:58:48 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
- 13/01/26 08:58:48 INFO mapred.JobClient: Reduce output records=2
- 13/01/26 08:58:48 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
- 13/01/26 08:58:48 INFO mapred.JobClient: Map output records=12
- 13/01/26 08:58:48 INFO mapred.JobClient: Running job: job_local_0001
- 13/01/26 08:58:48 INFO mapred.JobClient: Job complete: job_local_0001
- 13/01/26 08:58:48 INFO mapred.JobClient: Counters: 22
- 13/01/26 08:58:48 INFO mapred.JobClient: File Output Format Counters
- 13/01/26 08:58:48 INFO mapred.JobClient: Bytes Written=129
- 13/01/26 08:58:48 INFO mapred.JobClient: FileSystemCounters
- 13/01/26 08:58:48 INFO mapred.JobClient: FILE_BYTES_READ=1818
- 13/01/26 08:58:48 INFO mapred.JobClient: HDFS_BYTES_READ=450
- 13/01/26 08:58:48 INFO mapred.JobClient: FILE_BYTES_WRITTEN=122901
- 13/01/26 08:58:48 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=171
- 13/01/26 08:58:48 INFO mapred.JobClient: File Input Format Counters
- 13/01/26 08:58:48 INFO mapred.JobClient: Bytes Read=82
- 13/01/26 08:58:48 INFO mapred.JobClient: Map-Reduce Framework
- 13/01/26 08:58:48 INFO mapred.JobClient: Map output materialized bytes=200
- 13/01/26 08:58:48 INFO mapred.JobClient: Map input records=2
- 13/01/26 08:58:48 INFO mapred.JobClient: Reduce shuffle bytes=0
- 13/01/26 08:58:48 INFO mapred.JobClient: Spilled Records=24
- 13/01/26 08:58:48 INFO mapred.JobClient: Map output bytes=164
- 13/01/26 08:58:48 INFO mapred.JobClient: Total committed heap usage (bytes)=498860032
- 13/01/26 08:58:48 INFO mapred.JobClient: CPU time spent (ms)=0
- 13/01/26 08:58:48 INFO mapred.JobClient: SPLIT_RAW_BYTES=262
- 13/01/26 08:58:48 INFO mapred.JobClient: Combine input records=0
- 13/01/26 08:58:48 INFO mapred.JobClient: Reduce input records=12
- 13/01/26 08:58:48 INFO mapred.JobClient: Reduce input groups=2
- 13/01/26 08:58:48 INFO mapred.JobClient: Combine output records=0
- 13/01/26 08:58:48 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
- 13/01/26 08:58:48 INFO mapred.JobClient: Reduce output records=2
- 13/01/26 08:58:48 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
- 13/01/26 08:58:48 INFO mapred.JobClient: Map output records=12
- 上一次MapReduce结果:第一行:(10,30) (10,30) (34,40) (10,30) (34,40) (15,20) (20.6,32.0)
- 第二行:(2,3) (1,1) (9,9) (2,3) (4,4) (5,6) (2,3) (1,1) (3.4285715,3.857143)
- 。
- 0坐标距离:116.36001
- 1坐标距离:2.7755103
- 新中心点:(20.6,32.0) (3.4285715,3.857143)
- 13/01/26 08:58:49 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
- 13/01/26 08:58:49 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
- 13/01/26 08:58:49 INFO input.FileInputFormat: Total input paths to process : 2
- 13/01/26 08:58:49 INFO mapred.JobClient: Running job: job_local_0002
- 13/01/26 08:58:49 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@18aab40
- 13/01/26 08:58:49 INFO mapred.MapTask: io.sort.mb = 100
- 13/01/26 08:58:49 INFO mapred.MapTask: data buffer = 79691776/99614720
- 13/01/26 08:58:49 INFO mapred.MapTask: record buffer = 262144/327680
- 0list:1
- 0c:20.6
- 1list:1
- 1c:32.0
- 中心点(3.4285715,3.857143)对应坐标(1,1)
- Mapper输出:(3.4285715,3.857143) (1,1)
- 0list:9
- 0c:20.6
- 1list:9
- 1c:32.0
- 中心点(3.4285715,3.857143)对应坐标(9,9)
- Mapper输出:(3.4285715,3.857143) (9,9)
- 0list:2
- 0c:20.6
- 1list:3
- 1c:32.0
- 中心点(3.4285715,3.857143)对应坐标(2,3)
- Mapper输出:(3.4285715,3.857143) (2,3)
- 0list:10
- 0c:20.6
- 1list:30
- 1c:32.0
- 中心点(20.6,32.0)对应坐标(10,30)
- Mapper输出:(20.6,32.0) (10,30)
- 0list:4
- 0c:20.6
- 1list:4
- 1c:32.0
- 中心点(3.4285715,3.857143)对应坐标(4,4)
- Mapper输出:(3.4285715,3.857143) (4,4)
- 0list:34
- 0c:20.6
- 1list:40
- 1c:32.0
- 中心点(20.6,32.0)对应坐标(34,40)
- Mapper输出:(20.6,32.0) (34,40)
- 0list:5
- 0c:20.6
- 1list:6
- 1c:32.0
- 中心点(3.4285715,3.857143)对应坐标(5,6)
- Mapper输出:(3.4285715,3.857143) (5,6)
- 0list:15
- 0c:20.6
- 1list:20
- 1c:32.0
- 中心点(20.6,32.0)对应坐标(15,20)
- Mapper输出:(20.6,32.0) (15,20)
- 13/01/26 08:58:49 INFO mapred.MapTask: Starting flush of map output
- 13/01/26 08:58:49 INFO mapred.MapTask: Finished spill 0
- 13/01/26 08:58:49 INFO mapred.Task: Task:attempt_local_0002_m_000000_0 is done. And is in the process of commiting
- 13/01/26 08:58:50 INFO mapred.JobClient: map 0% reduce 0%
- 13/01/26 08:58:52 INFO mapred.LocalJobRunner:
- 13/01/26 08:58:52 INFO mapred.Task: Task 'attempt_local_0002_m_000000_0' done.
- 13/01/26 08:58:52 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@147358f
- 13/01/26 08:58:52 INFO mapred.MapTask: io.sort.mb = 100
- 13/01/26 08:58:52 INFO mapred.MapTask: data buffer = 79691776/99614720
- 13/01/26 08:58:52 INFO mapred.MapTask: record buffer = 262144/327680
- 0list:2
- 0c:20.6
- 1list:3
- 1c:32.0
- 中心点(3.4285715,3.857143)对应坐标(2,3)
- Mapper输出:(3.4285715,3.857143) (2,3)
- 0list:10
- 0c:20.6
- 1list:30
- 1c:32.0
- 中心点(20.6,32.0)对应坐标(10,30)
- Mapper输出:(20.6,32.0) (10,30)
- 0list:34
- 0c:20.6
- 1list:40
- 1c:32.0
- 中心点(20.6,32.0)对应坐标(34,40)
- Mapper输出:(20.6,32.0) (34,40)
- 0list:1
- 0c:20.6
- 1list:1
- 1c:32.0
- 中心点(3.4285715,3.857143)对应坐标(1,1)
- Mapper输出:(3.4285715,3.857143) (1,1)
- 13/01/26 08:58:52 INFO mapred.MapTask: Starting flush of map output
- 13/01/26 08:58:52 INFO mapred.MapTask: Finished spill 0
- 13/01/26 08:58:52 INFO mapred.Task: Task:attempt_local_0002_m_000001_0 is done. And is in the process of commiting
- 13/01/26 08:58:53 INFO mapred.JobClient: map 100% reduce 0%
- 13/01/26 08:58:55 INFO mapred.LocalJobRunner:
- 13/01/26 08:58:55 INFO mapred.Task: Task 'attempt_local_0002_m_000001_0' done.
- 13/01/26 08:58:55 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@2798e7
- 13/01/26 08:58:55 INFO mapred.LocalJobRunner:
- 13/01/26 08:58:55 INFO mapred.Merger: Merging 2 sorted segments
- 13/01/26 08:58:55 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 317 bytes
- 13/01/26 08:58:55 INFO mapred.LocalJobRunner:
- Reduce过程第一次
- (20.6,32.0)Reduce
- val:(10,30)
- values:org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@13043d2
- temlength:2
- val:(34,40)
- values:org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@13043d2
- temlength:2
- val:(10,30)
- values:org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@13043d2
- temlength:2
- val:(34,40)
- values:org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@13043d2
- temlength:2
- val:(15,20)
- values:org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@13043d2
- temlength:2
- count:5
- outVal:(10,30) (34,40) (10,30) (34,40) (15,20) /outVal
- ave0i103.0
- ave1i160.0
- 写入part:(20.6,32.0) (10,30) (34,40) (10,30) (34,40) (15,20) (20.6,32.0)
- Reduce过程第一次
- (3.4285715,3.857143)Reduce
- val:(1,1)
- values:org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@13043d2
- temlength:2
- val:(9,9)
- values:org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@13043d2
- temlength:2
- val:(2,3)
- values:org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@13043d2
- temlength:2
- val:(4,4)
- values:org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@13043d2
- temlength:2
- val:(5,6)
- values:org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@13043d2
- temlength:2
- val:(2,3)
- values:org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@13043d2
- temlength:2
- val:(1,1)
- values:org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@13043d2
- temlength:2
- count:7
- outVal:(1,1) (9,9) (2,3) (4,4) (5,6) (2,3) (1,1) /outVal
- ave0i24.0
- ave1i27.0
- 写入part:(3.4285715,3.857143) (1,1) (9,9) (2,3) (4,4) (5,6) (2,3) (1,1) (3.4285715,3.857143)
- 13/01/26 08:58:55 INFO mapred.Task: Task:attempt_local_0002_r_000000_0 is done. And is in the process of commiting
- 13/01/26 08:58:55 INFO mapred.LocalJobRunner:
- 13/01/26 08:58:55 INFO mapred.Task: Task attempt_local_0002_r_000000_0 is allowed to commit now
- 13/01/26 08:58:55 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0002_r_000000_0' to hdfs://localhost:9000/home/administrator/hadoop/kmeans/output
- 13/01/26 08:58:58 INFO mapred.LocalJobRunner: reduce > reduce
- 13/01/26 08:58:58 INFO mapred.Task: Task 'attempt_local_0002_r_000000_0' done.
- 13/01/26 08:58:59 INFO mapred.JobClient: map 100% reduce 100%
- 13/01/26 08:58:59 INFO mapred.JobClient: Job complete: job_local_0002
- 13/01/26 08:58:59 INFO mapred.JobClient: Counters: 22
- 13/01/26 08:58:59 INFO mapred.JobClient: File Output Format Counters
- 13/01/26 08:58:59 INFO mapred.JobClient: Bytes Written=148
- 13/01/26 08:58:59 INFO mapred.JobClient: FileSystemCounters
- 13/01/26 08:58:59 INFO mapred.JobClient: FILE_BYTES_READ=4442
- 13/01/26 08:58:59 INFO mapred.JobClient: HDFS_BYTES_READ=1262
- 13/01/26 08:58:59 INFO mapred.JobClient: FILE_BYTES_WRITTEN=246235
- 13/01/26 08:58:59 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=676
- 13/01/26 08:58:59 INFO mapred.JobClient: File Input Format Counters
- 13/01/26 08:58:59 INFO mapred.JobClient: Bytes Read=82
- 13/01/26 08:58:59 INFO mapred.JobClient: Map-Reduce Framework
- 13/01/26 08:58:59 INFO mapred.JobClient: Map output materialized bytes=325
- 13/01/26 08:58:59 INFO mapred.JobClient: Map input records=2
- 13/01/26 08:58:59 INFO mapred.JobClient: Reduce shuffle bytes=0
- 13/01/26 08:58:59 INFO mapred.JobClient: Spilled Records=24
- 13/01/26 08:58:59 INFO mapred.JobClient: Map output bytes=289
- 13/01/26 08:58:59 INFO mapred.JobClient: Total committed heap usage (bytes)=667418624
- 13/01/26 08:58:59 INFO mapred.JobClient: CPU time spent (ms)=0
- 13/01/26 08:58:59 INFO mapred.JobClient: SPLIT_RAW_BYTES=262
- 13/01/26 08:58:59 INFO mapred.JobClient: Combine input records=0
- 13/01/26 08:58:59 INFO mapred.JobClient: Reduce input records=12
- 13/01/26 08:58:59 INFO mapred.JobClient: Reduce input groups=2
- 13/01/26 08:58:59 INFO mapred.JobClient: Combine output records=0
- 13/01/26 08:58:59 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
- 13/01/26 08:58:59 INFO mapred.JobClient: Reduce output records=2
- 13/01/26 08:58:59 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
- 13/01/26 08:58:59 INFO mapred.JobClient: Map output records=12
- 13/01/26 08:58:59 INFO mapred.JobClient: Running job: job_local_0002
- 13/01/26 08:58:59 INFO mapred.JobClient: Job complete: job_local_0002
- 13/01/26 08:58:59 INFO mapred.JobClient: Counters: 22
- 13/01/26 08:58:59 INFO mapred.JobClient: File Output Format Counters
- 13/01/26 08:58:59 INFO mapred.JobClient: Bytes Written=148
- 13/01/26 08:58:59 INFO mapred.JobClient: FileSystemCounters
- 13/01/26 08:58:59 INFO mapred.JobClient: FILE_BYTES_READ=4442
- 13/01/26 08:58:59 INFO mapred.JobClient: HDFS_BYTES_READ=1262
- 13/01/26 08:58:59 INFO mapred.JobClient: FILE_BYTES_WRITTEN=246235
- 13/01/26 08:58:59 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=676
- 13/01/26 08:58:59 INFO mapred.JobClient: File Input Format Counters
- 13/01/26 08:58:59 INFO mapred.JobClient: Bytes Read=82
- 13/01/26 08:58:59 INFO mapred.JobClient: Map-Reduce Framework
- 13/01/26 08:58:59 INFO mapred.JobClient: Map output materialized bytes=325
- 13/01/26 08:58:59 INFO mapred.JobClient: Map input records=2
- 13/01/26 08:58:59 INFO mapred.JobClient: Reduce shuffle bytes=0
- 13/01/26 08:58:59 INFO mapred.JobClient: Spilled Records=24
- 13/01/26 08:58:59 INFO mapred.JobClient: Map output bytes=289
- 13/01/26 08:58:59 INFO mapred.JobClient: Total committed heap usage (bytes)=667418624
- 13/01/26 08:58:59 INFO mapred.JobClient: CPU time spent (ms)=0
- 13/01/26 08:58:59 INFO mapred.JobClient: SPLIT_RAW_BYTES=262
- 13/01/26 08:58:59 INFO mapred.JobClient: Combine input records=0
- 13/01/26 08:58:59 INFO mapred.JobClient: Reduce input records=12
- 13/01/26 08:58:59 INFO mapred.JobClient: Reduce input groups=2
- 13/01/26 08:58:59 INFO mapred.JobClient: Combine output records=0
- 13/01/26 08:58:59 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
- 13/01/26 08:58:59 INFO mapred.JobClient: Reduce output records=2
- 13/01/26 08:58:59 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
- 13/01/26 08:58:59 INFO mapred.JobClient: Map output records=12
- 上一次MapReduce结果:第一行:(20.6,32.0) (10,30) (34,40) (10,30) (34,40) (15,20) (20.6,32.0)
- 第二行:(3.4285715,3.857143) (1,1) (9,9) (2,3) (4,4) (5,6) (2,3) (1,1) (3.4285715,3.857143)
- 。
- 0坐标距离:0.0
- 1坐标距离:0.0
- 新中心点:(20.6,32.0) (3.4285715,3.857143)
- Iterator: 2
from: http://blog.csdn.net/lskyne/article/details/8543885
http://blog.csdn.net/lskyne/article/details/8543923
hadoop下的Kmeans算法实现相关推荐
- hadoop下实现kmeans算法——一个mapreduce的实现方法
写mapreduce程序实现kmeans算法,我们的思路可能是这样的 1. 用一个全局变量存放上一次迭代后的质心 2. map里,计算每个质心与样本之间的距离,得到与样本距离最短的质心,以这个质心作为 ...
- matlab disteclud,机器学习实战ByMatlab(3)K-means算法
K-means算法属于无监督学习聚类算法,其计算步骤还是挺简单的,思想也挺容易理解,而且还可以在思想中体会到EM算法的思想. K-means 算法的优缺点: 1.优点:容易实现 2.缺点:可能收敛到局 ...
- matlab 职坐标,机器学习入门之机器学习实战ByMatlab(三)K-means算法
本文主要向大家介绍了机器学习入门之机器学习实战ByMatlab(三)K-means算法,通过具体的内容向大家展现,希望对大家学习机器学习入门有所帮助.K-means算法属于无监督学习聚类算法,其计算步 ...
- 深度学习核心技术精讲100篇(五十一)-Spark平台下基于LDA的k-means算法实现
本文主要在Spark平台下实现一个机器学习应用,该应用主要涉及LDA主题模型以及K-means聚类.通过本文你可以了解到: 文本挖掘的基本流程 LDA主题模型算法 K-means算法 Spark平台下 ...
- Hadoop 实现kmeans 算法
关于kmeans说在前面:kmeans算法有一个硬性的规定就是簇的个数要提前设定.大家可能会质疑这个限制是否影响聚类效果,但是这种担心是多余的.在该算法诞生的这么多年里,该算法已被证明能够广泛的用于解 ...
- 用Hadoop1.0.3实现KMeans算法
从理论上来讲用MapReduce技术实现KMeans算法是很Natural的想法:在Mapper中逐个计算样本点离哪个中心最近,然后Emit(样本点所属的簇编号,样本点):在Reducer中属于同一个 ...
- KMeans算法的Mapreduce实现
Hive数据分析... 4 一.数据处理.... 4 1.1处理不符合规范的数据.... 4 1.2访问时间分段.... 5 二.基本统计信息.... 6 三.数据属性基础分析.... 6 3.1用户 ...
- KMeans算法,采用肘部法则获取类簇中心个数K的值。
K-Means是一种非常常见的聚类算法,在处理聚类任务中经常使用,K-Means算法是一种原型聚类算法. 该算法重要的一步就是确定K的值的划分,通常我们采用肘部法则选取K值,再依据轮廓系数,及各个数据 ...
- 一文详尽系列之K-means算法
点击上方"Datawhale",选择"星标"公众号 第一时间获取价值内容 K-means 是我们最常用的基于距离的聚类算法,其认为两个目标的距离越近,相似度越大 ...
最新文章
- 2021年大数据Spark(二十一):Spark Core案例-SogouQ日志分析
- 算法导论读书笔记-第十九章-斐波那契堆
- find命令详解(原创)
- NodeJS API Process全局对象
- CNI:容器网络接口详解
- Linux ffmpeg的安装编译过程
- 前端学习(2963):element-ui介绍
- 【spring boot基础知识】java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver
- 监听是否到达页面滑动的可视区域最底部
- Netty工作笔记0060---Tcp长连接和短连接_Http长连接和短连接_UDP长连接和短连接
- shell编程——判断条件
- TOMCAT SSL 配置
- extern ,extern C 与 __cplusplus
- echarts 折线图y轴自定义 使用icon
- android 判断手机计步_Android_基于G-Sensor的计步算法
- windows 实现 linux soft link,Linux中的软链接(Soft Link)和硬链接(Hard Link)的区别...
- r语言html爬虫,用R语言三行代码写爬虫
- 用scratch2.0编写乒乓球游戏
- k8s标签选择器使用详解
- k m kb mb计算机组成,为什么对计算机存储单位(K,M,G,T)换算,总是糊里又糊涂?
热门文章
- 【联邦学习】联盟学习到底是什么?他们画了部漫画……
- 【Python】写文件个性化设置模块Python_Xlwt练习
- Fabric 链码Chaincode 的安装、初始化、调用、升级
- 中科院罗平演讲全文:自动撰写金融文档如何实现,用 AI 解
- java校园足球管理系统_基于jsp的校园足球管理平台-JavaEE实现校园足球管理平台 - java项目源码...
- 小胖机器人能刷碗吗_小胖机器人好不好?透过真相看本质
- Spring Cloud Alibaba - 25 Gateway-路由断言工厂Route Predicate Factories谓词工厂示例及源码解析
- Java8 - 一文搞定Fork/Join 框架
- Centos显示-bash-4.1$问题的修复及原因探究
- java进入编程界面_java – 编程到界面是什么意思?