hive进行词频统计

统计文件信息：

$ /opt/cdh-5.3.6/hadoop-2.5.0/bin/hdfs dfs -text /user/hadoop/wordcount/input/wc.input
hadoop spark
spark hadoop
oracle mysql postgresql
postgresql oracle mysql
mysql mongodb
hdfs yarn mapreduce
yarn hdfs
zookeeper

针对于以上文件使用hive做词频统计：

create table docs (line string);

load data inpath '/user/hadoop/wordcount/input/wc.input' into table docs;

create table word_counts as
select word,count(1) as count from
(select explode(split(line,' ')) as word from docs) word
group by word
order by word;

分段解释：

--使用split函数对表中行按空格进行分隔：

select split(line,' ') from docs；
["hadoop","spark",""]
["spark","hadoop"]
["oracle","mysql","postgresql"]
["postgresql","oracle","mysql"]
["mysql","mongodb"]
["hdfs","yarn","mapreduce"]
["yarn","hdfs"]
["zookeeper"]

--使用explode函数对split的结果集进行行拆列：

select explode(split(line,' ')) as word from docs；
word
hadoop
spark

spark
hadoop
oracle
mysql
postgresql
postgresql
oracle
mysql
mysql
mongodb
hdfs
yarn
mapreduce
yarn
hdfs
zookeeper

--以上输出内容已经满足对其做统计分析，这时通过sql对其进行分析：

select word,count(1) as count from
(select explode(split(line,' ')) as word from docs) word
group by word
order by word;

word    count
     1
hadoop    2
hdfs    2
mapreduce    1
mongodb    1
mysql    3
oracle    2
postgresql    2
spark    2
yarn    2
zookeeper    1

hive进行词频统计相关推荐

Hadoop综合大作业补交4次作业：获取全部校园新闻，网络爬虫基础练习，中文词频统计，熟悉常用的Linux操作...
1.用Hive对爬虫大作业产生的文本文件(或者英文词频统计下载的英文长篇小说)进行词频统计. (1)开启所有的服务,并创建文件夹wwc (2)查看目录下所有文件 (3)把hdfs文件系统中文件夹里的文 ...
Hadoop的改进实验（中文分词词频统计及英文词频统计）（1/4）
声明: 1)本文由我bitpeach原创撰写,转载时请注明出处,侵权必究. 2)本小实验工作环境为Windows系统下的百度云(联网),和Ubuntu系统的hadoop1-2-1(自己提前配好).如不 ...
MapRecuce 词频统计案例
文章目录初探MapReduce 一.MapReduce核心思想二.MapReduce编程实例-词频统计思路 1.map阶段(映射) 2.reduce阶段(归并阶段) 三.词频统计编程实现 1.准备 ...
用R语言做词频统计_R语言 | 词频统计
Python网络爬虫与文本数据分析本章内容导入停用词读数据,分词剔除停用词导入停用词表 library(dplyr) ## [1] "?" "." & ...
统计csv词频_中文词频统计
中文词频统计 1. 下载一长篇中文小说. <倚天屠龙记> 2. 从文件读取待分析文本. 3. 安装并使用jieba进行中文分词. pip install jieba import jieb ...
201671010417 金振兴词频统计软件项目报告
1.需求分析按照<构建之法>第2章中2.3所述PSP流程,使用JAVA编程语言,独立完成一个英文文本词频统计的软件开发. .程序可读入任意英文文本文件,该文件中英文词数大于等于1个. . ...
字符串操作、文件操作，英文词频统计预处理
1.字符串操作: 解析身份证号:生日.性别.出生地等凯撒密码编码与解码网址观察与批量生成 (1)解析身份证: 编译结果: (2)凯撒密码编码与解码编译结果: 2.英文词频统计预处理下载一首英文 ...
Python_note6 组合数据类型+jieba库+文本词频统计
集合类型和操作集合元素不可修改,由不可变数据类型组成,元素不可重复 a = {"python",123,("python",123)}使用{}建立集合 b = ...
软工作业3：词频统计
词频统计一.编译环境 (1)IDE:PyCharm 2018 (2)python版本:python3.6.3(Anaconda3-5.1.0 ) 二.程序分析 (1)读文件到缓冲区(process ...

hive进行词频统计

hive进行词频统计相关推荐

最新文章

热门文章