Spark入门（八）之WordCount

一、WordCount

计算文本里面的每个单词出现的个数，输出结果。

二、maven设置

<?xml version="1.0" encoding="UTF-8"?><project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>com.mk</groupId><artifactId>spark-test</artifactId><version>1.0</version><name>spark-test</name><url>http://spark.mk.com</url><properties><project.build.sourceEncoding>UTF-8</project.build.sourceEncoding><maven.compiler.source>1.8</maven.compiler.source><maven.compiler.target>1.8</maven.compiler.target><scala.version>2.11.1</scala.version><spark.version>2.4.4</spark.version><hadoop.version>2.6.0</hadoop.version></properties><dependencies><!-- scala依赖--><dependency><groupId>org.scala-lang</groupId><artifactId>scala-library</artifactId><version>${scala.version}</version></dependency><!-- spark依赖--><dependency><groupId>org.apache.spark</groupId><artifactId>spark-core_2.11</artifactId><version>${spark.version}</version></dependency><dependency><groupId>org.apache.spark</groupId><artifactId>spark-sql_2.11</artifactId><version>${spark.version}</version></dependency><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.11</version><scope>test</scope></dependency></dependencies><build><pluginManagement><plugins><plugin><artifactId>maven-clean-plugin</artifactId><version>3.1.0</version></plugin><plugin><artifactId>maven-resources-plugin</artifactId><version>3.0.2</version></plugin><plugin><artifactId>maven-compiler-plugin</artifactId><version>3.8.0</version></plugin><plugin><artifactId>maven-surefire-plugin</artifactId><version>2.22.1</version></plugin><plugin><artifactId>maven-jar-plugin</artifactId><version>3.0.2</version></plugin></plugins></pluginManagement></build>
</project>

三、编程代码

public class WordCountApp implements SparkConfInfo{public static void main(String[]args){String filePath = "F:\\test\\log.txt";SparkSession sparkSession = new WordCountApp().getSparkConf("WordCount");Map<String, Integer> wordCountMap = sparkSession.sparkContext().textFile(filePath, 4).toJavaRDD().flatMap(v -> Arrays.asList(v.split("[(\\s+)(\r?\n),.。'’]")).iterator()).filter(v -> v.matches("[a-zA-Z-]+")).map(String::toLowerCase).mapToPair(v -> new Tuple2<>(v, 1)).reduceByKey(Integer::sum).collectAsMap();wordCountMap.forEach((k, v) -> System.out.println(k + ":" + v));sparkSession.stop();}
}public interface SparkConfInfo {default SparkSession getSparkConf(String appName){SparkConf sparkConf = new SparkConf();if(System.getProperty("os.name").toLowerCase().contains("win")) {sparkConf.setMaster("local[4]");System.out.println("使用本地模拟是spark");}else{sparkConf.setMaster("spark://hadoop01:7077,hadoop02:7077,hadoop03:7077");sparkConf.set("spark.driver.host","192.168.150.1");//本地ip，必须与spark集群能够相互访问，如：同一个局域网sparkConf.setJars(new String[] {".\\out\\artifacts\\spark_test\\spark-test.jar"});//项目构建生成的路径}SparkSession session = SparkSession.builder().appName(appName).config(sparkConf).config(sparkConf).getOrCreate();return session;}
}

文件内容

Spark Streaming is an extension of the core Spark API that enables scalable,high-throughput, fault-tolerant stream processing of live 。data streams. Data， can be ，ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems,Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.This guide shows you how to start writing Spark Streaming programs with DStreams. You can write Spark Streaming programs in Scala, Java or Python (introduced in Spark 1.2), all of which are presented in this guide. You will find tabs throughout this guide that let you choose between code snippets of different languages. databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams.

输出

created:1
continuous:1
high-level:3
either:1
reduce:1
many:1
writing:1
learning:1
sources:2
is:2
spark:7
can:6
high-throughput:1
filesystems:1
using:1
rdds:1
of:6
input:1
scala:1
you:5
operations:1
kinesis:2
fact:1
or:4
provides:1
pushed:1
how:1
will:1
join:1
databases:1
window:1
be:4
data:5
from:3
s:1
abstraction:1
languages:1
to:2
all:1
and:5
that:2
fault-tolerant:1
core:1
a:4
expressed:1
internally:1
streaming:4
on:2
dashboards:1
java:1
let:1
processed:2
with:2
write:1
by:1
between:1
in:4
live:2
like:2
represented:1
code:1
are:1
stream:3
algorithms:2
sequence:1
streams:3
graph:1
an:1
flume:2
apply:1
kafka:2
sockets:1
the:1
out:1
presented:1
snippets:1
extension:1
scalable:1
guide:3
dstream:2
choose:1
represents:1
dstreams:3
find:1
shows:1
programs:2
such:1
functions:1
called:1
tcp:1
machine:1
api:1
throughout:1
tabs:1
start:1
which:2
this:3
different:1
processing:2
applying:1
enables:1
complex:1
finally:1
introduced:1
python:1
other:1
discretized:1
map:1
as:2

Spark入门（八）之WordCount相关推荐

【Spark分布式内存计算框架——Structured Streaming】3. Structured Streaming —— 入门案例：WordCount
1.3 入门案例:WordCount 入门案例与SparkStreaming的入门案例基本一致:实时从TCP Socket读取数据(采用nc)实时进行词频统计WordCount,并将结果输出到控制台C ...
Spark入门系列（二）| 1小时学会RDD编程
作者 | 梁云1991 转载自Python与算法之美(ID:Python_Ai_Road) 导读:本文为 Spark入门系列的第二篇文章,主要介绍 RDD 编程,实操性较强,感兴趣的同学可以动手实现一 ...
Spark入门实战系列--2.Spark编译与部署（下）--Spark编译安装
[注]该系列文章以及使用到安装包/测试数据可以在<倾情大奉送--Spark入门实战系列>获取 1.编译Spark Spark可以通过SBT和Maven两种方式进行编译,再通过make-d ...
Spark入门实战系列--7.Spark Streaming（上）--实时流计算Spark Streaming原理介绍
[注]该系列文章以及使用到安装包/测试数据可以在<倾情大奉送--Spark入门实战系列>获取 1.Spark Streaming简介 1.1 概述 Spark Streaming 是Sp ...
数据分析从零到精通第二课 Hive和Spark入门
03 离线利器:大数据离线处理工具 Hive 的常用技巧今天为你介绍数据分析师最常用的数据处理工具 Hive 的一些使用技巧.这些技巧我们在工作中使用得比较频繁,如果运用得当,将为我们省去不少时间精 ...
spark系列3：spark入门编程与介绍
3. Spark 入门目标通过理解 Spark 小案例, 来理解 Spark 应用理解编写 Spark 程序的两种常见方式 spark-shell spark-submit Spark 官方提供 ...
一起学习Spark入门
操作系统:CentOS-7.8 Spark版本:2.4.4 本篇文章是一个Spark入门文章,在文章中首先会对Spark进行简单概述,帮助大家先认识Spark,然后会介绍Spark安装部署上的基础知识 ...
Spark入门实战系列--6.SparkSQL（中）--深入了解SparkSQL运行计划及调优
[注]该系列文章以及使用到安装包/测试数据可以在<倾情大奉送--Spark入门实战系列>获取 1.1 运行环境说明 1.1.1 硬软件环境 l 主机操作系统:Windows 64位, ...
Spark入门实战系列--5.Hive（下）--Hive实战
[注]该系列文章以及使用到安装包/测试数据可以在<倾情大奉送--Spark入门实战系列>获取 1.Hive操作演示 1.1 内部表 1.1.1 创建表并加载数据第一步启动HDFS ...
Spark Streaming实现实时WordCount，DStream的使用，updateStateByKey(func)实现累计计算单词出现频率
一. 实战 1．用Spark Streaming实现实时WordCount 架构图: 说明:在hadoop1:9999下的nc上发送消息,消费端接收消息,然后并进行单词统计计算. * 2.安装并启动生 ...

Spark入门（八）之WordCount

一、WordCount

二、maven设置

三、编程代码

Spark入门（八）之WordCount相关推荐

最新文章

热门文章