大数据计算技术大作业

课程名称	大数据计算技术
实验项目名称	大作业

项目目的

天气数据分析，通过网络爬虫（自己编写网络爬虫程序），网址：https://www.tianqi.com，点击“天气”->“历史天气”，获得天气数据，并将获得的天气数据存储到HDFS中，然后利用Map reduce和Hive，分析天气数据。

基本过程

Created with Raphaël 2.3.0开始天气网爬取数据数据保存在MySQLSqoop数据迁移MapReduce和Hive离线数据分析数据可视化结束

实验步骤

爬虫程序

项目流程

Created with Raphaël 2.3.0开始mainUI是否进入数据库管理databaseUI是否进行数据库操作数据库操作是否执行爬虫爬虫操作结束yesnoyesnoyesno

项目目录

mainUI.py：提供初始界面
databaseUI.py：提供数据库操作界面
databaseOperate.py：提供操作数据库的具体方法
- query_tables()：查询数据库返回WeatherData架构内具体的表
- query_tables_information(table_name)：输入table_name（String），返回该数据表内的数据
- delete_table_information(table_name)：输入table_name（String），删除该表内数据，返回bool
- insert_information(table_name, data)：输入table_name（String）、data（list），将数据插入指定数据表
reptile.py：爬虫主体，爬取city数据表内城市近一年的天气数据并存取到MySql内。
lib目录
- parameter.py：提供全局目标网站
- translateToPinyin.py：提供汉字转拼音方法
output_file：输出sql文件，包括建表、数据导入文件

爬取结果

city数据库
airQuality数据库
weather数据库

注：这里只展示性地爬取了输入三个城市的数据，并只获取了自 2021-4-1 到 2022-3-31 共1095条数据

Sqoop安装及使用

安装zookeper

修改存储目录

dataDir=/usr/local/zookeeper/zkdata

运行zookeper

查看运行状态

安装sqoop

配置环境（sqoop-env.sh）

export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop
export HIVE_HOME=/usr/local/hive
export ZOOKEEPER_HOME=/usr/local/zookeeper
export ZOOCFGDIR=/usr/local/zookeeper
export HBASE_HOME=/usr/local/hbase

添加依赖

sudo cp mysql-connector-java-8.0.29.jar /usr/local/sqoop1.4.6/lib/

验证Sqoop

测试链接数据库

bin/sqoop list-databases --connect jdbc:mysql://127.0.0.1:3306/ --username root --password 123456

浏览器查看hdfs配置

<property><name>dfs.http.address</name><value>0.0.0.0:50070</value>
</property>

Hadoop伪分布方式

hadoop/etc/hadoop/mapred-site.xml

<configuration><property><name>mapreduce.framework.name</name><value>yarn</value></property><property><name>mapreduce.admin.user.env</name><value>HADOOP_MAPRED_HOME=$HADOOP_COMMON_HOME</value></property><property><name>yarn.app.mapreduce.am.env</name><value>HADOOP_MAPRED_HOME=$HADOOP_COMMON_HOME</value></property>
</configuration>

hadoop/etc/hadoop/yarn-site.xml

<configuration><property><name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle</value></property>
</configuration>

启动服务

sbin/start-all.sh

导入数据

bin/sqoop import \
--connect jdbc:mysql://127.0.0.1:3306/WeatherData \
--username root \
--password 123456 \
--table city \
--target-dir /user/WeatherData/city \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t"

bin/sqoop import \
--connect jdbc:mysql://127.0.0.1:3306/WeatherData \
--username root \
--password 123456 \
--table weather \
--target-dir /user/WeatherData/weather \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t"

bin/sqoop import \
--connect jdbc:mysql://127.0.0.1:3306/WeatherData \
--username root \
--password 123456 \
--table airQuality \
--target-dir /user/WeatherData/airQuality \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t"

下载其中的一个数据查看

编写、部署及运行 MapReduce

本地测试

环境准备

创建 maven 工程，MapReduceDemo

在 pom.xml 文件中添加如下依赖

<dependencies><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-client</artifactId><version>3.1.3</version></dependency><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.12</version></dependency><dependency><groupId>org.slf4j</groupId><artifactId>slf4j-log4j12</artifactId><version>1.7.30</version></dependency>
</dependencies>

在项目的 src/main/resources 目录下，新建一个文件，命名为 log4j.properties，在文件中填入

1og4j.rootLogger=INFO, stdout
1og4j.appender.stdout=org.apache.1og4j.ConsoleAppender
1og4j.appender.stdout.layout=org.apache.1og4j.PatternLayout
1og4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
1og4j.appender.logfile=org.apache.1og4j.FileAppender
1og4j.appender.1ogfile.File=target/spring.log
1og4j.appender.1ogfile.layout=org.apache.1og4j.PatternLayout
1og4j.appender.1ogfile.1ayout.ConversionPattern=%d %p [%c] - %m%n

创建包名：com.owem.mapreduce.weatherAnalysis

编写程序

编写气温分析的 Bean 对象（程序参数、方法主体）

public class WeatherAnalysisBean implements Writable {// 日期private Date date;// 当日最低气温private int minTemp;// 当日最高气温private int maxTemp;// 用于存取每个月的最低气温private float[] minTempArrays = new float[12];// 用于存取每个月的最高气温private float[] maxTempArrays = new float[12];// 用于存取当天的天气private boolean[] weather = new boolean[24];// 用于存取每个月的天气数量分部private int[] weatherArrays = new int[24];// 无参构造器public WeatherAnalysisBean() {}// 相关的get set方法public String getDate() {String s = (date.getYear()+1900) + "-" + (date.getMonth()+1) + "-" + date.getDate();return s;}public void setDate(String s) throws ParseException {this.date = new SimpleDateFormat("yyyy-MM-dd").parse(s);;}public int getMinTemp() {return minTemp;}public void setMinTemp(int minTemp) {this.minTemp = minTemp;}public int getMaxTemp() {return maxTemp;}public void setMaxTemp(int maxTemp) {this.maxTemp = maxTemp;}public String getWeather() {String[] weatherList = {"阴", "晴", "多云", "霾", "小雨", "小雨转阴", "多云转小雨", "小雨到中雨", "小雨转多云", "中雨", "阴到小雨", "小雪", "中雪", "扬沙", "大雪", "雨夹雪","阴转小雨", "小雨到大雨", "晴转小雨", "风", "大雨", "中雨到大雨", "阴到中雨", "雾"};int index;for (index = 0; index < 24; index++) {if (weather[index]) {return weatherList[index];}}return null;}public void setWeather(String s) {for (int i = 0; i < 24; i++) {weather[i] = false;}String[] weatherList = {"阴", "晴", "多云", "霾", "小雨", "小雨转阴", "多云转小雨", "小雨到中雨", "小雨转多云", "中雨", "阴到小雨", "小雪", "中雪", "扬沙", "大雪", "雨夹雪","阴转小雨", "小雨到大雨", "晴转小雨", "风", "大雨", "中雨到大雨", "阴到中雨", "雾"};int index;for (index = 0; index < 24; index++) {if (weatherList[index].equals(s)) {weather[index] = true;break;}}}// 将当天的天气存到每月的分部public void computeWeather() {int index;for (index = 0; index < 24; index++) {if (weather[index]) {weatherArrays[index]++;break;}}}// 将每个月的天气温度累加public void setTemp() {int month = date.getMonth();minTempArrays[month] += minTemp;maxTempArrays[month] += maxTemp;}// 计算每个月的平均温度public void computeTemp() {int[] days = {31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31};for (int i = 0; i < 12; i++) {minTempArrays[i] /= days[i];maxTempArrays[i] /= days[i];}}public String getMinTempArrays() {String result = "{";int i;for (i = 0; i < minTempArrays.length-1; i++) {result = result + minTempArrays[i] + ",";}result = result + minTempArrays[i] + "}";return result;}public String getMaxTempArrays() {String result = "{";int i;for (i = 0; i < maxTempArrays.length-1; i++) {result = result + maxTempArrays[i] + ",";}result = result + maxTempArrays[i] + "}";return result;}public String getWeatherArrays() {String result = "{";int i;for (i = 0; i < weatherArrays.length-1; i++) {result = result + weatherArrays[i] + ",";}result = result + weatherArrays[i] + "}";return result;}// 序列化：write方法@Overridepublic void write(DataOutput out) throws IOException {out.writeLong(date.getTime());out.writeInt(minTemp);out.writeInt(maxTemp);for (float v : minTempArrays) {out.writeFloat(v);}for (float v : maxTempArrays) {out.writeFloat(v);}for (boolean v : weather) {out.writeBoolean(v);}}// 序列化：read方法@Overridepublic void readFields(DataInput in) throws IOException {this.date = new Date(in.readLong());this.minTemp = in.readInt();this.maxTemp = in.readInt();for (int i=0; i < 12; i++) {this.minTempArrays[i] = in.readFloat();}for (int i=0; i < 12; i++) {this.maxTempArrays[i] = in.readFloat();}for (int i=0; i < 24; i++) {this.weather[i] = in.readBoolean();}}@Overridepublic String toString() {return getMinTempArrays() + "\t" + getMaxTempArrays()  + "\t" + getWeatherArrays() ;}
}

编写 mapper 对象

public class WeatherAnalysisMapper extends Mapper<LongWritable, Text, Text, WeatherAnalysisBean> {private Text outK = new Text();private WeatherAnalysisBean  outV = new WeatherAnalysisBean();@Overrideprotected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, WeatherAnalysisBean>.Context context) throws IOException, InterruptedException {// 1 获取一行String line = value.toString();// 2 切割String[]  split  = line.split("\t");// 3 抓取想要的数据String cityId = split[0];String date = split[1];String maxTemp = split[2];String minTemp = split[3];String weather = split[4];// 4 封装outK.set(cityId);try {outV.setDate(date);} catch (ParseException e) {throw new RuntimeException(e);}outV.setMaxTemp((int) Float.parseFloat(maxTemp));outV.setMinTemp((int) Float.parseFloat(minTemp));outV.setWeather(weather);// 5 写出context.write(outK, outV);}
}

编写 reducer 对象

public class WeatherAnalysisReducer extends Reducer<Text, WeatherAnalysisBean, Text, WeatherAnalysisBean> {private WeatherAnalysisBean outV = new WeatherAnalysisBean();@Overrideprotected void reduce(Text key, Iterable<WeatherAnalysisBean> values, Reducer<Text, WeatherAnalysisBean, Text, WeatherAnalysisBean>.Context context) throws IOException, InterruptedException {// 1 遍历集合累加值for (WeatherAnalysisBean value : values) {try {outV.setDate(value.getDate());outV.setMaxTemp(value.getMaxTemp());outV.setMinTemp(value.getMinTemp());outV.setWeather(value.getWeather());outV.computeWeather();outV.setTemp();} catch (ParseException e) {throw new RuntimeException(e);}}// 2 封装outK, outVoutV.computeTemp();// 3 写出context.write(key, outV);}
}

编写 driver 对象

public class WeatherAnalysisDriver {public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {// 1 获次 jobConfiguration conf = new Configuration();Job job = Job.getInstance(conf);// 2 设置 jarjob.setJarByClass(WeatherAnalysisDriver.class);// 3 关供 mapper 和 Reducerjob.setMapperClass(WeatherAnalysisMapper.class);job.setReducerClass(WeatherAnalysisReducer.class);// 4 设置 mapper 粉出的 key 和 value 类型job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(WeatherAnalysisBean.class);// 5 设留最终数据输出的 key 和 value 类型job.setOutputKeyClass(Text.class);job.setOutputValueClass(WeatherAnalysisBean.class);// 6 设置数据的粉入路经和输出路径FileInputFormat.setInputPaths(job, new Path("/Users/owem/Desktop/Profession/Java/MapReduceDemo/input/WeatherData/weather"));FileOutputFormat.setOutputPath(job, new Path("/Users/owem/Desktop/Profession/Java/MapReduceDemo/output/WeatherData"));// 7 提交 jobjob.waitForCompletion(true);}
}

本地运行

运行 WeatherAnalysisDriver

生成相关文件

查看结果：part-r-00000

注：输出结果为：cityId [每个月平均最低温度] [每个月平均最高温度] [天气状态分部]

用 maven 打 jar 包，需要添加的打包插件依赖

 <build><plugins><plugin><artifactId>maven-compiler-plugin</artifactId><version>3.6.1</version><configuration><source>1.8</source><target>1.8</target></configuration></plugin><plugin><artifactId>maven-assembly-plugin</artifactId><configuration><descriptorRefs><descriptorRef>jar-with-dependencies</descriptorRef></descriptorRefs></configuration><executions><execution><id>make-assembly</id><phase>package</phase><goals><goal>single</goal></goals></execution></executions></plugin></plugins></build>

部署测试

修改 driver 对象的输入输出部分

FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

Maven 打 jar 包

生成两个 jar 包，大的包含依赖，小的不包含（一般用这个）
将 jar 包上传到 linux，同时保证 hdfs 有待分析的文件
1. 将 jar 包重命名为 WDA.jar 上传到 hadoop 目录下
2. 删除 HDFS 上 weather 文件夹内 _SUCCESS

执行 jar 包

hadoop jar WDA.jar com.owem.mapreduce.weatherAnalysis.WeatherAnalysisDriver /user/WeatherData/weather /user/WeatherData/weatherOutput

数据导入Hive

启动 Hive

在确保已经启动 hadoop 集群的前提下启动 hive：

创建数据库并填入数据

创建数据库

切换到 WeatherData 数据库并创建数据表

数据表信息：

city表

create table if not exists city
(cityID   int,cityName varchar(20)
);

airQuality表

create table if not exists airQuality
(CityID        int,`date`          date,quality_level varchar(20),AQI           int,PM2_5         int,PM10          int,So2           int,No2           int,Co            float,O3            int
);

weather表

create table if not exists weather
(cityID  int,weather varchar(20),`date`    date,maxTemp int,minTemp int
);

查看数据表：

将 MySQL 中数据导入 Hive 中

bin/sqoop import \
--connect jdbc:mysql://127.0.0.1:3306/WeatherData \
--username root \
--password 123456 \
--table city -m 1 \
--hive-import \
--hive-table WeatherData.city

bin/sqoop import \
--connect jdbc:mysql://127.0.0.1:3306/WeatherData \
--username root \
--password 123456 \
--table airQuality -m 1 \
--hive-import \
--hive-table WeatherData.airQuality

bin/sqoop import \
--connect jdbc:mysql://127.0.0.1:3306/WeatherData \
--username root \
--password 123456 \
--table weather -m 1 \
--hive-import \
--hive-table WeatherData.weather

缺失值处理

例如 airQuality 表内 data 不为空：

select count(1) from airQuality where 'data' is null;

类似地，可以知道本次导入数据正常，没有出现缺失值。

分析 Hive 内数据

统计每月晴天的天数

select cityName, month(`date`), count(1)
from city, weather
where city.cityID = weather.cityID and weather = '晴'
group by cityName, month(`date`);

统计空气质量的初步分析

每个月空气质量分部：

select month(`date`), quality_level, count(1)
from airQuality, city
where city.cityID = airQuality.cityID
group by month(`date`), quality_level
order by count(1) desc ;

可以看出，在一年中6、7、8、9、10月份普遍空气质量较好；在10、11、12、1、2、3空气质量会较差；

这点在对北京单独的查询结果中表现更为明显：

select month(`date`), quality_level, count(1)
from airQuality, city
where city.cityID = airQuality.cityID and city.cityID = 1
group by month(`date`), quality_level
order by count(1) desc ;

同时也注意到在10、11、12月南北方会有较大差异

不同城市每个月空气质量分部情况：

select cityName, month(`date`), quality_level, count(1)
from airQuality, city
where city.cityID = airQuality.cityID
group by cityName, month(`date`), quality_level;

查询雾霾天数

select cityName, count(1)
from city, weather
where city.cityID = weather.cityID and weather.weather like '%霾%'
group by cityName;

可以看到只有北京出现三天的雾霾

查询那一天的天气状况

select cityName, weather.`date`, weather.weather, AQI, PM2_5, maxTemp, minTemp
from city, weather, airQuality
where city.cityID = weather.cityID = airQuality.cityID and weather.weather like '%霾%' and weather.`date` = airQuality.`date`;

可以看到出现霾的日子空气质量不一定很差，推测仅仅是发生在早晨温度较低的时候，且随温度上升很快消失，所以整天的AQI反而不一定高

数据可视化

将数据导出为 Excel 文件进行分析

绘制 2021-4-1 至 2022-3-31 空气质量变化折线图

绘制 2021-4-1 至 2022-3-31 气温变化图

由 MapReduce 结果绘制 2021-4 至 2022-3 气温变化图

由 MapReduce 结果绘制 2021-4 至 2022-3 天气情况占比图

ZUCC_大数据计算技术_大作业相关推荐

ZUCC_大数据计算技术_实验三 HDFS编程实践
实验三 HDFS编程实践课程名称大数据计算技术实验项目名称实验三 HDFS编程实践启动hadoop 一.利用Shell命令与HDFS进行交互 1. 目录操作新建家目录可以看到新建目录内无 ...
什么是大数据口子_大数据分析师年薪几十万，学什么专业才能从事大数据？
近几年,大数据为各个领域带来了全新的变革,大数据的重要性越来越被企业和国家所看到,大数据工作者的需求再次被无限放大,他们的薪资和社会地位也在不断上涨.马云在演讲中就提到,未来的时代将不是IT时代,而是 ...
小白专属:大数据总纲_大数据路线_高屋建瓴的体验大数据的世界
零.前言不想告诉你前景. 因为好前景给你的回报是2x.坏前景给你的回报是1x. 而你缺的是走下去的过程.却的是进入前景,走到前景面前的那一个x 建议阅读人群: 大数据入门人员. 才学一两个组件的小白 ...
大数据工作流_大数据和人工智能时代下的数字化工作流
点击上方"Bentley软件"可以订阅哦本文作者 Bentley 软件公司高级技术经理赵顺耐大数据.人工智能以及与之相伴相生的物联网已经成为现代社会的运行方式,信息技术的急 ...
golang 大数据平台_大数据平台是什么？有哪些功能？如何搭建大数据平台？
大数据平台是为了满足企业对于数据的各种要求而产生的. 大数据平台: 是指以处理海量数据存储.计算及不间断流数据实时计算等场景为主的一套基础设施.典型的包括Hadoop系列.Spark.Storm.Fl ...
车联网大数据框架_大数据基础：ORM框架入门简介
作为大数据开发技术者,需要掌握扎实的Java基础,这是不争的事实,所以对于Java开发当中需要掌握的重要框架技术,也需要有相应程度的掌握,比如说ORM框架.今天的大数据基础分享,我们就来具体讲一讲OR ...
人力资源大数据公司_大数据与人力资源相结合，平衡透明度和隐私
人力资源大数据公司这对人力资源部门来说是一个激动人心的时刻-分析的使用可以预测地将围绕人力资源的对话和看法改变为一项功能. 大多数组织相信人员分析在使HR成为高级管理人员的战略合作伙伴方面可以发挥的 ...
线程导入大数据入库_大数据处理及分析该怎么做？用这款数据分析软件轻松搞定...
对大数据的重视让很多企业都在纷纷寻找更好的大数据处理及分析方法?这款数据分析软件轻松搞定! 一.数据采集虽然每天互联网都会产生大量的数据,对于企业来讲,要搜集对自己企业有用的数据才是真的大数据.首 ...
哈工大大数据实验_大数据创新实验室丨警大智慧警务学院人才培养打造新引擎...
2020年11月13日,警察大学智慧警务学院与江苏省南通市通州区公安局签署了<大数据创新运用联合开发实验室合作框架协议>.智慧警务学院党委书记王连鹏,南通市通州区副区长.公安局局长杨彬,南 ...

ZUCC_大数据计算技术_大作业