使用Hive UDF和GeoIP库为Hive加入IP识别功能

导读：Hive是基于Hadoop的数据管理系统，作为分析人员的即时分析工具和ETL等工作的执行引擎，对于如今的大数据管理与分析、处理有着非常大的意义。GeoIP是一套IP映射库系统，它定时更新，并且提供了各种语言的API，非常适合在做地域相关数据分析时的一个数据源。

Hive是基于Hadoop的数据管理系统，作为分析人员的即时分析工具和ETL等工作的执行引擎，对于如今的大数据管理与分析、处理有着非常大的意义。GeoIP是一套IP映射库系统，它定时更新，并且提供了各种语言的API，非常适合在做地域相关数据分析时的一个数据源。

UDF是Hive提供的用户自定义函数的接口，通过实现它可以扩展Hive目前已有的内置函数。而为Hive加入一个IP映射函数，我们只需要简单地在UDF中调用GeoIP的Java API即可。

GeoIP的数据文件可以从这里下载：http://www.maxmind.com/download/geoip/database/，由于需要国家和城市的信息，我这里下载的是http://www.maxmind.com/download/geoip/database/GeoLiteCity.dat.gz

GeoIP的各种语言的API可以从这里下载：http://www.maxmind.com/download/geoip/api/

查看文本copy to clipboard打印?

import java.io.IOException;
import org.apache.hadoop.hive.ql.exec.UDF;
import com.maxmind.geoip.Location;
import com.maxmind.geoip.LookupService;
import java.util.regex.*;
public class IPToCC extends UDF {
private static LookupService cl = null;
private static String ipPattern = "\\d+\\.\\d+\\.\\d+\\.\\d+";
private static String ipNumPattern = "\\d+";
static LookupService getLS() throws IOException{
String dbfile = "GeoLiteCity.dat";
if(cl == null)
cl = new LookupService(dbfile, LookupService.GEOIP_MEMORY_CACHE);
return cl;
}
/**
* @param str like "114.43.181.143"
* */
public String evaluate(String str) {
try{
Location Al = null;
Matcher mIP = Pattern.compile(ipPattern).matcher(str);
Matcher mIPNum = Pattern.compile(ipNumPattern).matcher(str);
if(mIP.matches())
Al = getLS().getLocation(str);
else if(mIPNum.matches())
Al = getLS().getLocation(Long.parseLong(str));
return String.format("%s\t%s", Al.countryName, Al.city);
}catch(Exception e){
e.printStackTrace();
if(cl != null)
cl.close();
return null;
}
}
}

import java.io.IOException;import org.apache.hadoop.hive.ql.exec.UDF;import com.maxmind.geoip.Location;
import com.maxmind.geoip.LookupService;
import java.util.regex.*;public class IPToCC  extends UDF {private static LookupService cl = null;private static String ipPattern = "\\d+\\.\\d+\\.\\d+\\.\\d+";private static String ipNumPattern = "\\d+";static LookupService getLS() throws IOException{String dbfile = "GeoLiteCity.dat";if(cl == null)cl = new LookupService(dbfile, LookupService.GEOIP_MEMORY_CACHE);return cl;}/*** @param str like "114.43.181.143"* */public String evaluate(String str) {try{Location Al = null;Matcher mIP = Pattern.compile(ipPattern).matcher(str);Matcher mIPNum = Pattern.compile(ipNumPattern).matcher(str);if(mIP.matches())Al = getLS().getLocation(str);else if(mIPNum.matches())Al = getLS().getLocation(Long.parseLong(str));return String.format("%s\t%s", Al.countryName, Al.city);}catch(Exception e){e.printStackTrace();if(cl != null)cl.close();return null;}}}

使用上也非常简单，将以上程序和GeoIP的API程序，一起打成JAR包iptocc.jar，和数据文件（GeoLiteCity.dat）一起放到Hive所在的服务器的一个位置。然后打开Hive执行以下语句：

查看文本copy to clipboard打印?

add file /tje/path/to/GeoLiteCity.dat;
add jar /the/path/to/iptocc.jar;
create temporary function ip2cc as 'your.company.udf.IPToCC';

add file /tje/path/to/GeoLiteCity.dat;
add jar /the/path/to/iptocc.jar;
create temporary function ip2cc as 'your.company.udf.IPToCC';

然后就可以在Hive的CLI中使用这个函数了，这个函数接收标准的IPv4地址格式的字符串，返回国家和城市信息；同样这个函数也透明地支持长整形的IPv4地址表示格式。如果想在每次启动Hive CLI的时候都自动加载这个自定义函数，可以在hive命令同目录下建立.hiverc文件，在启动写入以上三条语句，重新启动Hive CLI即可；如果在这台服务器上启动Hive Server，使用JDBC连接，执行以上三条语句之后，也可以正常使用这个函数；但是唯一一点不足是，HUE的Beeswax不支持注册用户自定义函数。

虽然不尽完美，但是加入这样一个函数，对于以后做地域相关的即时分析总是提供了一些方便的，还是非常值得加入的。

转载于:https://www.cnblogs.com/xd502djj/p/3253411.html

使用Hive UDF和GeoIP库为Hive加入IP识别功能相关推荐

hive udf 分组取top1_Hive中分组取前N个值的实现-row_number()
背景假设有一个学生各门课的成绩的表单,应用hive取出每科成绩前100名的学生成绩. 这个就是典型在分组取Top N的需求. 解决思路对于取出每科成绩前100名的学生成绩,针对学生成绩表,根据学科 ...
Spark SQL 和 Hive UDF ExceptionInInitializerError getRemoteBlockReaderFromTcp BlockReaderFactory
文章目录 1.背景 2. hive UDF函数 2. 注册到hive中 3. Spark SQL 4.运行报错 5. HDFS读取问题? 6. 牛 ???????? 7 . 解决后在谷歌搜索发现 8. ...
Hive UDF开发
Hive进行UDF开发十分简单,此处所说UDF为Temporary的function,所以需要hive版本在0.4.0以上才可以. Hive的UDF开发只需要重构UDF类的evaluate函数即可.例 ...
CDH 创建Hive UDF函数
导入依赖包: hive-exec.jar hadoop-common.jar 注意:函数名必须为 evaluate ,否则hive无法识别! package com.example.hive.udf; ...
Hive UDF，就这
摘要:Hive UDF是什么?有什么用?怎么用?什么原理?本文从UDF使用入手,简要介绍相关源码,UDF从零开始. 本文分享自华为云社区<Hive UDF,就这>,作者:汤忒撒. Hive ...
hive UDF 根据ip解析地理位置信息
hive UDF 根据ip查询对应地理位置信息 hive UDF 根据ip查询对应地理位置信息具体实现源码 hive UDF 根据ip查询对应地理位置信息最终效果具体可返回信息:洲,国家,省, ...
Hive UDF初探
1. 引言在前一篇中,解决了Hive表中复杂数据结构平铺化以导入Kylin的问题,但是平铺之后计算广告日志的曝光PV是翻倍的,因为一个用户对应于多个标签.所以,为了计算曝光PV,我们得另外创建视图. ...
spark hive udf java_【填坑六】 spark-sql无法加载Hive UDF的jar
/usr/custom/spark/bin/spark-sql --deploy-mode client add jar hdfs://${clusterName}/user/hive/udf/udf ...
Impala UDF - Impala调用Hive UDF函数
Impala 中运行 Hive UDF 场景:部分查询需要快速返回,使用Impala进行快速.复杂的查询 1.简单的UDF函数过滤,判断是否包含"好"字,返回boolean类型 i ...

使用Hive UDF和GeoIP库为Hive加入IP识别功能

使用Hive UDF和GeoIP库为Hive加入IP识别功能相关推荐

最新文章

热门文章