1.概述

转载：【数据仓库】元数据血缘分析

现在数据仓库基本上采用Hadoop平台了，那么数据仓库里面元数据的血缘分析的思路有哪些呢

基本上有下面这两种思路：

1、解析hql脚本，通过正则表达式去匹配每一行字符串

2、采用Hadoop自带的语法分析类解析

这里比较建议采用第二种，比较直接简单，因为第一种方式比较复杂，需要考虑场景比较多，容易出现遗漏

Hadoop 自带的类 org.apache.hadoop.hive.ql.tools.LineageInfo

将hql语句通过解析语法tree，获取hive表的源表和目标表，达到血缘分析的目的

但是这个类有一点缺陷就是对于create table xx as 这种hql语句无法解析

我们稍加修改代码就可以解决了

package com.neo.datamanager;import org.apache.hadoop.hive.ql.lib.*;
import org.apache.hadoop.hive.ql.parse.*;import java.io.IOException;
import java.util.*;public class HiveLineageInfo implements NodeProcessor {//    private static final Logger logger = LoggerFactory.getLogger(HiveLineageInfo.class);/*** Stores input tables in sql.*/TreeSet inputTableList = new TreeSet();/*** Stores output tables in sql.*/TreeSet OutputTableList = new TreeSet();/*** @return java.util.TreeSet*/public TreeSet getInputTableList() {return inputTableList;}/*** @return java.util.TreeSet*/public TreeSet getOutputTableList() {return OutputTableList;}/*** Implements the process method for the NodeProcessor interface.*/public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx,Object... nodeOutputs) throws SemanticException {ASTNode pt = (ASTNode) nd;switch (pt.getToken().getType()) {case HiveParser.TOK_CREATETABLE:OutputTableList.add(BaseSemanticAnalyzer.getUnescapedName((ASTNode) pt.getChild(0)));break;case HiveParser.TOK_TAB:OutputTableList.add(BaseSemanticAnalyzer.getUnescapedName((ASTNode) pt.getChild(0)));break;case HiveParser.TOK_TABREF:ASTNode tabTree = (ASTNode) pt.getChild(0);String table_name = (tabTree.getChildCount() == 1) ?BaseSemanticAnalyzer.getUnescapedName((ASTNode) tabTree.getChild(0)) :BaseSemanticAnalyzer.getUnescapedName((ASTNode) tabTree.getChild(0)) + "." + tabTree.getChild(1);inputTableList.add(table_name);break;}return null;}/*** parses given query and gets the lineage info.** @param query* @throws ParseException*/public void getLineageInfo(String query) throws ParseException,SemanticException {/** Get the AST tree*/ParseDriver pd = new ParseDriver();ASTNode tree = pd.parse(query);while ((tree.getToken() == null) && (tree.getChildCount() > 0)) {tree = (ASTNode) tree.getChild(0);}/** initialize Event Processor and dispatcher.*/inputTableList.clear();OutputTableList.clear();// create a walker which walks the tree in a DFS manner while maintaining// the operator stack. The dispatcher// generates the plan from the operator treeMap<Rule, NodeProcessor> rules = new LinkedHashMap<Rule, NodeProcessor>();// The dispatcher fires the processor corresponding to the closest matching// rule and passes the context alongDispatcher disp = new DefaultRuleDispatcher(this, rules, null);GraphWalker ogw = new DefaultGraphWalker(disp);// Create a list of topop nodesArrayList topNodes = new ArrayList();topNodes.add(tree);ogw.startWalking(topNodes, null);}public static void main(String[] args) throws IOException, ParseException, SemanticException {String query = "insert into table aa  select * from bb union all select * from cc";HiveLineageInfo lep = new HiveLineageInfo();lep.getLineageInfo(query);System.out.println("Input tables = " + lep.getInputTableList());System.out.println("Output tables = " + lep.getOutputTableList());}
}

运行之后结果如下:

【hadoop】hadoop 血缘解析相关推荐

Hadoop源码解析之: TextInputFormat如何处理跨split的行
Hadoop源码解析之: TextInputFormat如何处理跨split的行转载于:https://blog.51cto.com/taikongren/1742425
Hive SQL血缘解析
Druid可以直接获得所有的列 http://t.csdn.cn/mO4TX 利用Hive提供的LineageLogger与Execution Hooks机制做血缘 https://blog.csdn ...
Hadoop -- hadoop介绍
Hadoop hadoop介绍 hadoop核心组件 hadoop特性优点 hadoop发展 hadoop介绍 hadoop底层是Java语言实现是Apache软件基金会的一款开源软件允许用户使用 ...
linux如何授权HADOOP,hadoop用户权限管理
在上一篇博文我描述了在单机linux上安装hadoop,网址:http://my.oschina.net/hetiangui/blog/142897,这里我主要描述下hadoop的用户权限管理. 上篇 ...
[设计] Doris血缘解析流程
一.背景 1.1 元数据概述元数据是凌久中台重要功能模块,是数据治理的重要一环,元数据治理是一切数据治理的基础,主要分为元数据管理和表血缘管理: 元数据管理主要用来做数据地图.数据资产等: 血缘治理 ...
Hadoop常见错误解析
1:Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out Answer: 程序里面需要打开多个文件,进行分析,系统一般默认数量 ...
Hadoop中Partition解析
1.解析Partition Map的结果,会通过partition分发到Reducer上,Reducer做完Reduce操作后,通过OutputFormat,进行输出,下面我们就来分析参与这个过程的类 ...
Hadoop源码解析
一.hadoop的Job 提交流程源码流程图: 1.从我们编写的mapreduce的代码中进入job提交源码支线一:进入connect(); 2.支线二:进入submitter.submitJob ...
超详细单机版搭建hadoop环境图文解析
转自:http://weixiaolu.iteye.com/blog/1401931 安装过程: 一.安装Linux操作系统二.在Ubuntu下创建hadoop用户组和用户三.在Ubuntu下安装 ...

【hadoop】hadoop 血缘解析

1.概述

【hadoop】hadoop 血缘解析相关推荐

最新文章

热门文章