往期文章链接目录

文章目录

  • 往期文章链接目录
  • Probabilistic Graphical Model (PGM)
  • Why we need probabilistic graphical models
  • Three major parts of PGM
  • Representation
    • Directed graphical models (Bayesian networks)
    • Undirected graphical models (Markov random fields)
    • Markov Properties of undirected graph
    • Comparison between Bayesian networks and Markov random fields
    • Moral graph
    • Factor Graph
  • Inference
  • Learning
  • 往期文章链接目录

Probabilistic Graphical Model (PGM)

Definition: A probabilistic graphical model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables.

In general, PGM obeys following rules:
Sum Rule : p(x1)=∫p(x1,x2)dx2Product Rule : p(x1,x2)=p(x1∣x2)p(x2)Chain Rule: p(x1,x2,⋯,xp)=∏i=1pp(xi∣xi+1,xi+2…xp)Bayesian Rule: p(x1∣x2)=p(x2∣x1)p(x1)p(x2)\begin{aligned} &\text {Sum Rule : } p\left(x_{1}\right)=\int p\left(x_{1}, x_{2}\right) d x_{2}\\ &\text {Product Rule : } p\left(x_{1}, x_{2}\right)=p\left(x_{1} | x_{2}\right) p\left(x_{2}\right)\\ &\text {Chain Rule: } p\left(x_{1}, x_{2}, \cdots, x_{p}\right)=\prod_{i=1}^{p} p\left(x_{i} | x_{i+1, x_{i+2}} \ldots x_{p}\right)\\ &\text {Bayesian Rule: } p\left(x_{1} | x_{2}\right)=\frac{p\left(x_{2} | x_{1}\right) p\left(x_{1}\right)}{p\left(x_{2}\right)} \end{aligned} ​Sum Rule : p(x1​)=∫p(x1​,x2​)dx2​Product Rule : p(x1​,x2​)=p(x1​∣x2​)p(x2​)Chain Rule: p(x1​,x2​,⋯,xp​)=i=1∏p​p(xi​∣xi+1,xi+2​​…xp​)Bayesian Rule: p(x1​∣x2​)=p(x2​)p(x2​∣x1​)p(x1​)​​

As the dimension of the data increases, the chain rule is harder to compute. In fact, many models try to simplify it in some ways.

Why we need probabilistic graphical models

Reasons:

  • They provide a simple way to visualize the structure of a probabilistic model and can be used to design and motivate new models.

  • Insights into the properties of the model, including conditional independence properties, can be obtained by inspection of the graph.

  • Complex computations, required to perform inference and learning in sophisticated models, can be expressed in terms of graphical manipulations, in which underlying mathematical expressions are carried along implicitly.

Three major parts of PGM

  • Representation: Express a probability distribution that models some real-world phenomenon.

  • Inference: Obtain answers to relevant questions from our models.

  • Learning: Fit a model to real-world data.

We are going to mainly focus on Representation in this post.

Representation

Representation: Express a probability distribution that models some real-world phenomenon.

Directed graphical models (Bayesian networks)

Directed graphical models is also known as Bayesian networks.

Intuition:

In a directed graph, vertices correspond to variables xix_ixi​ and edges indicate dependency relationships. Once the graphical representation of a directed graph is given (directed acyclic graphs), we can easily calculate the joint probability. For example, from the figure above, we can calculate the joint probability p(a,b,c,d,e)p(a,b,c,d,e)p(a,b,c,d,e) by

p(a,b,c,d,e)=p(a)⋅p(b∣a)⋅p(c∣b,d)⋅p(d)⋅p(e∣c)p(a,b,c,d,e) = p(a) \cdot p(b|a) \cdot p(c|b,d) \cdot p(d) \cdot p(e|c)p(a,b,c,d,e)=p(a)⋅p(b∣a)⋅p(c∣b,d)⋅p(d)⋅p(e∣c)

Formal Definition:

A Bayesian network is a directed graph G=(V,E)G= (V,E)G=(V,E) together with

  • A random variable xix_ixi​ for each node i∈Vi \in Vi∈V.

  • One conditional probability distribution (CPD) p(xi∣xAi)p(x_i \mid x_{A_i})p(xi​∣xAi​​) per node, specifying the probability of xix_ixi​ conditioned on its parents’ values.

Note:

  • Bayesian networks represent probability distributions that can be formed via products of smaller, local conditional probability distributions (one for each variable). Another way to say it is that each factor in the factorization of p(a,b,c,d,e)p(a,b,c,d,e)p(a,b,c,d,e) is locally normalized (every factor can sum up to one).

  • Directed models are often used as generative models.

Undirected graphical models (Markov random fields)

Undirected graphical models is also known as Markov random fields (MRFs).

Unlike in the directed case, we cannot say anything about how one variable is generated from another set of variables (as a conditional probability distribution would do).

Intuition:

Suppose we have five students doing a project and we want to evaluate how well they would cooperate together. Since five people are too many to be evaluated as a whole, we devide it into small subgroups and evaluate these subgroups respectively. In fact, these small subgroups are called clique and we would introduce it later in this section.

Here, we introduce the concept of potential function ϕ\phiϕ to evaluete how well they would cooperate together. You can think of it as a score that measures how well a clique cooperate. Higher scores indicate better cooperation. In fact, we requie scores to be non-negative, and depending on how we define the potential functions, we would get different models. As the figure shown above, we could write p(a,b,c,d,e)p(a,b,c,d,e)p(a,b,c,d,e) as

p(a,b,c,d,e)=ϕ1(a,b,c)⋅ϕ2(b,d)⋅ϕ3(d,e)p(a,b,c,d,e) = \phi_1(a,b,c) \cdot \phi_2(b,d) \cdot \phi_3(d,e)p(a,b,c,d,e)=ϕ1​(a,b,c)⋅ϕ2​(b,d)⋅ϕ3​(d,e)

Note that the left hand side of the queation is a probability but the right hand side is a product of potentials/ scores. To make the right hand side a valid probability, we need to introduce a normalization term 1/Z1/Z1/Z. Hence it becomes

p(a,b,c,d,e)=1Z⋅ϕ1(a,b,c)⋅ϕ2(b,d)⋅ϕ3(d,e)p(a,b,c,d,e) = \frac{1}{Z} \cdot \phi_1(a,b,c) \cdot \phi_2(b,d) \cdot \phi_3(d,e)p(a,b,c,d,e)=Z1​⋅ϕ1​(a,b,c)⋅ϕ2​(b,d)⋅ϕ3​(d,e)

Here we say p(a,b,c,d,e)p(a,b,c,d,e)p(a,b,c,d,e) is globally normalized. Also, we call ZZZ a partition function, which is

Z=∑a,b,c,d,eϕ1(a,b,c)⋅ϕ2(b,d)⋅ϕ3(d,e)(1)Z = \sum_{a,b,c,d,e} \phi_1(a,b,c) \cdot \phi_2(b,d) \cdot \phi_3(d,e) \tag 1Z=a,b,c,d,e∑​ϕ1​(a,b,c)⋅ϕ2​(b,d)⋅ϕ3​(d,e)(1)

Notice that the summation in (1)(1)(1) is over the exponentially many possible assignments to a,b,c,da,b,c,da,b,c,d and eee. For this reason, computing ZZZ is intractable in general, but much work exists on how to approximate it.

Formal Definition:

  • cliques: fully connected subgraphs.

  • maximal clique: A clique is a maximal clique if it is not contained in any larger clique.

A Markov Random Field (MRF) is a probability distribution ppp over variables x1,…,xnx_{1}, \ldots, x_{n}x1​,…,xn​ defined by an undirected graph GGG in which nodes correspond to variables xi.x_{i} .xi​. The probability ppp has the form

p(x1,…,xn)=1Z∏c∈Cϕc(xc)(2)p\left(x_{1}, \ldots, x_{n}\right)=\frac{1}{Z} \prod_{c \in C} \phi_{c}\left(x_{c}\right) \tag 2p(x1​,…,xn​)=Z1​c∈C∏​ϕc​(xc​)(2)

where CCC denotes the set of cliques of G,G,G, and each factor ϕc\phi_{c}ϕc​ is a non-negative function over the variables in a clique. The partition function

Z=∑x1,…,xn∏c∈Cϕc(xc)Z=\sum_{x_{1}, \ldots, x_{n}} \prod_{c \in C} \phi_{c}\left(x_{c}\right) Z=x1​,…,xn​∑​c∈C∏​ϕc​(xc​)

is a normalizing constant that ensures that the distribution sums to one.

Markov Properties of undirected graph

  • Global Markov Property: ppp satisfies the global Markov property with respect to a graph GGG if for any disjoint vertex subsets AAA, BBB, and CCC, such that CCC separates AAA and BBB, the random variables XAX_AXA​ are conditionally independent of XBX_BXB​ given XCX_CXC​.
    Here,we say CCC separates AAA and BBB if every path from a node in AAA to a node in B passes through a node in CCC (d-seperation).

  • Local Markov Property: ppp satisfies the local Markov property with respect to GGG if the conditional distribution of a variable given its neighbors is independent of the remaining nodes.

  • Pairwise Markov Property: ppp satisfies the pairwise markov property with respect to GGG if for any pair of non-adjacent nodes, s,t∈Vs,t \in Vs,t∈V, we have Xs⊥Xt∣XV\{s,t}X_{s} \perp X_{t} | X_{V \backslash\{s, t\}}Xs​⊥Xt​∣XV\{s,t}​.

Note:

  • A distribution ppp that satisfies the global Markov property is said to be a Markov random field or Markov network with respect to the graph.

  • Global Markov Property ⇒\Rightarrow⇒ Local Markov Property ⇒\Rightarrow⇒Pairwise Markov Property.

  • A Markov random field reflects conditional independency since it satisfies the Local Markov Property.

  • To see whether a distribution is a Markov random field or Markov network, we have the following theorem:

    Hammersley-Clifford Theorem: Suppose ppp is a strictly positive distribution, and GGG is an undirected graph that indexes the domain of ppp. Then ppp is Markov with respect to G if and only if ppp factorizes over the cliques of the graph GGG.

Comparison between Bayesian networks and Markov random fields

  • Bayesian networks effectively show causality, whereas MRFs cannot. Thus, MRFs are preferable for problems where there is no clear causality between random variables.

  • It is much easier to generate data from a Bayesian network, which is important in some applications.

  • In Markov random fields, computing the normalization constant ZZZ requires a summation over the exponentially many possible assignments. For this reason, computing ZZZ is intractable in general, but much work exists on how to approximate it.

Moral graph

A moral graph is used to find the equivalent undirected form of a directed acyclic graph.

The moralized counterpart of a directed acyclic graph is formed by

  1. Add edges between all pairs of non-adjacent nodes that have a common child.

  2. Make all edges in the graph undirected.

Here is an example:

Note that a Bayesian network can always be converted into an undirected network.

Therefore, MRFs have more power than Bayesian networks, but are more difficult to deal with computationally. A general rule of thumb is to use Bayesian networks whenever possible, and only switch to MRFs if there is no natural way to model the problem with a directed graph

Factor Graph

A Markov network has an undesirable ambiguity from the factorization perspective. Consider the three-node Markov network in the figure (left). Any distribution that factorizes as

p(x1,x2,x3)∝ϕ(x1,x2,x3)(3)p(x_1, x_2, x_3) \propto \phi(x_1,x_2,x_3) \tag 3p(x1​,x2​,x3​)∝ϕ(x1​,x2​,x3​)(3)

for some positive function ϕ\phiϕ is Markov with respect to this graph (check Hammersley-Clifford Theorem mentioned earlier). However, we may wish to use a more restricted parameterization, where

p(x1,x2,x3)∝ϕ1(x1,x2)ϕ1(x2,x3)ϕ1(x1,x3)(4)p(x1, x2, x3) \propto \phi_1(x_1, x_2)\phi_1(x_2, x_3)\phi_1(x_1, x_3) \tag 4p(x1,x2,x3)∝ϕ1​(x1​,x2​)ϕ1​(x2​,x3​)ϕ1​(x1​,x3​)(4)

The model family in (4)(4)(4) is smaller, and therefore may be more amenable to parameter estimation. But the Markov network formalism cannot distinguish between these two parameterizations. In order to state models more precisely, the factorization in (2)(2)(2) can be represented directly by means of a factor graph.

Definition (factor graph): A factor graph is a bipartite graph G=(V,F,E)G = (V, F, E)G=(V,F,E) in which a variable node xi∈Vx_i \in Vxi​∈V is connected to a factor node ϕa∈F\phi_a \in Fϕa​∈F if xix_ixi​ is an argument to ϕa\phi_aϕa​.

An example of a factor graph is shown on the right side of the figure above. In the figure, the circles are variable nodes, and the shaded boxes are factor nodes. Notice that, unlike the undirected graph, the factor graph depicts the factorization of the model unambiguously.

Remark: Directed models can be thought of as a kind of factor graph, in which the individual factors are locally normalized in a special fashion so that globally Z=1Z = 1Z=1.

Inference

Inference: Obtain answers to relevant questions from our models.

  • Marginal inference: what is the probability of a given variable in our model after we sum everything else out?

p(y=1)=∑x1∑x2⋯∑xnp(y=1,x1,x2,…,xn)p(y=1) = \sum_{x_1} \sum_{x_2} \cdots \sum_{x_n} p(y=1, x_1, x_2, \dotsc, x_n)p(y=1)=x1​∑​x2​∑​⋯xn​∑​p(y=1,x1​,x2​,…,xn​)

  • Maximum a posteriori (MAP) inference: what is the most likely assignment to the variables in the model?

max⁡x1,…,xnp(y=1,x1,…,xn)\max_{x_1, \dotsc, x_n} p(y=1, x_1, \dotsc, x_n)x1​,…,xn​max​p(y=1,x1​,…,xn​)

Learning

Learning: Fit a model to real-world data.

  • Parameter learning: the graph structure is known and we want to estimate the parameters.

    • complete case:

      • We use Maximum Likelihood Estimation to estimate parameters.
    • incomplete case:

      • We use EM Algorithm to approximate parameters.

      • Example: Guassian Mixture Model (GMM), Hidden Markov Model (HMM).

  • Structure learning: we want to estimate the graph, i.e., determine from data how the variables depend on each other.


Reference:

  • Bishop, Christopher M., “Pattern Recognition and Machine Learning,” Springer, 2006.
  • https://ermongroup.github.io/cs228-notes/
  • https://en.wikipedia.org/wiki/Moral_graph
  • https://space.bilibili.com/97068901
  • https://zhenkewu.com/assets/pdfs/slides/teaching/2016/biostat830/lecture_notes/Lecture4.pdf
  • https://skggm.github.io/skggm/tour
  • https://homepages.inf.ed.ac.uk/csutton/publications/crftutv2.pdf

往期文章链接目录

Probabilistic Graphical Model (PGM) 概率图模型框架详解相关推荐

  1. 深入理解机器学习——概率图模型(Probabilistic Graphical Model):条件随机场(Conditional Random Field,CRF)

    分类目录:<深入理解机器学习>总目录 条件随机场(Conditional Random Field,CRF)是一种判别式无向图模型,在<概率图模型(Probabilistic Gra ...

  2. [Cocoa]深入浅出 Cocoa 之 Core Data(1)- 框架详解

    深入浅出 Cocoa 之 Core Data(1)- 框架详解 罗朝辉(http://blog.csdn.net/kesalin) CC 许可,转载请注明出处 Core data 是 Cocoa 中处 ...

  3. 【传智播客郑州校区分享】AndroidAnnotations框架详解

    AndroidAnnotations框架详解 文/传智播客郑州中心就业服务部 简介 在之前的开发中,你肯定用到了xUtils及ButterKnife等依赖注入框架,你可以使用这些框架来简化你的代码,因 ...

  4. 双目视觉集合框架详解

    双目视觉几何框架详解 一.图像坐标:我想和世界坐标谈谈(A) 玉米竭力用轻松具体的描述来讲述双目三维重建中的一些数学问题.希望这样的方式让大家以一个轻松的心态阅读玉米的<计算机视觉学习笔记> ...

  5. laravel框架详解 学以致用

    系列文章目录 提示: laravel介绍.文件配置.路由使用 .控制器的使用 . 数据的操作.@csrf防护.文件上传 文章目录 系列文章目录 laravel框架 详解一些功能 学以致用 一.lara ...

  6. Java 内存模型 JMM 详解

    转载自 Java 内存模型 JMM 详解 JMM简介 Java Memory Model简称JMM, 是一系列的Java虚拟机平台对开发者提供的多线程环境下的内存可见性.是否可以重排序等问题的无关具体 ...

  7. springboot2整合mysql5_SpringBoot2整合SSM框架详解

    SpringBoot2整合SSM框架详解 发布时间:2019-01-15 21:33, 浏览次数:1218 , 标签: SpringBoot SSM <>开发环境 * 开发工具:Eclip ...

  8. 框架详解_Qt开发技术:QtCharts(一)QtCharts基本介绍以及图表框架详解

    若该文为原创文章,未经允许不得转载 原博主博客地址:https://blog.csdn.net/qq21497936 原博主博客导航:https://blog.csdn.net/qq21497936/ ...

  9. (转) shiro权限框架详解06-shiro与web项目整合(上)

    http://blog.csdn.net/facekbook/article/details/54947730 shiro和web项目整合,实现类似真实项目的应用 本文中使用的项目架构是springM ...

最新文章

  1. html单击数字显示图片,记SpannableString金融数字显示与Html.from显示图片
  2. 统计特定文件中的词频
  3. php 点击按钮更新mysql_PHP与mysql超链接 有更新按钮 跳转更新,删除后数据表中的数据 怎么做来着?...
  4. delphi中DateTimePicker控件同时输入日期和时间
  5. php剪切文件,C/C++文件剪切复制删除
  6. linux 下 grep -c sh* /etc/passwd,Linux文本处理三剑客--grep
  7. [leetcode]Edit Distance
  8. GARFIELD@11-10-2004
  9. java之hibernate之基于外键的一对一单向关联映射
  10. 比较zImage和uImage的区别
  11. asp.net处理机制管道事件
  12. stringbuffer字符串反转操作
  13. 阿里大佬推荐初学者练习的 Java 开源项目
  14. Linux操作系统实践
  15. 计算机windows7升级,告诉你win7如何升级为旗舰版
  16. 摄像机标定的简单理解与纪要
  17. Python源码学习笔记:Python万物皆对象
  18. 计算机高中期末总结作文,高中学期总结作文
  19. 产生式规则实现动物、手机识别系统
  20. 手把手带你利用栈来实现一个简易版本的计算器

热门文章

  1. 回飞锅有哪些功能_回飞锅的制作方法
  2. Linux ELF 详解2 -- Section Header Section
  3. 华为公积金比例从12%降至5%,还有人去华为么?
  4. 数据结构与算法-24.跳跃游戏
  5. 剑指Offer(43)1~n整数中1出现的次数
  6. 自媒体工具OpenWrite
  7. keil工具中中ini文件的使用
  8. Java学习——整型变量的使用说明
  9. Android 发布开源项目到jcenter
  10. 实现展开收起功能最简单的方法!