论文题目:Kernel-penalized regression for analysis of microbiome data

scholar 引用:15



发表刊物:Institute of Mathematical Statistics

作者:Timothy W. Randolph, Sen Zhao, ..., and Ali Shojaie


The analysis of human microbiome data is often based on dimension-reduced graphical displays and clusterings derived from vectors of microbial abundances in each sample. Common to these ordination methods is the use of biologically motivated definitions of similarity. Principal coordinate analysis, in particular, is often performed using ecologically defined distances, allowing analyses to incorporate context-dependent, non-Euclidean structure. In this paper, we go beyond dimension-reduced ordination methods and describe a framework of high-dimensional regression models that extends these distance-based methods. In particular, we use kernel-based methods to show how to incorporate a variety of extrinsic information, such as phylogeny, into penalized regression models that estimate taxon-specific associations with a phenotype or clinical outcome. Further, we show how this regression framework can be used to address the compositional nature of multivariate predictors comprised of relative abundances; that is, vectors whose entries sum to a constant. We illustrate this approach with several simulations using data from two recent studies on gut and vaginal microbiomes. We conclude with an application to our own data, where we also incorporate a significance test for the estimated coefficients that represent associations between microbial abundance and a percent fat.


1. Introduction

2. Kernel Penalized Regression for Microbiome Data

2.1 Background for PCoA and principal component regression

2.2 Penalized regression and DPCoA

2.3 Kernel-based regression with two kernels

2.4 Regression with compositional data

3.  Numerical Experiments

3.1 Regression and DPCoA

3.2 Regression and PCoA with respect to a UniFrac kernel

3.3 Regression and PCoA using an edge-matrix kernel

4. Application to an observational study

5. Discussion


1. Biological Problem: What biological problems have been solved in this paper?

  • The analysis of human microbiome data

2. Main discoveries: What is the main discoveries in this paper?

  • use kernel-based methods to show how to incorporate a variety of extrinsic information, such as phylogeny, into penalized regression models that estimate taxon-specific associations with a phenotype or clinical outcome.
  • how this regression framework can be used to address the compositional nature of multivariate predictors comprised of relative abundances; that is, vectors whose entries sum to a constant.
  • An interesting feature of the proposed kernel-penalized regression framework is its ability to sidestep some of the problems inherent in compositional data analysis.

3. ML(Machine Learning) Methods: What are the ML methods applied in this paper?

  • describe a framework of high-dimensional regression models that extends these distance-based methods.
  • A primary motivation for PCoA graphical displays is the ability to incorporate biologically-inclined measures of (dis)similarity.
  • 提出的方法:kernel penalized regression
  • We show how phylogenetic and other structure can be incorporated via kernel penalized regression in either the primal (p-dimensional) feature space or the dual (n-dimensional) samples space
  • 以前的方法:PCoA?standard (Euclidean-based) statistical models
  • dataset:We apply our kernel-penalized regression framework to data from 16S rRNA gene collected in a study of premenopausal women (Hullar et al., 2015). This study investigated aspects of gut microbial communities in stool samples from premenopausal women using 454 pyrosequencing of the 16S rRNA gene. The abundances of 127 species were zero for more than 90% of the subjects and were removed from our analysis. The data set we consider consists of p = 128 species sampled from n = 102 women.

4. ML Advantages: Why are these ML methods better than the traditional methods in these biological problems?

  • traditional methods: dimension-reduced graphical displays and clusterings derived from vectors of microbial abundances in each sample. Principal coordinate analysis
  • none of these analyses proceed to estimate the individual associations
  • In contrast, we focus on estimating the coefficient vector, which is a key aspect of any approach used to draw scientific conclusions based on the association of microbial communities with an outcome or phenotype.
  • Our approach, which differs somewhat from that of Li (2015), may also be viewed as a penalized version of the low-dimensional linear model for compositions by Tolosana-Delgado and Van Den Boogart (2011), who use the isometric log-ratio (ILR) coordinates.
  • for addressing well-known problems that arise from applying standard (Euclidean-based) statistical models to compositional data

5. Biological Significance: What is the biological significance of these ML methods’ results?

  • In this analysis, we obtain estimates of associations between microbial species and percent fat measured in premenopausal women, and also provide inference for these estimates by applying a recent significance test in our kernel-penalized regression (KPR) framework.

6. Prospect: What are the potential applications of these machine learning methods in biological science?

  • the proposed framework also allows us to use existing inference frameworks for high-dimensional regression, and in particular the Grace test (Zhao and Shojaie, 2016), to assess the significance of estimated regression coefficients.

Paper reading (六十五):Kernel-penalized regression for analysis of microbiome data相关推荐

  1. Paper reading (六十):Multidomain analyses of a intestinal cleanout perturbation experiment

    论文题目:Multidomain analyses of a longitudinal human microbiome intestinal cleanout perturbation experi ...

  2. 六十五、Leetcode数组系列(上篇)

    @Author:Runsen @Date:2020/6/5 作者介绍:Runsen目前大三下学期,专业化学工程与工艺,大学沉迷日语,Python, Java和一系列数据分析软件.导致翘课严重,专业排名 ...

  3. 六十五年来,他的祖国向他道歉了三次

    △ "人工智能之父"艾伦 · 麦席森 · 图灵 (Alan Mathison Turing,1912-1954) 2021年6月23日是英国科学家."人工智能之父&quo ...

  4. JavaScript学习(六十五)—数组知识点总结

    JavaScript学习(六十五)-数组 学习内容 一.什么是数组 二.数组的分类 三.数组的创建方式 四.数组元素 五.数组的操作 六.数组元素遍历的四种方法 七.随机数为数组赋值 八.数组的比较 ...

  5. 信息系统项目管理师核心考点(六十五)信息安全基础知识网络安全

    科科过为您带来软考信息系统项目管理师核心重点考点(六十五)信息安全基础知识网络安全,内含思维导图+真题 [信息系统项目管理师核心考点]信息安全基础知识网络安全 1.拒绝服务攻击(Dos) 一种利用合理 ...

  6. C语言/C++常见习题问答集锦(六十五) 之彩票幸运星

    C语言/C++常见习题问答集锦(六十五) 之彩票幸运星 程序之美 1.L1-062 幸运彩票 (15 分) 彩票的号码有 6 位数字,若一张彩票的前 3 位上的数之和等于后 3 位上的数之和,则称这张 ...

  7. 问题六十五:二叉查找树的一个应用实例——求解一元十次方程时单实根区间的划分

    65.1 概述 回忆一下: "问题五十九:怎么求一元六次方程在区间内的所有不相等的实根"和"问题六十二:怎么求一元十次方程在区间内的所有不相等的实根"中求一元六 ...

  8. 如何选择适合你的兴趣爱好(六十五),文学

    围城网的摇摇今天给大家带来了"如何选择适合你的兴趣爱好"系列专辑的第六十五讲--文学. 文学是语言文字的艺术,包括小说.诗歌.散文等.相信我们经常看小说的朋友对唐家三少肯定不陌生吧 ...

  9. (六十五)Android O StartService的 anr timeout 流程分析

    前言:之前在(六十四)Android O Service启动流程梳理--startService 梳理了startService的一般流程,anr的没有涉及,本篇就以anr的为关注点梳理下流程. 参考 ...


  1. 4.3、Libgdx启动类和配置
  2. mysql免安装版net不是_MYSQL 免安装版的环境配置
  3. mysql Access denied for user root@localhost错误处理备忘
  4. 韩国研制出世界最薄光伏电池:厚度仅为人类头发直径百分之一
  5. Asp.net Request方法获取客户端的信息
  6. kingcms的标签
  7. java mysql insert id_MySQL和Java-获取最后插入值的ID(JDBC)[重复]
  8. mysql_error 1030
  9. Cesium:鼠标移动事件判断是否在地球上操作以及获取经纬度
  10. 3串锂电池充电保护板设计
  11. oracle 11g查隐含参数,oracle隐含参数修改与查看
  12. 姐养狗2号前来面基!祝大家新年快乐!
  13. 牛客寒假算法集训营1 小a与军团模拟器(启发式合并)
  14. 惠普m154a状态页_惠普m154a感叹号闪烁
  15. repost ACM算法竞赛生涯
  16. helm3 使用国内原安装Weave Scope
  17. SSD: Single Shot MultiBox Detector
  18. erp打印面单 php实现,利用店小秘ERP处理Shopee虾皮订单及打印面单
  19. 【数据安全案例】北京购车摇号查询系统出现信息泄露漏洞
  20. windows进注册表快捷键


  1. cass怎么添加指北针图例_答疑|CASS怎么添加图例?
  2. 常见的爬虫error以及解决方法
  3. 1945:【09NOIP普及组】多项式输出
  4. android studio开发微信界面
  5. 【Android实现返回主页,禁止返回上一层等功能】
  6. Linux 查看最耗费资源的几个进程
  7. 学习笔记 JavaScript 动画 加速
  8. Android Camera硬件结构组成(一)之 手机摄像头的组成结构和工作原理
  9. 选中exchange缓存模式后 GAL不会更新
  10. 软件开发人员的职业发展规划