json txt格式转换器_BIOM：生物观测矩阵——微生物组数据通用数据格式

简介

http://biom-format.org/

BIOM格式是微生物组领域最常用的结果保存格式，优点是可将OTU或Feature表、样本属性、物种信息等多个表保存于同一个文件中，且格式统一，体积更小巧，目前被微生物组领域几乎所有主流软件所支持：

QIIME
MG-RAST
PICRUSt
Mothur
phyloseq
MEGAN
VAMPS
metagenomeSeq
Phinch
RDP Classifier
USEARCH
PhyloToAST
EBI Metagenomics
GCModeller
MetaPhlAn 2

BIOM格式于2012年Rob Knight首发于我国GigaScience杂志上，被引242次。

The Biological Observation Matrix (or BIOM, canonically pronounced biome) 是微生物组分析的核心数据类型。

我们主要了解以下三方面的内容：

BIOM文件格式的定义；
biom命令对文件格式的转换、添加元数据、总结等；
使用Python和R操作BIOM文件

biom工具安装

常用的biom操作工具是一个python包，可通过pip、conda等安装

# 安装依赖关系科学计算包pip install numpy# 安装biom包pip install biom-format# 安装biom2.0格式支持pip install h5py# 显示命令行biom

更推荐，conda安装 python和r相应的操作包

相应bioconda包在 https://bioconda.github.io/recipes.html 查询名称和版本详细

# 安装Python包conda install biom-format # 2.1.7# 安装r的biom包conda install r-biom# 或安装r微生物组包，包括了r-biomconda install bioconductor-microbiome

主要功能如下

sage: biom [OPTIONS] COMMAND [ARGS]...

ptions:--version 版本Show the version and exit.-h, --help 帮助Show this message and exit.

ommands:add-metadata 添加元数据 Add metadata to a BIOM table.convert 文本表格与biom互转 Convert to/from the BIOM table format.from-uc 转换uc为biom Create a BIOM table from a vsearch/uclust/usearch BIOM...head 跳过表头 Dump the first bit of a table.normalize-table 标准化 Normalize a BIOM table.show-install-info 提供安装信息 Provide information about the biom-format installation.subset-table 提取子集 Subset a BIOM table.summarize-table 统计摘要 Summarize sample or observation data in a BIOM table.table-ids 转储 Dump IDs in a table.validate-table 格式验证 Validate a BIOM-formatted file.

文件格式

http://biom-format.org/documentation/biom_format.html

BIOM目前分为1.0 JSON和2.0 HDF5两个版本；

1.0 JSON是编程语言广泛支持的格式，类似于散列的键值对结果。会根据数据松散程度，选择不同的存储结构来节省空间。

2.0 HDF5是二进制格式，被许多程序语言支持，读取更高效和节约空间。

小提示和常见问题

BIOM的目的是存储和处理大、松散的表；储存研究主要信息为单个文件；格式在不同软件间通用。

下面是OTU表常用存储的两种样式

紧密OTU表 A dense representation of an OTU table:

OTU ID PC.354 PC.355 PC.356OTU0 0 0 4OTU1 6 0 0OTU2 1 0 7OTU3 0 0 3

松散OTU表 A sparse representation of an OTU table:

PC.354 OTU1 6PC.354 OTU2 1PC.356 OTU0 4PC.356 OTU2 7PC.356 OTU3 3

OTU表经常会有90%的0，甚至99%为0。其中BIOM 1.0支持松散、紧密两种格式；BIOM2.x仅支持松散格式。

封装核心研究数据(OTU表、样本信息和OTU物种注释)至单个文件

快速使用Quick Start

本节讲指在python中交互操作biom格式文件，我不常用，具体见附录1.

文件格式转换

convert命令可以将文本格式的表格与biom格式间自由转换。

转换为制表符分隔的表格，方便在Excel等程序中查看；
转换松散或紧密格式的biom(biom1.0只支持紧密dense格式)

制表符分隔的表格通常称为经典格式表格，BIOM格式称为biom表格。

转换经典表格为HDF5或JSON格式

biom convert -i table.txt -o table.from_txt_json.biom --table-type="OTU table" --to-jsonbiom convert -i table.txt -o table.from_txt_hdf5.biom --table-type="OTU table" --to-hdf5

转换biom为经典格式

biom convert -i table.biom -o table.from_biom.txt --to-tsv

转换biom为经典格式，并在最后列包括物种注释信息

biom convert -i table.biom -o table.from_biom_w_taxonomy.txt --to-tsv --header-key taxonomy

转换biom为经典格式，并在最后列包括物种注释信息，并改名为ConsensusLineage

此功能对于一些软件要求指定的列名有很有用。

biom convert -i table.biom -o table.from_biom_w_consensuslineage.txt --to-tsv --header-key taxonomy --output-metadata-id "ConsensusLineage"

带物种注释表格互转

biom convert -i table.biom -o table_tax.txt --to-tsv --header-key taxonomybiom convert -i table_tax.txt -o new_table.biom --to-hdf5 --table-type="OTU table" --process-obs-metadata taxonomybiom convert -i table_tax.txt -o new_table.biom --to-json --table-type="OTU table" --process-obs-metadata taxonomy

转换QIIME1.4早期表格为BIOM格式(不常用)

sed 's/Consensus Lineage/ConsensusLineage/' < otu_table.txt | sed 's/ConsensusLineage/taxonomy/' > otu_table.taxonomy.txtbiom convert -i otu_table.taxonomy.txt -o otu_table.from_txt.biom --table-type="OTU table" --process-obs-metadata taxonomy --to-hdf5

biom文件添加样本分组和物种注释

biom add-metadata -h# 显示帮助

Usage: biom add-metadata [OPTIONS]

Add metadata to a BIOM table.

Add sample and/or observation metadata to BIOM-formatted files. Seeexamples here: http://biom-format.org/documentation/adding_metadata.html

Example usage:

Add sample metadata to a BIOM table:

$ biom add-metadata -i otu_table.biom -o table_with_sample_metadata.biom-m sample_metadata.txt

Options:-i, --input-fp FILE 输入文件The input BIOM table [required]-o, --output-fp FILE 输出文件The output BIOM table [required]-m, --sample-metadata-fp FILE 样本信息The sample metadata mapping file (will addsample metadata to the input BIOM table, ifprovided).--observation-metadata-fp FILE OTU物种注释 The observation metadata mapping file (willadd observation metadata to the input BIOMtable, if provided).--sc-separated TEXT 元数据按分号分隔，如物种分类级 Comma-separated list of the metadata fieldsto split on semicolons. This is useful forhierarchical data such as taxonomy orfunctional categories.--sc-pipe-separated TEXT 元数据按竖线分隔，如lefse Comma-separated list of the metadata fieldsto split on semicolons and pipes ("|"). Thisis useful for hierarchical data such asfunctional categories with one-to-manymappings (e.g. x;y;z|x;y;w)).--int-fields TEXT 分号分隔的整数 Comma-separated list of the metadata fieldsto cast to integers. This is useful forinteger data such as "DaysSinceStart".--float-fields TEXT 分号分隔的符点数 Comma-separated list of the metadata fieldsto cast to floating point numbers. This isuseful for real number data such as "pH".--sample-header TEXT 指定样本属性列名 Comma-separated list of the sample metadatafield names. This is useful if a header lineis not provided with the metadata, if youwant to rename the fields, or if you want toinclude only the first n fields where n isthe number of entries provided here.--observation-header TEXT OTU属性样名 Comma-separated list of the observationmetadata field names. This is useful if aheader line is not provided with themetadata, if you want to rename the fields,or if you want to include only the first nfields where n is the number of entriesprovided here.--output-as-json 输出JSON格式 Write the output file in JSON format.-h, --help 帮助 Show this message and exit.

你的样本分组文件是这样格式的

head sample.txt

#SampleID BarcodeSequence genotypeKO1 TAGCTT KOKO2 GGCTAC KOKO3 CGCGCG KO

你的物种注释信息是这样的

head taxonomy.txt

#OTUID taxonomy confidenceOTU_325 k__Bacteria;p__Bacteroidetes;c__Flavobacteriia;o__Flavobacteriales;f__Cryomorphaceae;g__;s__ 0.880OTU_324 k__Bacteria;p__Chlorobi;c__SJA-28;o__;f__;g__;s__ 1.000

添加样本分组信息

biom add-metadata -i table.biom -o table.w_smd.biom --sample-metadata-fp sample.txt

添加OTU注释

biom add-metadata -i table.biom -o table.w_omd.biom --observation-metadata-fp taxonomy.txt

添加样本和OTU注释

biom add-metadata -i table.biom -o table.w_md.biom --observation-metadata-fp taxonomy.txt --sample-metadata-fp sample.txt

同时添加行列信息

可以指定注释的列格式，如整数integers (—int-fields)、浮点小数 (—float-fields)、或物种层级注释并用分号分隔 (—sc-separated)

biom add-metadata -i table.biom -o table.w_md.biom --observation-metadata-fp taxonomy.txt --sample-metadata-fp sample.txt --sc-separated taxonomy --float-fields confidence

—observation-header和—sample-header可以重命名列名，

biom add-metadata -i min_sparse_otu_table.biom -o table.w_smd.biom --sample-metadata-fp sam_md.txt --sample-header SampleID,BarcodeSequence,DateOfBirth

biom add-metadata -i min_sparse_otu_table.biom -o table.w_omd.biom --observation-metadata-fp obs_md.txt --observation-header OTUID,taxonomy,confidence

可以指定名称的列读入

biom add-metadata -i min_sparse_otu_table.biom -o table.w_omd.biom --observation-metadata-fp obs_md.txt --observation-header OTUID,taxonomy --sc-separated taxonomy

BIOM表统计

biom summarize-table -h

统计每个样品

biom summarize-table -i table.w_md.biom -o table.w_md_summary.txt

示例结果如下：

Num samples: 27Num observations: 975Total count: 409647Table density (fraction of non-zero values): 0.464

Counts/sample summary:Min: 2352.0Max: 35955.0Median: 14851.000Mean: 15172.111Std. dev.: 10691.823Sample Metadata Categories: BarcodeSequence; genotypeObservation Metadata Categories: taxonomy; confidence

Counts/sample detail:OE4: 2352.0OE3: 2353.0OE8: 3091.0OE2: 3173.0

统计每个样本中的观察值数量unique observations per sample，即alpha多样性 richness

biom summarize-table -i table.w_md.biom --qualitative -o table.w_md_qual_summary.txt

结果如下：

Num samples: 27Num observations: 975

Observations/sample summary:Min: 222Max: 633Median: 486.000Mean: 452.704Std. dev.: 138.713Sample Metadata Categories: BarcodeSequence; genotypeObservation Metadata Categories: taxonomy; confidence

Observations/sample detail:OE3: 222OE4: 248OE8: 261OE1: 272OE2: 278

Reference

The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome.

Daniel McDonald, Jose C. Clemente, Justin Kuczynski, Jai Ram Rideout, Jesse Stombaugh, Doug Wendel, Andreas Wilke, Susan Huse, John Hufnagle, Folker Meyer, Rob Knight, and J. Gregory Caporaso.

GigaScience 2012, 1:7. doi:10.1186/2047-217X-1-7

http://biom-format.org/

附录1. Python中交互操作biom的函数

函数

Python中只要有biom包，可在Python交互的命令行中读取

load_table(f) 函数读取biom文件

读取并展示biom的内置数据

>>> from biom import example_table>>> print(example_table)# Constructed from biom file#OTU ID S1 S2 S3O1 0.0 1.0 2.0O2 3.0 4.0 5.0

从文件读取biom文件

from biom import load_tabletable = load_table('otutab.biom')

Table函数

Table(data, observation_ids, sample_ids[, …])

import numpy as npfrom biom.table import Tabledata = np.arange(40).reshape(10, 4)sample_ids = ['S%d' % i for i in range(4)]observ_ids = ['O%d' % i for i in range(10)]sample_metadata = [{'environment': 'A'}, {'environment': 'B'},{'environment': 'A'}, {'environment': 'B'}]observ_metadata = [{'taxonomy': ['Bacteria', 'Firmicutes']},{'taxonomy': ['Bacteria', 'Firmicutes']},{'taxonomy': ['Bacteria', 'Proteobacteria']},{'taxonomy': ['Bacteria', 'Proteobacteria']},{'taxonomy': ['Bacteria', 'Proteobacteria']},{'taxonomy': ['Bacteria', 'Bacteroidetes']},{'taxonomy': ['Bacteria', 'Bacteroidetes']},{'taxonomy': ['Bacteria', 'Firmicutes']},{'taxonomy': ['Bacteria', 'Firmicutes']},{'taxonomy': ['Bacteria', 'Firmicutes']}]table = Table(data, observ_ids, sample_ids, observ_metadata,sample_metadata, table_id='Example Table')

table # 表格信息

print(table) # 输出表格

print(table.ids) # 显示样本名

print(table.ids(axis='observation')) # 显示观测值名称

print(table.nnz) # 非零number of nonzero entries

我更喜欢命令行模型，对于Python中交互使用，更多代码详见 http://biom-format.org/documentation/table_objects.html

猜你喜欢

10000+：菌群分析宝宝与猫狗梅毒狂想曲提DNA发Nature Cell专刊肠道指挥大脑

系列教程：微生物组入门 Biostar 微生物组宏基因组

专业技能：学术图表高分文章生信宝典不可或缺的人

一文读懂：宏基因组寄生虫益处进化树

必备技能：提问搜索 Endnote

文献阅读热心肠 SemanticScholar Geenmedical

扩增子分析：图表解读分析流程统计绘图

16S功能预测 PICRUSt FAPROTAX Bugbase Tax4Fun

在线工具：16S预测培养基生信绘图

科研经验：云笔记云协作公众号

编程模板: Shell R Perl

生物科普: 肠道细菌人体上的生命生命大跃进细胞暗战人体奥秘

写在后面

为鼓励读者交流、快速解决科研困难，我们建立了“宏基因组”专业讨论群，目前己有国内外5000+ 一线科研人员加入。参与讨论，获得专业解答，欢迎分享此文至朋友圈，并扫码加主编好友带你入群，务必备注“姓名-单位-研究方向-职称/年级”。PI请明示身份，另有海内外微生物相关PI群供大佬合作交流。技术问题寻求帮助，首先阅读《如何优雅的提问》学习解决问题思路，仍末解决群内讨论，问题不私聊，帮助同行。

学习16S扩增子、宏基因组科研思路和分析实战，关注“宏基因组”

点击阅读原文，跳转最新文章目录阅读