gff文件_如何提取gff文件中的基因注释信息

原标题：如何提取gff文件中的基因注释信息

gff3格式注释文件是最常见的基因注释，(https://archive.broadinstitute.org/annotation/argo/help/gff3.html)

简单来说，gff3是以tab分隔的文本文件，共有9列，对应信息如下：

1、seqname

The name of the sequence. Typically a chromosome or a contig. Argo does not care what you put here. It will superimpose gff features on any sequence you like.

2、source

The program that generated this feature. Argo displays the value of this field in the inspector but does not do anything special with it.

3、feature

The name of this type of feature. The official GFF3 spec states that this should be a term from the SOFA ontology, but Argo does not do anything with this value except display it.

4、start

The starting position of the feature in the sequence. The first base is numbered 1.

5、end

The ending position of the feature (inclusive).

6、score

A score between 0 and 1000. If there is no score value, enter ".".

7、strand

Valid entries include '+', '-', or '.' (for don't know/don't care).

8、frame

If the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be '.'. Argo does not do anything with this field except display its value.

9、GFF3: grouping attributes

Attribute keys and values are separated by '=' signs. Values must be URI encoded.quoted. Attribute pairs are separated by semicolons. Certain, special attributes are used for grouping and identification (See below). This field is the one important difference betweenGFF flavor

(https://archive.broadinstitute.org/annotation/argo/help/gff.html).

在进行生物信息分析的时候，常需要把gene的注释信息(第9列)提取出来附加到差异基因或目的基因的表格结果中，但第9列的注释信息通常较多，且不同基因含部分注释信息不全部一致，一般我们只需要部分重要的a信息，如Dbxref、gene_biotype、deion。

本文以ncbi上发布的人类GRCh38.p7版本注释文件为示例，使用awk命令进行该操作。

(https://www.gnu.org/software/gawk/manual/gawk.html)

1、下载目的物种注释文件：

(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000001405.33_GRCh38.p7/GCF_000001405.33_GRCh38.p7_genomic.gff.gz)

然后对GCF_000001405.33_GRCh38.p7_genomic.gff.gz进行解压操作，

得到解压文GCF_000001405.33_GRCh38.p7_genomic.gff；

2、查看第9列有哪些注释信息：

$ awk 'BEGIN{FS=OFS="\t"} $3=="gene"{split($9, a, ";"); for(i in a){split(a[i], b, "="); if(++c[b[1]]==1) print b[1]}}' GCF_000001405.33_GRCh38.p7_genomic.gff

运行显示结果有：

ID、Dbxref、Name、deion

gbkey、gene、gene_biotype、pseudo、gene_synonym、partial、start_range、end_range

exception、Note

然后使用以下命令查看gff3文件中的结果：

$ awk -F "\t" '$3=="gene"{print $9}' GCF_000001405.33_GRCh38.p7_genomic.gff | cat -n | less

可以看到

3、下面使用awk进行基因注释信息提取(以提取Dbxref、gene_biotype、deion信息为例)：

$ awk 'BEGIN{FS=OFS="\t"} $3=="gene"{print $0}' GCF_000001405.33_GRCh38.p7_genomic.gff |

sed 's/;/\t/g' |

awk 'BEGIN{FS=OFS="\t"} {for(i=1; i<=NF; i++){split($i, a, "=");

b[a[1]]=a[2]}} {print b["Name"],b["Dbxref"],b["gene_biotype"],b["deion"]}

{split("", b, ":")}'

终端显示的提取信息(tab分隔，依次为Name、Dbxref、gene_biotype、deion)：

说明：部分基因不包含某些注释信息，如LOC105379212基因没有deion信息，则在对应列为空字符。

4、对应终端打印的提取信息，可以添加表头和生成文件，同时对应部分出现在多个染色体的基因在第1列会重复，请对3中的结果进行以下操作即可:

$ sed ‘1i Name\tDbxref\tgene_biotype\tdeion’ | awk -F “\t” ‘++a[$1]==1’返回搜狐，查看更多

责任编辑：

gff文件_如何提取gff文件中的基因注释信息相关推荐

python 怎么拷贝一个文件到一个新的文件_Python脚本提取不同文件夹里面的文件到一个新的文件...
这条博客分享一个脚本,实现将不同文件夹的我们需要的文件提取到一个新的文件下.如: 这是原来文件下的文件,我将它重新命名,然后复制到下图的文件夹下,实现将很多不同文件夹下很多我们想要的文件提取出来,省的 ...
linux基因组文件,科学网-NGS基础 - 参考基因组和基因注释文件-陈同的博文
NGS基础 - 参考基因组和基因注释文件同步滚动:关参考基因组和基因注释文件获取通常测序生成的reads要与参考基因组或参考转录组进行比对,或Pseudo-alignment.所以首先需要获取参 ...
怎么导出mysql数据库注释_数据库基础：如何查看并导出数据表中字段的注释信息...
查看并导出SQL Server 2000数据表中字段的注释信息: 此示例为导出某个表注释的语句:(表名是bbs_bank_log) SELECT sysobjects.name AS 表名, sysc ...
R语言ggplot2可视化使用vjust和hjust参数对齐图像中的文本注释信息（左对齐、右对齐、居中）实战
R语言ggplot2可视化使用vjust和hjust参数对齐图像中的文本注释信息(左对齐.右对齐.居中)实战目录
如何提取fq.gz中的文件_什么是GZ文件类型或扩展名？如何创建，提取和打开Gz文件？...
如何提取fq.gz中的文件 gz or GZ files are compressed files using gzip algorithm or a related application. gz ...
手机上怎么打开md格式的文件_怎么提取pdf页面？职场达人教你一招
在平时的学习.工作生活中,大家面对pdf格式文件的机会有很多,文件里面也有很多重要信息.想要保留文件中某部分信息的时候,我们可以提取pdf文件页面,或者把pdf文件页面提取为其他格式文件的形式. 记得 ...
.sql文件_面试题：mybatis 中的 DAO 接口和 XML 文件里的 SQL 是如何建立关系的？
前言这是 mybatis 比较常问到的面试题,我自己在以前的面试过程中被问到了2次,2次都是非常重要的面试环节,因此自己印象很深刻.这个题目我很早就深入学习了,但是一直没有整理出来,刚好最近一段时间 ...
c++读取utf8文件_经常在日常工作中处理统一码文件(or其他编码)？这篇必读
全文共2717字,预计学习时长5分钟对于那些经常在日常工作中处理统一码文件(也适用于其他编码)的人来说,这篇文章是必读的.对于自然语言处理的从业者,处理统一码文件是一场噩梦,尤其是使用Windows ...
tomcat temp 大量 upload 文件_原创 | 浅谈URI中的任意文件下载
点击上方蓝字关注我吧引言文件下载是比较常见的业务.常见的接口格式为/download?fileName=xxx.png,整个过程若没过滤目录穿越符号-/或者未对下载的路径进行处理限制.当传入的fi ...

gff文件_如何提取gff文件中的基因注释信息

gff文件_如何提取gff文件中的基因注释信息相关推荐

最新文章

热门文章