HTSeq用作测序数据和分析软件之间的“粘合剂”,主要有下述功能:通过碱基质量了解数据

统计基因组覆盖度

GFF file文件读取注释信息

将RNA-Seq实验中的比对结果分配给外显子和基因。

--------------------------

安装:pip install HTSeq

服用方法

1. 读取fastq,获取read信息

>>> import HTSeq

>>> import itertools

>>> fastq_file = HTSeq.FastqReader( "test.fastq.gz")

>>> for read in itertools.islice( fastq_file, 5 ):

... print(read)

...

TCGCTAGCAAGACTTTTTCTTTTTCTAGGCACAGTAGGTATTAGATTAAATATGGTAATCACTCACTTCACTTCTGGAAGCAACAGCCCGGTGCAGTGGAGTCAGCCACATGTTGTCCTTGGCATTTACACGAGCTCCTGGAATCAAACA

CCCAGAGGCGGCAAGCTTCCTGTTCCATGGAGCCGGTACTCAGTCAAGGTCAGTATCTTCTGATCCTCCTTCTGCGGCTGCTGGGAGGTGGCAGGCCGGACAGCAGAAGGACAGGAGTGGCTCCATGGGGGCAGGCTTTCTGTGGCTGTG

GTTTGTGCCAGATGACGATTTGAATTAATAAATTCATTTGGTATAAACCGCGATTATTTTTGCATCAACTTACTACAGTCTATGGTCTTTTGTGTCTGGCATCGTTCACTTAGCGTGTTTTCATGGTTCATCCAGATTGTAACATGTTTC

TTCAGCCATTTCCCACTTCGCTTTAAGAGCATGCCCTGTTTAATGGGGATGGCTCTGCCGCTCCCGATGGTGTCAGCATGATTCTCCGGGGCTTTCCTCTCTTTGTCTGGGTCACTCCCTTTCTCAGATGTAAACAGGTTGGACCAGCGC

GTGATTTTCTCCCTTGAGGAGTTGAAGGAGATTGAGAAGGACTGTGCCGTCTACGTTGGGCGCATGGAGAGGGTGGCCCGCCACAGCTCTGTCAGCAAGGAGGAGAAGGTGCGTGGCACCACAGGTCTCGGGGGCTTTCTTCGCCCCTGC

# read 变量仍然保留第十个read的信息

>>> read

>>> read.name

'A00312:90:HKVF5DSXX:3:1101:3441:1000 1:N:0:GCCTATCA'

>>> read.seq

b'GTGATTTTCTCCCTTGAGGAGTTGAAGGAGATTGAGAAGGACTGTGCCGTCTACGTTGGGCGCATGGAGAGGGTGGCCCGCCACAGCTCTGTCAGCAAGGAGGAGAAGGTGCGTGGCACCACAGGTCTCGGGGGCTTTCTTCGCCCCTGC'

>>> read.qual

array([37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,

37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37,

37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,

37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,

37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,

37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37,

37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37,

37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,

37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37],

dtype=uint8)

遍历fastq文件,统计所有read的质量信息

# 创建长度序列读长的0向量

>>> import numpy

>>> len(read)

150

>>> qualsum = numpy.zeros(len(read),numpy.int)

>>> qualsum

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

# 遍历fastq对象,所有位置碱基质量相加

>>> nreads = 0

>>> for read in fastq_file:

... qualsum += read.qual

... nreads += 1

>>>

>>> qualsum / float(nreads)

array([36.56013134, 36.63208692, 36.6780693 , 36.69489843, 36.71856081,

36.72337391, 36.70846614, 36.71600597, 36.71639324, 36.73297184,

... ...

36.20018642, 36.19426629, 36.19388805, 36.20587639, 36.17329932,

36.15208402, 36.13248803, 36.13966627, 36.12272692, 36.12639415])

# 画图展示

>>>from matplotlib import pyplot

>>> pyplot.plot( qualsum / nreads )

>>> pyplot.show()

fastq更多质控,在命令行使用 htseq-qa 实现。htseq-qa还可以用sam文件生成比对到和未比对到参考基因组的质控信息(看sam分析模块)。htseq-qa参数的意义看这Quality Assessment with

$python -m HTSeq.scripts.qa -t fastq -o test.pdf -r 150 test.fastq.gz

200000 reads processed

400000 reads processed

... ...

49200000 reads processed

49318779 reads processed

$

图好丑==

抽取特定区域reads重新写入新的bam文件

>>> bam_reader = HTSeq.BAM_Reader( "deduped.bam" )

>>> for a in itertools.islice( bam_reader, 5 ): # printing first 5 reads

... print(a)

...

## 此处报错没测通

>>> bam_writer = HTSeq.BAM_Writer.from_BAM_Reader( "region.bam", bam_reader ) #set-up BAM_Writer with same header as reader

>>> for a in bam_reader.fetch( region = "1:249000000-249200000" ): #fetching reads in a region

... print("Writing Alignment", a, "to file", bam_writer.filename)

... bam_writer.write( a )

Writing Alignment to file region.bam

Writing Alignment to file region.bam

...

Writing Alignment to file region.bam

>>> bam_writer.close()

# 查看sam中的a对象

>>> a

>>> a.read

>>> a.read.name

'A00250:179:HN33WDSXX:3:1659:5077:20838'

>>> a.read.qualstr

b'FF,F,FFFFFFFFFFFFFFFFFFFFFF,:FFFFFF:FFFFFFFF:F:FFF:FFFFF,FF:F:,F:,FF:FF,FF,:::::,,F::F,F:,,F,,,,,F:,,,,F,,:,,F:F,F,F,,,,,,,,,:FF:,,,,F,,F,,F,,,,F::,:F'

>>> a.read.qual

array([37, 37, 11, 37, 11, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,

37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 25, 37, 37, 37, 37, 37,

... ...

37, 11, 11, 37, 11, 11, 11, 11, 37, 25, 25, 11, 25, 37],

dtype=uint8)

>>> a.iv

>>> a.iv.chrom

'chr1'

>>> a.iv.start

10018

>>> a.iv.end

10146

>>> a.iv.strand

'-'

处理CNV时用到的GenomicArrays 和 GenomicInterval

1) genomicInterval用来存储染色体及位置信息等

>>> iv = HTSeq.GenomicInterval( "chr3", 123203, 127245, "+" )

>>> print(iv)

chr3:[123203,127245)/+

>>> iv

>>> iv.length

4042

# 看两个iv 之间的包含于被包含关系

>>> iv2 = HTSeq.GenomicInterval( "chr3", 123100, 125045, "+" )

>>> iv2.overlaps(iv)

True

>>> iv2.contains(iv)

False

>>> iv2.is_contained_in(iv)

False

>>>

2) GenomicArray可以根据每条染色体信息,并用iv处理染色体特定区域信息

# 构建ga

>>> chromlens = { 'chr1': 3000, 'chr2': 2000, 'chr3': 1000 }

>>> chromlens

{'chr1': 3000, 'chr2': 2000, 'chr3': 1000}

>>> ga = HTSeq.GenomicArray( chromlens, stranded=False, typecode="i" )

>>> ga

# 构建一个iv

>>> iv = HTSeq.GenomicInterval( "chr1", 100, 120, "." )

# 设iv区域copy number 为5

>>> ga[iv] = 5

# 构建另一个iv

>>> iv = HTSeq.GenomicInterval( "chr1", 110, 135, "." )

# 此iv区域 copy number 增加3

>>> ga[iv] += 3

>>> iv = HTSeq.GenomicInterval( "chr1", 90, 140, "." )

# 查看特定位置拷贝数

>>> list(ga[iv])

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0]

储存每个碱基的copy number信息浪费资源,GenomicArray只储存copy number 一致的连续区域信息:

>>> for iv2,value in ga[iv].steps():

... print(iv2,value)

...

chr1:[90,100)/. 0

chr1:[100,110)/. 5

chr1:[110,120)/. 8

chr1:[120,135)/. 3

chr1:[135,140)/. 0

计算覆盖度实例

# 创建空的ga

>>> cvg = HTSeq.GenomicArray( "auto", stranded=True, typecode="i" )

>>> alignment_file = HTSeq.SAM_Reader("deduped.sam")

>>> for alngt in alignment_file:

... if alngt.aligned:

... cvg[alngt.iv] +=1

...

>>> pyplot.plot(list( cvg[ HTSeq.GenomicInterval( "chr1", 12000, 13000, "+" ) ] ) )

[]

>>> pyplot.show()

GenomicArrayOfSets用来存储注释信息

>>> gas = HTSeq.GenomicArrayOfSets( "auto", stranded=False )

>>> gas[ HTSeq.GenomicInterval( "chr1", 100, 250 ) ] += "A"

>>> gas[ HTSeq.GenomicInterval( "chr1", 360, 640 ) ] += "A"

>>> gas[ HTSeq.GenomicInterval( "chr1", 510, 950 ) ] += "B"

>>> read_iv = HTSeq.GenomicInterval( "chr1", 450, 800 )

>>> for iv, val in gas[ read_iv ].steps():

... print(iv, sorted(val))

...

chr1:[450,510)/. ['A']

chr1:[510,640)/. ['A', 'B']

chr1:[640,800)/. ['B']

python中ht_Python包学习-HTSeq相关推荐

  1. python xlrd安装_详解python中xlrd包的安装与处理Excel表格

    一.安装xlrd 地址 下载后,使用 pip install .whl安装即好. 查看帮助: >>> import xlrd >>> help(xlrd) Help ...

  2. python怎么安装本地的egg_怎么安装python中egg包

    怎么安装python中egg包 发布时间:2020-07-08 17:11:05 来源:亿速云 阅读:175 作者:Leah 怎么安装python中egg包?很多新手对此不是很清楚,为了帮助大家解决这 ...

  3. 对于python来说、一个模块就是一个文件-PYTHON中的包和模块

    为了更加友好的对python代码进行组织管理,python中出现了包和模块的概念 类似生活中整理我们的物品一样,将代码按照不同的功能进行整理整合,可以很大程度的提升代码可读性和代码质量,方便在项目中进 ...

  4. Python中索引的学习笔记

    1 前言 今天在学习FaceBoxes- 看到一个比较奇怪的代码,"order = scores.argsort()[::-1][:args.top_k]",不太懂这个" ...

  5. python中confIgparser模块学习

    python中configparser模块学习 ConfigParser模块在python中用来读取配置文件,配置文件的格式跟windows下的ini配置文件相似,可以包含一个或多个节(section ...

  6. python中引入包的时候报错AttributeError: module ‘sys‘ has no attribute ‘setdefaultencoding‘解决方法?

    python中引入包的时候报错AttributeError: module 'sys' has no attribute 'setdefaultencoding'解决方法? 参考文章: (1)pyth ...

  7. Python入门-Python中的包,impot,from,import

    #Python中的包 #包(python package)是一个分层次的目录(directory)结构,它将一组功能相近的模块组织在一个目录下 #作用:1.代码规范,2.避免模块名称冲突 #包与目录的 ...

  8. 基于python中jieba包的中文分词中详细使用(一)

    文章目录 基于python中jieba包的中文分词中详细使用(一) 01.前言 02.jieba的介绍 02.1 What 02.2特点 02.3安装与使用 02.4涉及到的算法 03.主要功能 03 ...

  9. 浅析Python中signal包的使用

    原文链接:https://www.jb51.net/article/74844.htm 这篇文章主要介绍了Python中signal包的使用,主要在Linux系统下对进程信号进行相关操作,需要的朋友可 ...

最新文章

  1. lunix系统安装及分区补充安装包
  2. Ubuntu 相关命令行工具
  3. .net中使用XPath语言在xml中判断是否存在节点值的方法
  4. 天下为公:TCP堵塞控制
  5. python中二进制表示_Python中的二进制搜索:直观介绍
  6. centos6.8自带mysql_CentOS6.8下MySQL数据库版本信息查看
  7. 传输层端口号的范围是多少?被分为哪两部分_6.传输层协议
  8. 马斯克脑机接口、BrainOS 相继发布,未来已来?
  9. idea module重命名后去掉后面带着的原来的名字
  10. 力扣(SQL)584. 寻找用户推荐人
  11. arcgis api for javascript 3.33 清空、删除图层
  12. 如何去掉百度地图 信息框的白色箭头
  13. Android 真机连接本地PC服务器
  14. ubuntu桌面出现问题,重启x桌面方法
  15. AJAX——百闻不如一见
  16. 机器学习的最佳入门学习资源
  17. stm32呼吸灯实验
  18. linux MQ 比较全面的操作命令
  19. freeswitch被国外ip扫描,iptables解决办法
  20. php json数据中 双引号变为quot;解决

热门文章

  1. Python中的lambda表达式与filter函数
  2. 使用Git命令从远程仓库获取项目代码
  3. 西安外国语大学计算机基础,西安外国语大学教务处.PDF
  4. matlab全景图素材,科学网—meshlab查看360度全景图像 - 王琳的博文
  5. vue ---- webpack扩展
  6. Linux下使用Nohup后台运行程序
  7. Spring Cloud Feign使用详解
  8. Java 8 Comparator: 列表排序
  9. java保留两位小数 四种方式
  10. 注解@Mapper、@MapperScan