python中ht_Python包学习-HTSeq
HTSeq用作测序数据和分析软件之间的“粘合剂”,主要有下述功能:通过碱基质量了解数据
统计基因组覆盖度
GFF file文件读取注释信息
将RNA-Seq实验中的比对结果分配给外显子和基因。
--------------------------
安装:pip install HTSeq
服用方法
1. 读取fastq,获取read信息
>>> import HTSeq
>>> import itertools
>>> fastq_file = HTSeq.FastqReader( "test.fastq.gz")
>>> for read in itertools.islice( fastq_file, 5 ):
... print(read)
...
TCGCTAGCAAGACTTTTTCTTTTTCTAGGCACAGTAGGTATTAGATTAAATATGGTAATCACTCACTTCACTTCTGGAAGCAACAGCCCGGTGCAGTGGAGTCAGCCACATGTTGTCCTTGGCATTTACACGAGCTCCTGGAATCAAACA
CCCAGAGGCGGCAAGCTTCCTGTTCCATGGAGCCGGTACTCAGTCAAGGTCAGTATCTTCTGATCCTCCTTCTGCGGCTGCTGGGAGGTGGCAGGCCGGACAGCAGAAGGACAGGAGTGGCTCCATGGGGGCAGGCTTTCTGTGGCTGTG
GTTTGTGCCAGATGACGATTTGAATTAATAAATTCATTTGGTATAAACCGCGATTATTTTTGCATCAACTTACTACAGTCTATGGTCTTTTGTGTCTGGCATCGTTCACTTAGCGTGTTTTCATGGTTCATCCAGATTGTAACATGTTTC
TTCAGCCATTTCCCACTTCGCTTTAAGAGCATGCCCTGTTTAATGGGGATGGCTCTGCCGCTCCCGATGGTGTCAGCATGATTCTCCGGGGCTTTCCTCTCTTTGTCTGGGTCACTCCCTTTCTCAGATGTAAACAGGTTGGACCAGCGC
GTGATTTTCTCCCTTGAGGAGTTGAAGGAGATTGAGAAGGACTGTGCCGTCTACGTTGGGCGCATGGAGAGGGTGGCCCGCCACAGCTCTGTCAGCAAGGAGGAGAAGGTGCGTGGCACCACAGGTCTCGGGGGCTTTCTTCGCCCCTGC
# read 变量仍然保留第十个read的信息
>>> read
>>> read.name
'A00312:90:HKVF5DSXX:3:1101:3441:1000 1:N:0:GCCTATCA'
>>> read.seq
b'GTGATTTTCTCCCTTGAGGAGTTGAAGGAGATTGAGAAGGACTGTGCCGTCTACGTTGGGCGCATGGAGAGGGTGGCCCGCCACAGCTCTGTCAGCAAGGAGGAGAAGGTGCGTGGCACCACAGGTCTCGGGGGCTTTCTTCGCCCCTGC'
>>> read.qual
array([37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,
37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37,
37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,
37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,
37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,
37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37,
37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37,
37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,
37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37],
dtype=uint8)
遍历fastq文件,统计所有read的质量信息
# 创建长度序列读长的0向量
>>> import numpy
>>> len(read)
150
>>> qualsum = numpy.zeros(len(read),numpy.int)
>>> qualsum
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
# 遍历fastq对象,所有位置碱基质量相加
>>> nreads = 0
>>> for read in fastq_file:
... qualsum += read.qual
... nreads += 1
>>>
>>> qualsum / float(nreads)
array([36.56013134, 36.63208692, 36.6780693 , 36.69489843, 36.71856081,
36.72337391, 36.70846614, 36.71600597, 36.71639324, 36.73297184,
... ...
36.20018642, 36.19426629, 36.19388805, 36.20587639, 36.17329932,
36.15208402, 36.13248803, 36.13966627, 36.12272692, 36.12639415])
# 画图展示
>>>from matplotlib import pyplot
>>> pyplot.plot( qualsum / nreads )
>>> pyplot.show()
fastq更多质控,在命令行使用 htseq-qa 实现。htseq-qa还可以用sam文件生成比对到和未比对到参考基因组的质控信息(看sam分析模块)。htseq-qa参数的意义看这Quality Assessment with
$python -m HTSeq.scripts.qa -t fastq -o test.pdf -r 150 test.fastq.gz
200000 reads processed
400000 reads processed
... ...
49200000 reads processed
49318779 reads processed
$
图好丑==
抽取特定区域reads重新写入新的bam文件
>>> bam_reader = HTSeq.BAM_Reader( "deduped.bam" )
>>> for a in itertools.islice( bam_reader, 5 ): # printing first 5 reads
... print(a)
...
## 此处报错没测通
>>> bam_writer = HTSeq.BAM_Writer.from_BAM_Reader( "region.bam", bam_reader ) #set-up BAM_Writer with same header as reader
>>> for a in bam_reader.fetch( region = "1:249000000-249200000" ): #fetching reads in a region
... print("Writing Alignment", a, "to file", bam_writer.filename)
... bam_writer.write( a )
Writing Alignment to file region.bam
Writing Alignment to file region.bam
...
Writing Alignment to file region.bam
>>> bam_writer.close()
# 查看sam中的a对象
>>> a
>>> a.read
>>> a.read.name
'A00250:179:HN33WDSXX:3:1659:5077:20838'
>>> a.read.qualstr
b'FF,F,FFFFFFFFFFFFFFFFFFFFFF,:FFFFFF:FFFFFFFF:F:FFF:FFFFF,FF:F:,F:,FF:FF,FF,:::::,,F::F,F:,,F,,,,,F:,,,,F,,:,,F:F,F,F,,,,,,,,,:FF:,,,,F,,F,,F,,,,F::,:F'
>>> a.read.qual
array([37, 37, 11, 37, 11, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,
37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 25, 37, 37, 37, 37, 37,
... ...
37, 11, 11, 37, 11, 11, 11, 11, 37, 25, 25, 11, 25, 37],
dtype=uint8)
>>> a.iv
>>> a.iv.chrom
'chr1'
>>> a.iv.start
10018
>>> a.iv.end
10146
>>> a.iv.strand
'-'
处理CNV时用到的GenomicArrays 和 GenomicInterval
1) genomicInterval用来存储染色体及位置信息等
>>> iv = HTSeq.GenomicInterval( "chr3", 123203, 127245, "+" )
>>> print(iv)
chr3:[123203,127245)/+
>>> iv
>>> iv.length
4042
# 看两个iv 之间的包含于被包含关系
>>> iv2 = HTSeq.GenomicInterval( "chr3", 123100, 125045, "+" )
>>> iv2.overlaps(iv)
True
>>> iv2.contains(iv)
False
>>> iv2.is_contained_in(iv)
False
>>>
2) GenomicArray可以根据每条染色体信息,并用iv处理染色体特定区域信息
# 构建ga
>>> chromlens = { 'chr1': 3000, 'chr2': 2000, 'chr3': 1000 }
>>> chromlens
{'chr1': 3000, 'chr2': 2000, 'chr3': 1000}
>>> ga = HTSeq.GenomicArray( chromlens, stranded=False, typecode="i" )
>>> ga
# 构建一个iv
>>> iv = HTSeq.GenomicInterval( "chr1", 100, 120, "." )
# 设iv区域copy number 为5
>>> ga[iv] = 5
# 构建另一个iv
>>> iv = HTSeq.GenomicInterval( "chr1", 110, 135, "." )
# 此iv区域 copy number 增加3
>>> ga[iv] += 3
>>> iv = HTSeq.GenomicInterval( "chr1", 90, 140, "." )
# 查看特定位置拷贝数
>>> list(ga[iv])
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0]
储存每个碱基的copy number信息浪费资源,GenomicArray只储存copy number 一致的连续区域信息:
>>> for iv2,value in ga[iv].steps():
... print(iv2,value)
...
chr1:[90,100)/. 0
chr1:[100,110)/. 5
chr1:[110,120)/. 8
chr1:[120,135)/. 3
chr1:[135,140)/. 0
计算覆盖度实例
# 创建空的ga
>>> cvg = HTSeq.GenomicArray( "auto", stranded=True, typecode="i" )
>>> alignment_file = HTSeq.SAM_Reader("deduped.sam")
>>> for alngt in alignment_file:
... if alngt.aligned:
... cvg[alngt.iv] +=1
...
>>> pyplot.plot(list( cvg[ HTSeq.GenomicInterval( "chr1", 12000, 13000, "+" ) ] ) )
[]
>>> pyplot.show()
GenomicArrayOfSets用来存储注释信息
>>> gas = HTSeq.GenomicArrayOfSets( "auto", stranded=False )
>>> gas[ HTSeq.GenomicInterval( "chr1", 100, 250 ) ] += "A"
>>> gas[ HTSeq.GenomicInterval( "chr1", 360, 640 ) ] += "A"
>>> gas[ HTSeq.GenomicInterval( "chr1", 510, 950 ) ] += "B"
>>> read_iv = HTSeq.GenomicInterval( "chr1", 450, 800 )
>>> for iv, val in gas[ read_iv ].steps():
... print(iv, sorted(val))
...
chr1:[450,510)/. ['A']
chr1:[510,640)/. ['A', 'B']
chr1:[640,800)/. ['B']
python中ht_Python包学习-HTSeq相关推荐
- python xlrd安装_详解python中xlrd包的安装与处理Excel表格
一.安装xlrd 地址 下载后,使用 pip install .whl安装即好. 查看帮助: >>> import xlrd >>> help(xlrd) Help ...
- python怎么安装本地的egg_怎么安装python中egg包
怎么安装python中egg包 发布时间:2020-07-08 17:11:05 来源:亿速云 阅读:175 作者:Leah 怎么安装python中egg包?很多新手对此不是很清楚,为了帮助大家解决这 ...
- 对于python来说、一个模块就是一个文件-PYTHON中的包和模块
为了更加友好的对python代码进行组织管理,python中出现了包和模块的概念 类似生活中整理我们的物品一样,将代码按照不同的功能进行整理整合,可以很大程度的提升代码可读性和代码质量,方便在项目中进 ...
- Python中索引的学习笔记
1 前言 今天在学习FaceBoxes- 看到一个比较奇怪的代码,"order = scores.argsort()[::-1][:args.top_k]",不太懂这个" ...
- python中confIgparser模块学习
python中configparser模块学习 ConfigParser模块在python中用来读取配置文件,配置文件的格式跟windows下的ini配置文件相似,可以包含一个或多个节(section ...
- python中引入包的时候报错AttributeError: module ‘sys‘ has no attribute ‘setdefaultencoding‘解决方法?
python中引入包的时候报错AttributeError: module 'sys' has no attribute 'setdefaultencoding'解决方法? 参考文章: (1)pyth ...
- Python入门-Python中的包,impot,from,import
#Python中的包 #包(python package)是一个分层次的目录(directory)结构,它将一组功能相近的模块组织在一个目录下 #作用:1.代码规范,2.避免模块名称冲突 #包与目录的 ...
- 基于python中jieba包的中文分词中详细使用(一)
文章目录 基于python中jieba包的中文分词中详细使用(一) 01.前言 02.jieba的介绍 02.1 What 02.2特点 02.3安装与使用 02.4涉及到的算法 03.主要功能 03 ...
- 浅析Python中signal包的使用
原文链接:https://www.jb51.net/article/74844.htm 这篇文章主要介绍了Python中signal包的使用,主要在Linux系统下对进程信号进行相关操作,需要的朋友可 ...
最新文章
- lunix系统安装及分区补充安装包
- Ubuntu 相关命令行工具
- .net中使用XPath语言在xml中判断是否存在节点值的方法
- 天下为公:TCP堵塞控制
- python中二进制表示_Python中的二进制搜索:直观介绍
- centos6.8自带mysql_CentOS6.8下MySQL数据库版本信息查看
- 传输层端口号的范围是多少?被分为哪两部分_6.传输层协议
- 马斯克脑机接口、BrainOS 相继发布,未来已来?
- idea module重命名后去掉后面带着的原来的名字
- 力扣(SQL)584. 寻找用户推荐人
- arcgis api for javascript 3.33 清空、删除图层
- 如何去掉百度地图 信息框的白色箭头
- Android 真机连接本地PC服务器
- ubuntu桌面出现问题,重启x桌面方法
- AJAX——百闻不如一见
- 机器学习的最佳入门学习资源
- stm32呼吸灯实验
- linux MQ 比较全面的操作命令
- freeswitch被国外ip扫描,iptables解决办法
- php json数据中 双引号变为quot;解决