纯Python read_counts 转FPKM v2

重新写了一个由reads_counts转FPKM矩阵的脚本，之前的那一般只适用于18个样本的，这里更新了一下，没有样本限制了。
还是分为3步：

grep "exon" genome.gtf > genome_exon.gtf
python count_genelen_from_gft.py genome_exon.gtf gene.len
python Caculate_FPKM.py mapped_gene_number.txt gene.len raw_counts.matrix FPKM.matrix

第一步：把gtf文件中的exon抓取出来
第二步：计算每条基因的exon长度和
第三步：基于每个样本匹配到的reads总数、每个基因的长度、read_counts矩阵得到FPKM矩阵

第一步：
直接运行上面脚本的第一行就行了，记得更改输入的gtf的文件的文件名为你的文件名。
gtf范例

Morimp01GS001   AUGUSTUS        start_codon     47156   47158   .       -       0       gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001   AUGUSTUS        exon    47073   47158   .       -       .       gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001   AUGUSTUS        CDS     47073   47158   0.6     -       0       gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001   AUGUSTUS        exon    46887   46999   .       -       .       gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001   AUGUSTUS        CDS     46887   46999   0.92    -       1       gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001   AUGUSTUS        exon    46644   46784   .       -       .       gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001   AUGUSTUS        CDS     46644   46784   0.91    -       2       gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001   AUGUSTUS        exon    45970   46559   .       -       .       gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001   AUGUSTUS        CDS     45970   46559   0.92    -       2       gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001   AUGUSTUS        stop_codon      45970   45972   .       -       0       gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";Morimp01GS001   AUGUSTUS        start_codon     54027   54029   .       +       0       gene_id "MIM04M26Gene00003"; transcript_id "MIM04M26Gene00003.t1"; gene_name "MIM04M26Gene00003";
Morimp01GS001   AUGUSTUS        exon    54027   54291   .       +       .       gene_id "MIM04M26Gene00003"; transcript_id "MIM04M26Gene00003.t1"; gene_name "MIM04M26Gene00003";
Morimp01GS001   AUGUSTUS        CDS     54027   54291   0.59    +       0       gene_id "MIM04M26Gene00003"; transcript_id "MIM04M26Gene00003.t1"; gene_name "MIM04M26Gene00003";
Morimp01GS001   AUGUSTUS        exon    54361   54689   .       +       .       gene_id "MIM04M26Gene00003"; transcript_id "MIM04M26Gene00003.t1"; gene_name "MIM04M26Gene00003";
Morimp01GS001   AUGUSTUS        CDS     54361   54689   0.22    +       2       gene_id "MIM04M26Gene00003"; transcript_id "MIM04M26Gene00003.t1"; gene_name "MIM04M26Gene00003";
Morimp01GS001   AUGUSTUS        stop_codon      54687   54689   .       +       0       gene_id "MIM04M26Gene00003"; transcript_id "MIM04M26Gene00003.t1"; gene_name "MIM04M26Gene00003";

输出结果范例：

Morimp01GS001   AUGUSTUS        exon    16797   16949   .       +       .       gene_id "MIM04M26Gene00001"; transcript_id "MIM04M26Gene00001.t1"; gene_name "MIM04M26Gene00001";
Morimp01GS001   AUGUSTUS        exon    17024   17219   .       +       .       gene_id "MIM04M26Gene00001"; transcript_id "MIM04M26Gene00001.t1"; gene_name "MIM04M26Gene00001";
Morimp01GS001   AUGUSTUS        exon    17298   18718   .       +       .       gene_id "MIM04M26Gene00001"; transcript_id "MIM04M26Gene00001.t1"; gene_name "MIM04M26Gene00001";
Morimp01GS001   AUGUSTUS        exon    47073   47158   .       -       .       gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001   AUGUSTUS        exon    46887   46999   .       -       .       gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";

第二步：
这里有同学反馈跑的有问题，一般是代码中基因id所在列选择的问题，我的选择标志是a[-2]（按照 " 双引号分割的），如果选择有问题建议将a[-2]更改为a[1]，反正就是要选到gene名所在位置
count_genelen_from_gft.py 脚本内容：

import sys,re
file1 = sys.argv[1]
file2 = sys.argv[2]
f1 = open(file1,'r')
f2 = open(file2,'w')
flag = "fuck"
exon = []
for i in f1:a = i.split("\"")if flag == a[-2]:pos = i.split("\t")exon.append(abs(int(pos[4])-int(pos[3]))+1)elif flag == "fuck":flag = a[-2]pos = i.split("\t")exon.append(abs(int(pos[4])-int(pos[3]))+1)else:f2.write("{0}\t{1}\n".format(flag,sum(exon)))exon = []flag = a[-2]pos = i.split("\t")exon.append(abs(int(pos[4])-int(pos[3]))+1)
f1.close()
f2.close()

输出的gene.len范例：

MIM04M24Gene00599       2898
MIM04M24Gene00600       1035
MIM04M24Gene08324       588
MIM04M24Gene08325       468
MIM04M26Gene00001       1770
MIM04M26Gene00002       930
MIM04M26Gene00003       594
MIM04M26Gene00004       426
MIM04M26Gene00005       1002
MIM04M26Gene00006       792
MIM04M26Gene00007       1125
MIM04M26Gene00008       4041
MIM04M26Gene00009       6537
MIM04M26Gene00010       309
MIM04M26Gene00011       1293
MIM04M26Gene00012       282
MIM04M26Gene00013       765
MIM04M26Gene00014       1680
MIM04M26Gene00015       1134
MIM04M26Gene00016       648

第三步：

python脚本：Caculate_FPKM_v2.py

import sys,re
file1 = sys.argv[1]
file2 = sys.argv[2]
file3 = sys.argv[3]
file4 = sys.argv[4]
f1 = open(file1,'r')    # mapped gene counts of each samples
f2 = open(file2,'r')    # length of each genes
f3 = open(file3,'r')    # raw reads counts matrix
f4 = open(file4,'w')    # output file of FPKM
a = []
arrf1 = []              # mapped gene counts of each samples
dickf2 = {}             # store the genelength of each gene to the dickf2 dicttory
dickf3 = {}
for i in f1:i = i.strip("\n")if re.match('sample',i,re.IGNORECASE) == None:a = i.split("\t")arrf1.append(int(a[1]))else:continue
f1.close()
print(len(arrf1))
for i in f2:i = i.strip("\n")a = i.split("\t")dickf2[a[0]] = int(a[1])
f2.close()
for i in f3:i = i.strip("\n")if re.match("Geneid",i,re.IGNORECASE) == None:a = i.split("\t")dickf3[a[0]] = a[1:len(arrf1)+1]else:f4.write(i+"\n")
f3.close()
for i in dickf3.keys():f4.write(i+"\t")for j in range(len(arrf1)):a = int(dickf3[i][j])#print(a)try:b = (a*1000000.0)/(arrf1[j]*(dickf2[i]/1000.0))except ZeroDivisionError:b = 0print(i)except KeyError:print(i)continuef4.write("{}".format(b))f4.write("\t")f4.write("\n")
f4.close()

输入的f1文件 mapped_reads 为map到每个样本的总的reads数量，注意第一列的头部一定要是sample开头：

sample  mapped_reads
NG-1    19410286
NG-2    19299615
NG-3    18635477
NY-1    21263905
NY-2    20375441
NY-4    17433317
WG-1    1488987
WG-2    19497358
WG-3    18309507
WY-1    20455991
WY-2    18644498
WY-3    31863298

输入的f2文件 gene.len 为每个gene的exon长度和信息文件（前面已有范例）。

输入的f3文件 raw_counts.matrix 为read_counts矩阵，注意第一列的头部为Geneid ，范例:

Geneid  NG-1    NG-2    NG-3    NY-1    NY-2    NY-4    WG-1    WG-2    WG-3    WY-1    WY-2    WY-3
TraesCS1A01G000100.1    0       1       0       3       10      22      0       3       2       32      75      51
TraesCS1A01G000200.1    1       0       0       0       3       16      0       2       2       17      14      25
TraesCS1A01G000300.1    0       0       1       0       2       7       0       2       0       4       10      20
TraesCS1A01G000400.1    2       2       1       0       6       58      1       20      8       14      42      44
TraesCS1A01G000500.1    1       0       2       0       2       6       0       1       4       1       4       17

最后的输出结果为FPKM矩阵，范例：

Geneid  NG-1    NG-2    NG-3    NY-1    NY-2    NY-4    WG-1    WG-2    WG-3    WY-1    WY-2    WY-3
TraesCS5A01G005100.1    0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0
TraesCS2B01G450000.1    0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0
TraesCS4A01G432900.1    0.28    0.17    0.23    0.0     0.0     0.0     0.72    0.11    0.15    0.0     0.0     0.0
TraesCSU01G138200.1     0.0     0.35    0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0
TraesCS5D01G490900.1    1.09    1.47    0.91    0.0     0.14    0.0     0.0     0.8     0.93    0.0     0.08    0.13

纯Python read_counts 转FPKM v2相关推荐

数学推导+纯Python实现机器学习算法：GBDT
Datawhale推荐作者:louwill,Machine Learning Lab 时隔大半年,机器学习算法推导系列终于有时间继续更新了.在之前的14讲中,笔者将监督模型中主要的单模型算法基本都过 ...
纯Python实现Torch API，康奈尔副教授为自己的课程创建了DIY教学库
视学算法报道编辑:杜伟该项目是为纽约校区 Cornell Tech 的「机器学习工程」课程开发的. 近日,机器之心在 GitHub 上发现了一个 DIY 教学库--MiniTorch,该库适用于希 ...
python数据库应用开发实例_纯Python开发的nosql数据库CodernityDB介绍和使用实例
看看这个logo,有些像python的小蛇吧 .这次介绍的数据库codernityDB是纯python开发的. 先前用了下tinyDB这个本地数据库,也在一个api服务中用了下,一开始觉得速度有些不给 ...
python递归_纯Python递归计算行列式
今天做leetcode周赛碰到一个判断三点共线的问题,好在数学系的我马上反应到了行列式,然鹅行列式展开来那么多项,,输错好几次的我直接判了三次罚时,凉凉趁空闲写了个纯py递归计算行列式的程序,兴许以 ...
数学推导+纯Python实现机器学习算法：逻辑回归
2019独角兽企业重金招聘Python工程师标准>>> 自本系列第一讲推出以来,得到了不少同学的反响和赞成,也有同学留言说最好能把数学推导部分写的详细点,笔者只能说尽力,因为打公式实 ...
python进行矩阵计算公式_纯python进行矩阵的相乘运算的方法示例
本文介绍了纯python进行矩阵的相乘运算的方法示例,分享给大家,具体如下: def matrixMultiply(A, B): # 获取A的行数和列数 A_row, A_col = shape(A) ...
纯Python包发布setup脚本编写示例
纯Python包发布setup脚本编写示例 2014 年 6 月 23 日IT.PythonIT.python 如果你有多个模块需要发布,而它们又存在于多个包中,那么指定整个包比指定模块可能要容易地多 ...
纯Python模块发布setup脚本编写示例
纯Python模块发布setup脚本编写示例 2014 年 6 月 22 日IT.PythonIT.python 如果你正准备发布几个模块,特别当它们并不是只在一个特定的包内,你可以在setup脚本中 ...
【机器学习基础】数学推导+纯Python实现机器学习算法30：系列总结与感悟
Python机器学习算法实现 Author:louwill Machine Learning Lab 终于到了最后的总结.从第一篇线性回归的文章开始到现在,已经接近有两年的时间了.当然,也不是纯写这3 ...

纯Python read_counts 转FPKM v2

纯Python read_counts 转FPKM v2相关推荐

最新文章

热门文章