python 处理snp的vcf文件,统计snp在基因的intron、exon还是上游、下游还是不在基因及基因附近
1、c处理vcf文件,初步统计snp位置
位置信息有5种
down_stream是gene下游2k以内,up_stream 是gene上游2k以内,gene_exon是snp在外显子内,gene_intron是gene在内含子内,uninfluncial是gene不在上诉所有区域
SnpType.py
内容:
#这个脚本是用来看snp哪些是在基因中(分为exon和intron),哪些是在基因的上游、下游
#最终输出结果为:scaffold position 位置 geneid
#输入文件1:snpeff结果文件(/public/home/wangwen_lab/zhangjiexiong/Ageing/SNP/snpeff/snpeffC_F/02.runC/mor_imp_strictestfilter_Snp_AncRef_selected_anno4.vcf)
#输入文件2:gff3文件(/public/home/wangwen_lab/zhangjiexiong/Ageing/SNP/genome.gff3)import sys,re
file1 = sys.argv[1]
file2 = sys.argv[2]
file3 = sys.argv[3]
f1 = open(file1,'r')
f2 = open(file2,'r')
f3 = open(file3,'w')
flag = 'fuck'
f1dick = {}
arr1 = []
for i in f1:if re.match("#",i):continuea = i.split("\t")if flag != a[0] and flag is not 'fuck':f1dick[flag] = arr1flag = a[0]arr1 = []# print(a[0])arr1.append(a[1])elif flag is 'fuck':flag = a[0]else:arr1.append(a[1])
f1dick[flag] = arr1
#for i in f1dick:
# print(i,f1dick[i])sc = "fuck"
f2dick1 = {}
f2dick2 = {}
f2arr1 = []
for i in f2:if re.match("M",i) == None:f2dick1[geneid] = f2arr1continuea = i.split("\t")if a[2] == 'gene':f2arr1 = []f2arr1.append(a[3:5])if sc != a[0]:f2dick2[sc] = f2dick1f2dick1 = {}sc = a[0]geneid = a[8].split(";")[0].split("=")[1]if a[2] == 'exon':f2arr1.append(a[3:5])
f2dick1[geneid] = f2arr1
f2dick2[sc] = f2dick1
del f2dick2['fuck']
f2.close()
for i in f1dick.keys():for j in f1dick[i]:#取到scaffold内的所有snp位点try:taq = 0for k in f2dick2[i].keys():#对gff3文件中所有的scaffold为i的gene进行遍历a = f2dick2[i][k]amax = max(a[0])amin = min(a[0])flag = 0if int(amin) <= int(j) <= int(amax):#判断在gene中后判断在exon还是在intronfor u in a[1:]:if int(min(u))<int(j)<int(max(u)):flag = "gene_exon"if flag != "gene_exon":flag = "gene_intron"elif int(min(a[0]))-2000 <= int(j) <=int(min(a[0])):#判断是不是在gene下游flag = 'down_stream'elif int(max(a[0]))+2000 >= int(j) >= int(max(a[0])):#判断是不是gene上游flag = "up_stream"else:passif flag != 0:f3.write(i+"\t"+j+"\t"+k+"\t"+flag+"\t"+max(a[0])+"\t"+min(a[0])+"\n")taq = 1if taq == 0:f3.write(i+"\t"+j+"\t"+"\t"+"Uninfluncial"+"\n")except KeyError:print("KeyError",sys.exc_info())
f3.close()
输入文件为vcf和gff3
vcf内容如下:
##contig=<ID=MI_M04M24_Scaffold001_2344856_2346921_target_region_2345149_2346921,length=2066>
##contig=<ID=MI_M04M24_Scaffold018_52187_53324_target_region_52187_53324,length=1138>
##contig=<ID=MI_M04M24_Scaffold037_441019_444340_target_region_441019_444340,length=3322>
##contig=<ID=MI_M04M24_Scaffold041_338612_340340_target_region_338612_340099,length=1729>
##contig=<ID=MI_M04M24_Scaffold041_340322_342654_target_region_340322_342654,length=2333>
##contig=<ID=MI_M04M24_Scaffold114_32039_35224_target_region_32039_35224,length=3186>
##reference=file:///public/home/wangwen_lab/zhangjiexiong/Ageing/SNP/genome.fasta
##source=SelectVariants
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Mor_importuna
Morimp01GS001 1387 . C T 969.80 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=1.56;ClippingRankSum=-4.910e-01;DP=28;ExcessHet=
Morimp01GS001 1418 . C T 1427.78 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=-1.073e+00;ClippingRankSum=-2.400e-02;DP=43;Exce
Morimp01GS001 8939 . A C 282.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=1.67;ClippingRankSum=-5.330e-01;DP=13;ExcessHet
Morimp01GS001 9251 . C T 1538.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=0.460;ClippingRankSum=0.631;DP=88;ExcessHet=3.0
Morimp01GS001 12787 . T G 2514.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=2.25;ClippingRankSum=0.383;DP=109;ExcessHet=3.0
Morimp01GS001 13324 . C G 2177.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=2.17;ClippingRankSum=1.50;DP=88;ExcessHet=3.010
Morimp01GS001 13446 . G C 1158.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=0.994;ClippingRankSum=-1.990e-01;DP=58;ExcessHe
Morimp01GS001 13866 . A C 2370.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-1.094e+00;ClippingRankSum=-6.080e-01;DP=100;Ex
Morimp01GS001 14181 . T C 1455.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=0.710;ClippingRankSum=-6.500e-01;DP=71;ExcessHe
Morimp01GS001 14320 . A G 1473.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=1.42;ClippingRankSum=-8.810e-01;DP=64;ExcessHet
Morimp01GS001 14344 . C T 1471.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=1.48;ClippingRankSum=-1.880e-01;DP=69;ExcessHet
Morimp01GS001 14606 . T G 2329.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-1.280e-01;ClippingRankSum=-1.583e+00;DP=94;Exc
Morimp01GS001 14869 . A G 2209.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-4.340e-01;ClippingRankSum=-2.854e+00;DP=99;Exc
Morimp01GS001 14897 . C T 2048.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=0.603;ClippingRankSum=1.03;DP=88;ExcessHet=3.01
Morimp01GS001 15056 . A T 1930.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=0.870;ClippingRankSum=-3.810e-01;DP=88;ExcessHe
Morimp01GS001 15384 . A G 1690.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-1.241e+00;ClippingRankSum=0.431;DP=86;ExcessHe
Morimp01GS001 15564 . C T 2023.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=0.198;ClippingRankSum=0.014;DP=85;ExcessHet=3.0
Morimp01GS001 15729 . C T 2458.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=
gff3如下:
Morimp01GS001 AUGUSTUS gene 16797 18718 0.81 + . ID=MIM04M26Gene00001;Name=MIM04M26Gene00001
Morimp01GS001 AUGUSTUS mRNA 16797 18718 0.81 + . ID=MIM04M26Gene00001.t1;Parent=MIM04M26Gene00001
Morimp01GS001 AUGUSTUS exon 16797 16949 . + . ID=MIM04M26Gene00001.t1.exon1;Parent=MIM04M26Gene00001.t1
Morimp01GS001 AUGUSTUS CDS 16797 16949 0.82 + 0 ID=MIM04M26Gene00001.t1.CDS1;Parent=MIM04M26Gene00001.t1
Morimp01GS001 AUGUSTUS exon 17024 17219 . + . ID=MIM04M26Gene00001.t1.exon2;Parent=MIM04M26Gene00001.t1
Morimp01GS001 AUGUSTUS CDS 17024 17219 0.99 + 0 ID=MIM04M26Gene00001.t1.CDS2;Parent=MIM04M26Gene00001.t1
Morimp01GS001 AUGUSTUS exon 17298 18718 . + . ID=MIM04M26Gene00001.t1.exon3;Parent=MIM04M26Gene00001.t1
Morimp01GS001 AUGUSTUS CDS 17298 18718 0.98 + 2 ID=MIM04M26Gene00001.t1.CDS3;Parent=MIM04M26Gene00001.t1Morimp01GS001 AUGUSTUS gene 45970 47158 0.55 - . ID=MIM04M26Gene00002;Name=MIM04M26Gene00002
Morimp01GS001 AUGUSTUS mRNA 45970 47158 0.55 - . ID=MIM04M26Gene00002.t1;Parent=MIM04M26Gene00002
Morimp01GS001 AUGUSTUS exon 47073 47158 . - . ID=MIM04M26Gene00002.t1.exon1;Parent=MIM04M26Gene00002.t1
Morimp01GS001 AUGUSTUS CDS 47073 47158 0.6 - 0 ID=MIM04M26Gene00002.t1.CDS1;Parent=MIM04M26Gene00002.t1
Morimp01GS001 AUGUSTUS exon 46887 46999 . - . ID=MIM04M26Gene00002.t1.exon2;Parent=MIM04M26Gene00002.t1
Morimp01GS001 AUGUSTUS CDS 46887 46999 0.92 - 1 ID=MIM04M26Gene00002.t1.CDS2;Parent=MIM04M26Gene00002.t1
Morimp01GS001 AUGUSTUS exon 46644 46784 . - . ID=MIM04M26Gene00002.t1.exon3;Parent=MIM04M26Gene00002.t1
Morimp01GS001 AUGUSTUS CDS 46644 46784 0.91 - 2 ID=MIM04M26Gene00002.t1.CDS3;Parent=MIM04M26Gene00002.t1
Morimp01GS001 AUGUSTUS exon 45970 46559 . - . ID=MIM04M26Gene00002.t1.exon4;Parent=MIM04M26Gene00002.t1
Morimp01GS001 AUGUSTUS CDS 45970 46559 0.92 - 2 ID=MIM04M26Gene00002.t1.CDS4;Parent=MIM04M26Gene00002.t1
输入结果为c.txt
对结果再进行排序
sort -k1 -k2 -V c.txt > snp_type_F.list
snp_type_F.list结果如下:
Morimp01GS001 1418 Uninfluncial
Morimp01GS001 1476 Uninfluncial
Morimp01GS001 8939 Uninfluncial
Morimp01GS001 9251 Uninfluncial
Morimp01GS001 12787 Uninfluncial
Morimp01GS001 13324 Uninfluncial
Morimp01GS001 13446 Uninfluncial
Morimp01GS001 13866 Uninfluncial
Morimp01GS001 14181 Uninfluncial
Morimp01GS001 14320 Uninfluncial
Morimp01GS001 14344 Uninfluncial
Morimp01GS001 14606 Uninfluncial
Morimp01GS001 14869 MIM04M26Gene00001 down_stream 18718 16797
Morimp01GS001 14897 MIM04M26Gene00001 down_stream 18718 16797
Morimp01GS001 15056 MIM04M26Gene00001 down_stream 18718 16797
Morimp01GS001 15384 MIM04M26Gene00001 down_stream 18718 16797
Morimp01GS001 15564 MIM04M26Gene00001 down_stream 18718 16797
Morimp01GS001 15729 MIM04M26Gene00001 down_stream 18718 16797
Morimp01GS001 15747 MIM04M26Gene00001 down_stream 18718 16797
Morimp01GS001 15943 MIM04M26Gene00001 down_stream 18718 16797
Morimp01GS001 15981 MIM04M26Gene00001 down_stream 18718 16797
Morimp01GS001 16351 MIM04M26Gene00001 down_stream 18718 16797
Morimp01GS001 16374 MIM04M26Gene00001 down_stream 18718 16797
Morimp01GS001 16376 MIM04M26Gene00001 down_stream 18718 16797
Morimp01GS001 16472 MIM04M26Gene00001 down_stream 18718 16797
Morimp01GS001 16496 MIM04M26Gene00001 down_stream 18718 16797
Morimp01GS001 16640 MIM04M26Gene00001 down_stream 18718 16797
Morimp01GS001 16761 MIM04M26Gene00001 down_stream 18718 16797
Morimp01GS001 17077 MIM04M26Gene00001 gene_exon 18718 16797
Morimp01GS001 17340 MIM04M26Gene00001 gene_exon 18718 16797
Morimp01GS001 17470 MIM04M26Gene00001 gene_exon 18718 16797
Morimp01GS001 17478 MIM04M26Gene00001 gene_exon 18718 16797
Morimp01GS001 17963 MIM04M26Gene00001 gene_exon 18718 16797
Morimp01GS001 18077 MIM04M26Gene00001 gene_exon 18718 16797
Morimp01GS001 18153 MIM04M26Gene00001 gene_exon 18718 16797
Morimp01GS001 18448 MIM04M26Gene00001 gene_exon 18718 16797
Morimp01GS001 18449 MIM04M26Gene00001 gene_exon 18718 16797
Morimp01GS001 18798 MIM04M26Gene00001 up_stream 18718 16797
Morimp01GS001 18802 MIM04M26Gene00001 up_stream 18718 16797
Morimp01GS001 18970 MIM04M26Gene00001 up_stream 18718 16797
Morimp01GS001 19137 MIM04M26Gene00001 up_stream 18718 16797
Morimp01GS001 19292 MIM04M26Gene00001 up_stream 18718 16797
Morimp01GS001 19367 MIM04M26Gene00001 up_stream 18718 16797
Morimp01GS001 19394 MIM04M26Gene00001 up_stream 18718 16797
Morimp01GS001 19401 MIM04M26Gene00001 up_stream 18718 16797
Morimp01GS001 19446 MIM04M26Gene00001 up_stream 18718 16797
Morimp01GS001 19785 MIM04M26Gene00001 up_stream 18718 16797
Morimp01GS001 19832 MIM04M26Gene00001 up_stream 18718 16797
Morimp01GS001 19885 MIM04M26Gene00001 up_stream 18718 16797
次文件position存在重复(eg: 既在A gene内,又在B gene的上游)
处理一下
display_SnpType.py
内容如下:
#这个是用来处理SnpType.py
import sys,re,math
file1 = sys.argv[1]
file2 = sys.argv[2]
f1 = open(file1,'r')
f2 = open(file2,'w')
la = [0,0]
a = []
fuck = []
arr =[]
arrline = []
for i in f1:if re.search("Uninfluncial",i):f2.write(i)else:line = i.strip()a = line.split("\t")if a[1] == la[1]:arr.append(la[4])arr.append(la[5])arrline.append(la)else:if la[1] == 0:la = acontinuearrline.append(la)arr.append(la[4])arr.append(la[5])for j in arr:c = abs(int(j)-int(la[1]))fuck.append(c)t = min(fuck)zjx = fuck.index(t)x = zjx//2fuck = []for zz in arrline[x]:f2.write(zz+"\t")f2.write("\n")arrline = []arr = []la = a
arrline.append(la)
arr.append(la[4])
arr.append(la[5])
for j in arr:c = abs(int(j)-int(la[1]))fuck.append(c)
t = min(fuck)
zjx = fuck.index(t)
x = zjx//2
fuck = []
for zz in arrline[x]:f2.write(zz+"\t")
f2.write("\n")
arrline = []
arr = []
f1.close()
f2.close()
python display_SnpType.py snp_type_F.list snp_type_F_clear.list
结果和处理之前差不多,就是根据距离筛选了与SNP的position关联最强的gene,一个position对应一个gene
python 处理snp的vcf文件,统计snp在基因的intron、exon还是上游、下游还是不在基因及基因附近相关推荐
- 2021.06.08|提取、比较各样品vcf文件中snp突变频率
目录 摘要 环境与方法 使用代码 分析结果 总结 摘要 接到一个wgs项目,要帮助客户统计vcf文件中snp突变频率,比较两个样品的突变位点.这个工作在上一个项目中是手动处理的,当时参考序列短,突变位 ...
- 【PYTHON小项目】VCF文件转EXCEL文件方法详解(附QUOTED-PRINTABLE编解码)
来源 整理大量通讯录时,发现从手机上下载的CSV文件不易转成EXCEL文件(QQ通讯录和百度云盘都试过了,CSV文件过大无法加载),导致整理起来特别麻烦,故试图自己写一个小程序来处理文件. 分析CSV ...
- java vcf文件 昵称怎么写_Annovar注释vcf-笔记
找了突变,获得了snp的vcf文件,肯定想知道这些突变位点到底是出现在哪些基因上以及那些转录本上:如果是出现在外显子上的突变,想了解这些突变会对编码蛋白造成怎么样的影响,这些统统能用annovar解决 ...
- R语言丨根据VCF文件设计引物,自动识别两样本差异SNP位点,调用samtools获取上下游参考序列
根据变异位点设计引物序列 今天碰到一个新问题:假如有一个vcf文件储存了两个样品的变异位点基因型数据,每行代表一个位点,我现在想找出两样本差异的SNP位点,再把差异位点用[REF/ALT]的形式表示, ...
- pta通讯录排序用python实现,python实现将android手机通讯录vcf文件转化为csv
经常会遇到将手机通讯录导出到电脑并转化为在电脑中可编辑的情况,在网上搜索了很久当前不外乎两种处理方式.1.使用电脑的outlook的通讯簿功能,将手机导出的vcf文件导入到outlook的通讯录中,然 ...
- Python+pandas读取Excel文件统计最受欢迎的前3位演员
推荐教材:<Python程序设计基础与应用>(ISBN:9787111606178),董付国,机械工业出版社,2018.8出版,2021.3第11次印刷 图书详情: 配套资源: 用书教师可 ...
- python提取文本中的手机号_Python从vcf文件中读取手机号并进行去重操作
文章目录 1. Python代码 file = open('test.vcf', 'r', encoding='utf-8') tels = [] for line in file: line = l ...
- 生信技能04 - 生信分析所需致病SNP位点Excel文件制作教程
致病SNP位点Excel文件制作教程 在生信分析项目中,有遗传相关的项目会涉及到SNP位点的查找,将该SNP位点文件作为程序输入文件,本文将介绍如果通过NCBI查看目标基因的致病SNP位点,并制作成可 ...
- Python 导出手机通讯录文件 VCF 文件中的手机号码
文章目录 Python 导出手机通讯录文件 VCF 文件中的手机号码 1.代码 Python 导出手机通讯录文件 VCF 文件中的手机号码 1.代码 if __name__ == '__main__' ...
最新文章
- Java基础知识强化之IO流笔记41:字符流缓冲流之复制文本文件案例02(使用 [ newLine() / readLine() ] )(重要)...
- 2020年行政区划代码_2020年梧州市行政区划,了解梧州市有几个区,详细数据
- oracle sql语句 只读,Oracle_SQL语句
- 在其他事件中repeater的取值
- mysql全文索引 插件,如何编写MySQL全文索引插件
- 欧洲互联网将“死于”版权法?
- H5常用代码:页面框架
- BT种子 kitty
- 传奇java手游_Java手机游戏神灯传奇代码JAVA游戏源码下载
- C# WinForm 使用SMS接口发送手机验证码+图形验证码+IP限制
- 计算机卡住了怎样恢复,电脑频繁假死怎么办 电脑死机数据恢复
- c语言编程培训都是小学,小学编程培训班明故宫哪里有C语言培训
- X-Order创始人陶荣祺:Libra让所有互联网应用成为开放金融的一部分
- 【Undertale-传说之下】-中文补丁汉化steam
- angular2 简述
- 数据库基本结构SQL语句
- 输出长剑(华山论剑)
- 怎样使用PDF阅读器
- cmakelist的作用及使用
- python pip下载路径_Python pip install如何修改默认下载路径
热门文章
- html做相册浏览,ul结合CSS制作网页相册滑动浏览效果
- HTML中使背景图片自适应浏览器大小
- 【开发教程14】AI语音人脸识别(会议记录仪/人脸打卡机)-AI人脸系统架构
- shellcode加载器--从入门到放弃
- SpringCache的简单入门(RedisCacheManager)(@Cacheable、@CachePut、@CacheEvict)
- whale 帷幄:crm客户管理营销系统全称是什么
- 给时光以生命,而不是给生命以时光--2018年终总结
- 倾斜摄影三维建模软件photoscan教程 [转]
- 阿里巴巴(菜鸟) - 算法工程师(机器学习)提前批笔试面试总结
- python教学视频k_10个Python奇趣教程,附视频讲解+练手项目。