实验记录 | 为什么mtDNA的fastq数据会比对到常染色体上？

刚刚验证了一下，原始的数据文件并没有弄错。
因为又从sra数据库中下载了这个文件，比对了作者提供的数据和我们下载解压后的数据，发现是同样的大小，因此，从原始数据层面可以控制是一致的。

Source name ATCC
Organism Homo sapiens
Characteristics cell line: TF1
cell type: hematopoietic cell line 造血干细胞系
clone id: G10
Treatment protocol NA
Growth protocol RPMI, 10%FBS, Pen/Strep, 2ng/ml GM-CSF
Extracted molecule genomic DNA 基因组DNA（其实可以想象也包括常染色体的DNA，要不然只取常染色体的怎么做到）
Extraction protocol Qiagen TCL lysis buffer
Qiagen Mito-RepliG
Single cell mitochondrial DNA-seq, Nextera

Data processing Paired-end reads were aligned to hg19 using BOWTIE2 using the parameter –X2000 allowing fragments of up to 2 kb to align. Reads were subsequently filtered for alignment quality of >Q20 and were required to be properly paired. Mitochondrial genotypes were inferred for each ATAC-seq sample using a custom python script
Samples were filtered such that individual cells each had at least 100x coverage at 9 established heteroplasmic variants
Genome_build: hg19
Supplementary_files_format_and_content: High confidence mitochondrial variants for 3 clones assayed by 3 separate single cell technologies

所以接下来，比较困惑的一个问题就是，数据筛选的问题，因为在sra数据库上关于该数据明明说的是采样于mtDNA，为什么我们在比对的过程中会有比对到常染色体（如chr1）的情况呢？这就很奇怪。

/home/xxzhang/workplace/software/bwa/bwa mem -v 1 -t 32 -a -M ./geneome/hg38/hg38.fasta /home/xxzhang/workplace/QBRC/example/example_dataset/sequencing/SRR7246238_1.fastq.gz /home/xxzhang/workplace/QBRC/example/example_dataset/sequencing/SRR7246238_2.fastq.gz >/home/xxzhang/workplace/output/dna.sam

这里面的话，真正的参数其实就三个。我们现在依次的将它们注释一下。

-v  1 ： verbosity level: 1=error
-a ：output all alignments for SE or unpaired PE
-M ：   mark shorter split hits as secondary

这几个参数，粗粗看了一下，是对比对结果格式的要求。比较奇怪。
我对于bwa算法，对于比对的细节是不够清楚的。
是不是这里的某个参数发生了改变。

Usage: bwa mem [options] <in1.fq> [in2.fq]

Algorithm options:

   -t INT        number of threads [1]-k INT        minimum seed length [19]-w INT        band width for banded alignment [100]-d INT        off-diagonal X-dropoff [100]-r FLOAT      look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]-y INT        seed occurrence for the 3rd round seeding [20]-c INT        skip seeds with more than INT occurrences [500]-D FLOAT      drop chains shorter than FLOAT fraction of the longest overlapping chain [0.50]-W INT        discard a chain if seeded bases shorter than INT [0]-m INT        perform at most INT rounds of mate rescues for each read [50]-S            skip mate rescue-P            skip pairing; mate rescue performed unless -S also in use

Scoring options:

   -A INT        score for a sequence match, which scales options -TdBOELU unless overridden [1]-B INT        penalty for a mismatch [4]-O INT[,INT]  gap open penalties for deletions and insertions [6,6]-E INT[,INT]  gap extension penalty; a gap of size k cost '{-O} + {-E}*k' [1,1]-L INT[,INT]  penalty for 5'- and 3'-end clipping [5,5]-U INT        penalty for an unpaired read pair [17]-x STR        read type. Setting -x changes multiple parameters unless overridden [null]pacbio: -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0  (PacBio reads to ref)ont2d: -k14 -W20 -r10 -A1 -B1 -O1 -E1 -L0  (Oxford Nanopore 2D-reads to ref)intractg: -B9 -O16 -L5  (intra-species contigs to ref)

Input/output options:

   -p            smart pairing (ignoring in2.fq)-R STR        read group header line such as '@RG\tID:foo\tSM:bar' [null]-H STR/FILE   insert STR to header if it starts with @; or insert lines in FILE [null]-o FILE       sam file to output results to [stdout]-j            treat ALT contigs as part of the primary assembly (i.e. ignore <idxbase>.alt file)-5            for split alignment, take the alignment with the smallest coordinate as primary-q            don't modify mapQ of supplementary alignments-K INT        process INT input bases in each batch regardless of nThreads (for reproducibility) []-v INT        verbosity level: 1=error, 2=warning, 3=message, 4+=debugging [3]-T INT        minimum score to output [30]-h INT[,INT]  if there are <INT hits with score >80% of the max score, output all in XA [5,200]-a            output all alignments for SE or unpaired PE-C            append FASTA/FASTQ comment to SAM output-V            output the reference FASTA header in the XR tag-Y            use soft clipping for supplementary alignments-M            mark shorter split hits as secondary-I FLOAT[,FLOAT[,INT[,INT]]]specify the mean, standard deviation (10% of the mean if absent), max(4 sigma from the mean if absent) and min of the insert size distribution.FR orientation only. [inferred]

Note: Please read the man page for detailed description of the command line and options.

这个注释不够清楚。这里的话，我不是很明白seed是什么意思？

seed :
BWA采用seed-and-extend策略。在seed阶段，BWA取read的碱基片段在reference上进行精确匹配，并选择满足一定长度要求和匹配次数的read片段作为seed，这个阶段算法的核心是基于FM-index的精确匹配；在extend阶段，BWA利用Smith-Waterman算法将seed在read和reference上向两边延伸比对（容忍gap），进而找到整个read在reference上符合条件的全局匹配。
参考链接：https://blog.csdn.net/weixin_34075551/article/details/92404768
这里的长度要求和匹配次数，对应的参数，就在这里的：

用于比对的fragment的长度大于2kb

reads的质量大于20，properly 配对

如果我换用bowtie，会有不同的表现效果吗？我想看看，如果我做如上的限定，是不是结果中还是会有常染色体的reads（因为它就是有的）。

>(base) [xxzhang@mu02 bowtie2]$ ./bowtie2
No index, query, or output file specified!
Bowtie 2 version 2.4.2 by Ben Langmead (langmea@cs.jhu.edu, www.cs.jhu.edu/~langmea)
Usage:bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r> | --interleaved <i> | -b <bam>} [-S <sam>]<bt2-idx>  Index filename prefix (minus trailing .X.bt2).NOTE: Bowtie 1 and Bowtie 2 indexes are not compatible.<m1>       Files with #1 mates, paired with files in <m2>.Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).<m2>       Files with #2 mates, paired with files in <m1>.Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).<r>        Files with unpaired reads.Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).<i>        Files with interleaved paired-end FASTQ/FASTA readsCould be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).<bam>      Files are unaligned BAM sorted by read name.<sam>      File for SAM output (default: stdout)<m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can bespecified many times.  E.g. '-U file1.fq,file2.fq -U file3.fq'.
Options (defaults in parentheses):Input:-q                 query input files are FASTQ .fq/.fastq (default)--tab5             query input files are TAB5 .tab5--tab6             query input files are TAB6 .tab6--qseq             query input files are in Illumina's qseq format-f                 query input files are (multi-)FASTA .fa/.mfa-r                 query input files are raw one-sequence-per-line-F k:<int>,i:<int> query input files are continuous FASTA where readsare substrings (k-mers) extracted from a FASTA file <s>and aligned at offsets 1, 1+i, 1+2i ... end of reference-c                 <m1>, <m2>, <r> are sequences themselves, not files-s/--skip <int>    skip the first <int> reads/pairs in the input (none)-u/--upto <int>    stop after first <int> reads/pairs (no limit)-5/--trim5 <int>   trim <int> bases from 5'/left end of reads (0)-3/--trim3 <int>   trim <int> bases from 3'/right end of reads (0)--trim-to [3:|5:]<int> trim reads exceeding <int> bases from either 3' or 5' endIf the read end is not specified then it defaults to 3 (0)--phred33          qualities are Phred+33 (default)--phred64          qualities are Phred+64--int-quals        qualities encoded as space-delimited integersPresets:                 Same as:For --end-to-end:--very-fast            -D 5 -R 1 -N 0 -L 22 -i S,0,2.50--fast                 -D 10 -R 2 -N 0 -L 22 -i S,0,2.50--sensitive            -D 15 -R 2 -N 0 -L 22 -i S,1,1.15 (default)--very-sensitive       -D 20 -R 3 -N 0 -L 20 -i S,1,0.50For --local:--very-fast-local      -D 5 -R 1 -N 0 -L 25 -i S,1,2.00--fast-local           -D 10 -R 2 -N 0 -L 22 -i S,1,1.75--sensitive-local      -D 15 -R 2 -N 0 -L 20 -i S,1,0.75 (default)--very-sensitive-local -D 20 -R 3 -N 0 -L 20 -i S,1,0.50Alignment:-N <int>           max # mismatches in seed alignment; can be 0 or 1 (0)-L <int>           length of seed substrings; must be >3, <32 (22)-i <func>          interval between seed substrings w/r/t read len (S,1,1.15)--n-ceil <func>    func for max # non-A/C/G/Ts permitted in aln (L,0,0.15)--dpad <int>       include <int> extra ref chars on sides of DP table (15)--gbar <int>       disallow gaps within <int> nucs of read extremes (4)--ignore-quals     treat all quality values as 30 on Phred scale (off)--nofw             do not align forward (original) version of read (off)--norc             do not align reverse-complement version of read (off)--no-1mm-upfront   do not allow 1 mismatch alignments before attempting toscan for the optimal seeded alignments--end-to-end       entire read must align; no clipping (on)OR--local            local alignment; ends might be soft clipped (off)Scoring:--ma <int>         match bonus (0 for --end-to-end, 2 for --local)--mp <int>         max penalty for mismatch; lower qual = lower penalty (6)--np <int>         penalty for non-A/C/G/Ts in read/ref (1)--rdg <int>,<int>  read gap open, extend penalties (5,3)--rfg <int>,<int>  reference gap open, extend penalties (5,3)--score-min <func> min acceptable alignment score w/r/t read length(G,20,8 for local, L,-0.6,-0.6 for end-to-end)Reporting:(default)          look for multiple alignments, report best, with MAPQOR-k <int>           report up to <int> alns per read; MAPQ not meaningfulOR-a/--all           report all alignments; very slow, MAPQ not meaningfulEffort:-D <int>           give up extending after <int> failed extends in a row (15)-R <int>           for reads w/ repetitive seeds, try <int> sets of seeds (2)Paired-end:-I/--minins <int>  minimum fragment length (0)-X/--maxins <int>  maximum fragment length (500)--fr/--rf/--ff     -1, -2 mates align fw/rev, rev/fw, fw/fw (--fr)--no-mixed         suppress unpaired alignments for paired reads--no-discordant    suppress discordant alignments for paired reads--dovetail         concordant when mates extend past each other--no-contain       not concordant when one mate alignment contains other--no-overlap       not concordant when mates overlap at allBAM:--align-paired-readsBowtie2 will, by default, attempt to align unpaired BAM reads.Use this option to align paired-end reads instead.--preserve-tags    Preserve tags from the original BAM record byappending them to the end of the corresponding SAM output.Output:-t/--time          print wall-clock time taken by search phases--un <path>        write unpaired reads that didn't align to <path>--al <path>        write unpaired reads that aligned at least once to <path>--un-conc <path>   write pairs that didn't align concordantly to <path>--al-conc <path>   write pairs that aligned concordantly at least once to <path>(Note: for --un, --al, --un-conc, or --al-conc, add '-gz' to the option name, e.g.--un-gz <path>, to gzip compress output, or add '-bz2' to bzip2 compress output.)--quiet            print nothing to stderr except serious errors--met-file <path>  send metrics to file at <path> (off)--met-stderr       send metrics to stderr (off)--met <int>        report internal counters & metrics every <int> secs (1)--no-unal          suppress SAM records for unaligned reads--no-head          suppress header lines, i.e. lines starting with @--no-sq            suppress @SQ header lines--rg-id <text>     set read group id, reflected in @RG line and RG:Z: opt field--rg <text>        add <text> ("lab:value") to @RG line of SAM header.Note: @RG line only printed when --rg-id is set.--omit-sec-seq     put '*' in SEQ and QUAL fields for secondary alignments.--sam-no-qname-truncSuppress standard behavior of truncating readname at first whitespaceat the expense of generating non-standard SAM.--xeq              Use '='/'X', instead of 'M,' to specify matches/mismatches in SAM record.--soft-clipped-unmapped-tlenExclude soft-clipped bases when reporting TLEN--sam-append-commentAppend FASTA/FASTQ comment to SAM recordPerformance:-p/--threads <int> number of alignment threads to launch (1)--reorder          force SAM output order to match order of input reads--mm               use memory-mapped I/O for index; many 'bowtie's can shareOther:--qc-filter        filter out reads that are bad according to QSEQ filter--seed <int>       seed for random number generator (0)--non-deterministicseed rand. gen. arbitrarily instead of using read attributes--version          print version information and quit-h/--help          print this usage message
(ERR): bowtie2-align exited with value 1

比对这个需要建立起索引，所以耗时还是挺多的。它这里也提到了seed。
所以，让我更想知道比对过程中的seed，究竟有什么作用？

-T INT 当比对的分值比 INT 小时，不输出该比对结果，这个参数只影响输出的结果，不影响比对的过程。
-a 将所有的比对结果都输出，包括 single-end 和 unpaired paired-end的 reads，但是这些比对的结果会被标记为次优。
-k INT Minimum seed length. Matches shorter than INT will be missed. The alignment speed is usually insensitive to this value unless it significantly deviates 20. [19]
最小种子匹配长度，影响比对速度，默认为19
参考链接：https://www.jianshu.com/p/19f58a07e6f4

这里，我想到的是可能是-a参数的添加，所以在结果文件中一些低质量比对的reads会出现。
那我现在，将bwa的指令修改为：

/home/xxzhang/workplace/software/bwa/bwa mem -v 1 -t 32  -M  -k 2000 -T 20 ./geneome/hg38/hg38.fasta /home/xxzhang/workplace/QBRC/example/example_dataset/sequencing/SRR7246238_1.fastq.gz /home/xxzhang/workplace/QBRC/example/example_dataset/sequencing/SRR7246238_2.fastq.gz >/home/xxzhang/workplace/output/re2.sam

想要查看一下sam文件的内容。
本身就是一个文本文件，所以，可以通过head之间打开。
现在把sam文件转换为bam文件。

修改后：

SRR7246238.49 141 * 0 0 * * 0 0 C ATAAACCTATCCCCCATTCTCCTCCTATCCCTCAACC /A/A//EEE//<A<EE//AEE/<//A/EEA/EE/AE/A A S:i:0 XS:i:0
SRR7246238.50 77 * 0 0 * * 0 0 C CAATTACCCCTACCATGAGCCCTACAAACAACTTACC AA/AAEAEEAAEE/EEAEAEE/EEEA//EE//E///E< A S:i:0 XS:i:0
SRR7246238.50 141 * 0 0 * * 0 0 A ATAAGAGGGATCACATGACTATAAGTCGCAGGTTAGT AAAAAEEE6EEE//EE//<EE///AEE6EAA//E/EEE A S:i:0 XS:i:0

修改前：

SRR7246238.1593 163 chr1 629106 0 38M = 629113 45 CTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTAT AAAAAEEEEEEE6EEEEEEEEEEEEAEEEEEEE/EEEE NM:i:0 MD:Z:38 MC:Z:38M AS:i:38 XS:i:38
SRR7246238.38509 163 chr1 629106 0 38M = 629173 105 CTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTAT AAAAAEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEEEE NM:i:0 MD:Z:38 MC:Z:38M AS:i:38 XS:i:38
SRR7246238.41649 99 chr1 629106 0 38M = 629123 55 CTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTAT AAAAAEEEEEEAEEEEEEEEEEEEEEEEAEAEEE//EE NM:i:0 MD:Z:38 MC:Z:38M AS:i:38 XS:i:38
SRR7246238.51559 99 chr1 629106 0 38M = 629139 71 CTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTAT AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEE NM:i:0 MD:Z:38 MC:Z:38M AS:i:38 XS:i:38
SRR7246238.75215 163 chr1 629106 21 38M = 629184 116 CTTCAACATCGAATACGCCGCAGGCCCCTTCGCCATAT AAAAAAEEEEEEEEEEEEAAEEEAEEEEEEEEAEEEEE NM:i:1 MD:Z:34C3 MC:Z:38M AS:i:34 XS:i:34
SRR7246238.105602 99 chr1 629106 15 38M = 629183 115 CTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTAT AAAAA6EEEEEEEEEEEEEAEEEEEEEEEEAEA/AEAE NM:i:0 MD:Z:38 MC:Z:38M AS:i:38 XS:i:38
SRR7246238.107380 99 chr1 629106 0 38M = 629160 92 CTTCAACATCGAATACGCCGCAGGCCACTTCGCCCTAT AAAA6AAAE/AEEEEEAEE/AEE6<E//A/E/E<A<// NM:i:1 MD:Z:26C11 MC:Z:38M AS:i:33 XS:i:33
SRR7246238.108971 353 chr1 629106 0 38M chrM 9509 0 * * NM:i:0 MD:Z:38 MC:Z:38M AS:i:38
SRR7246238.122723 163 chr1 629106 15 38M = 629183 115 CTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTAT A/AAAEEEE66EAA/AEEEAEEE6EAAE/E6EEE</// NM:i:0 MD:Z:38 MC:Z:38M AS:i:38 XS:i:38
SRR7246238.126973 99 chr1 629106 0 38M = 629144 76 CTTCAACATCGAATACGCCGCAGGCCCCTTCGTCCGAT AAAAAEEAEEEEEE6AEEEEEEEEEAEE/AEAE<///E NM:i:2 MD:Z:32C2T2 MC:Z:38M AS:i:32 XS:i:32
SRR7246238.139731 163 chr1 629106 21 38M = 629184 116 CTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTAT AAAAAEEEEEEEEEEAEEAEAEEEEEEEEEEEEEEEEE NM:i:0 MD:Z:38 MC:Z:38M AS:i:38 XS:i:38

所以我们的这个尝试是毫无意义的。甚至结果比想象的更加的糟糕。
但是我也注意到了一个现象，比如这条染色体：

SRR7246238.108971 353 chr1 629106 0 38M chrM 9509 0 * * NM:i:0 MD:Z:38 MC:Z:38M AS:i:38
我们的paired-end的reads，一条是比对到了1号染色体，一条是比对到了线粒体DNA。

这个过程用到的转换的方法(代码的复用很重要，最讨厌不断的查查查)：

samtools view -h re.sam >re1.bam #sam转换为bam文件
samtools sort -o sort3.bam -O bam re1.bam #对bam文件进行排序 #注意不要搞混两个文件的前后位置，意义不同
samtools view tumor.bam | head -n10 #查看bam文件

我们重新使用bwa进行比对，不过是去除-a这个参数。

/home/xxzhang/workplace/software/bwa/bwa mem -v 1 -t 32  -M  ./geneome/hg38/hg38.fasta /home/xxzhang/workplace/QBRC/example/example_dataset/sequencing/SRR7246238_1.fastq.gz /home/xxzhang/workplace/QBRC/example/example_dataset/sequencing/SRR7246238_2.fastq.gz >/home/xxzhang/workplace/output/noa.sam

我们比较一下，各个染色体上能够比对到参考基因组的reads数：

samtools idxstats noa_sort.bam | head -n30

chr1 248956422 31133 210
chr2 242193529 4193 43
chr3 198295559 3097 15
chr4 190214555 3185 8
chr5 181538259 9450 70
chr6 170805979 22487 55
chr7 159345973 591 7
chr8 145138636 1069 7
chr9 138394717 11109 36
chr10 133797422 711 2
chr11 135086622 3567 17
chr12 133275309 19143 57
chr13 114364328 946 5
chr14 107043718 376 9
chr15 101991189 468 2
chr16 90338345 580 3
chr17 83257441 721 10
chr18 80373285 851 6
chr19 58617616 351 1
chr20 64444167 690 5
chr21 46709983 467 1
chr22 50818468 296 1
chrX 156040895 2883 14
chrY 57227415 342 3
chrM 16569 296706 1748

这个文件的第1列是染色体数，第2列是染色体的长度，第3列是比对到染色体的reads的长度，第4列是双端匹配不一致的情况。
所以，可以看到从整体上看，还是chrM的reads占有较高的优势的。
未加-a参数的时候，比对上的reads明显的增多，在这个维度可以控制到一点，但整体的数其实变化并不大。

我们继续反思，流程中，还有哪里可能会对结果产生影响的。

下一步，我比较怀疑的地方可能就是lofreq使用时候的阈值问题，会不会在这个阶段有影响。

/home/xxzhang/miniconda3/bin/lofreq call -s --sig 1 --bonf 1 -C 5 -f ./geneome/hg38/hg38.fasta --call-indels  --verbose --use-orphan -o ./output/lofreq.vcf /home/xxzhang/workplace/QBRC/output/tumor/tumor.bam

lofreq call: call variants from BAM fileUsage: lofreq call [options] in.bamOptions:
- Reference:-f | --ref FILE              Indexed reference fasta file (gzip supported) [null]
- Output:-o | --out FILE              Vcf output file [- = stdout]
- Regions:-r | --region STR            Limit calls to this region (chrom:start-end) [null]-l | --bed FILE              List of positions (chr pos) or regions (BED) [null]
- Base-call quality:-q | --min-bq INT            Skip any base with baseQ smaller than INT [6]-Q | --min-alt-bq INT        Skip alternate bases with baseQ smaller than INT [6]-R | --def-alt-bq INT        Overwrite baseQs of alternate bases (that passed bq filter) with this value (-1: use median ref-bq; 0: keep) [0]-j | --min-jq INT            Skip any base with joinedQ smaller than INT [0]-J | --min-alt-jq INT        Skip alternate bases with joinedQ smaller than INT [0]-K | --def-alt-jq INT        Overwrite joinedQs of alternate bases (that passed jq filter) with this value (-1: use median ref-bq; 0: keep) [0]
- Base-alignment (BAQ) and indel-aligment (IDAQ) qualities:-B | --no-baq                Disable use of base-alignment quality (BAQ)-A | --no-idaq               Don't use IDAQ values (NOT recommended under ANY circumstances other than debugging)-D | --del-baq               Delete pre-existing BAQ values, i.e. compute even if already present in BAM-e | --no-ext-baq            Use 'normal' BAQ (samtools default) instead of extended BAQ (both computed on the fly if not already present in lb tag)
- Mapping quality:-m | --min-mq INT            Skip reads with mapping quality smaller than INT [0]-M | --max-mq INT            Cap mapping quality at INT [255]-N | --no-mq                 Don't merge mapping quality in LoFreq's model
- Indels:** --call-indels           Enable indel calls (note: preprocess your file to include indel alignment qualities!) **--only-indels           Only call indels; no SNVs
- Source quality:-s | --src-qual              Enable computation of source quality-S | --ign-vcf FILE          Ignore variants in this vcf file for source quality computation. Multiple files can be given separated by commas-T | --def-nm-q INT          If >= 0, then replace non-match base qualities with this default value [-1]
- P-values:-a | --sig                   P-Value cutoff / significance level [0.010000]-b | --bonf                  Bonferroni factor. 'dynamic' (increase per actually performed test) or INT ['dynamic']
- Misc.:-C | --min-cov INT           Test only positions having at least this coverage [1](note: without --no-default-filter default filters (incl. coverage) kick in after predictions are done)-d | --max-depth INT         Cap coverage at this depth [1000000]--illumina-1.3          Assume the quality is Illumina-1.3-1.7/ASCII+64 encoded--use-orphan            Count anomalous read pairs (i.e. where mate is not aligned properly)--plp-summary-only      No variant calling. Just output pileup summary per column--no-default-filter     Don't run default 'lofreq filter' automatically after calling variants--force-overwrite       Overwrite any existing output--verbose               Be verbose--debug                 Enable debugging

主要用的参数是：

-s： Enable computation of source quality
–sig：P-Value cutoff / significance level [0.010000] #对于Pvalue的cut-off值，这里我们是1
–bonf：Bonferroni factor. ‘dynamic’ (increase per actually performed test) or INT
-C : --min-cov INT Test only positions having at least this coverage [1] #这里的coverage我们设置的为5
(note: without --no-default-filter default filters (incl. coverage) kick in after predictions are done)

所以，我觉得这个参数真的特别重要。这里，就又让我想到了之前关于这个lofreq筛选参数的讨论，我觉得问题会在这里。

参考链接：https://github.com/tianshilu/QBRC-Somatic-Pipeline/issues/7

我打算将--sig调整为0.01，-C参数调整为7，看看这个流程下来，会不会减少很多。

我可以这样的尝试一下，我似乎找到了问题的原因了。
我现在就做这样的调整。

(1)将bwa比对的参数去掉-a
(2)将lofreq的--sig调整为0.01，-C调整为10 #再多一点吧 看看在结果上会有什么差别？

看一下，最终的结果文件和我们之前的结果是否有差别。
现在开始修改。

修改完成，现在开始编写指令（就是文件在哪里，output的文件夹啥地方这种）。

perl somatic.pl NA NA ./example/example_dataset/sequencing/SRR7246238_1.fastq.gz ./example/example_dataset/sequencing/SRR7246238_2.fastq.gz 32 hg38 ./geneome/hg38/hg38.fasta /home/xxzhang/workplace/software/java/jdk1.7.0_80/bin/java ./output_DNA  human 1 ./disambiguate_pipeline>pipeline_DNA_10.txt

奇怪了，咋又出问题了，我也没动啥呀。

Can’t locate Parallel/ForkManager.pm in @INC (you may need to install the Parallel::ForkManager module) (@INC contains: /home/xxzhang/miniconda3/lib/site_perl/5.26.2/x86_64-linux-thread-multi /home/xxzhang/miniconda3/lib/site_perl/5.26.2 /home/xxzhang/miniconda3/lib/5.26.2/x86_64-linux-thread-multi /home/xxzhang/miniconda3/lib/5.26.2 .) at somatic.pl line 42.
BEGIN failed–compilation aborted at somatic.pl line 42.

和师兄核实了一下，还是perl路径的问题。原来师兄把opt/perl5从系统路径中删去了。那我现在看一下，我的的perl路径有哪些？然后，对其进行添加。

输入perl -V查看当前的INC的路径：

/home/xxzhang/miniconda3/lib/site_perl/5.26.2/x86_64-linux-thread-multi
/home/xxzhang/miniconda3/lib/site_perl/5.26.2
/home/xxzhang/miniconda3/lib/5.26.2/x86_64-linux-thread-multi
/home/xxzhang/miniconda3/lib/5.26.2

所以是没有系统的那个perl路径的，我现在尝试把forkmanager所在的那个路径添加到我自己的环境变量中。

现在尝试添加：

export PERL5LIB=/opt/ActivePerl-5.26/lib/perl5

问题解决。（后来又看了一下，这种问题解决只是暂时性。想要真正的问题解决，要把这个添加到环境变量中，这样没错重启就会被加载一次）

现在程序开始运作。我先去吃个饭。

回来了，程序运行完成了，比我想象的要快一点。

发现调整lofreq的参数之后，确实会带来变化，但是这种变化是我们意料之外的。改了之后，虽然也有重合的地方，但是germline依旧是超多的其他染色体上的variants，与此同时虽然variants少了很多，但是剩下的variants的结果，又不能很好的覆盖到作者的文档中，这个参数调节的让人很悲痛。想不明白哪里出了问题，我就应该和作者就这件事讨论一下的，觉得比自己在这里瞎捉摸要好很多。

但是现在的确是找不出什么问题。
我现在有一个想法是，先把lofreq那个文件找到，在这个基础上，看看作者的那个位点在不在这里面。如果在，那就说明是筛选标准的问题，如果不在，说明按照我那些参数就是没有call得到这些variants。

对于这个pipeline也是很迷。
我没有找到它在什么时候把lofreq给删去了，所以需要一直监督着其运转，从而得到我们想要的lofreq.vcf文件。

三次比较。

/home/xxzhang/miniconda3/bin/lofreq call -s --sig 0.01 --bonf 1 -C 5 -f /home/xxzhang/workplace/QBRC/geneome/hg38/hg38.fasta --call-indels  --verbose --use-orphan -o lofreq1.vcf tumor.bam/home/xxzhang/miniconda3/bin/lofreq call -s --sig 0.1 --bonf 1 -C 5 -f /home/xxzhang/workplace/QBRC/geneome/hg38/hg38.fasta --call-indels  --verbose --use-orphan -o lofreq2.vcf tumor.bam/home/xxzhang/miniconda3/bin/lofreq call -s --sig 0.5 --bonf 1 -C 7 -f /home/xxzhang/workplace/QBRC/geneome/hg38/hg38.fasta --call-indels  --verbose --use-orphan -o lofreq3.vcf tumor.bam

最后，发现这个参数很难调整。可以同样画三张韦恩图来表示。vcf之后的筛选不太可能会筛选掉这些位点。如果是vcf之前，通过改变--sig的改变，一些在作者的结果中出现的位点不再出现在我的结果中，但是与此同时，我的结果中出现了其他的位点。
所以，只有这个参数控制远远不够。

--min-mq这个参数是否有用？

感觉如同大海捞针一样，想要精确的调整很难。

实验记录 | 为什么mtDNA的fastq数据会比对到常染色体上？相关推荐

HIT-大数据分析Lab1：数据预处理-实验记录
本文是哈工大大数据分析实验1的完整实验记录,包含环境配置,相关知识介绍以及实验解析.希望对后来人有帮助(新手小白没什么头绪,走一步查一步对应的博客o(╥﹏╥)o),博客链接之间会穿插一些我自己的理解, ...
大学物理实验绪论笔记——关于物理实验的误差分析、处理与数据记录
一.完整的测量结果表达式举例以钢丝的杨氏模量为例: 测量结果为:E=(1.89±0.08)x (N/m2) 或 E=1.89x(1±4.3%) (N/m2) 应包括:测量量(代表符号)+测量量值+ ...
【伊利丹】Hadoop-2.5.0-CDH5.2.0 版本升级和数据均衡实验记录
引言由于开发的需要 hadoop集群的版本要从hadoop2.2.0版本升级到 hadoop-2.5.0-cdh5.2.0版本,在升级的过程中要确保数据的完整性,故此写了下面一章关于hadoop升 ...
实验1 应用SQL Server进行数据定义和管理
实验1 应用SQL Server进行数据定义和管理 [实验目的] 1)熟悉SQL Server的配置和管理. 2)掌握数据库的定义和修改方法. 3)掌握表的定义和修改方法. 4)掌握使用SQL语句进行 ...
【Oracle RAC+DG实验】Oracle RAC+ASM+DataGuard配置实验记录+常见问题
[Oracle RAC+DG实验]Oracle RAC+ASM+DataGuard配置实验记录+常见问题 1.环境规划: ---RAC环境介绍(primary database) ...
Flink的scala+python的shell模式实验记录汇总
根据[1],FLINK的shell有以下一些运行模式 ################################下面是scala-shell########################### ...
【深度学习】【U-net】医学图像(血管)分割实验记录
医学图像分割实验记录 U-net介绍数据集实验记录实验1 实验2(fail) 实验3(fail) 实验4(fail) 实验5(fail) 实验6(fail) 本项目仅用于大创实验,使用pytor ...
【深度学习】图像匹配Siamese网络实验记录
图像匹配Siamese网络实验记录 Ⅰ. Siamese 网络介绍 Ⅱ. 数据集 AT&T 分拣行李匹配图像 Ⅲ. 实验记录 A. 模型1 1. 实验1 2. 实验2 3. 实验3 B. 模型 ...
【实验记录】EA-MLP(演化算法--全连接神经网络)实验记录
large scale evaluation net -- MLP全连接实验记录 Ⅰ. Experiment detail Ⅱ. Method Vertex Edge DNA Evolution_po ...
【实验记录】Fashion-Mnist分类实验记录
Fashion-Mnist实验记录使用深度学习解决Fashion-Mnist分类问题 • Problem Description • Solution Design • Data Preparati ...

实验记录 | 为什么mtDNA的fastq数据会比对到常染色体上？

实验记录 | 为什么mtDNA的fastq数据会比对到常染色体上？相关推荐

最新文章

热门文章