




hmmscan -E 0.00001 --domE 0.00001 --cpu 2  --noali  --acc --notextw --domtblout  pfam.tab Pfam-A.hmm test.pep.fa





输出结果中分为两类一类是针对序列的(full sequence) ,另一类是针对domain的(主要基于一条序列存在多个domain)。这两种格式涉及到的每一列信息解释如下(英文原文大家看的可能更明白!)

(1) target name: The name of the target sequence or profile.

(2) accession: The accession of the target sequence or profile, or ’-’ if none.

(3) query name: The name of the query sequence or profile.

(4) accession: The accession of the query sequence or profile, or ’-’ if none.

(5) hmmfrom: The position in the hmm at which the hit starts.

(6) hmm to: The position in the hmm at which the hit ends.

(7) alifrom: The position in the target sequence at which the hit starts.

(8) ali to: The position in the target sequence at which the hit ends.

(9) envfrom: The position in the target sequence at which the surrounding envelope starts. 结构域的起始位置。

(10) env to: The position in the target sequence at which the surrounding envelope ends. 结构域的终止位置。

(11) sq len: The length of the target sequence..

(12) strand: The strand on which the hit was found (“-” when alifrom¿ali to).

(13) E-value: The expectation value (statistical significance) of the target, as above.

(14) score (full sequence): The score (in bits) for this hit. It includes the biased-composition correction.

(15) Bias (full sequence): The biased-composition correction, as above

(16) description of target: The remainder of the line is the target’s description line, as free text.

(17) c-Evalue: The “conditional E-value”, a permissive measure of how reliable this particular domain may be. The conditional E-value is calculated on a smaller search space than the independent Evalue. The conditional E-value uses the number of targets that pass the reporting thresholds. The null hypothesis test posed by the conditional E-value is as follows. Suppose that we believe that there is already sufficient evidence (from other domains) to identify the set of reported sequences as homologs of our query; now, how many additional domains would we expect to find with at least this particular domain’s bit score, if the rest of those reported sequences were random nonhomologous sequence (i.e. outside the other domain(s) that were sufficient to identified them as homologs in the first place)?

(18) i-Evalue: The “independent E-value”, the E-value that the sequence/profile comparison would have received if this were the only domain envelope found in it, excluding any others. This is a stringent measure of how reliable this particular domain may be. The independent E-value uses the total number of targets in the target database.


Envelope定义:The envelope defines a subsequence for which their is substantial probability mass supporting a homologous domain, whether or not a single discrete alignment can be identified. The envelope may extend beyond the endpoints of the MEA(maximum expected accuracy ) alignment, and in fact often does, for weakly scoring domains.

Envelope鉴定:Now, within each region, we will attempt to identify envelopes. An envelope is a subsequence of the target sequence that appears to contain alignment probability mass for a likely domain (one local alignment to the profile).When the region contains '1 expected domain, envelope identification is already done: the region’s start and end points are converted directly to the envelope coordinates of a putative domain.

There are a few cases where the region appears to contain more than one expected domain – where more than one domain is closely spaced on the target sequence and/or the domain scores are weak and the probability masses are ill-resolved from each other. These “multidomain regions”, when they occur, are passed off to an even more ad hoc resolution algorithm called stochastic traceback clustering. In stochastic traceback clustering, we sample many alignments from the posterior alignment ensemble, cluster those alignments according to their overlap in start/end coordinates, and pick clusters that sum up to sufficiently high probability. Consensus start and end points are chosen for each cluster of sampled alignments. These start/end points define envelopes.These envelopes identified by stochastic traceback clustering are not guaranteed to be nonoverlapping.It’s possible that there are alternative “solutions” for parsing the sequence into domains, when the correct parsing is ambiguous. HMMER will report all high-likelihood solutions, not just a single nonoverlapping parse.

It’s also possible (though rare) for stochastic clustering to identify no envelopes in the region.In a tabular output (--tblout) file, the number of regions that had to be subjected to stochastic traceback clustering is given in the column labeled clu. This ought to be a small number (often it’s zero). The number of envelopes identified by stochastic traceback clustering that overlap with other envelopes is in the column labeled ov. If this number is non-zero, you need to be careful when you interpret the details of alignments in the output, because HMMER is going to be showing overlapping alternative solutions. The total number of domain envelopes identified (either by the simple method or by stochastic traceback clustering) is in the column labeled env. It ought to be almost the same as the expectation and the number of regions

