文献阅读 | Tracing the ancestry of modern bread wheats

Pont C, Leroy T, Seidel M, et al. Tracing the ancestry of modern bread wheats[J]. Nature genetics, 2019, 51(5): 905-911.

1. 文章结论

1.1 小麦基因组多样性

为了探索目前能够进入小麦基因池中的多样性的起源与模式，坐着组装了一个世界范围的、拥有487个基因型的、包含wild diploid and tetraploid relatives, domesticated tetraploid and hexaploid landraces, old cultivars and modern elite cultivars的数据集。作者采用了外显子测序的方法，参考基因组序列采用了中国春序列。
发现了620,158个高可信度遗传变异。

基因与结构变异的关联性支持了染色体水平检测到的，从远端基因富集区域到着丝粒周围基因匮乏区域基因的变异没有偏好。亚基因组（B>A>>D）之间和染色体不同区域都呈现了结构性变异丰富程度的差异。整体上，数据提供了多尺度上详细的小麦基因组多样性总览。

经过分析，作者给出了驱动数据集内多样性差异产生的三个主要因素：①对春化的不同需求；②历史学的分组不同；③地理起源不同。通过分支单系守恒置换检验，证实了这三个因素的强分组效应。同时，系统发育的深层结构以大洲/大陆呈现差异，并在后来受到了现代育种强烈的选择而产生的生长习性变化的影响。

通过将大洲/国家起源重叠到系统发育聚类结果上，显示出了观察到的遗传多样性的西-东轴向结构，这与人类迁出新月沃地的路径相一致。

1.2 小麦选择足迹

作者采用滑动窗口的形式对多样性的局部降低进行检测，并将地理结构纳入了考虑。在驯化信号检测上，作者采用了在1Mb不重叠的窗口内计算平均每位点核苷酸多样性的形式。

在野生麦祖先和六倍体栽培种（landraces）间的对比，支持了驯化过程中，基因组上发生的大量多样性降低（the reduction of diversity (ROD)）的多相性。通过历史分组i、ii、iii和iv的比较，结果显示经历了两轮主要的下降过程。第一轮是i到ii的早期育种选择，第二轮发生在iii到iv，对应了绿色革命期间。

为了鉴定育种家选择的基因marker与区段，采用了PCAdapt进行了全基因组、全样本扫描，并鉴定了5089个具有提高信号的多态性位点。一些已知的基因与这些位点距离较近（<5Mb distance）。近两世纪产生并固定的大型区间（>10Mb）在1A染色体尤其多，并在两个发生了结构性重新排列的染色体——4A和7B上较多。

对于欧/亚基因型上发现的8308/9948个重要足迹位点进行了2Mb重叠区间的拓展，从而分别定义了950Mb和1.3Gb的累计基因组区间，其分别具有两个地区的选择特点。作者对比发现，其中只有168Mb的区间能在两地都有发现，显示了两种地理起源有着不同选择目标。

作者接下来通过多环境下的GWAS分析，测试观测到的等位基因多样性是否能与两个关键的生命历程特征——抽穗期（HD）和株高（PH）相关，并发现了48/40个基因组位点与HD/PH显著相关，这之中包括一些包含已知基因的区域，和一些未知基因。

作者认为，目前的数据集为从在先前检测到、但当前仍未知的基因座中识别相关候选基因提供了基础。尤其是，先前的密度、选择足迹与GWAS分析清晰地显示了只有一小部分同源基因座包含了共有的信号，支持了六倍体面包小麦在遗传上与四倍体相似的观点（supporting the view that modern hexaploid bread wheats behave genetically as diploids）。一如先前选择的收敛模式（convergent pattern）所表现的、在同源区域之间的罕见性。

1.3 小麦起源

作者采用了一种基于网络的系统发育方法进行研究。该方法包括从重复的随机单倍型样本（repeated random haplotype samples (RRHS) ）中，基于最大似然度，选出1000棵树。随后的图重建分析和种群聚类重建了现代六倍体面包小麦及其二倍体、四倍体祖先的网络进化史。并通过网络中间亲本与已知亲缘关系的比较，证明了该方法的鲁棒性。

作者提出的小麦进化综合模型由下述三种因素综合得出：①对网络边与边的权重的彻底分析；②树的拓扑结构评估；③使用D统计量（Patterson’s D statistic）进行基因流动检验。

作者提出的模型认为，进化过程是AA+SS -> AABB -(+DD)-> AABBDD。

**进化模型叙述原文：**Our proposed model (Fig. 4b) largely refines the widely accepted evolutionary path leading to modern bread wheat with the hybridization of wild diploid AA and SS (close to BB) genotypes leading to wild tetraploid AABB progenitors, which subsequently hybridized with a wild diploid DD genotype resulting in the hexaploid T. aestivum (AADDBB) lineage. In our analysis, the wheat B genome
is confirmed to be derived from the Aegilops section Sitopsis lineage, which gave rise to A. speltoides (SS), while the progenitors of A. tauschii and T. urartu represent the established origins of the D and
A genome lineages, respectively. T. araraticum (also referenced as T. araraticum Jakubz) represents the closest wild descendant of the AAGG tetraploid ancestor. It appears to have been subsequently domesticated to form T. timopheevii (Zhuk.) Zhuk while also hybridizing with T. boeoticum leading to the hexaploid T. zhukovskyi (Menabde & Ericzjan) lineage (AAAAGG).

模型确认了野生二粒（T. dicoccoides）是与现代四倍体（AABB）和六倍体（AABBDD）小麦中A、B亚基因组祖先最接近的子代。数据显示，在驯化与栽培的早期阶段，野生二粒至少产生了两种不同的驯化四倍体小麦血统T. dicoccum Schrank ex Schübl.和T. durum Desf.。

最后，模型显示了下述假说：普通小麦很可能由硬粒小麦和一种具有D基因组的、类似粗山羊草的品种发生杂交而形成的。随后，由六倍体普通小麦和栽培二粒杂交，产生了T. spelta，并直到今天仍隐含着野生二粒渗入的证据。

**假说叙述原文：**Finally, the model supports the hypothesis that T. aestivum is most likely to be derived from an ancestral hybridization event between the previous T. durum lineage and a D lineage close to wild A. tauschii (Fig. 4b and Supplementary Fig. 11). Subsequently, T. spelta emerged from the hybridization between the hexaploid T. aestivum and the tetraploid T. dicoccum, and still harbors evidence of T. dicoccum introgressions today (Supplementary Fig. 12).

作者接着寻找了六倍体小麦驯化过程中固定下的基因池的创建者。它可能被包含在起源于新月沃地的远古栽培种之中，并导致了两个（β和γ）不同的六倍体群体。其中，γ在西欧更为常见，而β在东欧更为常见。这一进化上的差异可能显示了人类历史与社会活动如何影响小麦种质资源的基因组成。

2. 研究方法

2.1 系统发育分析

这部分分析使用了从全部三个亚基因组ABD上的三联体直系同源基因（ triplets (2,855) of orthologous genes）中的91554个SNP推断出来的435个六倍体面包小麦基因型。

数据首先用了iqtreeX（GTR+GAMMA(4) model）进行分析，with 1,000 ultrafast bootstraps。

祖先节点的地理区域使用如下方法进行重构：10,000 simulations were performed using the stochastic mapping algorithm of the R phytools package32 (using the equal rates model), the region of a
node was then chosen as the one with maximum sampled frequency

树的11个主要分支基于大小、代表性和统计数据的支持得出，以提供对于树的较好覆盖。

世界地图使用R包countrycode、geosphere、maps进行构建。

487个二倍体、四倍体、六倍体小麦的系统发育学分析考虑了基于SNP数据进行系统发育学分析中，由多个水平的杂合性、连锁不平衡、不完全的血统排序和网状进化而造成的歧义和可能的偏向。对这一点，作者使用了一个基于网络的方法来重构小麦科采样基因型的物种历史与群体结构。最后，作者采取了严格的过滤措施，并使用RRHS进行1000棵最大似然树的拓扑结构推测。

原文： The analysis of phylogeny for the 487 di- tetra- and hexaploid wheats was inferred in accounting for ambiguities and possible biases in phylogenetic inference from SNP data arising from varying levels of heterozygosity, linkage disequilibrium, incomplete lineage sorting and reticulate evolution. In that regard, we implemented a network-based approach to reconstruct the species history and community structure in the sampled Triticeae genotypes. To this end, we stringently filtered biallelic, polymorphic SNPs present in>90% of the genotypes from non-imputed data accounting for linkage disequilibrium (delivering 15,490 filtered SNPs), and implemented a repeated random haplotype sampling procedure including heterozygous sites (RRHS) to infer 1,000 maximum likelihood tree topologies with the ASC_GTRGAMMA model and JC69 distances in RAxML (asc-corr=felsenstein).

对于构建的1000棵树，采用了最小生成树算法，并将树形图与加权的、采用Cytoscape 3.6将节点聚类为树枝的系统发育网络结合。

**原文：**While these RRHS trees were also analyzed in the form of conventional consensus topologies and densitree visualizations to infer taxonomic clades, we analyzed the evolutionary distances among the tips of the 1,000 trees using the minimum spanning tree (MST) algorithm in Python. The MST graphs were subsequently combined into a weighted, phylogenetic consensus network whose nodes were clustered into clades using the Girvan–Newman EdgeBetweenness algorithm in Cytoscape 3.6 (ref. 33). The clustered network topology was plotted considering edge-betweenness in Cytoscape and taxonomic clades were inferred by intersection of community clusters with taxon information that was annotated using the AutoAnnotate plugin.

RRHS的边被MST选择了的相关数被用作边的权重，被解释为类似于共识树拓扑中的引导支持值。

原文： The relative number of RRHS trees for which a respective edge was selected by the MST algorithm were used as edge weights and were interpreted similar to bootstrap support values in the consensus tree topologies.

地理信息、历史分组等用于推测小麦群体的信息使用了chi平方进行检验，并使用了条形图绘制。

原文： The composition, geographical and historical origins of the identified wheat communities were analyzed using chi squared tests and barplots in R.

AB亚基因组间的基因流动使用了ANGSD进行检测。

原文： Gene flow in subgenomes A and B was investigated with the Patterson’s D statistic (or ABBA-BABA statistic) using ANGSD with a threshold of Z>4 (ref. 35).

小麦进化的综合模型（Fig. 4b）由下述三点因素共同决定，并手动合并而成：①系统发育共识网络的边支持值；②多种共识树和IUPAC树的拓扑结构；③ABBA-BABA检测结果。

原文： An integrative model (Fig. 4b) of wheat evolution was built by manual consolidation of the support values of the edges in the phylogenetic consensus network (Supplementary Fig. 10 and Supplementary Table 7), the various consensus and IUPAC tree topologies (Supplementary Fig. 11), the ABBA-BABA results (Supplementary Fig. 12) as well as the literature.

当物种关系仍不能由网络方法唯一确定时，将ABBA-BABA检测结果和已有的资料纳入考虑。

原文： Where species relationships remained ambiguous on the sole basis of the network approach (that is, when similar phylogenetic relatedness between groups of genotypes defines several possible evolutionary paths between putative progenitors and descendants), we then considered the results of the ABBA-BABA statistical test (Supplementary Fig. 12 and Supplementary Table 6) and the existing literature when available.

Fig.4b中，仅包含系统发育共识网络和ABBA-BABA检测结果共同支持的网络事件。

原文： Fig. 4b reports only the reticulation events identified on the basis of phylogenetic consensus networks supported by the ABBA-BABA analysis in both the A and B subgenomes.