Email updates

Keep up to date with the latest news and content from Investigative Genetics and BioMed Central.

Open Access Research

Inference of human continental origin and admixture proportions using a highly discriminative ancestry informative 41-SNP panel

Caroline M Nievergelt1*, Adam X Maihofer1, Tatyana Shekhtman1, Ondrej Libiger2, Xudong Wang34, Kenneth K Kidd4 and Judith R Kidd4

Author Affiliations

1 Department of Psychiatry, School of Medicine, University of San Diego California, La Jolla, CA, 92093, USA

2 Department of Molecular and Experiment Medicine, The Scripps Research Institute, La Jolla, CA, 92037, USA

3 Chongqing Medical University, Chongqing, Sichuan, China

4 Genetics Dept, Yale University School of Medicine, New Haven, CT, 06520, USA

For all author emails, please log on.

Investigative Genetics 2013, 4:13  doi:10.1186/2041-2223-4-13


The electronic version of this article is the complete one and can be found online at: http://www.investigativegenetics.com/content/4/1/13


Received:19 November 2012
Accepted:14 May 2013
Published:1 July 2013

© 2013 Nievergelt et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Accurate determination of genetic ancestry is of high interest for many areas such as biomedical research, personal genomics and forensics. It remains an important topic in genetic association studies, as it has been shown that population stratification, if not appropriately considered, can lead to false-positive and -negative results. While large association studies typically extract ancestry information from available genome-wide SNP genotypes, many important clinical data sets on rare phenotypes and historical collections assembled before the GWAS area are in need of a feasible method (i.e., ease of genotyping, small number of markers) to infer the geographic origin and potential admixture of the study subjects. Here we report on the development, application and limitations of a small, multiplexable ancestry informative marker (AIM) panel of SNPs (or AISNP) developed specifically for this purpose.

Results

Based on worldwide populations from the HGDP, a 41-AIM AISNP panel for multiplex application with the ABI SNPlex and a subset with 31 AIMs for the Sequenome iPLEX system were selected and found to be highly informative for inferring ancestry among the seven continental regions Africa, the Middle East, Europe, Central/South Asia, East Asia, the Americas and Oceania. The panel was found to be least informative for Eurasian populations, and additional AIMs for a higher resolution are suggested. A large reference set including over 4,000 subjects collected from 120 global populations was assembled to facilitate accurate ancestry determination. We show practical applications of this AIM panel, discuss its limitations for admixed individuals and suggest ways to incorporate ancestry information into genetic association studies.

Conclusion

We demonstrated the utility of a small AISNP panel specifically developed to discern global ancestry. We believe that it will find wide application because of its feasibility and potential for a wide range of applications.

Keywords:
Ancestry Informative Markers; Multiplex; Global Ancestry; Population Stratification; Admixture; AISNP; AIMs

Background

Characterization of human ancestry has been of interest for decades as information about population structure can provide novel insight into the human past and remains an important topic in the rapidly evolving biomedical field. For example, because genetic variants conferring risk to a particular disease may be geographically restricted because of evolutionary forces such as mutation, genetic drift, migration and natural selection, the assessment of the genetic background in individuals chosen for a study is crucial in genetic epidemiology [1].

While still a topic of controversy [2], there is ample evidence that self-reported race, as for example used in the US Census, can predict ancestral clusters in a population sample. However, it does not completely inform on how genetic variation is apportioned within and between racial groups, nor does information on race reveal the extent of admixture [2,3].

Especially in the context of mapping disease genes, more objective and accurate methods of defining homogenous populations for the investigation of specific population-disease associations are required. This is not only paramount for specific mapping approaches such as admixture mapping [4], but has also been recognized as a crucial prerequisite for genetic association studies, as the presence of undetected population structure can lead to both false-positive results and failures to detect genuine associations [5]. Furthermore, it has been shown that the consequences of population structure on association outcomes increase markedly with sample size, and even modest levels of population structure within population groups cannot safely be ignored in the large studies needed to detect typical genetic effects in common diseases [6].

In order to assess genetic background diversity, a large number of ancestry informative marker (AIM) panels have been developed for particular applications. Genome-wide panels for admixture mapping have been developed for Hispanic populations [7], African Americans [8] or three-way admixture in the Americas [9], and smaller AIM panels have been designed to discern ancestry at either the global level [10-12] or within specific populations such as the Native and Mexican Americans [13-15], Europeans [16-20] or African Americans [21,22]. In addition, genome-wide association studies (GWAS) are able to leverage ancestral information from the allele frequencies of the several thousand SNPs generated for whole-genome applications, alleviating the need for specific AIM panels [5].

However, determining ancestry and controlling for population structure is just as important in smaller genetic association studies. These include for example candidate gene studies involving only a few genetic markers, replication of GWAS findings, or consist of smaller, highly valuable collections of rare pathological phenotypes and historical collections with limited amounts of DNA. Genotyping these samples on large AIM panels or leveraging ancestry information from preexisting genotyping is often not practical or possible.

To address this specific need, we set out to develop a highly informative AIM panel that would allow us to infer a subject’s ancestral origin at the continental level and estimate admixture proportions among at least seven main geographic regions Africa, the Middle East, Europe, Central and South Asia, East Asia, Oceania and the Americas. The selection of such AIMs has to focus on SNPs with the largest allele frequency differences between the continental regions of interest to achieve the desired resolution at the continental level. Such high resolution is required because genetic diversity of human populations follows gradients or geographic clines within and among continents rather than specific clusters or clades [3,23,24].

We further aimed for the development of a feasible method to determine ancestry, as resources such as funding and available DNA are often limited for these applications. We therefore developed panels of AISNPs suitable for multiplex application on two commonly used platforms, the ABI SNPlex [25] and Sequenome iPLEX [26] systems. Additionally, all markers are also included on the Illumina HumanHap550 array, thus allowing for a combined analysis with studies genotyped on the Illumina whole-genome arrays.

Lastly, we specifically focused on the applicability of our panel to determine the ancestry of subjects from any of the worldwide geographic origins. To date, most research involving genetic association studies has focused on populations of European descent, where longer LD blocks require fewer genetic markers to be genotyped [27]. However, current gene-mapping efforts specifically request more global research, thus increasing the need for global AIM panels. Furthermore, global ancestry determination is especially important in clinical samples ascertained in specific geographic regions such as Southern California that are inhabited by individuals with very diverse and often heavily admixed ancestries.

Here we describe the development of AIM panels based on the well-studied global reference populations from the HGDP-CEPH [28], which include 52 geographically diverse populations collected from seven continental regions. We then greatly expanded the reference population set by genotyping the AIMs in over 2,000 additional subjects of known ancestry with the goal of achieving the most comprehensive global reference collection possible. We report on these efforts and describe highly discriminative ancestry informative 41- and 31-marker panels for multiplex applications.

Methods

Reference populations

AIM panels were developed based on the global reference populations from the HGDP-CEPH [28]. A total of 941 subjects including 52 populations from the standardized H952 subset were selected [29]. Based on the geographic origin of the samples, HGDP subjects were assigned to one of seven geographic or continental regions: Africa (n = 131), the Middle East (including the North African Moabites, n = 133), Europe (n = 158), Central/South Asia (CS Asia, n = 198), East Asia (E Asia, n = 229), the Americas (n = 64) and Oceania (n = 28) (Additional file 1: Table S1).

Additional file 1: Table S1. Geographic sampling location, population name, number of subjects and source of genotype data of 120 reference populations.

Format: DOCX Size: 300KB Download fileOpen Data

AIM panel development

Genotypes of HGDP subjects from the Illumina 650Y SNP array are publicly available (http://hagsc.org/hgdp/files.html webcite). We used Infocalc 1.1 [30] to calculate the marker informativeness (I_n) among the seven continental regions for each of the 644,195 autosomal markers. The mean informativeness of all markers was 0.0539, with a wide range of I_n = 0.0003-0.406. AIMs were selected according to the following criteria: being autosomal, unambiguous (AC, AG, TC, TG) and present on the Illumina Hap550 array (n = 547,458). Next, the top 5,000 markers with the highest I_n were chosen (I_n > 0.077) and, to reduce the correlation of markers, were subjected to LD pruning using PLINK [31] at a VIF = 1.5. The resulting pool of AIMs included 1,442 SNPs (Additional file 2: Table S2).

Additional file 2: Table S2. Chromosomal position (GRCh37.p5), alleles and informativeness (I_n) of 1,442 continental AIMs and sequence information for the multiplex 41-AIM and 31-AIM panels.

Format: XLSX Size: 90KB Download fileOpen Data

A small panel for multiplexing applications was developed by first choosing from the pool of 1,442 AIMs the top ten markers with the highest allele frequency differences (δ) between each of the 21 pairwise continental region comparisons. This set of 210 markers was then further reduced in an iterative way by considering multiplex genotyping requirements for the ABI SNPlex genotyping system [25] and Sequenome iPLEX system [26], leading to the final 41-AIM set for ABI SNPlex genotyping and the matching 31-AIM set for Sequenome iPLEX genotyping.

Additional reference and test populations

To validate the AIM panels and increase the global coverage of the reference population set for downstream applications, we included two additional, very large data sets with worldwide populations: the International HapMap Project (http://hapmap.ncbi.nlm.nih.gov/ webcite; phase III release 2 and 3) standard set HAP1161 [32] included 931 subjects from 11 populations, and the Yale data set included 2,146 subjects from 57 populations [33]. The combined reference set included 4,018 unrelated subjects from 120 (partially overlapping) populations (Additional file 1: Table S1). These reference populations have been described previously [33], and geographic features such as latitude and longitude of these populations are presented in the allele frequency database ALFRED (http://alfred.med.yale.edu/ webcite) [34]. Genotypes of at least 40 of the 41 AIMs were available for all reference subjects.

Finally, to illustrate a practical application of the 41-AIM panel with our complete set of global reference populations, a contemporary population sample of 2,392 subjects ascertained in Southern California [35] was genotyped using the ABI SNPlex system. Ancestry was determined for all subjects with < 5% genotypes missing.

Statistical analyses

Population structure and individual ancestry estimates were obtained using STRUCTURE v2.3.2.1. [36,37]. To assess the global informativeness of the 41-AIM panel in the original HGDP reference populations, five independent runs without prior population assignment were performed at K = 2 to K = 7, using 20,000 burn-in cycles and 20,000 MCMC replications under the admixture model. The “infer α” option with the same, uniform alpha for all populations was used under the λ = 1 option. All other parameters were set at default.

To further validate the 41-AIM panel, ancestry estimates of 3,077 independent subjects of known ancestry from 68 global populations (reference set 2) were determined at k = 7 using the above STRUCTURE parameters, but now including prior population information of the HGDP reference set. Allele frequencies were updated using only individuals with population information at a migration prior of 0.05. Graphs were plotted using DISTRUCT v1.1 [38].

CLUMPP v1.1.2 [39] was used to evaluate different replicates of STRUCTURE runs. To assign a subject to a specific cluster, we applied cutoffs of >85% and >50% cluster membership, respectively. These criteria were selected to facilitate a comparison with Seldin's 93-AIM panel [10]. Finally, to validate the AIM panels, the percentage of subjects that clustered correctly compared to the known geographic origin was calculated.

Population structure was further analyzed using principal component analysis (PCA) implemented in the EIGENSTRAT software [40] and multi-dimensional scaling (MDS) as implemented in PLINK. All other calculations were performed in R v2.15.0.

As a measure of informativeness of the different AIM panels at the population level, we calculated FST, a genetic distance measure for inter-population differentiation compared to intra-population variation. Significance of pairwise FSTs was established using 10,000 permutations. A Mantel test was used to correlate the FST matrices based on the 41-AIM and 31-AIM panels. Calculations were performed in ARLEQUIN 3.5 [41].

To investigate the informativeness of the AIM panels in detecting admixture at the individual level, subjects from two admixed populations of the Southern California test population (self-reported African Americans and self-reported Hispanic White and Native Americans) were selected. These subjects were subjected to the Illumina HumanOmniExpressExome array, and individual ancestry estimates were determined with a second, independent approach (see [42] for details). In brief, we used over 10,000 GWAS-derived SNPs, a set of 2,513 (partly overlapping) reference individuals and a two-step analysis approach implemented in ADMIXTURE [43]. Individual admixture estimates based on the GWAS-derived panel were then compared to the admixture estimates based on the 41- and 31-AIM panels for these two admixed populations (see above).

Results

Characterization of small AIM panels to determine continental ancestry

Fifty-two global populations from the HGDP-CEPH panel [28] were used to select AIMs optimized for the determination of continental ancestry. We developed a small 41-AIM panel specifically for multiplex application on the ABI system from a pre-selected pool of 1,442 highly informative AIMs (Additional file 1: Table S1). The panel was further reduced to 31 AIMs for application on the Sequenome iPLEX system.

Table 1 shows the informativeness (I_n) and pairwise allele frequency differences (δ) among the seven continental regions for each of the 41 AIMs. I_n ranges from 0.08 - 0.41 with a high mean of 0.23. The largest I_n and largest δ for each of the 21 continental comparisons are indicated in bold, highlighting the strength of a marker to distinguish between specific different global origins. Most continental comparisons included several markers with very high δ of >0.8. The smallest allele frequency differences were found for comparisons of regions within Eurasia where the top markers showed δ in the range of 0.4, indicating limited power to accurately distinguish subjects from Europe, the Middle East and Central/South Asia from each other.

Table 1. Informativeness (I_n) and allele frequency differences (δ) between seven continental regions for SNPs on the 41-AIM panel

The AIM panels were further characterized by calculating FST[41] as a measure of the panel’s relative strength to distinguish the seven geographic regions. Table 2 shows the genetic distance between the continental regions when using the 41-AIM (lower diagonal) and 31-AIM panel (upper diagonal), respectively. Inter-continent differentiation was based on allele frequencies from 51 HGDP populations; the atypical North African Mozabites were excluded here.

Table 2. Pairwise FST values among the seven continental regions calculated based on allele frequencies of 41 AIMs (below diagonal) and 31 AIMs (above diagonal)

In general, we found high FST values distinguishing the African, East Asian, American and Oceanian regions. As expected, the lower FST values among Europe, the Middle East and Central/South Asia reflect the I_n and δ found for the single markers. When comparing the FST values of the full 41-AIM panel with the reduced 31-AIM panel, no significant differences were found (Wilcoxon signed rank test, n = 21 paired comparisons, p > 0.38). In addition, a comparison of all pairwise FST values among the 52 populations showed a highly significant correlation among the FST values calculated based on the 41-AIM panel and the 31-AIM panel (Mantel test, r = 0.987, p < 0.001), further indicating no significant loss of power to discern global ancestry in the smaller panel.

Lastly, the population structure of the HGDP was analyzed using STRUCTURE. To facilitate a comparison with previous studies (e.g., [10,12,24,33,44]), we used similar model parameters without prior information about individual sampling locations. Figure 1 shows the most typical patterns with the highest likelihood from each of 20 independent runs at K = 2–7. Similar to Rosenberg‘s analyses including 377 microsatellites [44] and 993 SNPs [24], we found stable results with two clusters anchored by Africa and the Americas at K = 2 (20/20 runs) and a separation of Africa at K = 3 (19/20). At K = 4, a new cluster emerged isolating either the Americas (11/20) or alternatively Central/South Asia (9/20), and at K = 5 both of these regions were isolated (14/20). Most runs separated Europe from the Middle East at K = 6 (17/20), and at K = 7 the main continental regions for whose partitioning the panel was designed were separated from each other in the majority of runs (11/20) and with the highest likelihood.

thumbnailFigure 1. Inferred population structure of the HGDP subjects based on the 41-AIM panel with clusters ranging from K2 - K7.

Validation of the 41-AIM panel using additional populations of known origin

We further tested the performance of the 41-AIM panel in a realistic setting and estimated the ancestry of 3,077 test subjects from 68 regionally collected populations from the HapMap III and Yale collections. These test populations have been extensively characterized by us and others (see, e.g., [33] and [45]) and are well suited for this purpose. STRUCTURE was run with the HGDP as predefined reference populations at K = 7 (Yale samples were not genotyped for rs2717329). Table 3 shows the average cluster membership of individuals belonging to a specific population for each of the seven continental regions, Africa, the Middle East, Europe, Central/South Asia, East Asia, the Americas and Oceania (n = 68 populations). We calculated the percentage of subjects that clustered correctly, using criteria of >85% and >50% cluster membership (MS), respectively.

Table 3. Continental ancestry based on STRUCTURE analysis of 68 test populations genotyped on the 41-AIM panel with HGDP subjects included as reference populations

We found that African populations had very high cluster membership in the African cluster, but East African populations (e.g., Chagga, Maasai and Sandawe) showed slightly lower values. As expected, admixed African Americans as well as a population of Ethiopian Jews showed some cluster membership in Europe and the Middle East, and less than 50% of the subjects were included in the African group at the 85% MS criteria.

The ethnoreligious Samaritans, Yemenite Jews and Druze clustered with the Kuwaiti predominantly in the Middle East, but also showed a significant European contribution. As expected, most European populations clustered predominantly with Europe. However, there was a significant Middle Eastern component, even for the Northern European populations such as the Finns and Irish, demonstrating the somewhat reduced specificity of the 41-AIM panel to distinguish between Europe and the Middle East compared to the resolution between other continents. When applying the less stringent 50% MS criterion, most populations had over 90% of their subjects placed in Europe. Not surprisingly, the Russian populations Adygei, Chuvash, Komi Zyriane and Russian Vologda were found to have a significant Central/South Asian component.

The Central/South Asian cluster included the majority of the Gujarati, Keralite and Thoti Indians at the 50% MS criterion. As expected, the Kachari Assam, located in the East, also showed a significant East Asian contribution. However, there was no predominant placing in any of the seven continental groupings for the Khanty, a population from western Siberia. This is expected since the current continental grouping at K = 7 does not include a specific Siberian/North Asian cluster. The Khanty are currently our only representatives of this large geographic area.

The East Asian test subjects from 15 diverse populations clustered in East Asia with almost no exception. Most Southern Malaysians also showed some Central/South Asian contribution. Most Native American populations clustered predominantly in the Americas. Exceptions were the admixed Muscogee and HapMap Mexicans, which were not placed in this cluster, but showed a strong European component. The Oceanic cluster included all Papua-New Guinean and Nasioi Melanesian subjects. However, the Micronesian and Samoan subjects from this broad geographic area were not assigned to Oceania at the 50% MS criterion, but were found to be admixed with a strong East Asian component.

Finally, we combined the HapMap III and Yale collections with the HGDP, and further analyses were conducted with our complete reference population set including 4,018 subjects genotyped on the 41-AIM panel. A principal component analysis (PCA) including all 4,018 subjects and averaged for each of the 120 populations is shown in Figure 2. We found that the first PC explained 27.6% of the genetic variability in the data set and corresponded with the Africa to Americas gradient found by STRUCTURE at K = 2. PC2 explained an additional 16.8% variability and added a European component (Panel A). Of note is the misleading positioning of the admixed HapMap Mexicans (MEX) and Native American Muscogee (MUS), both falling within the East Asian cluster. Adding PC3, which accounted for another 6.2% of the genetic variability and includes the Native American component, resolved the structure and correctly placed the MEX and MUS between Europe and the Americas (Panel B).

thumbnailFigure 2. Principal component analysis (PCA) based on genotype data of 41 AIMs including 4,018 subjects from 120 populations from the HGDP, HapMap and Yale collections. Individual values of subjects belonging to the same population are averaged to highlight the relative location of specific populations.

We performed an analysis of the eigenvalues of the first 15 PCs and found that over 56% of the genetic variation among the seven continental regions was accounted for by the first five PCs (Figure 3).

thumbnailFigure 3. Eigenvalues of the first 15 principal components (PCs) indicating that most genetic variation among the seven continental regions captured by the 41-AIM panel is accounted for by the first 5 PCs.

Applications of the 41-AIM panel

To highlight a practical application of the 41-AIM panel and our large collection of reference populations, we considered the case of a genetic association study with subjects collected in Southern California. In order to minimize spurious results due to population stratification (i.e., false-positive associations between a phenotype and genetic marker), a PCA is often applied. PCs can be used as an easy tool to visualize large amounts of data or can be included as covariates in association analyses to adjust for population stratification.

PC plots of the first three PCs generated based on genotype data of the 41 AIMs are shown in Figure 4 for the complete reference set of 4,018 subjects from 120 populations. When placed in the context of clusters, several populations appear as admixed among the eight colored continental regions (see Table 3; Siberia has been added as its own region here) or are truly admixed (such as the African Americans, Mexicans, Mozabites and others). To increase resolution, we removed these populations (n = 13, open symbols) in specific applications.

thumbnailFigure 4. PC plots of 4,018 reference subjects based on genotype data from the 41-AIM panel. Subjects are color coded according the geographic sample origin. Admixed populations, African Americans and Mexicans are indicated by open symbols (see text). The % of variation explained ranges from 27.6% for PC1 to 6.2% for PC3.

Figure 5 shows PC plots of Southern Californian test subjects and 107 ‘typical’ reference populations. The first five PCs (PC1 - PC5) explain a total of 51.5% variability in the data, and each identifies different aspects of the population distribution. PC2 highlights the European-African gradient and identified African Americans (panel A), PC3 added the native American component and separated Mexican Americans from Central/South Asians (panel B), and PC4 separated Oceania (panel C). Corresponding to the small eigenvalue of the fifth PC (see Figure 3), PC5 explained only a small fraction of the genetic variability (2.1%) in this setting and did not lead to a strong separation of the eight geographic clusters (panel D). However, PC5 was found to show a North–south cline in Eurasian populations, as indicated by a significant correlation of PC5 values with the average latitude of 77 Eurasian populations (Spearman’s ρ = 0.62, p < 0.001). An often-applied alternative to the PCA is the multidimensional scaling (MDS) approach implemented in the genetic association software PLINK. MDS analyses lead to essentially the same results (Additional file 3: Figure S1).

thumbnailFigure 5. PC plots of the first five PCs for a visual inspection of a large population sample collected in Southern California (black). Subjects from 107 typical reference populations are color coded. The % of variation explained is indicated in parentheses for each PC.

Additional file 3: Figure S1. MDS plots of the first five MDS components for a visual inspection of a large population sample collected in Southern California (black). Subjects from 107 typical reference populations are color coded.

Format: DOCX Size: 901KB Download fileOpen Data

Guided by these visual approaches, subjects are then typically grouped into a small number of more homogeneous groups (e.g., European Americans or African Americans) prior to association analysis, using clustering methods such as implemented in STRUCTURE. Additional population stratification and varying degrees of individual admixture are then accounted for within these more homogeneous groups.

To assess the informativeness of the AIM panels in detecting admixture at the individual level, we compared STRUCTURE admixture estimates based on the 41 and 31 AIMs with independently derived estimates based on a large, GWAS-derived panel (see Methods). Subjects were selected based on self-report from the admixed African American and Hispanic White and Native American populations (Figure 6). Individual ancestry proportions derived from the 41-AIM and GWAS panels were strongly correlated for both the Hispanic White and Native American populations (n = 484, Pearson’s r = 0.81 for the proportion of Native American ancestry in panel A, r = 0.81 for the proportion of European ancestry in panel C) and the African Americans (n = 106, Pearson’s r = 0.86 for the proportion of African ancestry in panel B, r = 0.85 for the proportion of European ancestry in panel D; all p < 2 × 10-16). Slightly reduced correlations of individual admixture estimates were achieved between the GWAS-derived and smaller 31-AIM panels (A: r = 0.77, B: r = 0.78, C: r = 0.84 and D: r = 0.85, respectively). Importantly, a detailed inspection of the scatter plots indicates that the AIM panels lack sensitivity in detecting admixture in individuals with low proportions of admixture (in the range of < 20-25%) when compared to the GWAS-derived panel.

thumbnailFigure 6. Comparison of individual admixture estimates based on a large GWAS-derived marker panel and the 41-AIM panel in admixed populations collected in Southern California. Self-identified Hispanic-White and Native American subjects show a wide range in the degree of Native American and European ancestry proportions (panels A and C). Self-identified African Americans show a range in the degree of African and European ancestry proportions (panels B and D).

Discussion

Application and limitations

Our motivation was to develop a feasible method to discern continental ancestry that would enable a safeguard against the impact of population stratification in small genetic association studies where limited resources preclude large genotyping efforts. We achieved this by choosing a very small set of highly discriminative AIMs suitable for multiplexing applications, thus enabling lower cost and higher throughput. To ensure a wide application potential, we optimized our panel for two commonly used multiplex platforms, the ABI SNPlex [25] and Sequenome iPLEX systems [26]. At the same time, these AIMs perform well in single SNP TaqMan assays and can also be extracted from whole-genome arrays such as the Illumina HumanHap550 chip, thus allowing an easy combination of samples with genotyping from different sources. This is especially important for AIMs, where imputation of SNPs based on information from genotyped markers is not advisable.

Our panel was able to accurately discern the global ancestry of a large majority of subjects originating from one of the seven specific ancestral clusters. This was the case for both the full 41-AIM panel and the subset of 31 AIMs, indicating that a balanced reduction of markers in these small panels did not significantly impact the robustness of the results. A direct comparison of our findings with a previously published small panel of 93 AIMs published by Seldin’s group [10] showed 89.7% agreement in continental assignment of HGDP subjects (data not shown), further validating our panel.

Not surprisingly, the biggest limitation to imputing global ancestry was found for subjects from Eurasia, where low FST values of 0.06 - 0.09 among Europe, the Middle East and Central/South Asia indicated little genetic diversity. The clinal distributions of allele frequencies between Europe and East Asia pose a challenge for the identification of highly discriminative markers, a limitation also impacting other small AIM panels [12,46]. We therefore suggest supplementing our panel with additional high-resolution markers for studies with focus on Eurasia. Such markers suitable for discerning specific pairs of regions can easily be extracted from our extensive preselected list of global AIMs.

Impact of reference populations

Independent of the statistical method used to determine ancestry and admixture proportions, the results of these analyses depend not only on the informativeness of the genetic markers, but also strongly on the set of reference populations included. An omission of reference subjects from an ancestral group likely leads to misclassification of test subjects with similar ancestries. For example, we previously found that African Americans clustered strongly with Central Asians in a three-way admixture analysis (erroneously) including only reference subjects from Europe, Central Asia and the Americas.

We considered this crucial issue during panel development and leveraged the publicly available 52 HGDP populations collected across the globe [47]. We then increased our reference set by leveraging AIM frequencies from the HapMap III and performing additional AIM genotyping in our large global collection [33]. With over 4,000 subjects from 120 global populations, we thus assembled one of the largest reference sets published for the purpose of ancestry determination. However, specific regions such as Siberia are still underrepresented, and efforts to expand our reference subject collection are ongoing.

Admixed subjects

Whereas ancestry assignment of subjects from a specific geographic area represented by a cluster in the reference population set is a quantifiable and relatively straightforward task, admixed subjects resulting from ancient or recent contact of populations with distinct ancestries pose challenges.

If such a cohort consists of admixed subjects with known ancestry contributions, such as two-way admixed Mexican Americans collected from a distinct area in Southern California, the varying degree of European and Native American ancestries can easily be estimated in admixture analyses implemented in Statistical packages such as STRUCTURE, ADMIXTURE [43] or BAPS [48]. Our AIM panels were able to detect admixture in individuals from these populations, but as expected for such small panels, were less sensitive when individual admixture proportions were low.

However, in a clinical sample including subjects of unknown ancestral origin and complex population structure, as is often the case in our studies (e.g., [35,49,50]), the presented methods may lack specificity to distinguish between admixture of distinct populations and erroneously place admixed subjects together with intermediate populations. This is especially true when including only the first few components of multivariate data reduction methods such as MDS and PC analyses (see, e.g., Figure 2, Panel A, for Mexican Americans clustering with Central/South Asians). In these cases, adding demographic information such as self-declared race and ethnicity information is strongly suggested to help minimize misassignments.

Challenges for genetic association studies

Adding to the complexities of accurately differentiating ancestral groups and estimating admixture proportions is the appropriate incorporation of this information into the design of genetic association studies. While the negative impact of population structure on association studies is well known [6], and methods to control for it are established and now routinely applied to studies of relatively homogeneous cohorts such as typically collected for GWAS (see e.g. a recent review [5]), the situation remains challenging for heterogeneous clinical collections or epidemiological cohorts.

Depending on the composition and relative numbers of subjects from different ancestral backgrounds, common questions in such studies include the genetic definition of African Americans, which typically show degrees of European admixture that vary among individuals [51]. There is currently no consensus for an appropriate cutoff point between European Americans and African Americans. Even less trivial is the incorporation into association studies of three-way admixed subjects such as Caribbean Latinos originating from Puerto Rico and the Dominican Republic [52], typically showing both Native American and high levels of African ancestry.

For practical purposes, we often employ a multi-tier approach: we first group subjects into continental clusters using a majority criterion with statistical methods such as STRUCTURE and then confirm the plausibility of the grouping with demographic data, where available. Next, we aim to place most subjects into a very small number of clusters including genetically similar subjects, for example, by combining similar continental groups such as from Eurasia, and excluding outliers of minority ancestries. Lastly, we control for additional population stratification within clusters by incorporating MDS components into association studies and ultimately combine results in meta-analyses, where appropriate. Such a method was, for example, employed for the Southern Californian population sample presented here, which encompassed a wide array of self-declared ethnic groups. Our approach resulted in a four-cluster analysis with 61% European Americans, 18% subjects with Native American admixture, 7% subjects with African admixture, and 15% subjects of other ancestry and/or complex admixtures.

Conclusion

In conclusion, we demonstrated the utility and limitations of a small AIM panel specifically developed to discern global ancestry. We believe that it will find wide application because of its feasibility and potential for a wide range of applications. To allow this reference set to be readily accessible for others to use, we are entering the allele frequencies for these 41 SNPs into ALFRED (alfred.med.yale.edu) [34] as an “SNP Set.” To allow ready estimation of likelihoods of ancestry of individuals, these SNPs are also being entered as an additional AISNP Panel in FROG-kb (frog.med.yale.edu) [53].

Abbreviations

ALFRED: allele frequency database; AIM: ancestry informative marker; AISNP: ancestry informative single nucleotide polymorphism; δ: allele frequency differences; MS: cluster membership; GWAS: genome-wide association study; I_n: marker informativeness; MDS: multi-dimensional scaling; PC: principal component; PCA: principal component analysis.

Competing interests

The authors declare that they have no competing interests.

Author‘s contributions

All authors read and approved the final manuscript. CMN conceived of the study, participated in its design and coordination, and wrote the manuscript. TS carried out genotyping. AXM performed the statistical analyses. KKK and JRK participated in the coordination of the study and in the writing of the manuscript. XW carried out genotyping and assisted in preliminary analyses at Yale. All authors read and approved the final manuscript.

Acknowledgements

CMN is supported by NIH grants 1 R01MH093500, R01 AG030474, 3U01 HD052104, 1 U01 MH092758 and BUMED, and by the Marine Corps and Navy Bureau of Medicine and Surgery. KKK and JRK have been supported in part by grants 2010-DN-BX-K225 and 2010-DN-BX-K226 awarded by the National Institute of Justice, Office of Justice Programs, US Department of Justice, and also in part by NSF grant BCS0938633. Points of view in this document are those of the authors and do not necessarily represent the official position or policies of the US Department of Justice.

References

  1. Tishkoff SA, Kidd KK: Implications of biogeography of human populations for ‘race’ and medicine.

    Nat Genet 2004, 36:S21-S27. PubMed Abstract | Publisher Full Text OpenURL

  2. Yaeger R, Avila-Bront A, Abdul K, Nolan PC, Grann VR, Birchette MG, Choudhry S, Burchard EG, Beckman KB, Gorroochurn P, et al.: Comparing genetic ancestry and self-described race in african americans born in the United States and in Africa.

    Cancer Epidemiol Biomarkers Prev 2008, 17:1329-1338. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  3. Manica A, Prugnolle F, Balloux F: Geography is a better determinant of human genetic differentiation than ethnicity.

    Hum Genet 2005, 118:366-371. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  4. Nievergelt CM, Schork NJ: Admixture mapping as a gene discovery approach for complex human traits and diseases.

    Curr Hypertens Rep 2005, 7:31-37. PubMed Abstract | Publisher Full Text OpenURL

  5. Price AL, Zaitlen NA, Reich D, Patterson N: New approaches to population stratification in genome-wide association studies.

    Nat Rev Genet 2010, 11:459-463. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  6. Marchini J, Cardon LR, Phillips MS, Donnelly P: The effects of human population structure on large genetic association studies.

    Nat Genet 2004, 36:512-517. PubMed Abstract | Publisher Full Text OpenURL

  7. Mao X, Bigham AW, Mei R, Gutierrez G, Weiss KM, Brutsaert TD, Leon-Velarde F, Moore LG, Vargas E, McKeigue PM, et al.: A genomewide admixture mapping panel for Hispanic/Latino populations.

    Am J Hum Genet 2007, 80:1171-1178. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  8. Tian C, Hinds DA, Shigeta R, Kittles R, Ballinger DG, Seldin MF: A genomewide single-nucleotide-polymorphism panel with high ancestry information for African American admixture mapping.

    Am J Hum Genet 2006, 79:640-649. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  9. Galanter JM, Fernandez-Lopez JC, Gignoux CR, Barnholtz-Sloan J, Fernandez-Rozadilla C, Via M, Hidalgo-Miranda A, Contreras AV, Figueroa LU, Raska P, et al.: Development of a panel of genome-wide ancestry informative markers to study admixture throughout the Americas.

    PLoS Genet 2012, 8:e1002554. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  10. Nassir R, Kosoy R, Tian C, White PA, Butler LM, Silva G, Kittles R, Alarcon-Riquelme ME, Gregersen PK, Belmont JW, et al.: An ancestry informative marker set for determining continental origin: validation and extension using human genome diversity panels.

    BMC Genet 2009, 10:39. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  11. Paschou P, Lewis J, Javed A, Drineas P: Ancestry informative markers for fine-scale individual assignment to worldwide populations.

    J Med Genet 2010, 47:835-847. PubMed Abstract | Publisher Full Text OpenURL

  12. Phillips C, Salas A, Sanchez JJ, Fondevila M, Gomez-Tato A, Alvarez-Dios J, Calaza M, De Cal MC, Ballard D, Lareu MV, Carracedo A: Inferring ancestral origin using a single multiplex assay of ancestry-informative marker SNPs.

    Forensic Sci Int Genet 2007, 1:273-280. PubMed Abstract | Publisher Full Text OpenURL

  13. Kidd JR, Friedlaender F, Pakstis AJ, Furtado M, Fang R, Wang X, Nievergelt CM, Kidd KK: Single nucleotide polymorphisms and haplotypes in Native American populations.

    Am J Phys Anthropol 2011, 146:495-502. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  14. Kosoy R, Nassir R, Tian C, White PA, Butler LM, Silva G, Kittles R, Alarcon-Riquelme ME, Gregersen PK, Belmont JW, et al.: Ancestry informative marker sets for determining continental origin and admixture proportions in common populations in America.

    Hum Mutat 2009, 30:69-78. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  15. Collins-Schramm HE, Chima B, Morii T, Wah K, Figueroa Y, Criswell LA, Hanson RL, Knowler WC, Silva G, Belmont JW, Seldin MF: Mexican American ancestry-informative markers: examination of population structure and marker characteristics in European Americans, Mexican Americans, Amerindians and Asians.

    Hum Genet 2004, 114:263-271. PubMed Abstract | Publisher Full Text OpenURL

  16. Tian C, Plenge RM, Ransom M, Lee A, Villoslada P, Selmi C, Klareskog L, Pulver AE, Qi L, Gregersen PK, Seldin MF: Analysis and application of European genetic substructure using 300 K SNP information.

    PLoS Genet 2008, 4:e4. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  17. Price AL, Butler J, Patterson N, Capelli C, Pascali VL, Scarnicci F, Ruiz-Linares A, Groop L, Saetta AA, Korkolopoulou P, et al.: Discerning the ancestry of European Americans in genetic association studies.

    PLoS Genet 2008, 4:e236. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  18. Bauchet M, McEvoy B, Pearson LN, Quillen EE, Sarkisian T, Hovhannesyan K, Deka R, Bradley DG, Shriver MD: Measuring European population stratification with microarray genotype data.

    Am J Hum Genet 2007, 80:948-956. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  19. Drineas P, Lewis J, Paschou P: Inferring geographic coordinates of origin for Europeans using small panels of ancestry informative markers.

    PLoS One 2010, 5:e11892. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  20. Tian C, Kosoy R, Nassir R, Lee A, Villoslada P, Klareskog L, Hammarstrom L, Garchon HJ, Pulver AE, Ransom M, et al.: European population genetic substructure: further definition of ancestry informative markers for distinguishing among diverse European ethnic groups.

    Mol Med 2009, 15:371-383. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  21. Parra EJ, Marcini A, Akey J, Martinson J, Batzer MA, Cooper R, Forrester T, Allison DB, Deka R, Ferrell RE, Shriver MD: Estimating African American admixture proportions by use of population-specific alleles.

    Am J Hum Genet 1998, 63:1839-1851. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  22. Collins-Schramm HE, Chima B, Operario DJ, Criswell LA, Seldin MF: Markers informative for ancestry demonstrate consistent megabase-length linkage disequilibrium in the African American population.

    Hum Genet 2003, 113:211-219. PubMed Abstract | Publisher Full Text OpenURL

  23. Serre D, Paabo S: Evidence for gradients of human genetic diversity within and among continents.

    Genome Res 2004, 14:1679-1685. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  24. Rosenberg NA, Mahajan S, Ramachandran S, Zhao C, Pritchard JK, Feldman MW: Clines, clusters, and the effect of study design on the inference of human population structure.

    PLoS Genet 2005, 1:e70. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  25. Tobler AR, Short S, Andersen MR, Paner TM, Briggs JC, Lambert SM, Wu PP, Wang Y, Spoonde AY, Koehler RT, et al.: The SNPlex genotyping system: a flexible and scalable platform for SNP genotyping.

    J Biomol Tech 2005, 16:398-406. PubMed Abstract | PubMed Central Full Text OpenURL

  26. Gabriel S, Ziaugra L, Tabbaa D: SNP genotyping using the Sequenom MassARRAY iPLEX platform.

    Curr Protoc Hum Genet 2009, 2:2-12. OpenURL

  27. Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, et al.: The structure of haplotype blocks in the human genome.

    Science 2002, 296:2225-2229. PubMed Abstract | Publisher Full Text OpenURL

  28. Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, Cann HM, Barsh GS, Feldman M, Cavalli-Sforza LL, Myers RM: Worldwide human relationships inferred from genome-wide patterns of variation.

    Science 2008, 319:1100-1104. PubMed Abstract | Publisher Full Text OpenURL

  29. Rosenberg NA: Standardized subsets of the HGDP-CEPH Human Genome Diversity Cell Line Panel, accounting for atypical and duplicated samples and pairs of close relatives.

    Ann Hum Genet 2006, 70:841-847. PubMed Abstract | Publisher Full Text OpenURL

  30. Rosenberg NA, Li LM, Ward R, Pritchard JK: Informativeness of genetic markers for inference of ancestry.

    Am J Hum Genet 2003, 73:1402-1422. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  31. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, De Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for whole-genome association and population-based linkage analyses.

    Am J Hum Genet 2007, 81:559-575. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  32. Pemberton TJ, Wang C, Li JZ, Rosenberg NA: Inference of unexpected genetic relatedness among individuals in HapMap Phase III.

    Am J Hum Genet 2010, 87:457-464. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  33. Kidd JR, Friedlaender FR, Speed WC, Pakstis AJ, De La Vega FM, Kidd KK: Analyses of a set of 128 ancestry informative single-nucleotide polymorphisms in a global set of 119 population samples.

    Investig Genet 2011, 2:1. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  34. Rajeevan H, Soundararajan U, Kidd JR, Pakstis AJ, Kidd KK: ALFRED: an allele frequency resource for research and teaching.

    Nucleic Acids Res 2011, 40:D1010-D1015. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  35. Baker DG, Nash WP, Litz BT, Geyer MA, Risbrough VB, Nievergelt CM, O‘Connor DT, Larson GE, Schork NJ, Vasterling JJ, et al.: Predictors of risk and resilience for posttraumatic stress disorder among ground combat marines: methods of the marine resiliency study.

    Prev Chronic Dis 2012, 9:E97. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  36. Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data.

    Genetics 2000, 155:945-959. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  37. Falush D, Stephens M, Pritchard JK: Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies.

    Genetics 2003, 164:1567-1587. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  38. Rosenberg NA: DISTRUCT: a program for the graphical display of population structure.

    Mol Ecol Notes 2004, 4:137-138. OpenURL

  39. Jakobsson M, Rosenberg NA: CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure.

    Bioinformatics 2007, 23:1801-1806. PubMed Abstract | Publisher Full Text OpenURL

  40. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genome-wide association studies.

    Nat Genet 2006, 38:904-909. PubMed Abstract | Publisher Full Text OpenURL

  41. Excoffier L, Lischer HE: Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows.

    Mol Ecol Resour 2010, 10:564-567. PubMed Abstract | Publisher Full Text OpenURL

  42. Libiger O, Schork NJ: A method for inferring an individual‘s genetic ancestry and degree of admixture associated with six major continental populations.

    Front Genet 2013, 3:1-11. OpenURL

  43. Alexander DH, Novembre J, Lange K: Fast model-based estimation of ancestry in unrelated individuals.

    Genome Res 2009, 19:1655-1664. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  44. Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA, Feldman MW: Genetic structure of human populations.

    Science 2002, 298:2381-2385. PubMed Abstract | Publisher Full Text OpenURL

  45. Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, Yu F, Bonnen PE, De Bakker PI, Deloukas P, Gabriel SB, et al.: Integrating common and rare genetic variation in diverse human populations.

    Nature 2010, 467:52-58. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  46. Rosenberg NA, Mahajan S, Gonzalez-Quevedo C, Blum MG, Nino-Rosales L, Ninis V, Das P, Hegde M, Molinari L, Zapata G, et al.: Low levels of genetic divergence across geographically and linguistically diverse populations from India.

    PLoS Genet 2006, 2:e215. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  47. Cann HM, De Toma C, Cazes L, Legrand MF, Morel V, Piouffre L, Bodmer J, Bodmer WF, Bonne-Tamir B, Cambon-Thomsen A, et al.: A human genome diversity cell line panel.

    Science 2002, 296:261-262. PubMed Abstract OpenURL

  48. Corander J, Waldmann P, Marttinen P, Sillanpaa MJ: BAPS 2: enhanced possibilities for the analysis of genetic population structure.

    Bioinformatics 2004, 20:2363-2369. PubMed Abstract | Publisher Full Text OpenURL

  49. Shimizu C, Matsubara T, Onouchi Y, Jain S, Sun S, Nievergelt CM, Shike H, Brophy VH, Takegawa T, Furukawa S, et al.: Matrix metalloproteinase haplotypes associated with coronary artery aneurysm formation in patients with Kawasaki disease.

    J Hum Genet 2010, 55(12):779-784. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  50. Mbewe-Campbell N, Wei Z, Zhang K, Friese RS, Mahata M, Schork AJ, Rao F, Chiron S, Biswas N, Kim HS, et al.: Genes and environment: novel, functional polymorphism in the human cathepsin L (CTSL1) promoter disrupts a xenobiotic response element (XRE) to alter transcription and blood pressure.

    J Hypertens 2012, 30(10):1961-1969. PubMed Abstract | Publisher Full Text OpenURL

  51. Zakharia F, Basu A, Absher D, Assimes TL, Go AS, Hlatky MA, Iribarren C, Knowles JW, Li J, Narasimhan B, et al.: Characterizing the admixed African ancestry of African Americans.

    Genome Biol 2009, 10:R141. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  52. Bryc K, Auton A, Nelson MR, Oksenberg JR, Hauser SL, Williams S, Froment A, Bodo JM, Wambebe C, Tishkoff SA, Bustamante CD: Genome-wide patterns of population structure and admixture in West Africans and African Americans.

    Proc Natl Acad Sci U S A 2010, 107:786-791. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  53. Rajeevan H, Soundararajan U, Pakstis AJ, Kidd KK: Introducing the Forensic Research/Reference on Genetics knowledge base.

    FROG-kb. Investig Genet 2012, 3:18. BioMed Central Full Text OpenURL