Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations

A discussion of the paper by Wang et al. 2020

This week in Journal Club at the Mathieson Lab we discussed the recently published paper by Wang et. al (2020) on the theoretical aspects of the transferability of polygenic risk scores across ancestries.

This work provides three predictors for the relative accuracy of a polygenic risk scores (i.e., the accuracy in a test population with different ancestry than that from which the summary statistics came from,divided by the accuracy in an independent test sample of same ancestry as the original GWAS). The three different predictors differ in their assumptions. All of them assume that the causal variants are the same across populations and that their effect sizes are 100% correlated. Predictor 1 requires that one knows what the causal variants are, as well as their degree of linkage disequilibrium with the genome-wide significant SNPs. In practice that is rarelly if ever known. Predictor 2 approximates this by doing a heuristic approach to select candidate causal variants for each GWS SNP. Finally, Predictor 3 assumes (rather naively as the authors state) that the GWS SNPs are the causal SNPs. In summary, their three predictors capture the impact of LD and allelic frequencies on PRS performance.

Next, they simulate genotypes based on the UK Biobank data, exploring different heritability values and number of causal variants. With this, they explored different genetic architectures. They used the 1000 Genomes to input variants in the UKBB data, which they divived into EUR (European), AFR (African), EAS (East Asian), and SAS (South Asian) based on the proximity of each individual to the principal components generated by 1000 Genomes populations. They assingned effect sizes based on a normal distirbution with mean 0 and variance 1 minus the heritability.

They evaluated the relative accuracies (RA) for Predictors 1, 2 and 3. They compared those to the observed RA in the simulated genotypes. Generally, Predictors 1 and 2 were pretty close to the simulation-based observed RA, while Predictorr 3 tends to give overestimates. They also verified that using a different panel for imputation, different clumping thresholds and heritabilities and number of causal variants did not strongly affect RA, which decreases monotonically with distance from Europe, as previously shown.

Next, they used real sumamry statistics to construct PRSs for 8 traits, and tested the performance of these three predictors. Their main fidnings are that: 1) RA is higher with genetic proximity to Europe and 2) the loss of accuracy (LOA=1-RA) attributable to LD and allelic frequencies is highest in Africans. That is, for more genetically distant populations, differences in allelic frequencies play a more substantial role, while for more closely related populations other factors (not investigated) such as differences in effect sizes, gene by environment interactions, etc, presumably have a greater realtive improtantce. They authors say that they provide upper bounds for the proportion of LOA due to LD and allelic frequencies, which is useful as new studies are trying to understand and improve the transferability of PRS across ancestries.

My slides are available here.

Bárbara Bitarello
Bárbara Bitarello
Postdoctoral Researcher

Related