Applications of Phylogenetics in Medicine and Public Health

Applications of Phylogenetics in Medicine and Public Health

One of the most common charges levied against evolutionary biology is that it’s a useless science with no applications in the real world. If this was true it wouldn’t do anything to tarnish the validity of the evolutionary biology research, despite what many anti-evolutionists would have you believe. After all, the mark of a valid scientific theory isn’t the ability to derive lots of technological or humanitarian advances (although it’s obviously a nice bonus), but the ability to accurately describe existing data and predict future data. That being said, I’ll demonstrate here using just a specific set of examples that the charge that evolutionary biology is useless is patently false in the first place.

One of the most basic tools in the arsenal of an evolutionary biologist is the ability to construct phylogenetic trees from a dataset of characters to visualise the evolutionary relationships between a set of organisms. In the age of next-generation sequencing, the characters used are the different nucleotides in a set of DNA sequences, whereas in the past it was more common to just use anatomical character states. The general principle behind phylogenetics is that sequences that diverged most recently will share the most nucleotides in common, whereas more distantly-related sequences will have fewer in common, as illustrated below in Figure 1. It’s the same principle as you being more genetically similar to your immediate family than to your distant relatives.

Screen Shot 2017-07-29 at 16.38.52
Figure 1 | A mock phylogeny. Based on 5 nucleotide sequences, each 10 bases long, a phylogeny can be drawn to reconstruct the most likely phylogenetic relationship between them. For the sake of simplicity of the example, the outgroup sequence (AAAAAAAAAA) is assumed to represent the ancestral sequence. Bases in black represent bases that haven’t mutated, while the colours correspond to mutations that occurred at different times in the phylogeny. Mutation events are represented by the coloured bars on the branches, and described by the text of the same colour (e.g. A->T describes a mutation from A to T at a particular position.

In this example, you can see that the mutations represented in green (A->T and A->C) are shared in all 4 of the in-group sequences, indicating that it occurred before any of them diverged from each other. The mutations represented in red and blue are shared only by a subset of these 4, so most likely occurred after a divergence between the top 2 and bottom 2 in-group sequences. The mutations represented in pink and orange are unique to a single sequence, so occurred after those sequences diverged from their sister sequence. The sequences in this example are only 10 bases long, but in “proper” phylogenetic analyses, sequences of anywhere from hundreds to tens of millions of bases are used in order to resolve the phylogeny with a high degree of confidence. As you might expect, the algorithms required to parse through these huge amounts of data and give accurate results by accounting for things like biases in mutation rates and models of selection get very complicated and are fortunately outside the scope of this blog post.

While the basic reasoning behind this phylogenetic analysis is quite obvious and has therefore been used to reconstruct phylogenies (based on morphological data) since the late 19th century, the real revolution has been in the last 50 years or so since DNA sequences have become available. Right from the start, phylogenetics was a tool for understanding evolutionary relationships. One of the landmark examples was when Carl Woese and his colleagues constructed the 3-domain tree of life using sequences from bacteria, archaea, and eukaryotes in 1977. Because analysis of DNA sequences gives us so much more phylogenetic information than measuring just a few phenotypic traits, phylogenetics has gained the resolution required to allow its application to much finer relationships than the comparatively obvious divergences between phyla or even species. I’ll discuss 3 examples where phylogenetics has been applied at the much finer scale in recent years in applications in medicine and epidemiology: tracking the spread of pathogens across continents, tracking infections through hospital wards, and in studying the evolution of tumours in cancer patients.


Tracking global influenza pandemics with phylogeography

Phylogeography is the study of phylogenetics integrated with spatial data to make inferences about the geographical movements of organisms through time. In evolutionary biology, this is often applied to track animal and plant migration/dispersal patterns in the deep past, but it has also found an application in tracking another type of organism: pathogens. Phylogeographic techniques allow researchers to track infections as they spread geographically on a range of scales, from local to global. Just about every major disease outbreak in the post-genomic era has been studied in this way, including the recent 2014 Ebola outbreak in West Africa, where phylogeography was used to trace the origin of the outbreaks to Guinea, and describe how the pathogen spread to Sierra Leone twice independently. However, in this blog post I’ll describe a different example.

In our 21st century world, populations are connected by commercial travel on a global scale. People can travel to other countries easier than ever before, but this has the inevitable side-effect of allowing pathogenic bacteria and viruses to spread between otherwise distance-separated populations. As such, understanding the dynamics of how diseases spread on a global scale is of great interest to epidemiologists, because this understanding is the necessary to formulate the most effective countermeasures.

One such study was performed by Lemey et al. and published in 2014. They looked at the H3N2 subtype of the Influenza A virus (IAV) and how infections were distributed globally between 2002 and 2007. Using over 1,000 samples of the virus from around the world, they calculated a phylogeny (Figure 2b) and combined this with the geographical information about where each sample was collected to infer the movements of the virus around the world in the 5-year period. This was then compared to a model based on flight and passenger data obtained from over 4000 airports, and it was found that this model predicted the molecular phylogeny with a very high degree of accuracy. In other words, the flow of air passengers is the primary driver of H3N2 Influenza dissemination around the world. This intuitive result was already predicted from previous mathematical models, but this study provided the first empirical demonstration of this fact.

Figure 2 | Correlating passenger flight routes with the phylogeography of H3N2 subtype of Influenza A (IAV). (a) Coloured map of a set of airports around the world. (b) Phylogeny of the H3N2 influenza virus. Branches were coloured according to their inferred location. Image from Pybus et al. 2015, which was adapted from Lemey et al. 2014.

Because the phylogeny combined with spatial data allowed the researchers to infer the location of ancestral viral populations prior to their spread into the locations where the samples were collected, Lemey et al. were able to calculate that the primary source of the virus, year on year, was mainland China (note the mostly blue “trunk” of the phylogeny in Figure 2b). In other words, flights coming from mainland China were the primary carriers of the virus as it spread around the world. The authors discussed some of the implications of their results:

Although identifying the causes of pathogen spread is of great importance in spatial epidemiology, integrating this information in evolutionary models also offers major advantages for phylogeographic reconstructions and their relevance to infectious disease surveillance and pandemic preparedness. By capturing a more realistic process of spatial spread, our novel approach results in more credible reconstructions of spatial evolutionary history, which may shed further light on the persistence and migration dynamics of human influenza viruses.

In their 2015 review, Pybus et al. described another similar case of a global influenza pandemic, this time of the H1N1 subtype:

The emergence of pandemic H1N1 (pH1N1) influenza in 2009 was the first influenza pandemic in the post-genomic era. Genetic analysis of the pandemic in its early stages was aided by pre-planned and intensive virus sequencing in some countries, and by the immediate and open sharing of the resulting data through online databases. Consequently, the molecular epidemiology of the virus could be tracked in ‘real time’ as the epidemic unfolded [57,58]. This included phylogeographic analyses that studied the global dispersal of the virus during its establishment phase [59,60], which followed patterns of international air travel [13,61].

Tracking pathogens across countries and continents like this has become standard practice in identifying and controlling outbreaks. It’s now possible to identify a new outbreak, sequence the pathogen, and place it on a phylogenetic tree to identify the most likely source of the pathogen. For example, this sort of test was done by Rezza et al. in 2007 during an isolated outbreak of the Chikungunya virus in rural Italy. The index case of the outbreak was a man who was recently visited by a relative from India. Sure enough, when placed on a phylogeny, the virus was most closely related to the Indian version of the virus, revealing that an Indian origin for the virus was most likely.


Tracking the transmission of a rare superbug in the US NIH Clinical Centre

In 2011, the United Stated National Institutes of Health Clinical Centre experienced an outbreak of a carbapenem-resistant strain of Klebsiella pneunomiae (KPC). Over the course of about 6 months, 18 patients at the hospital were found to be infected with KPC, 10 of whom died, although only 6 of these deaths were directly attributable to the infection. Determining transmission routes between these patients is a monumental task, involving tracing the movements of patients to see if they ever shared a ward, tracing the movements of doctors, nurses, equipment, etc. Needless to say, understanding exactly how the infection is being spread between patients across a hospital is vital, not only to stop that one outbreak but also to prevent future incidents by informing outbreak protocols. To that end, Evan Snitkin and his colleagues at the NIH and National Human Genome Research Institute sampled and sequenced the genomes of the KPC bacteria that had infected each patient in the hope of resolving the transmission routes of the outbreak. They published their results in the journal Science Translational Medicine in August of 2012. Well-known science writer Carl Zimmer wrote a piece for Wired describing the case shortly after.

As the bacterial strain incubated in patients between transmission events, random mutations in the genomes of the bacterial cells accumulated, such that the exact genomic sequence of the bacteria would vary between infected individuals. If a patient acquired the infection from patient A, the genomic signature of that infection would be different to the infection they would have got if they were infected by patient B. Using the DNA sequences to construct a phylogeny of the bacterial strains between patients, weighted using epidemiological data on ward-sharing, a most likely transmission route between patients was worked out (Figure 3).

Screen Shot 2017-07-29 at 17.50.25
Figure 3 | Most likely transmission route map. The numbered nodes represent patients designated with numbers 1-18 in order that they presented with the infection. The arrows between the nodes represent transmission events, for example, patient 1 transmitted the infection to patients 3, 4 and 8. Red arrows represent transmission events that can be explained by ward sharing, while the black arrows represent transmission events that would have required more complex routes of infection. Dotted arrows point to patients where at least one other equally parsimonious transmission link exists leading to the given patient – in other words, an ambiguous transmission event. Image taken from Snitkin et al. 2012.

Additional transmission maps were also constructed, one using genetic data alone and a second using epidemiological data alone. The maps made using just the genetic data were much more similar to the final map that used both sets of data.

In their paper, the authors discussed the implications of identifying transmission events that can’t otherwise be explained by overlapping patient stays in the same hospital wards:

A second transmission from patient 1 was predicted to go through patient 4 and is based entirely on genetic data. Because no direct contact occurred between patients 1 and 4, we looked for evidence of patients acting as asymptomatic intermediates in a transmission between these two patients. Knowledge of patient locations in the hospital allowed us to identify patients whose movements within the hospital make them candidates for being silent transmission vectors. Specifically, we identified all patients who overlapped with the index patient and then with patient 4 before he cultured positive. Among the 1115 patients at the NIH Clinical Center during the outbreak, there were only 5 patients who overlapped with patients 1 and 4 and could have acted as vectors for transmission (Fig. 4). Patients B and D are especially compelling because of their extensive overlap with patients 1 and 4, but neither one cultured positive with surveillance cultures. Although the asymptomatic carrier could have been colonized below detection level or could have been an untested health care provider, this type of mining of epidemiological data has the potential to identify a handful of candidates who merit additional surveillance cultures and/or placement in contact isolation. As genome sequencing becomes even faster, such insights could be obtained in real time to perform targeted, thorough surveillance of patients of interest. (Emphasis mine)

The authors also discussed the benefits of using phylogenetics to help infer transmission routes in general, and look to the future:

In addition to supporting the implementation of these specific infection control procedures, our results suggest several ways in which whole-genome sequencing of outbreak isolates in real time could guide future infection control efforts. First, genetic data can allow for the identification of unexpected modes of transmission. For instance, it was initially assumed that patient 4 was colonized during a 24-hour stay in the ICU, during which he overlapped with patients 2 and 3. However, sequence analysis demonstrated that patient 4’s isolate could not have come from patients 2 or 3 and must have derived from an independent transmission chain from patient 1. This finding motivated the search for an intermediate patient to explain the transmission from patient 1 to patient 4, ultimately revealing four highly plausible candidates. If we had such knowledge in real time, these putative intermediate patients could have undergone more rigorous surveillance culturing to identify KPC–K. pneumoniae colonization and/or been placed on contact isolation, potentially terminating a silent transmission chain. Second, genomic sequencing may distinguish between alternate transmission scenarios, which may be critical for surmising the scope of an outbreak within the hospital. For example, patients 15 and 16 cultured positive for KPC-producing K. pneumoniaewithin 1 day of each other while residing in the same non-ICU ward. There was no obvious epidemiological link connecting these patients to the outbreak, which raised the possibility that patient 16 had brought a new strain to the hospital from a long-term care facility in which he had recently had a prolonged stay. Genomic sequencing revealed that patient 15’s and 16’s isolates both matched the dominant ICU strain, suggesting that transmission had occurred within our hospital to a non-ICU ward. Finally, genetic data can link patients directly to environmental or infrastructure isolates. Such findings can motivate refinement in cleaning and decontamination procedures by providing insight as to how and when contamination occurred.

Beyond applications to outbreak containment, we foresee that future applications of real-time sequencing have the potential to transform both the control and the treatment of hospital infection. In both epidemic and endemic settings, real-time genomic sequencing can ultimately provide a powerful tool to define the nosocomial epidemiology of important health care–associated pathogens with heretofore unprecedented precision.

This study isn’t just an isolated example either, genotyping infections is a very common practice to examine the transmission of diseases in hospitals. Another well-cited study was published by Schürch et al. in 2010.


The evolution and diversification of tumours

In the last few years, developments in sequencing technology have allowed us to sequence individual cells. Previously, large numbers of cells from a tissue or bacterial culture were required, and the sequences that were returned were essentially “averaged out” over all the cells from which the DNA was extracted. Even before this new capability existed, cancer biologists were well aware that tumours were heterogeneous – they weren’t composed of identical cells, they were very different (Figure 4). Applying single-cell sequencing technology to tumours revealed that the cells within a tumour were not only very phenotypically diverse, but they were also very genetically diverse.

Figure 4 | Inter- and intra-tumour heterogeneity. Image taken from Burrell et al. 2013

In this way, tumours can be thought of in a similar way to a communal population of individual organisms, and as such, the tumour evolves over time. Different subfamilies of cells outcompete each other as mutations accumulate, some of which help the cancer cells proliferate faster, for example. While tumours get started as a single cell, over time they become a heterogenous mass of cells, all with different evolutionary histories from that single ancestral cell, as illustrated in Figure 5. Understanding the dynamics of cancer proliferation is obviously critical to inform the most effective treatments. Like populations of bacteria, tumours can evolve to become resistant to drug therapies, and different tumours will be susceptible to different therapies based on the phenotypes of the cells that comprise them, which are determined by the genetic mutations they accumulate over time.

Figure 5 | An illustration of tumour evolution. Image from

Using single-cell sequencing techniques, doctors can sample a subset of cells in a tumour and infer phylogenetic relationships of the individual cells. This means it becomes possible to infer which mutations were present in the first cancerous cells in that tumour – which mutations drove the cells to become cancerous in the first place. In the case of the mock-up in Figure 5, we can see that mutation A initiated the tumour.

The analogy of an evolving species in the wild also applies to metastases. Small numbers of tumour cells break off from the main tumour mass (often by shear forces) and travel through the bloodstream or lymphatic system to land in a new part of the body. This is analogous to a founder population migrating to a new environment and colonising it. These cancerous cells proliferate and diversify, forming a secondary tumour.

If a patient presents with multiple tumours, doctors can sample each of them and infer the phylogenetic relationships between the tumours, as shown in Figure 6. In fact, this method can be used to tell whether or not a series of tumours represent an original tumour and a set of metastases, or just a series of independent tumours that have nothing to do with each other and arose idependently.

Figure 6 | Cancer phylogeny including metastases. Image from

Of course, intra-tumour evolution isn’t identical to inter-species evolution in terms of parameters like mutation rates, selection coefficients, etc, so new phylogenetic techniques have been tailored to best fit these quirks of tumour evolution. Until recently though, phylogenetic algorithms used in classic evolutionary biology were the only game in town. As Schwartz and Schäffer put it in their 2017 review of tumour phylogenetics:

Most studies of tumour phylogenetics to date have adapted standard algorithms that were developed for species phylogenetics (for example, maximum parsimony21, 61, minimum evolution73, neighbour joining71, 80, UPGMA21, or various maximum likelihood or Bayesian probabilistic inference methods81, 82), occasionally comparing multiple standard approaches in a single study21, 83 (Tables 1,2). Only recently have new phylogeny algorithms emerged to deal with the peculiarities of tumour versus species evolution84, 85, 86, 87, 88.

They also describe (in slightly more technical terms than I did above) some of the applications of phylogenetics in cancer research that currently exist and continue to be developed:

This variety of phylogeny methods has corresponded to a variety of applications. Tumour evolutionary trees, which were once merely conceptual models2, are now central in the results of many studies11. Early uses of phylogeny methods often focused on applying the new tool of tumour phylogenetics to old problems, such as using evidence of evolutionary selection to separate driver mutations from passenger mutations29, 50, or using novel algorithms to find the order and timing of driver mutations89, 90, 91 or to determine how these driver mutations associate with progression stages92. Other key results have emerged organically, for example from studies addressing the still controversial question of whether tumour evolution follows the expectations of classical clonal evolution theory93, 94, 95 in producing predominantly linear phylogenies54, 76, 96, 97, whether it exhibits predominantly branched evolution exemplified by the early divergence of subclones30, 33, 40, 42, 49, 73, 83, 98, 99, 100, or whether it occupies some continuum encompassing both extremes in different tumours34, 101. Researchers continue to find new applications for phylogeny models, such as the use of phylogenies prognostically to predict the likely future progression of a tumour43, 58, 85, 92, 102; such applications are an evolution of older approaches that have been used to predict progression from simpler measures of tumour heterogeneity38, 58, 59, 102, 103, 104, 105.

Suffice it to say, phylogenetics plays a major role in modern cancer research.



Here I have given 3 broad examples of the common applications of phylogenetics at 3 different resolutions: the single-cell resolution in individual tissues, the between-patient resolution of disease strains in a single hospital, and the scale of global disease outbreaks. In these examples, phylogenetic methods were reliably implemented to inform medical and epidemiological strategies.

The science of phylogenetics, derived entirely from evolutionary biology has clearly found a huge role in the practical field of medicine, but that’s not all. It also informs work in conservation biology by tracking population sizes and migrations of wild populations of endangered and potentially endangered species. Perhaps that will be the subject of a future blog post.


Comments and queries are welcome.


One thought on “Applications of Phylogenetics in Medicine and Public Health

  1. Reblogged this on A Tale Unfolds and commented:
    Recently discovered blog that is well worth reading and signing on.
    While much of the minutia is beyond me and there is no mention of Liverpool anywhere, if you want some excellent ammo against those Young Earth Creationist Wallies then this site has all the firepower you could ever need.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s