This will be a short “detour” about a topic that might have crossed some readers’ minds after the latest American series of posts: can external evidence help decide which linguistic classification is more likely? Genetics is a discipline that often uses linguistic data (in the case of the Americas, it is usually Greenberg’s classification). Of course, a straightforward relationship between genes and languages would imply that peoples’ movement is the only mechanism responsible for language change, which is definitely not the case. However, it can offer insights – after all, if one can prove that some people did move, well, that must have had some linguistic consequence. I have here and there touched on some of these issues, but now I want to explore the genetic data more seriously.
About twenty years ago, Luca Cavalli-Sforza and other Italian geneticists published the venerable History and Geography of Human Genes, summarising decades of research all over the world. Beyond the presentation of phylogenetic trees for all human populations and a summary of the history of human colonisation of the globe, the book offered an innovative perspective with the use of principal component analysis (PCA) to shed light on finer scale movements within continents. While we could analyse map after map of the distribution of individual genes, what PCA does is to summarise the main trends in all that variability into just a few maps (for example, if there are several genes that show more or less the same distribution, they can be transformed into a single component).
Some of the principal component (PC) maps of Europe correlate with known or hypothetical migrations in the past, and a lot of them are repeated in the more popular Genes, peoples and languages. For example, Cavalli-Sforza found that the first PC – which explains 28% of the total variance – peaked in the near east and decreased towards north-western Europe. What better proof that the Neolithic expansion involved migrations, not just diffusion of ideas? Such results have now been confirmed with actual archaeological DNA from the Neolithic settlers. The third PC, which explains about 10% of the variance in Europe, peaks north of the Black Sea and decreases towards the west. This can be seen as proof that the spread of Bronze Age cultures such as the Beaker and Corded Ware, thought to originate from the Yamnaya of the Russian steppe, again involved a wave of migrants. DNA from ancient skeletons has once more confirmed those results. The tricky part is that both population waves could be associated with the spread of Indo-European languages, so it becomes difficult to decide between Renfrew and Anthony.
Well, I have an answer to that question (I will explain one day, but here is a clue: the Basques are the key). The important thing to keep in mind is that genes denounce when migrations took place or not, and thus have immense potential to unravel historical linguistic questions. Let us have a look now at the Americas – they turn out to be quite homogeneous (in comparison to Eurasia) when it comes to genetics – and therein lies the problem.
A forest of genes in South America
Phylogenetic (or neighbour-joining) trees are an intuitive way of showing how populations branched out based on their genetic distances. Because new analyses are conducted all the time and new results are constantly being published, we often don’t have a single tree, but a forest of possible phylogenies! For the Americas, they can vary a lot from the earliest to the latest research. Because the trees are usually colour-coded according to linguistic phyla (taken from Greenberg) in almost all publications, let’s do the same thing here and see how well my own classification fits the genetic data.
The first tree shown here was taken from The history and geography…, first published in 1994. It is based on 60 to 70 markers. Cavalli-Sforza and his colleagues present different trees depending on whether North or South America is included (I have used the second). One of the things we can immediately see is that there is no correlation between proposed language phyla and genetic proximity. In fact, there’s no correlation with well-established, smaller families either. The Parakanã, a Tupi population of eastern Amazonia, is the first to split, instead of grouping with the other Tupi-speakers. The Mapuche are almost in an isolated branch instead of appearing close to other Andean and Patagonian peoples. Even worst, the Quechua are shown as being closer to the Maya than to their Aymara neighbours! I believe the problem in this tree is the intrinsic homogeneity of South American populations, coupled with the small number of markers analysed. However, when the whole continent is considered (and groups from the same language family are averaged), the tree is insightful: 1. the Eskimo are closer to Siberian populations; 2. the Na-Dené from the Northwest coast (but not those further south, like the Navajo and Apache) are a bit more distant, but in the same cluster; 3. all the other populations form a separate branch. Needless to say, this was seen as the confirmation of the theory of three waves of migrations – one of them originating the Na-Dene languages, whose distinctiveness and purported links to Siberia are still given serious consideration.
Now let’s examine a more recent tree. This one was produced by Wang and colleagues in 2007 and takes into consideration over 600 markers. I am only showing the South American portion of the tree. Overall, the conclusions of this study do not support three waves of migrants, but a single founding population: all Native American peoples grouped closer to each other than to the Siberians. As for linguistic correlates, it seems that a little more structure is emerging: Andean-Patagonian speakers are the first to branch, Quechua and Aymara groups being particularly close genetically. Strangely, some Central and North American populations appear in the following branch. Finally, we have a cluster for Chibcha speakers and another for my purported Chaco-Amazonian phylum. Arawak speakers are distributed across both, something I will comment on ahead. Two things worthy of notice here: 1. as suggested by Wang’s study, the outlier position of Andean-Patagonian might indicate a Pacific coast colonisation route for South America; 2. Chibcha-speakers are closer to Amazonian populations, which would invalidate my claims that it belongs on the western group (assuming that genetic distance equals linguistic distance).
And now for the most recent tree, one published in 2012 by Reich and colleagues in Nature. This time, over 300,000 markers were used. This impressive study confirms some of the early conclusions of Cavalli-Sforza about the continent as a whole, but presents a different picture of South America. Eskimos appear closer to Siberian populations than to the rest of the Americas; Na-Dené speakers, however, are not clustered with them, but form a highly divergent branch in the American side of the tree. The authors support the idea of at least three gene flows from Siberia to the Americas, the last two originating the Eskimo and Na-Dené groups. As for South America, we see a similar structure as to the previous tree, but with significant changes: 1. now all the Central American groups are on a branch of their own, instead of mixed with South America; 2. Andean-Patagonian speakers are closer to Chaco-Amazonian ones, with Chibcha splitting earlier than them. In any case, if this is reflected in the language phyla, maybe Chibcha shouldn’t be part either of the western or the eastern South American branches, but maybe as a distinct branch connecting South and Central/North America. Arawak speakers in the latest tree appear in different branches, and that is the only family that never shows a clear genetic correlate. I think this is extremely interesting, as it may confirm the suggestion that Arawak languages spread through trade rather than migration.
Why are the trees so different in this “genetic forest”? Well, maybe it’s difficult to arrive at a consensus because Amerind populations are genetically so close. For example, most people know that nearly all Native Americans, with the exception of some North American groups, belong to blood type O. This is because of a population bottleneck, as not many people crossed Beringia to populate the New World – and that happened relatively recently, so there hasn’t been much time for internal drift. The estimates for the founder population are usually not more than a few hundred persons, and even if the initial separation of this population from the other Siberians might have occurred some 20-30 thousand years ago, the “bottleneck” for the colonisation of the Americas proper is estimated on genetic grounds to be somewhere between 15 and 18 thousand years (with later migrations of the Eskimo and, possibly, the Na-Dené speakers). Moreover, Wang and colleagues find that a coastal route of colonisation of the Americas explains certain variance in the data better, a conclusion that is reinforced by the fact that ice sheets were blocking the inland route in North America before 13,000 years ago. A quick advance of the first settlers along the Pacific is in agreement with the early dates of some archaeological sites like Monte Verde in Chile. Needless to say, it also explains the linguistic divide between the western and eastern parts of the continent.
Let’s now have a quick look at the maps for the principal components in the Americas. The first PC has a very regular North-South gradient in North America. Cavalli-Sforza and colleagues interpret it as summarising the main differences between the Eskimo and (Canadian) Na-Dené, on the one hand, and all other Native Americans, on the other. In South America, there is little variation in this PC, but it does show a West-East gradient. Maybe that is related to the colonisation along the Pacific coast, coupled with a relative isolation between the highland and the lowland groups. The second PC shows several trends. In North America, it summarises again the difference between the northernmost groups (Eskimo and NW Na-Dené) and the remaining peoples. A little dot in the U.S. Southwest may relate to the local Na-Dené speakers, the Apache and Navajo. The peak in the American northeast is interpreted by Cavalli-Sforza as relating to European admixture (which is documented by other means). In South America, I believe this PC shows essentially the same trend as the previous one, but in more detail: the Central Andes, together with the northern coast of South America, are in one extreme, whereas the region of the lower Amazon is in the other. In my view, this shows perfectly the colonisation of South America along the Pacific coast, another entry along the Atlantic, and the genetic isolation between Andean and Amazonian populations. Finally, the third PC is a bit more difficult to interpret. It forms a mainly West-East gradient. In North America, the Eskimo and Na-Dené are again well differentiated, including the outlying position of the Apache and Navajo (in the little orange dot). In South America, there is a peak in the Mapuche area of Chile, decreasing towards the Guyanas. Cavalli-Sforza sees that as evidence of African admixture in the latter. Possibly this PC resumes the distinction between (southern) Andean and Patagonian peoples on one side, and the remaining groups on the other.