(Felsenstein Lab) Some rants and diatribes

Some rants and diatribes by Joe

I always have a collection of rants and diatribes on various topics, and it occurred to me that they should be more widely available. Currently, if you are talking to me and mention topic #15, I come forth with rant #15 on that topic. Why not put them online? I will be gradually adding rants here. Of course some of them will turn out to be oversimplifications and caricatures, and I will try to admit this if called out on these.

Rant #1: Genomics terminology is often stupid. Examples:

"Box"
(This is fading from use, fortunately). People will point to a sequence motif and call it a "box" as in "TATA box". Why "box"? Because when they first saw one of these they drew a box around it. But suppose they had drawn an elephant around it -- would we then call it a "TATA elephant"?
"Nextgen sequencing"
This is pure marketing terminology, adopted uncritically. But think: the current generation of sequencing methods is "next generation sequencing". What will happen in another generation? Will we say "back in the days of next generation sequencing?". Some embarrassment is called for.
Intron and exon
See, an "intron" is a piece of a gene that is cut out when making mature mRNA, and an "exon" is a piece of the gene that is left in. Only if you run time backwards does this make any sense.
Rant #2: The definition of "monophyletic" needs work. Right now the standard definition is that a group is monophyletic if it consists of an ancestor and all of its descendants. I see why that was done, but in some cases it does not work. For example if we are discussing the set of species [Carrot, Salmon, Gorilla, Human], and I claim that the set [Gorilla, Human] is monophyletic, I'd like to be able to be right about that. But alas, that set does not contain the common ancestor of humans and gorillas, nor does it contain all descendants, the fossil ones and present-day ones such as the two chimpanzee species.
My own definition of monophyletic is that a set of species is monophyletic if it has its own common ancestor, which is not the ancestor of any of the other species under discussion. That seems to work for all cases.
Rant #3: The term "segmental duplication" can be misleading. When I first heard it, I thought it must refer to the duplication of some new unit of the genome, the "segment". But I wasn't sure what that unit was. Subsequently I realized that it simply meant duplications of parts of the genome that were not defined as gene duplications. I wonder whether the word "segmental" is really helpful.
Rant #4: Inference of phylogenies is not simply a matter of nested derived states. In many textbooks, museum displays, and blog arguments it is confidently asserted that the way we reconstruct phylogenies is to find shared derived states (synapomorphies), and that the nested pattern of these defines clades. The problem is that, if this were true, we would have no parallelisms, no convergences, no reversals in evolution. We would also always know where the tree was rooted, since we would always know which state was the ancestral state. We would not ever need computer programs to reconstruct phylogenies. Not even parsimony methods would be necessary -- you could always just work by hand and you would then always find a tree with no extra steps. Later in the same textbooks, there are sections on phylogeny methods that talk about the need for computer programs to reconcile conflict among characters. But there is not a word about how this conflicts with the statements earlier in the book.
Rant #5: The Modern Synthesis has not been replaced. Sure, all sorts of new phenomena have come along since the 1940s: neutral mutation, lateral gene transfer, symbiosis, evo-devo, epigenetics, etc. And we could declare the death of the current Synthesis each time one came along. But here's why we shouldn't do that:
1. Otherwise every time John Blotz pointed out a new phenomenon he could strut around publicizing the fact that he, the great Blotz, had invalidated the evolutionary synthesis, and now we had (ta-da!) the Blotzian Synthesis. But he would be shocked a year or two later when Jane Schmerz came along and invalidated the Blotzian Synthesis in favor of the new Schmerzian Synthesis. And so it would go, synthesis after synthesis, until everyone was totally confused, and most people were several syntheses behind.
2. Meanwhile the public would be continually told that all that stuff they learned in secondary school, about mutation and natural selection and some other evolutionary forces, was all wrong, because now we had the Blotzian (er, oops, actually the Schmerzian) Synthesis instead.
It would be (temporarily) great for Blotz's and Schmerz's careers and egos, but a disaster for everyone else.
Rant #6: The term "transitional fossil" is misleading terminology. Creationists often say that we have no "transitional fossils". Biologists reply that we have lots of them. The argument is partly over what a transitional fossil is supposed to be. It sounds like it is a fossil from an ancestral species, caught in the act of "transition" to a new form.
I think "transitional" is actually bad terminology. I think it dates to 50+ years ago when many evolutionary biologists naïvely assumed that any fossil that looked like an ancestor really was the ancestor. Archaeopteryx was assumed to actually be an ancestor of modern birds.
Now the definition is modified to mean having a "transitional" combination of character-states, which is what our fossils really are. We have lots of those. But we're plagued by the word "transitional". We need some term that does not also imply, to the unwary listener, that the fossil is known-to-be-the-ancestor.
Rant #7. There is no consensus, even among systematists, as to what the word "cladistics" means.
There seem to be various positions:
1. Cladistics is the position that all groups in the classification system should be monophyletic.
2. Cladistics is that, plus reconstructing the tree by nested synapomorphies.
3. Cladistics is those, plus using parsimony methods when you have reversals or parallelisms in your data.
4. Cladistics is all those, but only if you're a paid-up member of the Willi Hennig Society.
5. Cladistics is numbers 1-3, plus using likelihood or Bayesian methods to infer trees.
6. Cladistics is numbers 1-3 and number 5, plus inferring trees by distance matrix methods.
The straightforward and coherent definition seems, to me, to be #1. It is a position on classification, not on how the phylogeny should be inferred. As such it is a sensible position, and certainly the strongly dominant one these days. The other definitions are, however, talking about two things at once -- how we infer phylogenies and how we define groups in the classification system. You can find in the literature, including in textbooks, all of the other positions. The definition of "cladistics" is a disastrous mess. Systematists are wildly divided among these various positions. The one thing that unites all of them is the belief that the definition of cladistics is clear and widely agreed-upon. They are each sure that their definition is the one everyone else agrees with.
A glance at the Wikipedia page for "cladistics" will find it advocating a position somewhere between #5 and #6. In the Talk:Cladistics Wikipedia page (especially here) people, including me, complain about this. The failure of agreement simply reflects the state of systematics and the wildly mixed messages that it gives to the outside world.
Rant #8. Can we please stop using the term "incomplete lineage sorting"?
Because it is much less clear than just thinking about coalescence within each species and then, in their common ancestor, the remaining lineages coming to be in the same population and then coalescing as one goes further back.
Consider thinking of it as ILS. Suppose we have three samples from species A and four from species B and the tree of species is drawn as growing upwards. Then we start at some point earlier than their common ancestor (actually, where?) and as we go up that lineage to the common ancestor, the lineages split (actually, how many times?). Then the resulting lineages go up into the two species (actually, how do they decide how many go each way? They can't do it independently, because then all of them might go into species B, in which case there are none to go into species A.) Once they get into those two species, they can split further, but have to end up with exactly the right sample sizes.
This is horribly complex. Turning around and starting with the samples and going back in time, it is just two coalescents, combination of the remaining lineages, and a further coalescent. That is much easier to think about, and to simulate. So let's think of this as the multispecies coalescent, not as the difficult concept of "incomplete lineage sorting". It was important in the early history of thinking about the multispecies coalescent, but it has been superseded by a much clearer conceptual framework.
However, there is still one argument for using "incomplete lineage sorting" as the name for the phenomenon. It is misleading, but it is difficult to find a phrase to replace it. Perhaps call the discrepancies "gene-tree/species tree differences"?
Rant #9: Paleoanthropology needs to communicate more sanely.
As far as I can see, here's how a typical cycle of communication between paleoanthropologists and the public goes:
1. They find a bone.
2. They hold a press conference.
3. They describe this as a new species of hominid.
4. They announce that up until now, no one has understood human evolution.
5. They say that with the finding of this bone, we now understand human evolution.
6. This is widely applauded in the popular science press.
7. They make a National Geographic special.
8. The cycle then starts again at step 1.
Thing is, next time they start by saying that up till now, no one has understood human evolution. Their listeners don't seem to remember that they recently announced that an understanding of human evolution was finally achieved.
OK, this must be a caricature (right?). I see why things like this are done, as they have to raise money, often from outside of normal scientific funding agencies. But it does leave the public confused, if also feverishly excited.
Rant #10. The term "gene tree" is ambiguous. OK, I know that I myself used the term "gene tree" in rant #8. But it is used in two senses: (1) coalescent trees, and (2) trees of gene duplications. These are very different, one having at its tips multiple copies at the same locus, the other having at each tip a representative of an individual locus in an individual species. A fork in the first kind of "gene tree" represents a coalescence, and a fork in the second kind of "gene tree" represents a gene duplication event. The day is soon coming when we will have data that consists of multiple copies of each member of a gene family in each of a set of species. When we do, will we describe the analysis of these as "gene tree / gene tree" methods?
Rant #11. The word "gene" has a three-way ambiguity. Actually more, since people are unsure how far upstream in front of the start codon a gene extends, and how far downstream beyond the stop codon. That problem is well-known. But there is another three-way ambiguity. Is one "gene"
1. a single copy, such as my maternal copy of hemoglobin-β ? or
2. a single allele, such as the wild-type allele of hemoglobin-β ? or
3. a single locus, such as the hemoglobin-β locus?
Actually, there is a fourth useage. There are about 20,000 "genes" in the human species. If we also consider (say) the gorilla, does it have those 20,000 genes or another 20,000 genes? It seems to me that the second usage in the above list is rare, but does persist in the term "gene frequency", which really means allele frequency. Mostly "gene" is either a single copy or a single locus. But that ambiguity persists.

More to come, soon.

Joe Felsenstein