I always have a collection of rants and diatribes on various topics, and it
occurred to me that they should be more widely available. Currently, if you
are talking to me and mention topic #15, I come forth with rant #15 on that
topic. Why not put them online? I will be gradually adding rants here. Of
course some of them will turn out to be oversimplifications and caricatures,
and I will try to admit this if called out on these.
- Rant #1: Genomics terminology is often stupid. Examples:
- "Box"
- (This is fading from use, fortunately). People will point to a sequence
motif and call it a "box" as in "TATA box". Why "box"? Because when they
first saw one of these they drew a box around it. But suppose they had drawn
an elephant around it -- would we then call it a "TATA elephant"?
- "Nextgen sequencing"
- This is pure marketing terminology, adopted uncritically. But think: the
current generation of sequencing methods is "next generation
sequencing". What will happen in another generation? Will we say "back in the
days of next generation sequencing?". Some embarrassment is called for.
- Intron and exon
- See, an "intron" is a piece of a gene that is cut out
when making mature mRNA, and an "exon" is a piece of the gene that is left
in. Only if you run time backwards does this make any sense.
- Rant #2: The definition of "monophyletic" needs work. Right now the
standard definition is that a group is monophyletic if it consists of an
ancestor and all of its descendants. I see why that was done, but in some cases
it does not work. For example if we are discussing the set of species [Carrot,
Salmon, Gorilla, Human], and I claim that the set [Gorilla, Human] is
monophyletic, I'd like to be able to be right about that. But alas, that set
does not contain the common ancestor of humans and gorillas, nor does it
contain all descendants, the fossil ones and present-day ones such as the two
chimpanzee species.
My own definition of monophyletic is that a set of species is monophyletic if
it has its own common ancestor, which is not the ancestor of any of the
other species under discussion. That seems to work for all cases.
- Rant #3: The term "segmental duplication" can be misleading.
When I first heard it, I thought it must refer to the duplication of some new
unit of the genome, the "segment". But I wasn't sure what that unit was.
Subsequently I realized that it simply meant duplications of parts of the
genome that were not defined as gene duplications. I wonder whether the
word "segmental" is really helpful.
- Rant #4: Inference of phylogenies is not simply a matter of
nested derived states. In many textbooks, museum displays, and blog
arguments it is confidently asserted that the way we reconstruct phylogenies is
to find shared derived states (synapomorphies), and that the nested pattern
of these defines clades. The problem is that, if this were true, we would
have no parallelisms, no convergences, no reversals in evolution. We would
also always know where the tree was rooted, since we would always know which
state was the ancestral state. We would not
ever need computer programs to reconstruct phylogenies. Not even parsimony
methods would be necessary -- you could always just work by hand and you would
then always find a tree with no extra steps. Later in the same textbooks,
there are sections on phylogeny methods that talk about the need for computer
programs to reconcile conflict among characters. But there is not a word
about how this conflicts with the statements earlier in the book.
- Rant #5: The Modern Synthesis has not been replaced. Sure, all
sorts of new phenomena have come along since the 1940s: neutral mutation,
lateral gene transfer, symbiosis, evo-devo, epigenetics, etc. And we could
declare the death of the current Synthesis each time one came along. But here's
why we shouldn't do that:
- Otherwise every time John Blotz pointed out a new phenomenon he
could strut around publicizing the fact that he, the great Blotz, had
invalidated the evolutionary synthesis, and now we had (ta-da!) the Blotzian
Synthesis. But he would be shocked a year or two later when Jane Schmerz came
along and invalidated the Blotzian Synthesis in favor of the new Schmerzian
Synthesis. And so it would go, synthesis after synthesis, until everyone was totally confused, and most
people were several syntheses behind.
- Meanwhile the public would be continually told that all that stuff they
learned in secondary school, about mutation and natural selection and some
other evolutionary forces, was all wrong, because now we had the Blotzian (er,
oops, actually the Schmerzian) Synthesis instead.
It would be (temporarily) great for Blotz's and Schmerz's careers and egos, but a disaster
for everyone else.
- Rant #6: The term "transitional fossil" is misleading terminology.
Creationists often say that we have no "transitional fossils". Biologists
reply that we have lots of them. The argument is partly over what a
transitional fossil is supposed to be. It sounds like it is a fossil from an
ancestral species, caught in the act of "transition" to a new form.
I think "transitional" is actually bad terminology. I think it dates to 50+
years ago when many evolutionary biologists naïvely assumed that any fossil
that looked like an ancestor really was the ancestor. Archaeopteryx was assumed
to actually be an ancestor of modern birds.
Now the definition is modified to mean having a "transitional" combination of
character-states, which is what our fossils really are. We have lots of those.
But we're plagued by
the word "transitional". We need some term that does not also imply, to the
unwary listener, that the fossil is known-to-be-the-ancestor.
- Rant #7. There is no consensus, even among systematists, as to what the
word "cladistics" means.
There seem to be various
positions:
- Cladistics is the position that all groups in the classification system should be
monophyletic.
- Cladistics is that, plus reconstructing the tree by nested synapomorphies.
- Cladistics is those, plus using parsimony methods when you have reversals or
parallelisms in your data.
- Cladistics is all those, but only if you're a paid-up member of the Willi
Hennig Society.
- Cladistics is numbers 1-3, plus using likelihood or Bayesian methods to infer
trees.
- Cladistics is numbers 1-3 and number 5, plus inferring trees by distance matrix
methods.
The straightforward and coherent definition seems, to
me, to be #1. It is a position on classification, not on how the phylogeny
should be inferred. As such it is a sensible position, and certainly the
strongly dominant one these days. The other definitions are, however,
talking about two things at
once -- how we infer phylogenies and how we define groups in the classification
system. You can find in the literature, including in textbooks, all of the
other positions. The definition of "cladistics" is a disastrous mess. Systematists are wildly divided among these
various positions. The one thing that unites all of them is the belief that
the definition of cladistics is clear and widely agreed-upon. They are each
sure that their definition is the one everyone else agrees with.
A glance at the Wikipedia page for "cladistics" will find
it advocating a position somewhere between #5 and #6. In the Talk:Cladistics
Wikipedia page (especially here) people,
including me, complain about this. The failure of agreement simply reflects the state of systematics and the wildly mixed
messages that it gives to the outside world.
- Rant #8. Can we please stop using the term "incomplete lineage sorting"?
Because it is much less clear than just thinking about coalescence within each species and then, in their
common ancestor, the remaining lineages coming to be in the same population and then coalescing as one goes
further back.
Consider thinking of it as ILS. Suppose we have three samples from species A and four from species
B and the tree of species is drawn as growing upwards. Then we
start at some point earlier than their common ancestor (actually, where?) and as we go up that
lineage to the common ancestor, the
lineages split (actually, how many times?). Then the resulting lineages go up into the two species (actually, how
do they decide how many go each way? They can't do it independently, because then all of them might go into
species B, in which case there are none to go into species A.) Once they get into those two species, they can
split further, but have to end up with exactly the right sample sizes.
This is horribly complex. Turning around and starting with the samples and going back in time, it is just
two coalescents, combination of the remaining lineages, and a further coalescent. That is much easier to
think about, and to simulate. So let's think of this as the multispecies coalescent, not as the difficult
concept of "incomplete lineage sorting". It was important in the early history of thinking about the
multispecies coalescent, but it has been superseded by a much clearer conceptual framework.
However, there is still one argument for using "incomplete lineage sorting" as
the name for the phenomenon. It is misleading, but it is difficult to find a
phrase to replace it. Perhaps call the discrepancies "gene-tree/species tree differences"?
- Rant #9: Paleoanthropology needs to communicate more
sanely.
As far as I can see, here's how a typical cycle of communication between
paleoanthropologists and the public goes:
- They find a bone.
- They hold a press conference.
- They describe this as a new species of hominid.
- They announce that up until now, no one has understood human evolution.
- They say that with the finding of this bone, we now understand human
evolution.
- This is widely applauded in the popular science press.
- They make a National Geographic special.
- The cycle then starts again at step 1.
Thing is, next time they start by saying that up till now, no one has
understood human evolution. Their listeners don't seem to remember that they recently announced that an understanding
of human evolution was finally achieved.
OK, this must be a caricature (right?). I see why things like this
are done, as they have to raise money, often from outside of normal scientific
funding agencies. But it does leave the public confused, if also feverishly excited.
- Rant #10. The
term "gene tree" is ambiguous. OK, I know that I myself used the term "gene tree" in rant
#8. But it is used in two senses: (1) coalescent trees, and (2)
trees of gene duplications. These are very different, one having at its tips
multiple copies at the same locus, the other having at each tip a representative of an individual
locus in an individual species. A fork in the first kind of "gene tree" represents a coalescence, and
a fork in the second kind of "gene tree" represents a gene duplication event. The day is soon
coming when we will have data that consists of multiple copies of each member of a gene family
in each of a set of
species. When we do, will we describe the analysis of these as "gene tree / gene tree" methods?
- Rant #11. The word "gene" has a three-way ambiguity. Actually more, since people are unsure
how far upstream in front of the start codon a gene extends, and how far downstream
beyond the stop codon. That
problem is well-known. But there is another three-way ambiguity. Is one "gene"
- a single copy, such as my maternal copy of hemoglobin-β ? or
- a single allele, such as the wild-type allele of hemoglobin-β ? or
- a single locus, such as the hemoglobin-β locus?
Actually, there is a fourth useage. There are about 20,000 "genes" in the human species. If
we also consider (say) the gorilla, does it have those 20,000 genes or another 20,000 genes?
It seems to me that the second usage in the above list is rare, but does persist in the term "gene frequency", which
really means allele frequency. Mostly "gene" is either a single copy or a single
locus. But that ambiguity persists.
More to come, soon.