version 2, 11/10/2016

                             R exercise 4a

           (Additional to the earlier R exercise for lecture 4)

This tells you how to read in forms such as the fishes data I gave you, or the
Vilmann rat data Fred gave you, and do Generalized Procrustes Analysis
superpositions.  It uses the code Fred gave on the course website in the file
"S.procrustes.from.scratch" which will be found in the course Catalyst files.

The material below modifies Fred's routine for reading in the Vilmann rat data
into one that can read in a general  n x m  array.

1.  This is Fred's statement for reading in the Vilmann rat dataset.

## read in two-dimensional coordinate data  [these are 18*8 octagons (x,y)]
## except that we are going to read the matrix first and get  n  and  m
## 
##
 
 vil.coordinate.data.144<-t(matrix(scan(
      file="vilmann.data.set.txt",n=18*144,what=numeric()),18,144))[,3:18]

But we want to do something that will work more generally.


2. Now we want instead to read in data like the simulated fishes data I gave
you.  You can do that like this (as we saw earlier) by the commands 

a <- read.table("fishes1.txt");

and convert it into a matrix of  n x m  and drop the first column (the names)
by doing

b <- data.matrix(a[,2:dim(a)[2]]);

That assumes that the file of data has a first column of names.


3.  Now here is a function that will take these forms and do a Generalized
Procrustes Analysis.  It starts by detecting  n  and  m.   You can just copy
this function and paste it into your R environment.  It is hard to understand
because Fred has very cleverly used complex numbers so each landmark is not a
pair of (x,y) coordinates but a single complex number   x + i y    That way
he can rotate a point by just doing a multiplication.


procrustes <- function(b) {
## JF: function doing Fred's Procrustes algorithm on a file of  n  rows which
##    has a name and then   m   coordinates, blank-separated.
##    argument   b   is the array  n x m  with the data
##
##   Procrustes from scratch (for two-dimensional data)

##      this version takes advantage of complex arithmetic
##      code is in Splus, very easy to convert to R

## JF:  get the dimensions

  n <- dim(b)[1];   ## number of specimens
  m <- dim(b)[2];   ## number of coordinates (twice the number of landmarks)

## JF: the code after this point is Fred's except generalized by me to  n x m
 
## convert to complex numbers
 vil.complex.144<-matrix(complex(r=b[,seq(1,m-1,2)], i=b[,seq(2,m,2)]),n,m/2)

## center to mean zero casewise
 vil.centered.144<-sweep(vil.complex.144, 1,apply(vil.complex.144,1,mean),"-")

## scale to unit Centroid Size casewise
 vil.scaled.144<-sweep(vil.centered.144,
        1,sqrt(apply(Mod(vil.centered.144)^2,1,sum)),"/")

## compute tentative Procrustes mean 
 vil.testmean.144<-apply(vil.scaled.144,2,mean)

## set up GPA cycle
 vil.iterates<-matrix(vil.scaled.144,n,m)

## loop
 print("Convergence: changes in iteration cycles 1-5");
 print("------------ ------- -- --------- ------ ---");
 for (iloop in 1:5) {

## for each case
    for (j in 1:n) {

## rotate
      vil.iterates[j,]<-vil.scaled.144[j,]*
           (sum(vil.testmean.144*Conj(vil.scaled.144[j,]))/
            sum(vil.scaled.144[j,]*Conj(vil.scaled.144[j,])))

## restore Centroid Size 1
      vil.iterates[j,]<-vil.iterates[j,]/sqrt(sum(
            Mod(vil.iterates[j,])^2))
}
## convergence criterion (hardly ever needed)

  print(sum(Mod(vil.testmean.144-apply(vil.iterates,2,mean))^2))

## replace tentative mean by new centroids
   vil.testmean.144<-apply(vil.iterates,2,mean)
}

## rotate mean to first principal axis horizontal
 vil.pc1<-princomp(cbind(Re(vil.testmean.144),Im(vil.testmean.144)))$loadings[,1]
 if (vil.pc1[1]<0) vil.pc1<- -vil.pc1; vil.pc1[2]<- -vil.pc1[2]
 vil.turn<-complex(r=vil.pc1[1],i=vil.pc1[2])
 vil.testmean.144<-vil.turn*vil.testmean.144
 vil.iterates<-vil.turn*vil.iterates

## JF: my modification of Fred's plot routines using ordinary "plot" and "line"

 xmin <- min(Re(vil.iterates));      ## find min, max of each specimen's x,
y's
 ymin <- min(Im(vil.iterates));
 xmax <- max(Re(vil.iterates));
 ymax <- max(Im(vil.iterates));
##  plots a single plot with the points (smaller ones if there are more)
##  with the mean form drawn with lines in red on top of them
 plot(vil.iterates,pch=16,asp=1,cex=5/sqrt(n),xlim=c(xmin,xmax),
        ylim=c(ymin,ymax),xlab="x", ylab="y");
 lines(vil.testmean.144,type="l",col="red",lwd=2);
 xvil <- Re(vil.iterates);           ## the  x  coordinates   n x (m/2)
 yvil <- Im(vil.iterates);           ## the  y  coordinates   n x (m/2)
 superposed <- matrix(0, n, m);      ##  empty matrix to hold output values
 for (i in 1:(m/2)) {                ## copy columns into superposition array
   superposed[,2*i-1] <- xvil[,i];   ## a column of x's, then ...
   superposed[,2*i] <- yvil[,i];     ## a column of y's
 }
return(superposed);
}


What that function does is superpose the forms by Generalized Procrustes
Analysis, get the superposed forms and the mean form as  vil.iterates  and
vil.testmean.   As it goes it prints out an indication of how much change
occurs in its 5 iterations of fitting -- you can see that this gets smaller
very rapidly.

It then plots in red the mean form (a plot that assumes that the coordinates
are around the outside of the form) as line segments, and plots the individual
GPA-superposed Procrustes coordinates.

It also returns as an result the array of superposed forms, which is an
n x m  array.  If you just run this function by typing  "procrustes(b)" you
will get that whole array printed out on your screen, which may startle you.
Better to run it by typing

bbb <- procrustes(b)


4. You should be able to use this to basically duplicate my file
"fishes1proc.txt".   I can also put up more simulated-fishes data sets if you
want.  You might benefit also from making up small datasets of your own and
running this function on them.


5. Finally, note that for forms from a phylogeny one uses the contrasts if one
wants to compute  eigenvalues and eigenvectors (or equivalently, "principal
warps" or the results of SVD computations).  But do not attempt to do a
Procrustes superposition of the contrasts -- it is meaningful on the original
forms, not the contrasts.  The analysis I had you do in the previous exercise
was equivalent to doing the Generalized Procrustes Analysis on the forms,
then taking contrasts, then getting an eigenvector of those, and then plotting
them on the mean form from the Generalized Procrustes Analysis, *not* on the
mean of contrast values, which wouldn't make sense.


6. An interesting exercise is to compare the eigenvectors (plotted onto the
mean fish by the routine I gave last time) from two analyses: one which
uses the contrasts to get the eigenvectors, and one which just assumes wrongly
that the forms are independent and identically distributed and gets the
eigenvectors from the covariances of the forms without any attempt to correct
for the phylogeny.

7. Enjoy.

                                                        Joe Felsenstein