version 2, 11/10/2016 R exercise 4a (Additional to the earlier R exercise for lecture 4) This tells you how to read in forms such as the fishes data I gave you, or the Vilmann rat data Fred gave you, and do Generalized Procrustes Analysis superpositions. It uses the code Fred gave on the course website in the file "S.procrustes.from.scratch" which will be found in the course Catalyst files. The material below modifies Fred's routine for reading in the Vilmann rat data into one that can read in a general n x m array. 1. This is Fred's statement for reading in the Vilmann rat dataset. ## read in two-dimensional coordinate data [these are 18*8 octagons (x,y)] ## except that we are going to read the matrix first and get n and m ## ## vil.coordinate.data.144<-t(matrix(scan( file="vilmann.data.set.txt",n=18*144,what=numeric()),18,144))[,3:18] But we want to do something that will work more generally. 2. Now we want instead to read in data like the simulated fishes data I gave you. You can do that like this (as we saw earlier) by the commands a <- read.table("fishes1.txt"); and convert it into a matrix of n x m and drop the first column (the names) by doing b <- data.matrix(a[,2:dim(a)[2]]); That assumes that the file of data has a first column of names. 3. Now here is a function that will take these forms and do a Generalized Procrustes Analysis. It starts by detecting n and m. You can just copy this function and paste it into your R environment. It is hard to understand because Fred has very cleverly used complex numbers so each landmark is not a pair of (x,y) coordinates but a single complex number x + i y That way he can rotate a point by just doing a multiplication. procrustes <- function(b) { ## JF: function doing Fred's Procrustes algorithm on a file of n rows which ## has a name and then m coordinates, blank-separated. ## argument b is the array n x m with the data ## ## Procrustes from scratch (for two-dimensional data) ## this version takes advantage of complex arithmetic ## code is in Splus, very easy to convert to R ## JF: get the dimensions n <- dim(b)[1]; ## number of specimens m <- dim(b)[2]; ## number of coordinates (twice the number of landmarks) ## JF: the code after this point is Fred's except generalized by me to n x m ## convert to complex numbers vil.complex.144<-matrix(complex(r=b[,seq(1,m-1,2)], i=b[,seq(2,m,2)]),n,m/2) ## center to mean zero casewise vil.centered.144<-sweep(vil.complex.144, 1,apply(vil.complex.144,1,mean),"-") ## scale to unit Centroid Size casewise vil.scaled.144<-sweep(vil.centered.144, 1,sqrt(apply(Mod(vil.centered.144)^2,1,sum)),"/") ## compute tentative Procrustes mean vil.testmean.144<-apply(vil.scaled.144,2,mean) ## set up GPA cycle vil.iterates<-matrix(vil.scaled.144,n,m) ## loop print("Convergence: changes in iteration cycles 1-5"); print("------------ ------- -- --------- ------ ---"); for (iloop in 1:5) { ## for each case for (j in 1:n) { ## rotate vil.iterates[j,]<-vil.scaled.144[j,]* (sum(vil.testmean.144*Conj(vil.scaled.144[j,]))/ sum(vil.scaled.144[j,]*Conj(vil.scaled.144[j,]))) ## restore Centroid Size 1 vil.iterates[j,]<-vil.iterates[j,]/sqrt(sum( Mod(vil.iterates[j,])^2)) } ## convergence criterion (hardly ever needed) print(sum(Mod(vil.testmean.144-apply(vil.iterates,2,mean))^2)) ## replace tentative mean by new centroids vil.testmean.144<-apply(vil.iterates,2,mean) } ## rotate mean to first principal axis horizontal vil.pc1<-princomp(cbind(Re(vil.testmean.144),Im(vil.testmean.144)))$loadings[,1] if (vil.pc1[1]<0) vil.pc1<- -vil.pc1; vil.pc1[2]<- -vil.pc1[2] vil.turn<-complex(r=vil.pc1[1],i=vil.pc1[2]) vil.testmean.144<-vil.turn*vil.testmean.144 vil.iterates<-vil.turn*vil.iterates ## JF: my modification of Fred's plot routines using ordinary "plot" and "line" xmin <- min(Re(vil.iterates)); ## find min, max of each specimen's x, y's ymin <- min(Im(vil.iterates)); xmax <- max(Re(vil.iterates)); ymax <- max(Im(vil.iterates)); ## plots a single plot with the points (smaller ones if there are more) ## with the mean form drawn with lines in red on top of them plot(vil.iterates,pch=16,asp=1,cex=5/sqrt(n),xlim=c(xmin,xmax), ylim=c(ymin,ymax),xlab="x", ylab="y"); lines(vil.testmean.144,type="l",col="red",lwd=2); xvil <- Re(vil.iterates); ## the x coordinates n x (m/2) yvil <- Im(vil.iterates); ## the y coordinates n x (m/2) superposed <- matrix(0, n, m); ## empty matrix to hold output values for (i in 1:(m/2)) { ## copy columns into superposition array superposed[,2*i-1] <- xvil[,i]; ## a column of x's, then ... superposed[,2*i] <- yvil[,i]; ## a column of y's } return(superposed); } What that function does is superpose the forms by Generalized Procrustes Analysis, get the superposed forms and the mean form as vil.iterates and vil.testmean. As it goes it prints out an indication of how much change occurs in its 5 iterations of fitting -- you can see that this gets smaller very rapidly. It then plots in red the mean form (a plot that assumes that the coordinates are around the outside of the form) as line segments, and plots the individual GPA-superposed Procrustes coordinates. It also returns as an result the array of superposed forms, which is an n x m array. If you just run this function by typing "procrustes(b)" you will get that whole array printed out on your screen, which may startle you. Better to run it by typing bbb <- procrustes(b) 4. You should be able to use this to basically duplicate my file "fishes1proc.txt". I can also put up more simulated-fishes data sets if you want. You might benefit also from making up small datasets of your own and running this function on them. 5. Finally, note that for forms from a phylogeny one uses the contrasts if one wants to compute eigenvalues and eigenvectors (or equivalently, "principal warps" or the results of SVD computations). But do not attempt to do a Procrustes superposition of the contrasts -- it is meaningful on the original forms, not the contrasts. The analysis I had you do in the previous exercise was equivalent to doing the Generalized Procrustes Analysis on the forms, then taking contrasts, then getting an eigenvector of those, and then plotting them on the mean form from the Generalized Procrustes Analysis, *not* on the mean of contrast values, which wouldn't make sense. 6. An interesting exercise is to compare the eigenvectors (plotted onto the mean fish by the routine I gave last time) from two analyses: one which uses the contrasts to get the eigenvectors, and one which just assumes wrongly that the forms are independent and identically distributed and gets the eigenvectors from the covariances of the forms without any attempt to correct for the phylogeny. 7. Enjoy. Joe Felsenstein