Nominal predicted
variable (Discriminant function analysis)
Let's look at nominal predicted variables, which are going to be "Sex" (with two choices: Male or Female)
and "Population" (with two choices: Buriat and Easter Islander).
The data are from W.W. Howells' craniometric dataset and use only
the first eight of his variables. Mark and copy the yellow
highlighted section to your RGui.
BE = structure(list(Sex=structure(c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),.Label=c("F","M"),class="factor"),Pop=structure(c(1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),.Label=c("BURIAT",
"EASTER"),class="factor"),GOL=c(166,171,175,176,180,179,173,170,169,175,163,181,166,176,158,170,167,170,162,179,174,167,173,171,171,183,170,169,171,175,174,174,168,160,173,171,176,
176,173,175,164,166,176,168,171,175,172,177,162,175,176,177,170,180,178,185,191,184,185,183,187,188,177,172,191,183,189,180,194,179,182,190,187,185,180,186,184,176,184,178,177,
184,181,170,180,180,183,174,185,187,183,175,182,168,182,171,178,174,191,176,187,178,191,172,180,190,189,173,171,187,172,186,180,183,180,175,185,185,188,176,180,183,172,187,177,
184,183,169,186,180,180,184,192,181,172,183,183,180,180,176,179,180,187,174,183,176,197,194,201,188,190,194,181,189,187,191,199,191,180,201,183,196,191,202,203,195,188,190,203,
189,187,190,184,191,192,192,190,187,189,194,192,187,196,191,183,190,189,199,198,201,189,193,195,193,200),NOL=c(166,169,175,175,179,178,171,169,168,175,162,179,165,176,157,169,
167,169,160,176,172,166,172,170,168,181,170,169,170,175,172,174,166,159,172,171,174,173,170,175,162,166,175,168,169,171,171,174,161,174,176,177,170,179,175,183,190,179,183,181,
184,186,174,171,188,181,187,178,191,177,177,186,184,182,177,182,181,175,182,174,176,180,178,170,177,177,180,170,183,183,179,173,178,165,179,170,176,171,185,175,185,175,190,171,
177,186,186,173,171,184,167,182,177,180,178,171,181,178,183,171,177,179,169,183,174,182,179,166,182,177,174,180,188,177,168,178,180,179,176,175,176,176,183,171,180,173,192,191,
198,182,188,190,178,185,183,185,194,188,178,197,179,189,186,197,198,191,184,186,197,185,183,187,181,188,187,187,186,181,185,189,188,184,191,187,180,186,186,191,192,195,183,188,
191,188,196),BNL=c(94,97,93,97,102,101,99,94,99,95,97,102,89,100,91,93,93,104,89,96,95,95,96,95,92,101,98,97,97,98,95,97,93,96,92,101,93,100,94,95,99,96,102,
98,97,99,93,101,91,97,101,104,96,103,104,106,104,100,102,101,108,112,104,98,98,106,106,107,103,99,100,101,104,103,104,100,95,97,107,99,103,99,98,102,101,107,99,98,104,
104,106,102,103,100,104,99,97,97,98,97,105,104,99,96,101,105,113,96,97,110,99,110,110,104,104,102,108,106,105,102,107,100,98,110,106,107,100,97,104,103,103,104,111,
101,102,108,104,113,102,105,105,105,106,98,106,100,117,114,116,107,109,113,110,110,107,110,110,114,102,117,108,115,110,116,114,113,109,120,111,111,112,113,110,109,110,111,111,
107,112,111,115,110,109,109,101,108,108,111,114,116,112,112,116,112,111),BBH=c(125,126,123,121,128,128,133,130,129,115,127,120,131,129,119,124,117,130,123,130,121,130,129,129,
130,130,127,119,126,125,133,122,130,123,131,131,135,130,131,127,138,126,129,122,128,132,121,126,124,123,133,133,127,138,141,134,138,126,131,140,141,135,137,128,130,133,139,132,
134,134,131,130,132,127,133,124,131,123,131,121,126,123,134,133,135,138,133,136,138,136,138,135,135,132,138,122,126,125,135,127,134,138,136,129,139,135,136,130,132,145,133,145,
139,137,140,131,140,140,135,137,140,137,132,138,137,142,137,123,138,137,135,137,138,134,131,142,134,145,136,141,136,140,137,132,137,134,155,141,150,142,139,145,140,142,142,141,
141,141,142,150,141,146,144,144,145,150,143,154,146,140,143,150,144,141,147,150,136,148,145,141,147,145,144,143,138,148,144,144,145,153,145,145,147,154,145),XCB=c(150,146,150,
151,155,157,156,151,144,142,148,149,150,140,142,144,149,154,150,155,142,144,153,142,161,155,147,138,141,148,147,153,152,147,149,141,149,146,154,147,148,143,148,153,147,146,155,
155,146,156,145,147,138,145,150,145,159,166,158,164,149,167,150,150,160,157,157,159,163,160,157,166,159,155,154,149,151,149,160,149,150,142,157,150,138,151,149,157,153,154,157,
153,149,155,152,160,153,164,167,150,162,154,152,145,145,159,164,163,152,132,126,122,131,133,129,129,135,126,127,129,130,130,129,130,128,129,127,131,133,132,132,128,135,123,122,
128,132,126,127,126,129,122,130,126,124,127,135,129,141,130,131,145,136,130,136,143,141,135,140,139,138,137,132,140,130,138,136,139,138,134,127,142,132,129,139,137,129,133,136,
135,142,133,137,134,139,129,136,129,131,135,126,130,139,128,136),XFB=c(124,112,122,124,124,129,127,128,128,115,122,125,124,112,118,121,122,127,120,125,113,118,123,115,130,125,
115,115,114,127,120,119,129,120,125,119,123,115,124,121,118,119,128,123,121,129,128,128,117,130,123,120,114,122,125,121,128,129,124,135,126,133,122,123,129,123,125,132,139,128,
125,122,132,126,125,122,124,127,131,121,119,122,126,125,118,119,123,128,125,127,127,129,115,127,122,127,119,130,145,129,135,128,128,114,127,134,133,135,123,113,106,100,115,104,
105,106,113,107,104,104,105,110,103,105,108,109,112,97,108,107,107,108,112,102,98,108,111,109,103,106,104,102,110,111,104,100,115,108,118,105,113,113,115,111,114,110,109,110,
114,115,116,115,110,118,108,115,109,117,113,112,107,115,112,109,119,114,111,108,117,109,120,111,111,112,115,110,112,109,115,109,110,112,113,107,118),ZYB=c(139,132,134,132,142,
139,143,141,126,129,131,128,129,132,132,134,132,134,126,138,132,138,137,128,134,136,137,134,130,133,131,136,134,137,145,133,143,133,141,132,133,132,137,135,134,136,138,146,129,
140,128,133,126,131,143,140,151,145,142,147,147,152,148,135,149,149,147,144,145,146,147,147,148,143,143,137,139,137,149,140,142,134,151,141,136,140,142,143,152,152,145,142,143,
147,147,139,138,151,154,138,151,148,139,139,138,147,158,142,140,129,123,131,125,126,132,118,127,130,122,126,123,126,128,132,126,123,130,121,118,124,129,126,127,123,126,124,125,
126,117,126,123,126,131,120,128,128,142,138,138,137,133,143,135,137,132,138,140,130,141,131,139,143,130,140,132,136,133,141,141,133,137,141,139,131,138,136,135,139,137,139,141,
132,140,129,135,134,136,131,136,140,135,143,136,133,139),AUB=c(131,128,128,131,138,134,138,137,124,125,127,121,125,127,123,132,127,133,123,134,127,130,129,123,131,128,133,125,
128,129,129,133,131,131,136,125,133,128,139,129,129,129,134,130,128,130,122,137,122,131,124,130,120,126,137,127,142,142,137,140,135,146,137,130,140,144,138,138,135,138,140,143,
142,140,135,127,132,123,137,132,138,128,144,132,125,134,135,138,140,137,138,134,137,135,142,134,134,142,145,135,139,139,127,131,123,140,149,136,130,118,113,118,116,120,121,110,
121,116,116,114,116,116,119,121,118,116,117,112,108,116,114,114,116,110,117,118,120,120,106,114,117,115,121,110,117,117,126,124,125,121,119,130,123,120,118,126,128,120,131,120,
125,130,120,132,119,125,120,131,127,125,119,128,128,120,121,127,123,124,120,129,127,120,126,118,120,123,125,117,124,130,121,122,124,124,126)),.Names=c("Sex","Pop","GOL","NOL","BNL","BBH",
"XCB","XFB","ZYB","AUB"),class="data.frame",row.names=c(NA,-195))
#The End!
We are going to convert the data to Daroch and Mossiman shape variables, which we can do in one fell swoop:
Shape.BE = cbind(BE[,1:2],BE[,-(1:2)]/apply(BE[,-(1:2)],1,prod)^(1/8)%o%rep(1,8))
biotools
is a great R package, but when last looked at by some MAC users at UIUC
it was was not working on MACS. And we are only going
to use one pretty small part. So here's the code you need (which will work
on both MACS and non-MACS)
D2.disc = function(data, grouping, pooled.cov = NULL)
{
if
(!inherits(data, c("data.frame",
"matrix")))
stop("'data'
must be a numeric data.frame or matrix!")
if
(length(grouping) != nrow(data))
stop("incompatible
dimensions!")
data
<- as.matrix(data)
name.fac
<- deparse(substitute(grouping))
grouping
<- as.factor(grouping)
n <- nrow(data)
p <- ncol(data)
nlev <- nlevels(grouping)
lev
<- levels(grouping)
pooledCov=function(data,
classes)
{
n <- nrow(data)
p <- ncol(data)
stopifnot(n ==
length(classes))
classes
<- as.factor(classes)
lev
<- levels(classes)
dfs <- tapply(classes,
classes, length) - 1
if
(any(dfs < p))
warning("such
a few observations for many variables!")
covs <- aux <- list()
for
(i in 1:nlevels(classes)) {
covs[[i]]
<- cov(data[classes == lev[i],
])
aux[[i]] <- covs[[i]] * dfs[i]
}
names(covs) <- lev
pooled
<- Reduce("+", aux)/sum(dfs)
return(pooled)
}
confusionmatrix=function(obs, predict)
{
obs <- as.vector(obs)
predict
<- as.vector(predict)
if
(length(obs) != length(predict))
stop("incompatible
dimensions!")
aux
<- data.frame(gr = obs, newgr = predict)
gr
<- NULL
ng
<- length(unique(aux[["gr"]]))
lev
<- levels(as.factor(aux[["gr"]]))
newlab <- paste("new", lev)
m <- matrix(0, ng, ng,
dimnames = list(lev, newlab))
equal
<- function(x, y) sum(x == y)
for(i in 1:ng) {
for(j
in 1:ng) {
m[i, j] <-
equal(subset(aux, gr == lev[i])[["newgr"]], lev[j])
}
}
return(m)
}
# pooled cov matrix
if
(is.null(pooled.cov)) {
pooled.cov
<- pooledCov(data, grouping)
} else if (!is.matrix(pooled.cov))
{
stop("'pooled.cov' must be a square matrix!")
} else if (dim(pooled.cov)[1] != dim(pooled.cov)[2]) {
stop("'pooled.cov' must be a square matrix!")
} else if (any(dim(pooled.cov) != p)) {
stop("'pooled.cov' has incompatible dimensions with 'data'!")
}
# means of each class
med
<- aggregate(data, list(grouping), mean)
med
<- as.matrix(med[, -1])
rownames(med) <- lev
# D2 dists
dists <- matrix(NA, n, nlev,
dimnames = list(rownames(data),
lev))
for(i in 1:n) {
for(j
in 1:nlev) {
dists[i, j] <- mahalanobis(data[i, ],
med[j, ], pooled.cov)
}
}
# misclassifications
id
<- function(x) colnames(dists)[which.min(x)]
pred <- apply(dists, 1,
id)
misclass
<- character(n)
for(i in 1:n) if (grouping[i] != pred[i]) misclass[i] <- "*"
confusion
<- confusionmatrix(grouping, pred)
# output
out
<- list(call = match.call(),
data
= data,
D2 = data.frame(dists, grouping, pred, misclass),
means
= med,
pooled
= pooled.cov,
confusion.matrix
= confusion)
class(out)
<- "D2.disc"
return(out)
}
print.D2.disc = function(x, ...)
{
cat("\nCall:\n")
print(x$call)
cat("\nMahalanobis distances from each class and class prediction
(first 6 rows):\n")
print(head(x$D2),
...)
cat("\nClass means:\n")
print(x$means, ...)
cat("\nConfusion matrix:\n")
print(x$confusion.matrix, ...)
invisible(x)
}
And now we're on our merry
way…
D2.disc(Shape.BE[,-(1:2)],Shape.BE[,1])
D2.disc(Shape.BE[,-(1:2)],Shape.BE[,2])
OK, so what that was doing was getting the Mahalanobis D2 for each case against the centroids for the two sexes and against the centroids for the two "Populations," and then "assigning" cases to the closest groups. The "confusion matrix" is a tabulation of actual sexes or populations against classified sexes or populations. The cognoscenti will note that this matrix gives the "apparent error rate" which is slightly biased downward. Let's look at this for "population" as that is clearly the direction in which we are headed (i.e., we "removed" size to try to get rid of sexual dimorphism). To make life simpler let's drop "sex" from the data:
Shape.BE2=Shape.BE[,-1]
And now the apparent error rate is:
1-sum(diag(D2.disc(Shape.BE2[,-1],Shape.BE2[,1])$conf))/NROW(Shape.BE2)
R has a nifty little package
called Daim for Diagnostic Accuracy of Classification
Models (Dacm?) which shows you this when there is a
binary classification:
library(Daim)
library(MASS)
mylda=function(formula, train, test){
model <- lda(formula,
train)
predict(model,
test)$posterior[,"pos"]
}
Daim(Pop~., model=mylda,
data=Shape.BE2, labpos=c("BURIAT"))
"loob" is "leave one out bootstrap" which
is similar to the "leave one out cross-validation" used in ForDISC. 0.632 is
0.368*Apparent error rate + 0.632*average across bootstraps, where within
bootstraps your "test cases" are only those cases that did not make
it into the bootstrap sample. Why 0.632?
If you have a sample of size
N, then when you bootstrap sample you make N draws. At each draw there is
1/N chance you will draw a particular case, and consequently there is a 1-1/N
chance that you will not draw the case. So the probability
that you never draw the case in one bootstrap sample is (1-1/N)^N and the
probability that you draw a given case one or more times in a bootstrap sample
is 1-(1-1/N)^N.
But back to the Mahalanobis distance:
sto = D2.disc(Shape.BE2[,-1],Shape.BE2[,1])
D2 = sto$D2
V.pool = sto$pool
centroids=sto$mean
Now, it's finally time to go
Bayesian. If you've dealt with discriminant function analysis before,
you've probably seen posterior probabilities. Where do they come from?
LK = D2[,1:2]
(LK/apply(LK,1,sum))[105:115,]
That was with equal priors, so let's compare to equal priors using lda in the library MASS. You do not need to download this library as it comes packaged with R.
predict(lda(Pop~.,data=Shape.BE2,prior=c(.5,.5)))$post[105:115,]
Now let's make a new case which is a simple average of the centroids from the Buriat and Easter Island shape centroids:
X=data.frame(t(apply(centroids,2,mean)))
And look at the posterior probabilities:
predict(lda(Pop~.,data=Shape.BE2,prior=c(.5,.5)),newdata=X)$post
D1.2 = mahalanobis(X,centroids[1,],var(Shape.BE2[1:109,-1]))
D2.2 = mahalanobis(X,centroids[2,],var(Shape.BE2[110:195,-1]))
1-pchisq(D1.2,8)
1-pchisq(D2.2,8)
But
the concept of "typicality" is very non-Bayesian. Instead, let's
look at the posterior predictive distributions for the eight
craniometrics using OpenBUGS. Sure. The model is pretty straightforward:
model{ |
The dismal part is getting the numbers to "feed" the data list. That can be done in R using "dput" as we have been doing elsewhere. Beware the lower case "e" for exponentials. OpenBUGS does not like those. Search and replace with "E."
#data list(Rp=structure(.Data=c(0.0008459324, 0.0007712689, 9.92804E-05, -9.47864E-05, |
discrim=read.coda.interactive()
Enter CODA index file name
(or a blank line to exit)