Fun with summary multivariate statistics, a probability density function, and a little bit of calculus

One can do quite a bit from multivariate summary statistics, but that depends on you giving the intrepid reader all the requisite summary statistics.  I have been on this particular soapbox since at least 1991:

Konigsberg LW. 1991. An historical note on the t-test for differences in sexual dimorphism between populations. Am J Phys Anthropol 84:93-96.

So let's turn back the time machine to the late 80s (some pretty awful music back then, I can tell you) and use R to look at sexual dimorphism in LDL and apoB levels in baboons on a regular chow diet as versus a "challenge diet" (called "pink" because loading the monkey chow with cholesterol turned it pink).  I can guarantee you that I do not have that raw data anymore, but I did when I wrote the 1991 AJPA "Notes and Comments."

So paste the requisite summary stats into "R":

baboons = structure(list(R.res = structure(c(1, 0.487002, 0.681163, 0.476761,
0.487002, 1, 0.326327, 0.804516, 0.681163, 0.326327, 1, 0.399609,
0.476761, 0.804516, 0.399609, 1), .Dim = c(4L, 4L)), M.mu = structure(c(46.4487,
43.2333, 40.9091, 97.4231, 106.1, 87.5454, 39.3833, 37.6755,
35.9909, 57.959, 58.1111, 47.8772), .Dim = 3:4, .Dimnames = list(
    c("anubis", "hybrid", "cyno"), c("LDLChow", "LDLpink", "ApoBChow",
    "ApoBpink"))), F.mu = structure(c(55.0232, 48.5135, 48.5227,
105.749, 101.9189, 114.7045, 49.0575, 42.9811, 42.1795, 67.7031,
59.3252, 58.4773), .Dim = 3:4, .Dimnames = list(c("anubis", "hybrid",
"cyno"), c("LDLChow", "LDLpink", "ApoBChow", "ApoBpink"))), m = c(78,
90, 22), f = c(259, 111, 44), M.sdev = structure(c(17.1983, 17.4749,
20.8599, 41.4364, 54.5311, 40.6644, 13.9573, 12.9743, 16.0057,
21.8897, 22.4638, 16.8267), .Dim = 3:4), F.sdev = structure(c(17.1913,
20.7596, 23.6992, 44.1428, 51.2467, 65.2397, 14.788, 13.9412,
13.4013, 21.459, 20.8992, 24.0234), .Dim = 3:4)), .Names = c("R.res",
"M.mu", "F.mu", "m", "f", "M.sdev", "F.sdev"))

And list to the screen so we can see what this stuff is and understand the use of the "$" symbol within a list.

baboons

$R.res
         [,1]     [,2]     [,3]     [,4]
[1,] 1.000000 0.487002 0.681163 0.476761
[2,] 0.487002 1.000000 0.326327 0.804516
[3,] 0.681163 0.326327 1.000000 0.399609
[4,] 0.476761 0.804516 0.399609 1.000000

$M.mu
       LDLChow  LDLpink ApoBChow ApoBpink
anubis 46.4487  97.4231  39.3833  57.9590
hybrid 43.2333 106.1000  37.6755  58.1111
cyno   40.9091  87.5454  35.9909  47.8772

$F.mu
       LDLChow  LDLpink ApoBChow ApoBpink
anubis 55.0232 105.7490  49.0575  67.7031
hybrid 48.5135 101.9189  42.9811  59.3252
cyno   48.5227 114.7045  42.1795  58.4773

$m
[1] 78 90 22

$f
[1] 259 111  44

$M.sdev
        [,1]    [,2]    [,3]    [,4]
[1,] 17.1983 41.4364 13.9573 21.8897
[2,] 17.4749 54.5311 12.9743 22.4638
[3,] 20.8599 40.6644 16.0057 16.8267

$F.sdev
        [,1]    [,2]    [,3]    [,4]
[1,] 17.1913 44.1428 14.7880 21.4590
[2,] 20.7596 51.2467 13.9412 20.8992
[3,] 23.6992 65.2397 13.4013 24.0234

And a script to do the analysis (that was 1000+ lines of Fortran code back in the day, though thankfully I didn't have to write most of it).

Kberg91 = function (dat=baboons)
{
 R = dat$R.res
 M = dat$M.mu; F = dat$F.mu
 nM = dat$m; nF = dat$f
 M.sd = dat$M.sdev; F.sd = dat$F.sdev
p=NROW(R)
r=NROW(M)
o.p=rep(1,p)
o.r=rep(1,r)
J=matrix(1,nr=r,nc=r)
d=M-F
N = sum(nM)+sum(nF)
w=(nM*nF)/(nM+nF)
SSCPsex=t(d)%*%w%*%t(w)%*%d/as.numeric(t(o.r)%*%w)
SSCPi=t(d)%*%(w%*%t(o.p)*d)-SSCPsex

T=diag(sqrt(apply((nM-1)%o%o.p*M.sd^2 + (nF-1)%o%o.p*F.sd^2,2,sum)))
SSCPe=T%*%R%*%T

Xm=nM%*%t(o.p)*M
Xf=nF%*%t(o.p)*F
SSCPsamp = t(Xm)%*%M + t(Xf)%*%F - t(Xm)%*%J%*%Xm/sum(nM) -
           t(Xf)%*%J%*%Xf/sum(nF) - SSCPi

Lambda = det(SSCPe)/det(SSCPi+SSCPe)
Lambda[2] = det(SSCPi+SSCPe)/det(SSCPsex+SSCPi+SSCPe)
Lambda[3] = det(SSCPi+SSCPe)/det(SSCPsamp+SSCPi+SSCPe)

vh = r - 1
vh[2] = 1
vh[3] = vh[1]

ve = N - 2*r
ve[2] = N - 2*(r-1)
ve[3] = ve[2]

m = ve + vh - (p + vh + 1)/2
s = sqrt(((p*vh)^2-4)/(p^2+vh^2-5))

df1 = p*vh
df2 = m*s-p*vh/2+1

Rao.R = (1-Lambda^(1/s))/(Lambda^(1/s))*(m*s-p*vh/2+1)/(p*vh)

p = 1-pf(Rao.R,df1,df2)

out = cbind(round(Lambda,4),round(Rao.R,4),round(df1,0),round(df2,0),round(p,4))
rownames(out)=c('Sex*Subspec','Sex','Subsp')
colnames(out)=c("Wilks' Lambda","Rao's R",'df1','df2','Prob')
out
}

Kberg91()
            Wilks' Lambda Rao's R df1  df2   Prob
Sex*Subspec        0.9820  1.3581   8 1190 0.2108
Sex                0.9348 10.4061   4  597 0.0000
Subsp              0.9060  7.5486   8 1194 0.0000

Which shows that there is significant sexual dimorphism and differences between the subspecies, but no sex-by-subspecies interaction.

Here's the published results from AJPA in 1991:

baboons


The simplest of pdfs, and some calculus

The harsh (?) reality is that most of us are more comfy-cozy with continuous density functions, also known as "probability density functions" or just pdfs (which is confusing what with Adobe Acrobat having subverted the acronym).  And among our most favorite pdf is probably (intentional pun) the normal pdf.  But that is a two parameter pdf (mean and variance), so let's instead start simple and spend some quality time with the uniform pdf. 

...and what is that "and some calculus" bit about?  Not to worry.  "R" and "SimPy Gamma" will be doing the heavy lifting for us.  I generally use wxMaxima for all my calculus and algebraic needs, but SimPy Gamma is a quick on-line way to go.


The uniform (rectangular) pdf

In its general setting, the uniform pdf is actually a 2 parameter problem.  The parameters are usually labeled "a" and "b" and there is one main constraint on the parameters, that a<b.  There are the additional constraints that a>-Inf ("-Inf" is R-ese for negative infinity) and b<Inf.  We will make our lives simpler and assume that a=0 so that the only parameter is b.  This may sound like a very boring distribution but it is often used as a "diffuse prior" for the standard deviation.  It is also used a lot in Bayesian analysis of (calibrating) radiocarbon dates.

I have sometimes walked into defenses with the bald-faced claim that I have an age-at-death estimation method that never fails, which is to say that the age was between 0 and 120 years.  So let's take that as our first uniform distribution.

R has the uniform built in as one of many pdfs.  So let's first get some info:

 ?dunif

Uniform {stats}R Documentation

The Uniform Distribution

Description

These functions provide information about the uniform distribution on the interval from min to max. dunif gives the density, punif gives the distribution function qunif gives the quantile function and runif generates random deviates.

Usage

dunif(x, min = 0, max = 1, log = FALSE)
punif(q, min = 0, max = 1, lower.tail = TRUE, log.p = FALSE)
qunif(p, min = 0, max = 1, lower.tail = TRUE, log.p = FALSE)
runif(n, min = 0, max = 1)
.
.
.

So what is the
median of our ~U(0,120) pdf?  Another way of asking this is what is the 50th percentile value?  Should be smack dab in the middle, so age 60, but we can use "qunif" to check this (and get the 25th and 75th percentiles as well):

 qunif(p=c(.25,.5,.75),0,120)
[1] 30 60 90

So what is the mean of our ~U(0,120) pdf?  It is also 60.

mu.unif = function (x)
{
x*dunif(x,0,120)
}

integrate(mu.unif,0,120)

60 with absolute error < 6.7e-13

So we have seen quinf and dunif.  Speaking of which, R told us in the help above that dunif was the (probability) density (function).  In symbols, this is usually written as f(x).  So what is f(10), f(34.34), f(0), f(120), f(121), etc.?

 dunif(10,0,120)

[1] 0.008333333

 
1/120
[1] 0.008333333

And what is runif?

 hist(runif(10000,0,120),pr=T,ylim=c(0,0.01))
 abline(h=1/120)

And punif?  It is the cumulative density function which is also known as the distribution function.  It is the "p-value" that you have butted heads (but not tails?) with in stats classes.  In symbols, it is usually written as F(x)

 punif(0,0,120)
[1] 0
 punif(60,0,120)
[1] 0.5
 punif(90,0,120)
[1] 0.75


But wait, there's more!  You can also use punif to find the complement, which is sometimes referred to as the survivorship function.  In symbols, this is usually written as S(x), and it is equal to 1 - F(x)

plot(0:120,punif(0:120,0,120,lower=F),type='l')

But wait, there's more ---- that believe it or not is not in R, and that would be the hazard function.  From "hazards analysis" we have f(x) = S(x)h(x), which if you know your abridged life table notation is the continuous analog of d(x) = l(x)q(x) where d(x) is the proportion of deaths in the age interval, l(x) is survivorship to the beginning of the interval, and q(x) is the age-specific probability of death for the interval.  So h(x) is sort of like a probability, except that it can be greater than 1.0 (so it is the "hazard rate").  Solving for h(x), we can plot it using dunif divided by 1-punif.

plot(0:120,dunif(0:120,0,120)/(1-punif(0:120,0,120)),type='l')

From hazards analysis, h(x) is also the negative of the first derivative of log survivorship, which turns out to be 1/(120-x).  We will try to get this using "SimPy Gamma" copy: diff(log(1-x/120),x) and paste to simpy.  That gave you the derivative, so negate to get: 1/(120-x)

In wxMaxima it looks like so:

Calc

lines(0:120,1/(120-(0:120)),col='red')

Out of curiosity, what is the variance of U(0,120)?

var.unif = function (x)
{
(x-60)^2*(1/120)
}


integrate(var.unif,0,120)

1200 with absolute error < 1.3e-11


And we can go back to simpy to get the analytical form: copy and paste integrate((x-b/2)^2/b,(x,0,b)), or use wxMaxima to find that the variance is b2/12.  And here's wxMaxima:

Calc2
One really nice thing about wxMaxima is that you can export the results as text that R will understand.  For example, here are the partial derivatives of the Gompertz survival function:

If the Gompertz survivorship is written as:

exp(a/b*(1-exp(b*t)))

Then the partial derivative with respect to b is:
(-(a*t*exp(b*t))/b-(a*(1-exp(b*t)))/b^2)*exp((a*(1-exp(b*t)))/b)

and with respect to a is:
((1-exp(b*t))*exp((a*(1-exp(b*t)))/b))/b

But let's get back to the uniform pdf.  The information above suggests some "fun with numbers" to make an inefficient standard random normal deviate generator.

my.rnorm = function (N=25000)
{
set.seed(123456)
z=vector()
for(i in 1:N) z[i]=sum(runif(12))-6
return(z)
}

z=my.rnorm()
hist(z,nc=50,pr=T); lines(seq(-4,4,.01),dnorm(seq(-4,4,.01)))
if(.Platform$OS.type=="windows") {
quartz<-function() windows()
}

quartz()
plot(density(z));
lines(seq(-4,4,.01),dnorm(seq(-4,4,.01)),col='red')
if(.Platform$OS.type=="windows") {
quartz<-function() windows()
}
quartz()
plot(ecdf(z)); lines(seq(-4,4,.01),pnorm(seq(-4,4,.01)),col='red')

So are a (probability) density and a probability the same thing?  Well, first, what is a probability?  There is both the frequency definition (as one uses for allele frequencies) and the subjective definition as one uses in transmission genetics, but either way probabilities are bounded by 0.0 and 1.0 (inclusive).  But probability densities can be greater than 1.0, as the requirements are that they be non-negative and integrate to 1.0 across the acceptable range.

dnorm(0)
[1] 0.3989423
dnorm(0,0,.1)
[1] 3.989423
integrate(dnorm,mean=0,sd=1,-Inf,Inf)
1 with absolute error < 9.4e-05
integrate(dnorm,mean=0,sd=.01,-Inf,Inf)
1 with absolute error < 3.1e-08
dunif(.25,0,.5)
[1] 2

And where can I find more pdfs?  Try Darryl Holman's vol. 2 (reference) for mle.