Introduction to Statistical Computing - Using R for Stats

👨‍💻 Eugene Hickey @ Atlantic Technological University 👨‍💻

  • eugene.hickey@tudublin.ie
  • @eugene100hickey
  • github.com/eugene100hickey
  • www.fizzics.ie

Week Two - Statistical Inference

  • Time we did some statistics on our data

  • This ties in nicely with Emma’s half of our course

  • We’ll look at this from three perspectives

    • distributions

    • confidence intervals

    • hypothesis tests

Distributions

  • There are lots of these, but we’ll mention five (and describe two)

    • binomial (discrete)

    • poisson

    • normal (continuous)

    • Student’s t distribution

    • exponential

Binomial Distribution

  • number of successes in a sequence of trials

  • three parameters

    • the number of trials, n

    • the number of successes, k

    • the probability of success in a single trial, p

\[ {\displaystyle f(k,n,p)=\Pr(k;n,p)=\Pr(X=k)={\binom {n}{k}}p^{k}(1-p)^{n-k}}\]

  • for k = 0, 1, 2, …, n, where

\[ {\displaystyle {\binom {n}{k}}={\frac {n!}{k!(n-k)!}}} \]

What this looks like:

A rocket launch has a probability of success of 85%. From a 100 launches, what is the profile of the number of successes?

How this works in R

  • number if successes (k) called x
  • number of trials (n) called size
  • probability (p) called prob
dbinom(x = 90, size = 100, prob = 0.85)
[1] 0.04435277
dbinom(x = 85, size = 100, prob = 0.85)
[1] 0.111091
dbinom(x = 80, size = 100, prob = 0.85)
[1] 0.04022449
dbinom(x = 70, size = 100, prob = 0.85)
[1] 6.456559e-05
dbinom(x = 60, size = 100, prob = 0.85)
[1] 8.850537e-10
dbinom(x = 50, size = 100, prob = 0.85)
[1] 1.902669e-16

dbinom(), pbinom(), qbinom(), and rbinom()

  • dbinom() calculates the probability of a specific outcome value

  • pbinom() calculates the sum of probabilities less than (or greater than), an outcome value

    • this is the cumulative probability

    • adds up a whole bunch of dbinom()’s

  • qbinom() calculates the number of successes below which there is a certain probability

    • like the inverse function of pbinom()

    • give it a probability and it works out the number of successes threshold

    • q stands for quantile

  • rbinom() generates random numbers of successs from a binomial distribution

pbinom()

pbinom(1:100,100,0.85,lower.tail=TRUE)

pbinom(1:100,100,0.85,lower.tail=FALSE)

  • lower.tail = TRUE means k successes or less (successes \(\le\) threshold, k)

  • lower.tail = FALSE means more than k successes (successes \(>\) threshold, k)

  • worth remembering when to use lower.tail = TRUE/FALSE

Example

  • Family of eight children, chance of getting at least 7 girls
dbinom(x = 7, size = 8, prob = 0.5) + dbinom(x = 8, size = 8, prob = 0.5)
[1] 0.03515625
pbinom(q = 6, size = 8, prob = 0.5, lower.tail = FALSE)
[1] 0.03515625
  • Note we had to use k - 1 = 6 in pbinom() calculation

qbinom()

  • in families of eight children, 30% of them have X girls or less. What is X?

  • qbinom(p = 0.3, size = 8, prob = 0.5, lower.tail = TRUE)

  • gives X = 3

pbinom() versus qbinom()

  • pbinom() gives the probability of observing at least k successes

    • it returns a probability
  • qbinom() gives the number of successes that would be observed with p probability

    • it returns a number of successes

    • gives us the position of some quantile

rbinom()

  • this gives a random number from the distribution

  • take a sample of 100 rocket launches, then for this sample 83 will be successful

  • command is rbinom(n = 1, size=100, prob =0.85)

    • n = 1 means just one sample

Normal Distribution

  • spread in values of a continuous variable

  • two parameters

  • the mean value, \(\mu\)

  • the standard deviation, \(\sigma\)

\[ {\displaystyle f(x)={\frac {1}{\sigma {\sqrt {2\pi }}}}e^{-{\frac {1}{2}}\left({\frac {x-\mu }{\sigma }}\right)^{2}}} \]

  • \(\mu\) is the mean

  • \(\sigma\) is the standard deviation

What this looks like:

A human liver has an average mass of 1.5kg with a standard deviation of 0.2kg.

How this works in R

dnorm(x = 1.0, mean = 1.5, sd = 0.2)
[1] 0.0876415
dnorm(x = 1.2, mean = 1.5, sd = 0.2)
[1] 0.647588
dnorm(x = 1.4, mean = 1.5, sd = 0.2)
[1] 1.760327
dnorm(x = 1.6, mean = 1.5, sd = 0.2)
[1] 1.760327
dnorm(x = 1.8, mean = 1.5, sd = 0.2)
[1] 0.647588
dnorm(x = 2.0, mean = 1.5, sd = 0.2)
[1] 0.0876415

dnorm(), pnorm(), qnorm(), and rnorm()

  • dnorm() calculates the relative probability of a specific outcome value

  • pnorm() calculates the sum of probabilities less than (or greater than), an outcome value

    • this is the cumulative probability

    • adds up a whole bunch of dnorm()’s

  • qnorm() calculates the value below which there is a certain probability

    • like the inverse function of pnorm()

    • give it a probability and it works out the value

  • rnorm() generates random value from the normal distribution

pnorm()

pnorm(q = seq(0, 3, by = 0.01), mean = 1.5, sd = 0.2, lower.tail = TRUE)

pnorm(q = seq(0, 3, by = 0.01), mean = 1.5, sd = 0.2, lower.tail = FALSE)

  • lower.tail = TRUE means values less than

  • lower.tail = FALSE means values more than

  • because this is a continuous distribution we never have to worry about the pesky -1

Examples

  • The concentration of \(PM_{2.5}\) in air has an average value of 12 \(\mu g/m^{3}\) with a standard deviation of 5 \(\mu g/m^{3}\). What is the probability that a given day will exceed the recommended 25\(\mu g/m^{3}\)?

    • pnorm(q = 25, mean =12, sd =5, lower.tail = FALSE)

    • gives 0.47%

  • Using the values above, what is the highest value of \(PM_{2.5}\) we can expect to see once a year?

    • qnorm(p = 1/365.25, mean =12, sd =5, lower.tail = FALSE)

    • gives 25.9\(\mu g/m^{3}\)

rnorm()

  • this gives a random number from the normal distribution

  • a random sample of one person’s liver mass

  • command is rnorm(n = 1, mean = 1.5, sd = 0.2)

    • n = 1 means just one sample

Confidence Intervals

  • we rarely know population statistics about our data

  • more common to have to infer them from sample statistics

  • if we don’t know the real standard deviation

    • estimate it from the sample

    • replace the normal distribution by its little cousin, the t-distribution

    • t-distribution is more spread out (more uncertain) than the normal

    • as sample size gets bigger the t converges to the normal

Confidence Intervals

  • imagine taking a bunch of different samples from a population

  • each one will have a slightly different mean

  • for one sample, its mean is our best guess as to the true population mean

  • but also need to express our degree of certainty (or otherwise) in this guess

  • this gives the confidence interval

Confidence Interval Imagined

Confidence Intervals in R

  • function t.test() is handy for confidence interval calculation

  • in fact it’s a bit of a Swiss army knife of conf. int., hypothesis testing….

  • give it a set of values and it’ll output the confidence interval for their mean

  • t.test(airquality$Wind)

  • t.test(airquality$Wind)$conf.int


    One Sample t-test

data:  airquality$Wind
t = 34.961, df = 152, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
  9.394804 10.520229
sample estimates:
mean of x 
 9.957516 

Hypothesis Testing

  • do two samples have difference means?

  • consider gapminder dataset

    • was life expectancy in 2007 higher in Southern Europe than Northern Europe?
z <- dslabs::gapminder |> 
  filter(year == 2007, region %in% c("Northern Europe", "Southern Europe"))
t.test(life_expectancy ~  region, data = z)

    Welch Two Sample t-test

data:  life_expectancy by region
t = -0.10871, df = 14.573, p-value = 0.9149
alternative hypothesis: true difference in means between group Northern Europe and group Southern Europe is not equal to 0
95 percent confidence interval:
 -3.374106  3.047439
sample estimates:
mean in group Northern Europe mean in group Southern Europe 
                     77.67000                      77.83333 

Paired T-tests

Ten mice were placed on a high fat diet. Their weights were recorded before and afterwards. Is there a significant weight gain?

mouse before after
1 265.6 274.6
2 84.4 100.2
3 196.7 206.0
4 187.7 194.0
5 199.0 203.9
6 235.7 247.9
7 259.2 262.9
8 172.3 172.8
9 188.2 198.1
10 241.1 248.0

Graphically

With Paired t-test

t.test(x = before, y = after)

    Welch Two Sample t-test

data:  before and after
t = -0.33694, df = 17.986, p-value = 0.7401
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -56.79972  41.09972
sample estimates:
mean of x mean of y 
   202.99    210.84 
t.test(x = before, y = after, paired = TRUE)

    Paired t-test

data:  before and after
t = -5.6619, df = 9, p-value = 0.0003089
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -10.986396  -4.713604
sample estimates:
mean difference 
          -7.85 

Workshop - Week Two

Perform the Following Tasks:

  1. Human gestation has an average period of 266 days with a standard deviation of 16 days. What is the probability that a baby will arrive exactly on their due date? Assume normal distribution. [2.5%]

  2. Make a 95% confidence interval for animal weight for the bisons recorded in the knz_bison dataset. [742.16 - 756.74]

  3. Perform a hypothesis test that female and male bisons are the same weight. [ \(p_{value} = 7 \times 10^{-5} << 0.05\)]

Assignments - Week Two

  1. Complete week two moodle quiz

  2. Complete swirl() exercises

  • install.packages("swirl")

  • library(swirl)

  • install_course("Statistical Inference")

  • swirl()

  • choose course Statistical Inference

  • do the three exercises 1 (Introduction), 9 (T Confidence Intervals), and 10 (Hypothesis Testing)

  • email a screen shot of the end of each lesson to eugene.hickey@associate.atu.ie

  • it’ll look a bit like screen capture here