Introduction to Statistical Computing - Using R for Stats

👨‍💻 Eugene Hickey @ Atlantic Technological University 👨‍💻

eugene.hickey@tudublin.ie
@eugene100hickey
github.com/eugene100hickey
www.fizzics.ie

Week Two - Statistical Inference

Time we did some statistics on our data
This ties in nicely with Emma’s half of our course
We’ll look at this from three perspectives
- distributions
- confidence intervals
- hypothesis tests

Distributions

There are lots of these, but we’ll mention five (and describe two)
- binomial (discrete)
- poisson
- normal (continuous)
- Student’s t distribution
- exponential

Binomial Distribution

number of successes in a sequence of trials
three parameters
- the number of trials, n
- the number of successes, k
- the probability of success in a single trial, p

\[ {\displaystyle f(k,n,p)=\Pr(k;n,p)=\Pr(X=k)={\binom {n}{k}}p^{k}(1-p)^{n-k}}\]

for k = 0, 1, 2, …, n, where

\[ {\displaystyle {\binom {n}{k}}={\frac {n!}{k!(n-k)!}}} \]

What this looks like:

A rocket launch has a probability of success of 85%. From a 100 launches, what is the profile of the number of successes?

How this works in R

number if successes (k) called x
number of trials (n) called size
probability (p) called prob

dbinom(x = 90, size = 100, prob = 0.85)

[1] 0.04435277

dbinom(x = 85, size = 100, prob = 0.85)

[1] 0.111091

dbinom(x = 80, size = 100, prob = 0.85)

[1] 0.04022449

dbinom(x = 70, size = 100, prob = 0.85)

[1] 6.456559e-05

dbinom(x = 60, size = 100, prob = 0.85)

[1] 8.850537e-10

dbinom(x = 50, size = 100, prob = 0.85)

[1] 1.902669e-16

`dbinom()`, `pbinom()`, `qbinom()`, and `rbinom()`

dbinom() calculates the probability of a specific outcome value
pbinom() calculates the sum of probabilities less than (or greater than), an outcome value
- this is the cumulative probability
- adds up a whole bunch of dbinom()’s
qbinom() calculates the number of successes below which there is a certain probability
- like the inverse function of pbinom()
- give it a probability and it works out the number of successes threshold
- q stands for quantile
rbinom() generates random numbers of successs from a binomial distribution

`pbinom()`

pbinom(1:100,100,0.85,lower.tail=TRUE)

pbinom(1:100,100,0.85,lower.tail=FALSE)

lower.tail = TRUE means k successes or less (successes $\le$ threshold, k)
lower.tail = FALSE means more than k successes (successes $>$ threshold, k)
worth remembering when to use lower.tail = TRUE/FALSE

Example

Family of eight children, chance of getting at least 7 girls

dbinom(x = 7, size = 8, prob = 0.5) + dbinom(x = 8, size = 8, prob = 0.5)

[1] 0.03515625

pbinom(q = 6, size = 8, prob = 0.5, lower.tail = FALSE)

[1] 0.03515625

Note we had to use k - 1 = 6 in pbinom() calculation

`qbinom()`

in families of eight children, 30% of them have X girls or less. What is X?
qbinom(p = 0.3, size = 8, prob = 0.5, lower.tail = TRUE)
gives X = 3

`pbinom()` versus `qbinom()`

pbinom() gives the probability of observing at least k successes
- it returns a probability
qbinom() gives the number of successes that would be observed with p probability
- it returns a number of successes
- gives us the position of some quantile

`rbinom()`

this gives a random number from the distribution
take a sample of 100 rocket launches, then for this sample 83 will be successful
command is rbinom(n = 1, size=100, prob =0.85)
- n = 1 means just one sample

Normal Distribution

spread in values of a continuous variable
two parameters
the mean value, $\mu$
the standard deviation, $\sigma$

\[ {\displaystyle f(x)={\frac {1}{\sigma {\sqrt {2\pi }}}}e^{-{\frac {1}{2}}\left({\frac {x-\mu }{\sigma }}\right)^{2}}} \]

$\mu$ is the mean
$\sigma$ is the standard deviation

What this looks like:

A human liver has an average mass of 1.5kg with a standard deviation of 0.2kg.

How this works in R

dnorm(x = 1.0, mean = 1.5, sd = 0.2)

[1] 0.0876415

dnorm(x = 1.2, mean = 1.5, sd = 0.2)

[1] 0.647588

dnorm(x = 1.4, mean = 1.5, sd = 0.2)

[1] 1.760327

dnorm(x = 1.6, mean = 1.5, sd = 0.2)

[1] 1.760327

dnorm(x = 1.8, mean = 1.5, sd = 0.2)

[1] 0.647588

dnorm(x = 2.0, mean = 1.5, sd = 0.2)

[1] 0.0876415

`dnorm()`, `pnorm()`, `qnorm()`, and `rnorm()`

dnorm() calculates the relative probability of a specific outcome value
pnorm() calculates the sum of probabilities less than (or greater than), an outcome value
- this is the cumulative probability
- adds up a whole bunch of dnorm()’s
qnorm() calculates the value below which there is a certain probability
- like the inverse function of pnorm()
- give it a probability and it works out the value
rnorm() generates random value from the normal distribution

`pnorm()`

pnorm(q = seq(0, 3, by = 0.01), mean = 1.5, sd = 0.2, lower.tail = TRUE)

pnorm(q = seq(0, 3, by = 0.01), mean = 1.5, sd = 0.2, lower.tail = FALSE)

lower.tail = TRUE means values less than
lower.tail = FALSE means values more than
because this is a continuous distribution we never have to worry about the pesky -1

Examples

The concentration of $PM_{2.5}$ in air has an average value of 12 $\mu g/m^{3}$ with a standard deviation of 5 $\mu g/m^{3}$. What is the probability that a given day will exceed the recommended 25$\mu g/m^{3}$?
- pnorm(q = 25, mean =12, sd =5, lower.tail = FALSE)
- gives 0.47%
Using the values above, what is the highest value of $PM_{2.5}$ we can expect to see once a year?
- qnorm(p = 1/365.25, mean =12, sd =5, lower.tail = FALSE)
- gives 25.9$\mu g/m^{3}$

`rnorm()`

this gives a random number from the normal distribution
a random sample of one person’s liver mass
command is rnorm(n = 1, mean = 1.5, sd = 0.2)
- n = 1 means just one sample

Confidence Intervals

we rarely know population statistics about our data
more common to have to infer them from sample statistics
if we don’t know the real standard deviation
- estimate it from the sample
- replace the normal distribution by its little cousin, the t-distribution
- t-distribution is more spread out (more uncertain) than the normal
- as sample size gets bigger the t converges to the normal

Confidence Intervals

imagine taking a bunch of different samples from a population
each one will have a slightly different mean
for one sample, its mean is our best guess as to the true population mean
but also need to express our degree of certainty (or otherwise) in this guess
this gives the confidence interval

Confidence Interval Imagined

Confidence Intervals in R

function t.test() is handy for confidence interval calculation
in fact it’s a bit of a Swiss army knife of conf. int., hypothesis testing….
give it a set of values and it’ll output the confidence interval for their mean
t.test(airquality$Wind)
t.test(airquality$Wind)$conf.int


    One Sample t-test

data:  airquality$Wind
t = 34.961, df = 152, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
  9.394804 10.520229
sample estimates:
mean of x 
 9.957516

Hypothesis Testing

do two samples have difference means?
consider gapminder dataset
- was life expectancy in 2007 higher in Southern Europe than Northern Europe?

z <- dslabs::gapminder |> 
  filter(year == 2007, region %in% c("Northern Europe", "Southern Europe"))
t.test(life_expectancy ~  region, data = z)


    Welch Two Sample t-test

data:  life_expectancy by region
t = -0.10871, df = 14.573, p-value = 0.9149
alternative hypothesis: true difference in means between group Northern Europe and group Southern Europe is not equal to 0
95 percent confidence interval:
 -3.374106  3.047439
sample estimates:
mean in group Northern Europe mean in group Southern Europe 
                     77.67000                      77.83333

Paired T-tests

Ten mice were placed on a high fat diet. Their weights were recorded before and afterwards. Is there a significant weight gain?

mouse	before	after
1	265.6	274.6
2	84.4	100.2
3	196.7	206.0
4	187.7	194.0
5	199.0	203.9
6	235.7	247.9
7	259.2	262.9
8	172.3	172.8
9	188.2	198.1
10	241.1	248.0

Graphically

With Paired t-test

t.test(x = before, y = after)


    Welch Two Sample t-test

data:  before and after
t = -0.33694, df = 17.986, p-value = 0.7401
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -56.79972  41.09972
sample estimates:
mean of x mean of y 
   202.99    210.84

t.test(x = before, y = after, paired = TRUE)


    Paired t-test

data:  before and after
t = -5.6619, df = 9, p-value = 0.0003089
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -10.986396  -4.713604
sample estimates:
mean difference 
          -7.85

Workshop - Week Two

Perform the Following Tasks:

These are largely based on the lterdatasampler package
- remotes::install_github("lter/lterdatasampler")

Human gestation has an average period of 266 days with a standard deviation of 16 days. What is the probability that a baby will arrive exactly on their due date? Assume normal distribution. [2.5%]
Make a 95% confidence interval for animal weight for the bisons recorded in the knz_bison dataset. [742.16 - 756.74]
Perform a hypothesis test that female and male bisons are the same weight. [ $p_{value} = 7 \times 10^{-5} << 0.05$]

Assignments - Week Two

Complete week two moodle quiz
Complete swirl() exercises

install.packages("swirl")
library(swirl)
install_course("Statistical Inference")
swirl()
choose course Statistical Inference
do the three exercises 1 (Introduction), 9 (T Confidence Intervals), and 10 (Hypothesis Testing)
email a screen shot of the end of each lesson to eugene.hickey@associate.atu.ie
it’ll look a bit like screen capture here

Introduction to Statistical Computing - Using R for Stats

Week Two - Statistical Inference

Distributions

Binomial Distribution

What this looks like:

How this works in R

dbinom(), pbinom(), qbinom(), and rbinom()

pbinom()

Example

qbinom()

pbinom() versus qbinom()

rbinom()

Normal Distribution

What this looks like:

How this works in R

dnorm(), pnorm(), qnorm(), and rnorm()

pnorm()

Examples

rnorm()

Confidence Intervals

Confidence Intervals

Confidence Interval Imagined

Confidence Intervals in R

Hypothesis Testing

Paired T-tests

Graphically

With Paired t-test

Workshop - Week Two

Perform the Following Tasks:

Assignments - Week Two

`dbinom()`, `pbinom()`, `qbinom()`, and `rbinom()`

`pbinom()`

`qbinom()`

`pbinom()` versus `qbinom()`

`rbinom()`

`dnorm()`, `pnorm()`, `qnorm()`, and `rnorm()`

`pnorm()`

`rnorm()`