
👨‍💻 Eugene Hickey @ Atlantic Technological University 👨‍💻
Time we did some statistics on our data
This ties in nicely with Emma’s half of our course
We’ll look at this from three perspectives
distributions
confidence intervals
hypothesis tests
There are lots of these, but we’ll mention five (and describe two)
binomial (discrete)
poisson
normal (continuous)
Student’s t distribution
exponential
number of successes in a sequence of trials
three parameters
the number of trials, n
the number of successes, k
the probability of success in a single trial, p
\[ {\displaystyle f(k,n,p)=\Pr(k;n,p)=\Pr(X=k)={\binom {n}{k}}p^{k}(1-p)^{n-k}}\]
\[ {\displaystyle {\binom {n}{k}}={\frac {n!}{k!(n-k)!}}} \]
A rocket launch has a probability of success of 85%. From a 100 launches, what is the profile of the number of successes?

dbinom(), pbinom(), qbinom(), and rbinom()dbinom() calculates the probability of a specific outcome value
pbinom() calculates the sum of probabilities less than (or greater than), an outcome value
this is the cumulative probability
adds up a whole bunch of dbinom()’s
qbinom() calculates the number of successes below which there is a certain probability
like the inverse function of pbinom()
give it a probability and it works out the number of successes threshold
q stands for quantile
rbinom() generates random numbers of successs from a binomial distribution
pbinom()lower.tail = TRUE means k successes or less (successes \(\le\) threshold, k)
lower.tail = FALSE means more than k successes (successes \(>\) threshold, k)
worth remembering when to use lower.tail = TRUE/FALSE
pbinom() calculationqbinom()
in families of eight children, 30% of them have X girls or less. What is X?
qbinom(p = 0.3, size = 8, prob = 0.5, lower.tail = TRUE)
gives X = 3
pbinom() versus qbinom()pbinom() gives the probability of observing at least k successes
qbinom() gives the number of successes that would be observed with p probability
it returns a number of successes
gives us the position of some quantile
rbinom()this gives a random number from the distribution
take a sample of 100 rocket launches, then for this sample 83 will be successful
command is rbinom(n = 1, size=100, prob =0.85)
n = 1 means just one samplespread in values of a continuous variable
two parameters
the mean value, \(\mu\)
the standard deviation, \(\sigma\)
\[ {\displaystyle f(x)={\frac {1}{\sigma {\sqrt {2\pi }}}}e^{-{\frac {1}{2}}\left({\frac {x-\mu }{\sigma }}\right)^{2}}} \]
\(\mu\) is the mean
\(\sigma\) is the standard deviation
A human liver has an average mass of 1.5kg with a standard deviation of 0.2kg.

dnorm(), pnorm(), qnorm(), and rnorm()dnorm() calculates the relative probability of a specific outcome value
pnorm() calculates the sum of probabilities less than (or greater than), an outcome value
this is the cumulative probability
adds up a whole bunch of dnorm()’s
qnorm() calculates the value below which there is a certain probability
like the inverse function of pnorm()
give it a probability and it works out the value
rnorm() generates random value from the normal distribution
pnorm()lower.tail = TRUE means values less than
lower.tail = FALSE means values more than
because this is a continuous distribution we never have to worry about the pesky -1
The concentration of \(PM_{2.5}\) in air has an average value of 12 \(\mu g/m^{3}\) with a standard deviation of 5 \(\mu g/m^{3}\). What is the probability that a given day will exceed the recommended 25\(\mu g/m^{3}\)?
pnorm(q = 25, mean =12, sd =5, lower.tail = FALSE)
gives 0.47%
Using the values above, what is the highest value of \(PM_{2.5}\) we can expect to see once a year?
qnorm(p = 1/365.25, mean =12, sd =5, lower.tail = FALSE)
gives 25.9\(\mu g/m^{3}\)
rnorm()this gives a random number from the normal distribution
a random sample of one person’s liver mass
command is rnorm(n = 1, mean = 1.5, sd = 0.2)
n = 1 means just one samplewe rarely know population statistics about our data
more common to have to infer them from sample statistics
if we don’t know the real standard deviation
estimate it from the sample
replace the normal distribution by its little cousin, the t-distribution
t-distribution is more spread out (more uncertain) than the normal
as sample size gets bigger the t converges to the normal
imagine taking a bunch of different samples from a population
each one will have a slightly different mean
for one sample, its mean is our best guess as to the true population mean
but also need to express our degree of certainty (or otherwise) in this guess
this gives the confidence interval

function t.test() is handy for confidence interval calculation
in fact it’s a bit of a Swiss army knife of conf. int., hypothesis testing….
give it a set of values and it’ll output the confidence interval for their mean
t.test(airquality$Wind)
t.test(airquality$Wind)$conf.int
One Sample t-test
data: airquality$Wind
t = 34.961, df = 152, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
9.394804 10.520229
sample estimates:
mean of x
9.957516
do two samples have difference means?
consider gapminder dataset
z <- dslabs::gapminder |>
filter(year == 2007, region %in% c("Northern Europe", "Southern Europe"))
t.test(life_expectancy ~ region, data = z)
Welch Two Sample t-test
data: life_expectancy by region
t = -0.10871, df = 14.573, p-value = 0.9149
alternative hypothesis: true difference in means between group Northern Europe and group Southern Europe is not equal to 0
95 percent confidence interval:
-3.374106 3.047439
sample estimates:
mean in group Northern Europe mean in group Southern Europe
77.67000 77.83333
Ten mice were placed on a high fat diet. Their weights were recorded before and afterwards. Is there a significant weight gain?
| mouse | before | after |
|---|---|---|
| 1 | 265.6 | 274.6 |
| 2 | 84.4 | 100.2 |
| 3 | 196.7 | 206.0 |
| 4 | 187.7 | 194.0 |
| 5 | 199.0 | 203.9 |
| 6 | 235.7 | 247.9 |
| 7 | 259.2 | 262.9 |
| 8 | 172.3 | 172.8 |
| 9 | 188.2 | 198.1 |
| 10 | 241.1 | 248.0 |


These are largely based on the lterdatasampler package
remotes::install_github("lter/lterdatasampler")Human gestation has an average period of 266 days with a standard deviation of 16 days. What is the probability that a baby will arrive exactly on their due date? Assume normal distribution. [2.5%]
Make a 95% confidence interval for animal weight for the bisons recorded in the knz_bison dataset. [742.16 - 756.74]
Perform a hypothesis test that female and male bisons are the same weight. [ \(p_{value} = 7 \times 10^{-5} << 0.05\)]
Complete week two moodle quiz
Complete swirl() exercises
install.packages("swirl")
library(swirl)
install_course("Statistical Inference")
swirl()
choose course Statistical Inference
do the three exercises 1 (Introduction), 9 (T Confidence Intervals), and 10 (Hypothesis Testing)
email a screen shot of the end of each lesson to eugene.hickey@associate.atu.ie
it’ll look a bit like screen capture here
