Introduction to Statistical Computing
Getting & Cleaning Data

👨‍💻 Eugene Hickey @ Atlantic Technological University 👨‍💻

eugene.hickey@tudublin.ie
@eugene100hickey
github.com/eugene100hickey
www.fizzics.ie

Week Four - Getting and Cleaning Data

R only any good when it has data to work with, and lots of it
really not a good idea to enter data in to R directly
we’ll look at three ways to load our data in to an R session
- reading it in from files
- using an application programme interface (API) to get data from external data sources
- web scraping

Ethics

Accessing data comes with some caveats
- just because it’s on a webpage doesn’t mean we can use it as we want
- must be:
  - public
  - non-personal
  - factual
- also, be mindful of giving appropriate credit
- and, of course, GDPR
This link discusses the issue nicely

Reading in from Files

getting data from a csv file
getting data from an excel file
getting data from a googlesheet
getting data from a pdf file
other file formats

Reading in from Files

two options
- files are located on your local computer
- files are located somewhere on the internet
R is pretty good at treating these the same

Reading in from Files

we’ll harness two packages for this
- here
  - this package is useful for finding files on your computer
  - makes locating files a lot more streamlined
  - makes you code more portable
- rio
  - this package is very streamlined for accessing data
  - as well as excel and csv, will do a host of other formats: SPSS, minitab, json, xml, Stata, matlab, and compressed formats like zip files
- good at recognising file type automatically and leveraging appropriate function from other packages

`here`

typing the full address of a file is cumbersome, e.g. C:\Users\keyser-soze\Documents\this-week\shopping-list.doc
if someone copies or moves your file, they must reconstitute file address
differences between “\” in windows and “/” in Linux
- sometimes even need a double “\\”
here gets around all this by having a chain of directories in quotes
- here("keyser-soze", "Documents", "this-week", "shopping-list.doc")
- starts from current directory, find out where that is by typing here()
using projects in R is also a good idea
these make it so much easier to copy a block of someone’s work (for example from github) without having to trace down all their file locations.

`rio`

main function we’ll use is import()
- import("some-file-name.xlsx")
- import(system.file("extdata", package = "dslabs", "olive.csv"))
- import("https://github.com/rfordatascience/tidytuesday/raw/refs/heads/main/data/2025/2025-01-21/exped_tidy.csv")
- can specify sheet from excel, import(here(.libPaths()[1], "readxl", "extdata", "datasets.xls"), which = "quakes")
- can skip the first few rows with skip = 4
- also works with googlesheets. Pass in web address that we can get from
  the Share section of the googlesheet (we’ll look at this in this weeks
  worksheet later on)

PDF
installing tabulizer

two packages of note; pdftools and tabulizer
tabulizer has an extract_tables() function.
- good to specify the specific page or location in the pdf of the table you want
- I (pretty much) always include a output = "data.frame" in the call to extract_tables()
always returns a list, each element of the list would be one of the tables in the document
- use object_returned_by_extract_tables[[1]] to get the first table

here(system.file(package="tabulizer"), "examples", "data.pdf") |> extract_tables(output = "data.frame", pages = 1) |> pluck(1)

Need Java installed, and R needs to know about it.
Might need to do this:
- Install 64 bit Java from java.com.
- Then in windows cmd run
- set PATH "C:\Program Files\Java\jre1.8.0_211\bin\server;%PATH%" (make sure your path is correct).
- Then in RStudio run Sys.setenv(JAVA_HOME="")
then remotes::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer"), INSTALL_opts = "--no-multiarch")
can be a lot easier just to work in posit.cloud

organ transplant an donation data from the NHS

library(tabulizer)
organ_transplant <- "https://nhsbtdbe.blob.core.windows.net/umbraco-assets-corp/18494/centre-specific-activity-report.pdf"
extract_tables(organ_transplant, output = "data.frame", pages = 2) |> pluck(1) |> gt::gt()

Adult	X	X.1	X.2	X.3	X.4	X.5
Birmingham	27	23	17.4	17	16	+6.3
Glasgow	11	7	57.1	-	-	-
Harefield	21	33	-36.4	37	48	-22.9
Manchester	22	20	10.0	28	19	+47.4
Newcastle	22	19	15.8	30	25	+20.0
Papworth	36	39	-7.7	38	39	-2.6
Total	139	141	-1.4	150	147	+2.0
	NA	NA	NA
Paediatric	NA	NA	NA
GOSH	6	13	-53.8	2	3	-33.3
Newcastle	11	9	22.2	2	1	+100.0
Total	17	22	-22.7	4	4	+0.0
	NA	NA	NA
TOTAL	156	163	-4.3	154	151	+2.0

Other File Formats

see the bottom of this page to see some other file formats that rio::import() can handle.
includes SPSS files (.sav), Minitab files (.mtp), Matlab files (.mat), etc…..

Using Application Programme (API) Access

people working in different domains simplify access to databases
produce an R package that does the trick
financial, social media, life science, census, music, ecology, earth observation…….
- pretty much anyway there’s large data stores
we’ll look at some examples of these
tend to be quite domain specific, everyone does it differently to suit the vagaries of their data source
sometimes need to register and generate user token to get access
- can store this within RStudio so we just need to do this once

Life Sciences, Bioinformatics, NCBI, and Bioconductor

this is a whole ecosystem of work in R
vast troves of data
BiocManager package is the sine qua non for accessing this
- instigates installation of relevant packages
- even tighter controls over package quality than CRAN
I like the rentrez package to access gene sequences
we’ll look at this in the workshop

Financial

tidyquant is my go-to for stock data
- tq_index_options() returns the names of stock markets
- tq_index("my_stock_market") returns a dataframe of stock market indices
- tq_get("my_stock_symbol") returns actual stock market data.

Political Science

list of useful packages here

Ecology

rfishbase
rgbif
GIFT with this tutorial

Earth Observation - Satellite Images

good resource for Ireland is Sentinel2
- need to make an account on Copernicus
- R packages sentinel2 and sen2r
- see this presentation for a walkthrough
also Planet.com
- free (but limited) academic access

Web Scraping

often want to take data from an internet page (or family of pages) and read in to R
two types
- static web scraping where the web address changes when you click links in the page
- dynamic when the web address stays the same but the page content changes
- dynamic is tricky and requires the use of a virtual browser (RSelenium)
- we’ll just look at static web scraping

Static Web Scraping

rvest package is super useful
three step process
- read in the whole web page (read_html())
- select out the elements that you want (html_elements())
- format the results (html_text2())
to figure out the elements part, the Chrome Extension called Selector Gadget is very useful. Could also look at the page source by Ctrl+U

Workshop - Week Four

Perform the Following Tasks:

Use rio::import() to read in the Ferris Wheel dataset available from https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-08-09/wheels.csv. Where is the highest ferris wheel? [Golden Gate Flyer]
Look at the example excel file called deaths.xlsx from the readxl package (find it using system.file(package = "readxl") then look in the extdata directory). Read the "other" sheet from this file using the import() function. Note that the first four rows are full of junk so they should be skipped. See if you can figure out what the Date of Birth column represents. [days since 1900]
Make a new googlesheet in the Chrome browser. After you’ve entered some data, click Share (top right corner). Click General Access and choose Anyone with the link. Copy the link. Go to RStudio now and use the link you’ve copied to import this googlesheet into R.
Have a look at the Department of Education SCHOOL SELF-EVALUATION GUIDELINES 2016-2020 here. Use tabulizer to read in the table on page 27 of this report as a data.frame (note, it’ll be the sixth table tabulizer encounters).

Install the eurostat package. Search for datasets on education using the search_eurostat() function. Get the eurostat dataset on “Quality of education” (the code should be “med_ps24”). Decode the column values using get_eurostat_dic() (e.g. get_eurostat_dic("indic_md")). Find the Annual, Proportion of youth (20-24 years) having attained ISCED 3 level, total (%) in Palestine in 2014. [35.8%]
Install the BiocManager package (install.packages("BiocManager")). Install the rentrez and seqinr packages from Bioconductor (e.g. BiocManager::install("rentrez")). Find the NCBI accession numbers for the BRCA1 cancer gene using z <- entrez_search(db = "nuccore", term = "brca1"). This returns a list of 5 elements, the first of which (z[[1]]) will be the relevant accession numbers. Take the first of these accession numbers and fetch the gene sequence using entrez_fetch(db = "nuccore", id = "my_accession_numbber", rettype = "fasta").
Install Selector Gadget as a Chrome extension. Go to the RTE. Find the css selector for the top navigation bar (where it says “News Sport Entertainment…”). [“.primary-nav__link””]
Go to the wikipedia page for the English Premier League Season 2020/2021. Find the css selector for the Best Goalkeeper (should be something like “tr:nth-child(11) .infobox-data a”). Use read_html(), html_elements(), and html_text2() from the rvest package to find the best goalkepper for this season. [David Raya]

Assignments - Week Four

Complete week four moodle quiz
Complete swirl() exercises

swirl()
choose course R Programming E
do the three exercises 5 (Missing Values), 6 (Subsetting Vectors), and 7 (Matrices and Data Frames)
email the results to eugene.hickey@associate.atu.ie

Introduction to Statistical Computing Getting & Cleaning Data

Week Four - Getting and Cleaning Data

Ethics

Reading in from Files

Reading in from Files

Reading in from Files

here

rio

Other File Formats

Using Application Programme (API) Access

Life Sciences, Bioinformatics, NCBI, and Bioconductor

Financial

Social Media

Political Science

Ecology

Earth Observation - Satellite Images

Web Scraping

Static Web Scraping

Workshop - Week Four

Perform the Following Tasks:

Assignments - Week Four

Introduction to Statistical Computing
Getting & Cleaning Data

`here`

`rio`