Introduction to Statistical Computing
Getting & Cleaning Data

👨‍💻 Eugene Hickey @ Atlantic Technological University 👨‍💻

  • eugene.hickey@tudublin.ie
  • @eugene100hickey
  • github.com/eugene100hickey
  • www.fizzics.ie

Week Four - Getting and Cleaning Data

  • R only any good when it has data to work with, and lots of it

  • really not a good idea to enter data in to R directly

  • we’ll look at three ways to load our data in to an R session

    • reading it in from files

    • using an application programme interface (API) to get data from external data sources

    • web scraping

Ethics

  • Accessing data comes with some caveats

    • just because it’s on a webpage doesn’t mean we can use it as we want

    • must be:

      • public

      • non-personal

      • factual

    • also, be mindful of giving appropriate credit

    • and, of course, GDPR

  • This link discusses the issue nicely

Reading in from Files

  • getting data from a csv file

  • getting data from an excel file

  • getting data from a googlesheet

  • getting data from a pdf file

  • other file formats

Reading in from Files

  • two options

    • files are located on your local computer

    • files are located somewhere on the internet

  • R is pretty good at treating these the same

Reading in from Files

  • we’ll harness two packages for this

    • here

      • this package is useful for finding files on your computer

      • makes locating files a lot more streamlined

      • makes you code more portable

    • rio

      • this package is very streamlined for accessing data

      • as well as excel and csv, will do a host of other formats: SPSS, minitab, json, xml, Stata, matlab, and compressed formats like zip files

    • good at recognising file type automatically and leveraging appropriate function from other packages

here

  • typing the full address of a file is cumbersome, e.g. C:\Users\keyser-soze\Documents\this-week\shopping-list.doc

  • if someone copies or moves your file, they must reconstitute file address

  • differences between “\” in windows and “/” in Linux

    • sometimes even need a double “\\”
  • here gets around all this by having a chain of directories in quotes

    • here("keyser-soze", "Documents", "this-week", "shopping-list.doc")

    • starts from current directory, find out where that is by typing here()

  • using projects in R is also a good idea

  • these make it so much easier to copy a block of someone’s work (for example from github) without having to trace down all their file locations.

rio

  • main function we’ll use is import()

    • import("some-file-name.xlsx")

    • import(system.file("extdata", package = "dslabs", "olive.csv"))

    • import("https://github.com/rfordatascience/tidytuesday/raw/refs/heads/main/data/2025/2025-01-21/exped_tidy.csv")

    • can specify sheet from excel, import(here(.libPaths()[1], "readxl", "extdata", "datasets.xls"), which = "quakes")

    • can skip the first few rows with skip = 4

    • also works with googlesheets. Pass in web address that we can get from
      the Share section of the googlesheet (we’ll look at this in this weeks
      worksheet later on)

  • two packages of note; pdftools and tabulizer

  • tabulizer has an extract_tables() function.

    • good to specify the specific page or location in the pdf of the table you want

    • I (pretty much) always include a output = "data.frame" in the call to extract_tables()

  • always returns a list, each element of the list would be one of the tables in the document

    • use object_returned_by_extract_tables[[1]] to get the first table

here(system.file(package="tabulizer"), "examples", "data.pdf") |> extract_tables(output = "data.frame", pages = 1) |> pluck(1)

  • Need Java installed, and R needs to know about it.

  • Might need to do this:

    • Install 64 bit Java from java.com.
    • Then in windows cmd run
    • set PATH "C:\Program Files\Java\jre1.8.0_211\bin\server;%PATH%" (make sure your path is correct).
    • Then in RStudio run Sys.setenv(JAVA_HOME="")
  • then remotes::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer"), INSTALL_opts = "--no-multiarch")

  • can be a lot easier just to work in posit.cloud

library(tabulizer)
organ_transplant <- "https://nhsbtdbe.blob.core.windows.net/umbraco-assets-corp/18494/centre-specific-activity-report.pdf"
extract_tables(organ_transplant, output = "data.frame", pages = 2) |> pluck(1) |> gt::gt()
Adult X X.1 X.2 X.3 X.4 X.5
Birmingham 27 23 17.4 17 16 +6.3
Glasgow 11 7 57.1 - - -
Harefield 21 33 -36.4 37 48 -22.9
Manchester 22 20 10.0 28 19 +47.4
Newcastle 22 19 15.8 30 25 +20.0
Papworth 36 39 -7.7 38 39 -2.6
Total 139 141 -1.4 150 147 +2.0
NA NA NA
Paediatric NA NA NA
GOSH 6 13 -53.8 2 3 -33.3
Newcastle 11 9 22.2 2 1 +100.0
Total 17 22 -22.7 4 4 +0.0
NA NA NA
TOTAL 156 163 -4.3 154 151 +2.0

Other File Formats

  • see the bottom of this page to see some other file formats that rio::import() can handle.

  • includes SPSS files (.sav), Minitab files (.mtp), Matlab files (.mat), etc…..

Using Application Programme (API) Access

  • people working in different domains simplify access to databases

  • produce an R package that does the trick

  • financial, social media, life science, census, music, ecology, earth observation…….

    • pretty much anyway there’s large data stores
  • we’ll look at some examples of these

  • tend to be quite domain specific, everyone does it differently to suit the vagaries of their data source

  • sometimes need to register and generate user token to get access

    • can store this within RStudio so we just need to do this once

Life Sciences, Bioinformatics, NCBI, and Bioconductor

  • this is a whole ecosystem of work in R

  • vast troves of data

  • BiocManager package is the sine qua non for accessing this

    • instigates installation of relevant packages

    • even tighter controls over package quality than CRAN

  • I like the rentrez package to access gene sequences

  • we’ll look at this in the workshop

Financial

  • tidyquant is my go-to for stock data

    • tq_index_options() returns the names of stock markets

    • tq_index("my_stock_market") returns a dataframe of stock market indices

    • tq_get("my_stock_symbol") returns actual stock market data.

Social Media

  • let’s look at mastodon

  • package rtoot is useful

  • function auth_setup("mastodon.social", "public") to connect to your account

  • get_timeline_hashtag(hashtag = "eurovision", instance = "mastodon.social", limit = 200) to get dataframe of toots

Political Science

  • list of useful packages here

Ecology

  • rfishbase

  • rgbif

  • GIFT with this tutorial

Earth Observation - Satellite Images

  • good resource for Ireland is Sentinel2

    • need to make an account on Copernicus

    • R packages sentinel2 and sen2r

    • see this presentation for a walkthrough

  • also Planet.com

    • free (but limited) academic access

Web Scraping

  • often want to take data from an internet page (or family of pages) and read in to R

  • two types

    • static web scraping where the web address changes when you click links in the page

    • dynamic when the web address stays the same but the page content changes

    • dynamic is tricky and requires the use of a virtual browser (RSelenium)

    • we’ll just look at static web scraping

Static Web Scraping

  • rvest package is super useful

  • three step process

    • read in the whole web page (read_html())

    • select out the elements that you want (html_elements())

    • format the results (html_text2())

  • to figure out the elements part, the Chrome Extension called Selector Gadget is very useful. Could also look at the page source by Ctrl+U

Workshop - Week Four

Perform the Following Tasks:

  1. Use rio::import() to read in the Ferris Wheel dataset available from https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-08-09/wheels.csv. Where is the highest ferris wheel? [Golden Gate Flyer]

  2. Look at the example excel file called deaths.xlsx from the readxl package (find it using system.file(package = "readxl") then look in the extdata directory). Read the "other" sheet from this file using the import() function. Note that the first four rows are full of junk so they should be skipped. See if you can figure out what the Date of Birth column represents. [days since 1900]

  3. Make a new googlesheet in the Chrome browser. After you’ve entered some data, click Share (top right corner). Click General Access and choose Anyone with the link. Copy the link. Go to RStudio now and use the link you’ve copied to import this googlesheet into R.

  4. Have a look at the Department of Education SCHOOL SELF-EVALUATION GUIDELINES 2016-2020 here. Use tabulizer to read in the table on page 27 of this report as a data.frame (note, it’ll be the sixth table tabulizer encounters).

  1. Install the eurostat package. Search for datasets on education using the search_eurostat() function. Get the eurostat dataset on “Quality of education” (the code should be “med_ps24”). Decode the column values using get_eurostat_dic() (e.g. get_eurostat_dic("indic_md")). Find the Annual, Proportion of youth (20-24 years) having attained ISCED 3 level, total (%) in Palestine in 2014. [35.8%]

  2. Install the BiocManager package (install.packages("BiocManager")). Install the rentrez and seqinr packages from Bioconductor (e.g. BiocManager::install("rentrez")). Find the NCBI accession numbers for the BRCA1 cancer gene using z <- entrez_search(db = "nuccore", term = "brca1"). This returns a list of 5 elements, the first of which (z[[1]]) will be the relevant accession numbers. Take the first of these accession numbers and fetch the gene sequence using entrez_fetch(db = "nuccore", id = "my_accession_numbber", rettype = "fasta").

  3. Install Selector Gadget as a Chrome extension. Go to the RTE. Find the css selector for the top navigation bar (where it says “News Sport Entertainment…”). [“.primary-nav__link””]

  4. Go to the wikipedia page for the English Premier League Season 2020/2021. Find the css selector for the Best Goalkeeper (should be something like “tr:nth-child(11) .infobox-data a”). Use read_html(), html_elements(), and html_text2() from the rvest package to find the best goalkepper for this season. [David Raya]

Assignments - Week Four

  1. Complete week four moodle quiz

  2. Complete swirl() exercises

  • swirl()

  • choose course R Programming E

  • do the three exercises 5 (Missing Values), 6 (Subsetting Vectors), and 7 (Matrices and Data Frames)

  • email the results to eugene.hickey@associate.atu.ie