In this workshop, we introduce R packages and practice building one with common bioinformatics tools. Much of this material is modified from Hadley Wickham’s book R Packages.
At the end of this workshop, you will be able to:
This workshop assumes no prior experience. However, you will need to complete the following steps prior to the workshop.
R packages are reproducible bundles of code. They represent what many think of as the final step in reproducible coding as you progress from scripts to functions to packages.
Some benefits of R packages as compared to R functions and scripts are:
library( )
So, when do you make the jump from functions to packages? It is up to you to consider if the benefits of a package outweigh the time to produce and maintain one. If you use the same function in multiple projects, many people use the same function, and/or multiple functions are often used together, you will likely benefit from bundling everything into a package. Making a package also forces you to more robustly test functions and will improve your function writing overall.
Creating an R package in a iterative process. You will use the same code repeatedly to test your package until is passes all tests. You may wish to save this code in an R script but we will not build live notes like we normally do in a workshop. My package building script is provided at the end of these notes.
Maybe not surprisingly, there are R packages to help you make R packages! First install these packages.
install.packages(c("devtools","usethis","roxygen2"))
Then load these packages.
library(devtools)
library(roxygen2)
You can create the skeleton of a package with create_package( )
. Input the full path where you would like to save this package locally.
usethis::create_package("~/Documents/GitHub/BIGslu/R_packages/BIGpicture")
## ✓ Setting active project to '/Users/kim/Documents/GitHub/BIGslu/R_packages/BIGpicture'
## ✓ Leaving 'DESCRIPTION' unchanged
## Package: BIGpicture
## Title: What the Package Does (One Line, Title Case)
## Version: 0.0.0.9000
## Authors@R (parsed):
## * First Last <first.last@example.com> [aut, cre] (YOUR-ORCID-ID)
## Description: What the package does (one paragraph).
## License: `use_mit_license()`, `use_gpl3_license()` or friends to pick a
## license
## Encoding: UTF-8
## LazyData: true
## Roxygen: list(markdown = TRUE)
## RoxygenNote: 7.1.2
## ✓ Setting active project to '<no active project>'
This will open a new R session in an R project formatted for packages. This automatically creates a number of files including the following.
.Rbuildignore
lists files needed for package development but that will be ignored when building the R package.gitignore
anticipates Git usage and ignores standard files created by R and RStudioDESCRIPTION
contains package metadataNAMESPACE
denotes internal functions and external R packages used by your package. Do not edit this file by-handR/
directory to hold R scripts with functionsBIGpicture.Rproj
RStudio ProjectYou do not have to version control your package, but it is strongly recommended. You can auto setup git with use_git( )
and allow RStudio to create the first commit like so.
usethis::use_git()
✓ Setting active project to '/Users/kim/Documents/GitHub/BIGslu/R_packages/BIGpicture'
✓ Initialising Git repo
✓ Adding '.Rhistory', '.Rdata', '.httr-oauth', '.DS_Store' to '.gitignore'
There are 5 uncommitted files:
* '.gitignore'
* '.Rbuildignore'
* 'BIGpicture.Rproj'
* 'DESCRIPTION'
* 'NAMESPACE'
Is it ok to commit them?
1: Nope
2: No way
3: For sure
Selection: 3
✓ Adding files
✓ Making a commit with message 'Initial commit'
• A restart of RStudio is required to activate the Git pane
Restart now?
1: Absolutely not
2: Negative
3: Absolutely
Selection: 3
This is the last time RStudio needs to restart during package development. So we will reload our development packages and no longer need to call them for each function with ::
library(devtools)
library(roxygen2)
You will generally not start package development until you have a working function to incorporate. For this package, we will add a function to plot model fit sigma
comparisons between two models run in kimma::kmFit( )
. The original function:
library(kimma)
#Linear models
model_result <- kmFit(dat = example.voom, kin = example.kin, patientID = "donorID",
run.lmekin = TRUE, run.lme = TRUE,
model = "~ virus + (1|donorID)", processors = 6)
#plot fxn
sigma_plot <- function(model_result, x, y){
library(tidyverse)
#Extract results
dat_x <- model_result[[x]] %>%
distinct(model,gene, sigma)
dat_y <- model_result[[y]] %>%
distinct(model,gene, sigma)
#Merge and format
dat <- bind_rows(dat_x,dat_y) %>%
pivot_wider(names_from = model, values_from = sigma) %>%
#add best fit variable
mutate(`Best fit` = ifelse(get(x)<get(y), x,
ifelse(get(y)<get(x), y, "none")))
#plot
plot <- ggplot(dat, aes(x=get(x), y=get(x), color=`Best fit`)) +
geom_point(alpha=0.3) +
labs(x=x, y=y) +
geom_abline() +
theme_classic()
message("Total genes best fit by")
table(dat$`Best fit`)
return(plot)
}
#test fxn
sigma_plot(model_result, x="lme", y="lmekin")
We don’t want to simply copy this script to our package’s R/
directory. Instead, we ensure correct setup by creating a new script with use_r( )
use_r("sigma_plot")
✓ Setting active project to '/Users/kim/Documents/GitHub/BIGslu/R_packages/BIGpicture'
• Modify 'R/sigma_plot.R'
• Call `use_test()` to create a matching test file
This opens a new R script where we copy just the function part of our original script.
sigma_plot <- function(model_result, x, y){
library(tidyverse)
#Extract results
dat_x <- model_result[[x]] %>%
distinct(model,gene, sigma)
dat_y <- model_result[[y]] %>%
distinct(model,gene, sigma)
#Merge and format
dat <- bind_rows(dat_x,dat_y) %>%
pivot_wider(names_from = model, values_from = sigma) %>%
#add best fit variable
mutate(`Best fit` = ifelse(get(x)<get(y), x,
ifelse(get(y)<get(x), y, "none")))
#plot
plot <- ggplot(dat, aes(x=get(x), y=get(x), color=`Best fit`)) +
geom_point(alpha=0.3) +
labs(x=x, y=y) +
geom_abline() +
theme_classic()
message("Total genes best fit by")
table(dat$`Best fit`)
return(plot)
}
We can test this function within the package environment by loading it with load_all( )
load_all()
ℹ Loading BIGpicture
Warning message:
── Conflicts ───────────────────────────────────────────────────────────── BIGpicture conflicts ──
x sigma_plot() masks BIGpicture::sigma_plot()
Did you accidentally source a file rather than using `load_all()`?
Run `rm(list = c("sigma_plot"))` to remove the conflicts.
Because we have sigma_plot
in our current R environment from testing the original script, R finds a conflict i.e. two functions by the same name. If we remove the original script’s function from the environment, we resolve this conflict. Now the only copy of sigma_plot
is from the BIGpicture
package.
rm(list = c("sigma_plot"))
load_all()
ℹ Loading BIGpicture
If we call sigma_plot( )
in this session, R will use the BIGpicture
function even though we do not have this package installed. This allows you to test your functions like a user while still in the development phase.
In your workflow, this would be a good point to make a git commit!
devtools
runs a number of standard checks of packages including those needed for CRAN upload. We run all of these with check( )
and see a number of issues with our package.
check()
0 errors ✓ | 2 warnings x | 1 note x
The three types of issues are
Any of these issues would result in the package be denied upload to CRAN. Let’s tackle each of these individually.
The first warning notes that we do not have a license in our package. Thus, no one can use it!
> checking DESCRIPTION meta-information ... WARNING
Non-standard license specification:
`use_mit_license()`, `use_gpl3_license()` or friends to pick a
license
Standardizable: FALSE
You can choose your own license but we recommend GPL3.
use_gpl3_license()
✓ Setting License field in DESCRIPTION to 'GPL (>= 3)'
✓ Writing 'LICENSE.md'
✓ Adding '^LICENSE\\.md$' to '.Rbuildignore'
If we rerun check( )
, we see one fewer warnings now.
check()
0 errors ✓ | 1 warning x | 1 note x
Our function uses the tidyverse
suite of packages. Thus, on its own, it will not work unless someone already has tidyverse
installed and loaded in their R session.
> checking dependencies in R code ... WARNING
'library' or 'require' call not declared from: ‘tidyverse’
'library' or 'require' call to ‘tidyverse’ in package code.
Please use :: or requireNamespace() instead.
See section 'Suggested packages' in the 'Writing R Extensions' manual.
Within our package, we need to tell R to install and load all dependencies automatically. This is done with use_package( )
. Since tidyverse
is actually a bunch of packages, we must call each individual package instead.
use_package("dplyr")
use_package("tidyr")
use_package("ggplot2")
✓ Adding 'dplyr' to Imports field in DESCRIPTION
• Refer to functions with `dplyr::fun()`
✓ Adding 'tidyr' to Imports field in DESCRIPTION
• Refer to functions with `tidyr::fun()`
✓ Adding 'ggplot2' to Imports field in DESCRIPTION
• Refer to functions with `ggplot2::fun()`
This added the following text to our DESCRIPTION
file.
Imports:
dplyr,
ggplot2,
tidyr
After we’ve added the packages, we still have some related notes to address.
> checking R code for possible problems ... NOTE
sigma_plot: no visible global function definition for ‘distinct’
sigma_plot: no visible global function definition for ‘bind_rows’
sigma_plot: no visible global function definition for ‘pivot_wider’
sigma_plot: no visible global function definition for ‘mutate’
sigma_plot: no visible global function definition for ‘ggplot’
sigma_plot: no visible global function definition for ‘aes’
sigma_plot: no visible global function definition for ‘geom_point’
sigma_plot: no visible global function definition for ‘labs’
sigma_plot: no visible global function definition for ‘geom_abline’
sigma_plot: no visible global function definition for ‘theme_classic’
Within package functions, we must explicitly call each package when needed using ::
. So in our function, we need to add dplyr::
, etc to each function in that package. We also remove the library( )
call.
sigma_plot <- function(model_result, x, y){
#Extract results
dat_x <- model_result[[x]] %>%
dplyr::distinct(model,gene, sigma)
dat_y <- model_result[[y]] %>%
dplyr::distinct(model,gene, sigma)
#Merge and format
dat <- dplyr::bind_rows(dat_x,dat_y) %>%
tidyr::pivot_wider(names_from = model, values_from = sigma) %>%
#add best fit variable
dplyr::mutate(`Best fit` = ifelse(get(x)<get(y), x,
ifelse(get(y)<get(x), y, "none")))
#plot
plot <- ggplot2::ggplot(dat, ggplot2::aes(x=get(x), y=get(x), color=`Best fit`)) +
ggplot2::geom_point(alpha=0.3) +
ggplot2::labs(x=x, y=y) +
ggplot2::geom_abline() +
ggplot2::theme_classic()
message("Total genes best fit by")
table(dat$`Best fit`)
return(plot)
}
And we re-check.
check()
0 errors ✓ | 0 warnings ✓ | 1 note x
While not used in this package, you may need to access Bioconductor or GitHub packages within your package. To do this, you run use_package( )
as we did for CRAN packages and then add the following to your DESCRIPTION
file to tell R to look for packages outside CRAN.
For Bioconductor. No specifics are needed since package names are already listed under Imports.
biocViews:
For GitHub. Input the user and repo name of your desired package(s).
Remotes: BIGslu/kimma, BIGslu/RNAetc
Pipes %>%
are a common dependency the requires some special formatting in packages. This is why they are listed as a note instead of a warning.
> checking R code for possible problems ... NOTE
sigma_plot: no visible global function definition for ‘%>%’
The pipe function is in the package magrittr
but use_package("magrittr")
does not result in everything we need. So, there is use_pipe( )
which does all we need!
use_pipe()
✓ Adding 'magrittr' to Imports field in DESCRIPTION
✓ Writing 'R/utils-pipe.R'
• Run `devtools::document()` to update 'NAMESPACE'
This creates the file R/utils-pipe.R
which contains everything R needs for pipes. magrittr
was also added to our DESCRIPTION
.
This leaves us with the following notes. These tell us there there are variables in our function with no default. They occur when variables not defined in the function are used or created within the function such as select
, mutate
, or pivot_wider
.
> checking R code for possible problems ... NOTE
sigma_plot: no visible binding for global variable ‘model’
sigma_plot: no visible binding for global variable ‘gene’
sigma_plot: no visible binding for global variable ‘sigma’
sigma_plot: no visible binding for global variable ‘Best fit’
Undefined global functions or variables:
Best fit gene model sigma
Consider adding
importFrom("stats", "sigma")
to your NAMESPACE file.
We resolve this note by defining each variable at the start of the function. I like to just keep a running list like so.
model <- gene <- sigma <- `Best fit` <- NULL
Now all our checks pass!
check()
0 errors ✓ | 0 warnings ✓ | 0 notes ✓
Now, you have a package! This is another good time to git commit! But there is a lot we can do to improve it.
Several steps thus far have modified the DESCRIPTION
file. Open this file and add info about this package for the title, version, author, and description.
Package: BIGpicture
Title: Standard Plots for BIGslu
Version: 0.0.1
Authors@R:
person(given = "Kim",
family = "Dill-McFarland",
role = c("aut", "cre"),
email = "kadm@uw.edu")
Description: Create plots related to bioinformatic analyses.
License: GPL (>= 3)
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.2
Imports:
dplyr,
ggplot2,
magrittr,
tidyr
Each R function has a help page accessed with ?
. At this point, however, we have not provided any information for sigma_plot
so no page is available.
Help pages are created using roxygen2
, which we installed earlier but have not used yet. First, insert a template by opening sigma_plot.R
and selecting Code > Insert roxygen skeleton under the RStudio menu bar.
#' Title
#'
#' @param model_result
#' @param x
#' @param y
#'
#' @return
#' @export
#'
#' @examples
Then fill in the appropriate information. It is best practices to include the data type (character, vector, data frame, etc) in each variable’s description.
#' Model fit comparisons using sigma estimate
#'
#' @param model_result List of data frames output by kmFit( )
#' @param x Character string of model to plot on x-axis. Must match object names in model_result. For example, "lm", "lme", "lmekin"
#' @param y Character string of model to plot on y-axis. Must match object names in model_result. For example, "lm", "lme", "lmekin"
#'
#' @return ggplot object
#' @export
#'
#' @examples
We next create the help pages with document( )
. This makes a manual man/
directory with .Rd
files for each help page.
document()
ℹ Updating BIGpicture documentation
ℹ Loading BIGpicture
Warning: [/Users/kim/Documents/GitHub/BIGslu/R_packages/BIGpicture/R/sigma_plot.R:10] @examples requires a value
Writing NAMESPACE
Writing NAMESPACE
Writing sigma_plot.Rd
And now anyone using the package can access the help page.
?sigma_plot
Running check will now also perform tests of the documentation.
check()
0 errors ✓ | 0 warnings ✓ | 0 note x
Ever wonder about the “Examples” section at the end of R help pages? Why did the authors choose those specific examples? Why are some lines commented out? Why do some have tons and others none?
Well, this section serves two purposes. 1) It gives the user example code for how to use the associated function. 2) It tests the function when you build the package. This second purpose is actually the more important one, though as an R user, you never see these tests. When you include example code, R runs it every time you run check( )
and thus, you test your code on real data every time you make a change.
You may use base R example data in the example code or include your own custom example data relevant to your package. Here, we want BIGpicture
to play nice with other BIGslu packages like kimma
. Thus, we will provide our own example data.
Data within your package is housed within the data/
directory. Oftentimes, these date are a cleaned or reduced version of some other data set. It is highly recommended to use publicly available data and provide a cleaning script to produce the actual data used by the package. These documents generally live in data-raw/
.
Setup the raw data files with
usethis::use_data_raw()
This opens DATASET.R
where we will paste in the data cleaning code from the original function script.
library(kimma)
#Linear models
model_result <- kmFit(dat = example.voom, kin = example.kin, patientID = "donorID",
run.lmekin = TRUE, run.lme = TRUE,
model = "~ virus + (1|donorID)", processors = 6)
usethis::use_data(model_result, overwrite = TRUE)
Running that final line of code, sets up the data/
directory and saves our example data for the package.
✓ Adding 'R' to Depends field in DESCRIPTION
✓ Creating 'data/'
✓ Saving 'model_result' to 'data/model_result.rda'
• Document your data (see 'https://r-pkgs.org/data.html')
After loading BIGpicture
, these data can be accessed simply as model_result
.
As usual, we check the package build after this change. We see a new warning that our data object has no documentation.
check()
> checking for missing documentation entries ... WARNING
Undocumented code objects:
‘model_result’
Undocumented data sets:
‘model_result’
All user-level objects in a package should have documentation entries.
See chapter ‘Writing R documentation files’ in the ‘Writing R
Extensions’ manual.
0 errors ✓ | 1 warning x | 0 note x
Documenting data is similar to a function except that it gets a new file in R/
instead of this information being part of the data object itself. There is no template for data documentation so we modify an example to fit our needs. Generally, you want to include everything you’d get if you looked at the data with str( )
str(model_result)
Which leads to documentation like the following in R/model_result.R
#' kmFit example model results
#'
#' @format List of 2:
#' \enumerate{
#' \item \strong{lme} A data frame with 1000 rows and 7 columns
#' \describe{
#' \item{model}{character. Model name.}
#' \item{gene}{character. ENSEMBL gene ID.}
#' \item{variable}{character. Variable name.}
#' \item{estimate}{numeric. Model estimate, log fold change.}
#' \item{pval}{numeric. Significance estimate.}
#' \item{sigma}{numeric. Model fit, mean residual error.}
#' \item{FDR}{numeric. FDR corrected signficant estimate.}
#' }
#'
#' \item \strong{lmekin} A data frame with 2000 rows and 7 columns
#' \describe{
#' \item{model}{character. Model name.}
#' \item{gene}{character. ENSEMBL gene ID.}
#' \item{variable}{character. Variable name.}
#' \item{estimate}{numeric. Model estimate, log fold change.}
#' \item{pval}{numeric. Significance estimate.}
#' \item{sigma}{numeric. Model fit, mean residual error.}
#' \item{FDR}{numeric. FDR corrected signficant estimate.}
#' }
#' }
#' @source \url{https://github.com/altman-lab/P259_pDC_public}
#' @references Dill-McFarland et al. 2021. Eosinophil-mediated suppression and Anti-IL-5 enhancement of plasmacytoid dendritic cell interferon responses in asthma. J Allergy Clin Immunol. In revision
#' @description A dataset containing linear mixed effects model results with and without genetic kinship correction. Raw data were obtained from the kimma package.
#' @docType data
#' @name model_result
#' @keywords datasets
"model_result"
Checking again, we have resolved the previous warning.
check()
0 errors ✓ | 0 warnings ✓ | 0 note x
Now that we have example data in our package, we can use it in example code in the function. Taking the test code we had in the original script, we add it to the roxygen header examples like so.
#' @examples
#' sigma_plot(model_result, x="lme", y="lmekin")
Re-running check results in no errors so we know that the above code successfully completes using our custom function.
check()
0 errors ✓ | 0 warnings ✓ | 0 note x
If there was an error in the example code, check( )
would provide the output of the failed code similar to running it in the R console. For example, if we did not specify the y variable as in sigma_plot(model_result, x="lme")
, we would see the following.
> checking examples ... ERROR
Running examples in ‘BIGpicture-Ex.R’ failed
The error most likely occurred in:
> base::assign(".ptime", proc.time(), pos = "CheckExEnv")
> ### Name: sigma_plot
> ### Title: Model fit comparisons using sigma estimate
> ### Aliases: sigma_plot
>
> ### ** Examples
>
> sigma_plot(model_result, x="lme")
Error in UseMethod("distinct") :
no applicable method for 'distinct' applied to an object of class "NULL"
Calls: sigma_plot -> %>% -> <Anonymous>
Execution halted
1 error x | 0 warnings ✓ | 0 notes ✓
This error tells us that distinct
is being run on a variable that does not exist, e.g. is NULL. To figure out what is going on, we would go back to the function and run it line-by-line until we reproduced the error.
If you’re planning to upload your package to GitHub, it is nice to have a README on the repo’s homepage to let potential users know a bit about the package. You can create a template like so and customize it as desired.
use_readme_rmd()
✓ Writing 'README.Rmd'
✓ Adding '^README\\.Rmd$' to '.Rbuildignore'
• Modify 'README.Rmd'
✓ Writing '.git/hooks/pre-commit'
library(devtools)
library(roxygen2)
# Create package
create_package("")
use_git()
# Add function
use_r("")
# Add dependencies
use_package("")
use_pipe()
# Update DESCRIPTION for remote dependencies
## biocViews:
## Remotes: BIGslu/kimma, BIGslu/RNAetc
# https://r-pkgs.org/whole-game.html#edit-description
# Add license
use_gpl3_license()
# Build package
load_all()
document()
check()
# Add documentation
# RStudio menu: Code > Insert roxygen skeleton
# Create manual
build_manual(pkg = ".", path = "../")
sessionInfo()