This workshop introduces popular data visualization methods for single cell RNA-seq data. Specifically, we will cover UMAP/tSNE and heatmaps.
During the workshop, we will build an R script together, which is available at https://github.com/BIGslu/workshops/blob/main/2023.02.07_scRNAseq.viz.workshop/2023.02.07_live.notes.R
The video recording is available at https://youtu.be/x-BtCyLEH8c
Please following the setup instructions at https://bigslu.github.io/workshops/setup/setup.html
If you are brand new to R, we recommend completely our 1-hour Introduction to R workshop to prepare, https://bigslu.github.io/workshops/introR.workshop/introR.html
Create a new R project and new R script to save your code. See Intro R for more information if you are unfamiliar with these.
You should have installed packages prior to the workshop. For more information on installation, see the setup instructions.
Every time you open a new RStudio session, the packages you
want to use need to be loaded into the R workspace with the
library
function. This tells R to access the package’s
functions and prevents RStudio from lags that would occur if it
automatically loaded every downloaded package every time you opened it.
To put this in perspective, I had 464 packages installed at the time
this document was made.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Because tidyverse
is a meta-package, it automatically
tells you what packages it is loading and their versions. In addition,
the Conflicts section let’s you know functions in the
tidyverse
that alter exist in your R session. Because you
chose to load the package, calling the function filter
will
use the tidyverse
function not the stats
function (which comes with base R). If you for some reason needed the
stats
version, you can specify it with
package::
like stats::filter
.
library(Seurat)
## Attaching SeuratObject
library(ComplexHeatmap)
## Loading required package: grid
## ========================================
## ComplexHeatmap version 2.12.1
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
##
## If you use it in published research, please cite either one:
## - Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional
## genomic data. Bioinformatics 2016.
## - Gu, Z. Complex Heatmap Visualization. iMeta 2022.
##
##
## The new InteractiveComplexHeatmap package can directly export static
## complex heatmaps into an interactive Shiny app with zero effort. Have a try!
##
## This message can be suppressed by:
## suppressPackageStartupMessages(library(ComplexHeatmap))
## ========================================
library(circlize)
## ========================================
## circlize version 0.4.15
## CRAN page: https://cran.r-project.org/package=circlize
## Github page: https://github.com/jokergoo/circlize
## Documentation: https://jokergoo.github.io/circlize_book/book/
##
## If you use it in published research, please cite:
## Gu, Z. circlize implements and enhances circular visualization
## in R. Bioinformatics 2014.
##
## This message can be suppressed by:
## suppressPackageStartupMessages(library(circlize))
## ========================================
These packages automatically print some information of dependencies, citation, etc. You don’t need to worry about these messages.
library(ggalluvial)
And finally, these packages load silently.
We will use example data provided by 10X. These data were pre-cleaned and normalized for this workshop so we can get right to plotting! You can see the cleaning steps in this markdown with more information in the Seurat tutorial.
Please download
the data and place the RData
file in your project
directory.
Then, load the data into R.
load("pbmc_clean.RData")
First let’s explore the final data quality. While Seurat
has a number of plotting functions, we will also build a plot from
scratch so that you can further customize as needed.
The percent mitochondrial data are contained in the data frame
pbmc[["percent.mt"]]
. See the data
cleaning notes for how this metric was calculated.
class(pbmc[["percent.mt"]])
## [1] "data.frame"
head(pbmc[["percent.mt"]])
## percent.mt
## AAACATACAACCAC-1 3.0177759
## AAACATTGAGCTAC-1 3.7935958
## AAACATTGATCAGC-1 0.8897363
## AAACCGTGCTTCCG-1 1.7430845
## AAACCGTGTATGCG-1 1.2244898
## AAACGCACTGGTAC-1 1.6643551
Seurat
provides an easy way to make violin plots with
VlnPlot
.
VlnPlot(pbmc, features = "percent.mt", group.by = "orig.ident")
Now let’s recapitulate this plot in ggplot
. Ggplot uses
layers connected by +
to progressively build more and more
complex and customized plots. To start, we create a simple dot plot but
we see that this is not very readable since there are so many data
points (one per cell in the data set).
ggplot(data = pbmc[["percent.mt"]]) +
aes(x = 1, y = percent.mt) +
geom_point()
We can improve readability by jittering the points on the x axis.
ggplot(data = pbmc[["percent.mt"]]) +
aes(x = 1, y = percent.mt) +
geom_jitter(height = 0, width = 0.2)
Next, we add a violin by adding another layer.
ggplot(data = pbmc[["percent.mt"]]) +
aes(x = 1, y = percent.mt) +
geom_jitter(height = 0, width = 0.4) +
geom_violin()
Oh no! Our violin trend covers up our points. This is because ggplot adds layers progressively. We easily fix this by calling the violin layer before the jitter layer.
ggplot(data = pbmc[["percent.mt"]]) +
aes(x = 1, y = percent.mt) +
geom_violin() +
geom_jitter(height = 0, width = 0.4)
Finally, let’s customize the plot to look more like the
Seurat
version (though not exactly the same as some
defaults in the original plot are unnecessary - like the legend).
Checkout the other workshops at this end of this document to learn more
about ggplot customization.
ggplot(data = pbmc[["percent.mt"]]) +
aes(x = 1, y = percent.mt) +
geom_violin(fill = "salmon") + #Add color to violin
geom_jitter(height = 0, width = 0.4, size = 0.5) + #Make points smaller
#Add nice labels
labs(x = "Identity pbmc3k",
y = "",
title = "percent.mt") +
#Format background and axes
theme_classic() +
#Remove meaningless x-axis values
theme(axis.ticks.x = element_blank(),
axis.text.x = element_blank())
Your exact jitter plot will look different than mine. This is because this function uses a random seed to determine the x-axis values. You can force a reproducible plot by specifically setting the random seed.
The most common seeds are 0, 1, 123, and 42. You can use any number and change it up between scripts if you’d like.
set.seed(4389)
ggplot(data = pbmc[["percent.mt"]]) +
aes(x = 1, y = percent.mt) +
geom_jitter(height = 0, width = 0.2)
Alternatively, you can improve readability by using a bee swarm.
Checkout more in the ggbeeswarm
package.
geom_hline( )
pbmc[["nFeature_RNA"]]
and our original
quality filtering retained cells with 200 - 2500 features.Ordination simplifies the multi-dimensional gene expression data to a small number of latent variables that can be plotted on a simple XY plane. You may be familiar with principle component analysis (PCA) which is linear ordination. Single cell data performs better with non-linear ordinations like those described below.
In our ordination plots, we will color cell clusters and differentially expressed genes determined in the data cleaning notes.
Uniform Manifold Approximation and Projection (UMAP) is one possible non-linear ordination. We, unfortunately, don’t have time to go into the math but you can learn more in McInnes et al.
Let’s make a UMAP before discussing more. First, calculate the UMAP for 10 axes (i.e. latent variables).
pbmc <- RunUMAP(pbmc, dims = 1:10)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 16:50:41 UMAP embedding parameters a = 0.9922 b = 1.112
## 16:50:41 Read 2638 rows and found 10 numeric columns
## 16:50:41 Using Annoy for neighbor search, n_neighbors = 30
## 16:50:41 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 16:50:42 Writing NN index file to temp file /var/folders/_c/rdwjy6qn3bl7xt82xy_wp90w0000gn/T//RtmpMk2cP7/file11f42582a1c63
## 16:50:42 Searching Annoy index using 1 thread, search_k = 3000
## 16:50:42 Annoy recall = 100%
## 16:50:42 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 16:50:43 Initializing from normalized Laplacian + noise (using irlba)
## 16:50:43 Commencing optimization for 500 epochs, with 105124 positive edges
## 16:50:47 Optimization finished
Seurat
can quickly plot the results with
DimPlot
.
DimPlot(pbmc, reduction = "umap", group.by = "orig.ident")
And further color the pre-determined cell clusters.
DimPlot(pbmc, reduction = "umap", group.by = "ident")
Next, we can re-capitulate this plot in ggplot. We do some data
wrangling in the tidyverse
to extract and format all the
data we need. See the other workshops at the end of this document to
learn more about the tidyverse
.
#Extract the XY values for UMP
umap_xy <- as.data.frame(pbmc[["umap"]]@cell.embeddings)
#Add cell cluster information
cell_clusters <- as.data.frame(pbmc$seurat_clusters)
#Combine
umap_xy_clusters <- umap_xy %>%
#move rownames to data column
rownames_to_column("cell") %>%
#merge with cluster info
full_join(
cell_clusters %>% rownames_to_column("cell")
) %>%
#clean column names
rename(ident = `pbmc$seurat_clusters`)
## Joining, by = "cell"
head(umap_xy_clusters)
## cell UMAP_1 UMAP_2 ident
## 1 AAACATACAACCAC-1 3.448149 -3.393726 2
## 2 AAACATTGAGCTAC-1 3.107640 11.810047 3
## 3 AAACATTGATCAGC-1 6.537256 -4.587552 2
## 4 AAACCGTGCTTCCG-1 -10.413080 3.015030 1
## 5 AAACCGTGTATGCG-1 3.034070 2.898531 6
## 6 AAACGCACTGGTAC-1 4.263805 -5.398699 2
Then plot!
ggplot(data = umap_xy_clusters) +
aes(x = UMAP_1, y = UMAP_2) +
geom_point(size = 0.5) +
aes(color = ident) +
labs(title = "ident", color = "") +
theme_classic()
Sometimes UMAP is not automatically installed with
Seurat
. If you get an error, try unloading the package,
installing UMAP, and re-loading the package.
detach("package:Seurat", unload = TRUE)
reticulate::py_install(packages = 'umap-learn')
library(Seurat)
We can plot gene expression per cell on the ordination. In
Seurat
, you list the gene(s) (i.e. features) and it does so
with FeaturePlot
FeaturePlot(pbmc, reduction = "umap", features = c("LYZ", "MS4A1"))
Or we can extract and manipulate the pbmc
data to make a
similar plot ourselves.
#Define gene of interest
genes_of_interest <- c("LYZ", "MS4A1")
#Extract normalized gene expression data
pbmc_counts <- as.data.frame(GetAssayData(
object = pbmc, slot = "data"))
#Filter for gene of interest
umap_xy_expression <- pbmc_counts %>%
rownames_to_column("gene") %>%
filter(gene %in% genes_of_interest) %>%
#long format
pivot_longer(-gene, names_to = "cell") %>%
#combine with UMAP data
full_join(umap_xy_clusters)
## Joining, by = "cell"
head(umap_xy_expression)
## # A tibble: 6 × 6
## gene cell value UMAP_1 UMAP_2 ident
## <chr> <chr> <dbl> <dbl> <dbl> <fct>
## 1 MS4A1 AAACATACAACCAC-1 0 3.45 -3.39 2
## 2 MS4A1 AAACATTGAGCTAC-1 2.58 3.11 11.8 3
## 3 MS4A1 AAACATTGATCAGC-1 0 6.54 -4.59 2
## 4 MS4A1 AAACCGTGCTTCCG-1 0 -10.4 3.02 1
## 5 MS4A1 AAACCGTGTATGCG-1 0 3.03 2.90 6
## 6 MS4A1 AAACGCACTGGTAC-1 0 4.26 -5.40 2
ggplot(data = umap_xy_expression) +
aes(x = UMAP_1, y = UMAP_2) +
geom_point(size = 0.5) +
aes(color = value) + #Color by expression
scale_color_gradient(low = "lightgrey",
high = "blue") + #Change colors
facet_wrap(~ gene, scales = "free") + #Split by gene
labs(color = "") +
theme_classic()
Note that the ggplot version has a single color scale across all facets of the plot. This is helpful in cases where the ranges of gene expression are similar (like here) but can mask differences if scales are wildly different. If you want two separate scales, we recommend making two separate ggplots.
Similarly, we can use the tSNE algorithm to plot these data in
Seurat
.
pbmc <- RunTSNE(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "tsne", group.by = "ident")
Or make the plot from scratch.
#Extract the XY values for UMP
tsne_xy <- as.data.frame(pbmc[["tsne"]]@cell.embeddings)
#Use the same cell clusters as UMAP
#Combine
tsne_xy_clusters <- tsne_xy %>%
#move rownames to data column
rownames_to_column("cell") %>%
#merge with cluster info
full_join(
cell_clusters %>% rownames_to_column("cell")
) %>%
#clean column names
rename(ident = `pbmc$seurat_clusters`)
## Joining, by = "cell"
head(tsne_xy_clusters)
## cell tSNE_1 tSNE_2 ident
## 1 AAACATACAACCAC-1 -11.648669 6.765393 2
## 2 AAACATTGAGCTAC-1 -22.229239 -21.442232 3
## 3 AAACATTGATCAGC-1 -1.887318 23.947976 2
## 4 AAACCGTGCTTCCG-1 29.506157 -10.035204 1
## 5 AAACCGTGTATGCG-1 -33.962000 9.758798 6
## 6 AAACGCACTGGTAC-1 -2.840661 13.441918 2
ggplot(data = tsne_xy_clusters) +
aes(x = tSNE_1, y = tSNE_2) +
geom_point(size = 0.5) +
aes(color = ident) +
labs(title = "ident", color = "") +
theme_classic()
You may wish to visualize expression of more genes than is reasonable to show in individual ordinations. Heatmaps are one option to show a lot of data in a small space.
Here, we will plot the expression of the top 3 marker genes in
clusters. Gene markers were determined in data
cleaning notes and are contained in pbmc_markers
top3 <- pbmc_markers %>%
group_by(cluster) %>%
top_n(n = 3, wt = avg_log2FC)
head(top3)
## # A tibble: 6 × 7
## # Groups: cluster [2]
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 1.74e-109 1.07 0.897 0.593 2.39e-105 0 LDHB
## 2 1.17e- 83 1.33 0.435 0.108 1.60e- 79 0 CCR7
## 3 3.28e- 49 1.05 0.333 0.103 4.50e- 45 0 LEF1
## 4 0 5.57 0.996 0.215 0 1 S100A9
## 5 0 5.48 0.975 0.121 0 1 S100A8
## 6 1.07e-267 4.55 1 0.517 1.47e-263 1 LYZ
Seurat
has DoHeatmap
to plot these
data.
DoHeatmap(pbmc, features = top3$gene)
Note that by default, this function only shows data with log
expression from -2.5 to 2.5. Anything outside this range is set to this
minima or maxima. This is an important distinction as your gene of
interest may not fall within the range. You can change this range with
the parameters disp.min
and disp.max
. It is
rare that you’ll need to change the minimum given the log transformation
but the maximum is low for some genes.
For example, to plot all of the data for our top 3 genes, we need to increase the maximum to 10. However, this makes the scale less readable and may point to a need for additional data cleaning as these high values are likely outliers.
DoHeatmap(pbmc, features = top3$gene, disp.max = 10)
To create the default Seurat
heatmap from scratch, there
are a lot of R package options. Many people have their
favorites and some packages have some features, some have others. Here,
we will use ComplexHeatmap
, because it can do nearly
everything you can think of in heatmaps. However, the trade-off if that
it’s syntax is somewhat confusing. There is an extensive reference
for ComplexHeatmap that will help you continue to customize heatmaps
in the future.
First, we must extract and reformat the data.
ComplexHeatmap
takes an input matrix that looks mostly like
the final heatmap, e.g. the rows are genes and the columns are cells.
Unfortunately, the pbmc_markers
data from
Seurat
does not contain cell level expression (just
averages), so we must go to the normalized expression data in
pbmc
.
We extract these data similar to the raw counts we used in ordination
only now we get the log-scaled data to be consistent with the
Seurat
heatmap.
pbmc_counts_log <- as.data.frame(GetAssayData(
object = pbmc,
slot = "scale.data"))
Filter these data for the top cluster genes, recode outlier values to 2.5, order cells and genes by cluster, and convert to a matrix with rownames.
#Order cells by clusters. Remember that we originally
# got these data from as.data.frame(pbmc$seurat_clusters)
cell_order <- cell_clusters %>%
arrange(`pbmc$seurat_clusters`)
pbmc_counts_log_top3 <- pbmc_counts_log %>%
#Filter top genes
rownames_to_column("gene") %>%
filter(gene %in% top3$gene) %>%
#Remove outliers
mutate_if(is.numeric, ~ifelse(. > 2.5, 2.5, .)) %>%
#Reorder cells
select(gene, all_of(rownames(cell_order))) %>%
#Reorder genes
mutate(gene = factor(gene, levels = top3$gene)) %>%
arrange(gene) %>%
#Convert to matrix
column_to_rownames("gene") %>%
as.matrix()
Then create a default ComplexHeatmap
. We turn off
clustering to keep the order the same as the original matrix.
Heatmap(pbmc_counts_log_top3, use_raster = FALSE,
cluster_rows = FALSE, cluster_columns = FALSE)
Now, let’s modify the heatmap to look more like
Seurat
.
#Make color scale to span the data values
#Force black center at 0.5 to match Seurat
col_min <- min(pbmc_counts_log_top3, na.rm = TRUE)
col_max <- max(pbmc_counts_log_top3, na.rm = TRUE)
col_fun <- colorRamp2(c(col_min, 0.5, col_max),
c("magenta", "black", "yellow"))
#Format cluster annotation
hm_clusters <- cell_order$`pbmc$seurat_clusters`
names(hm_clusters) <- rownames(cell_order)
#Make cluster colors the same as ggplot default
hm_colors <- scales::hue_pal()(9)
names(hm_colors) <- as.character(c(0:8))
#Format ComplexHeatmap annotation
#This is the not great syntax that the reference can help with
hm_anno <- HeatmapAnnotation(
Identity = hm_clusters,
col = list(Identity = hm_colors),
show_annotation_name = FALSE)
#Make heatmap
Heatmap(pbmc_counts_log_top3, use_raster = FALSE,
cluster_rows = FALSE, cluster_columns = FALSE,
name = "Expression",
column_split = hm_clusters,
col = col_fun, #recolor
show_column_names = FALSE, #remove cell (column) labels
row_names_side = "left", #move gene (row) labels
top_annotation = hm_anno) #add cluster annotation
The Seurat clustering above is aimed at grouping the cell clusters in numeric order. However, you may glean more information with hierarchical clustering from the actual expression data.
For example, if you cluster across all genes and within each cluster, we see that clusters 5 and 1 are very similar except in the gene FCGR3A.
Heatmap(pbmc_counts_log_top3, use_raster = FALSE,
cluster_rows = TRUE, cluster_columns = TRUE, #Turn clustering on
name = "Expression",
column_split = hm_clusters,
col = col_fun, #recolor
show_column_names = FALSE, #remove cell (column) labels
row_names_side = "left", #move gene (row) labels
top_annotation = hm_anno #add cluster annotation
)
Or if we allow completely de novo clustering, we see that some cell clusters group well given only this gene set while others do not and likely require more marker genes to resolve.
Heatmap(pbmc_counts_log_top3, use_raster = FALSE,
cluster_rows = TRUE, cluster_columns = TRUE, #Turn clustering on
name = "Expression",
col = col_fun, #recolor
show_column_names = FALSE, #remove cell (column) labels
row_names_side = "left", #move gene (row) labels
top_annotation = hm_anno #add cluster annotation
)
We don’t have any variables in these example data that lend themselves well to alluvial plotting. So, we will create a dummy T-cell receptor data set and use alluvial plots to compare V and J segment usage similar to Figure 2 in this paper.
Create the dummy data first. Be sure to set the same seed if you want your plot to look exactly like the one in this tutorial.
These data are simply the V and J name pairs for 100 “cells” from two individuals.
set.seed(86)
tcr <- data.frame(
Patient = c("pt01", "pt02"),
TRBV = paste0("TRBV", sample(1:3, 200, replace = TRUE)),
TRBJ = paste0("TRBJ", sample(1:3, 200, replace = TRUE))
)
head(tcr)
## Patient TRBV TRBJ
## 1 pt01 TRBV3 TRBJ3
## 2 pt02 TRBV1 TRBJ3
## 3 pt01 TRBV1 TRBJ1
## 4 pt02 TRBV1 TRBJ2
## 5 pt01 TRBV3 TRBJ3
## 6 pt02 TRBV2 TRBJ1
We then calculate how many times each VJ pair occurs in each patient.
Note that count( )
is a special function in the
tidyverse
which is shorthand for
mutate(n = n( ))
.
tcr_summary <- tcr %>%
group_by(Patient) %>%
count(TRBV, TRBJ) %>%
ungroup()
tcr_summary
## # A tibble: 18 × 4
## Patient TRBV TRBJ n
## <chr> <chr> <chr> <int>
## 1 pt01 TRBV1 TRBJ1 8
## 2 pt01 TRBV1 TRBJ2 10
## 3 pt01 TRBV1 TRBJ3 10
## 4 pt01 TRBV2 TRBJ1 9
## 5 pt01 TRBV2 TRBJ2 17
## 6 pt01 TRBV2 TRBJ3 12
## 7 pt01 TRBV3 TRBJ1 6
## 8 pt01 TRBV3 TRBJ2 17
## 9 pt01 TRBV3 TRBJ3 11
## 10 pt02 TRBV1 TRBJ1 7
## 11 pt02 TRBV1 TRBJ2 9
## 12 pt02 TRBV1 TRBJ3 16
## 13 pt02 TRBV2 TRBJ1 16
## 14 pt02 TRBV2 TRBJ2 10
## 15 pt02 TRBV2 TRBJ3 12
## 16 pt02 TRBV3 TRBJ1 9
## 17 pt02 TRBV3 TRBJ2 9
## 18 pt02 TRBV3 TRBJ3 12
We make a default alluvial plot in ggplot with
ggalluvial
. We set the y variable to total counts
(n
) and define each x axis value in the order that we want.
You can add more x groups beyond 3 as well.
ggplot(tcr_summary) +
aes(y = n, axis1 = Patient, axis2 = TRBV, axis3 = TRBJ) +
geom_alluvium(aes(fill = Patient))
Then adding labels and further improving the plot. Since this is a random dummy data set, there are no trends to note.
ggplot(tcr_summary) +
aes(y = n, axis1 = Patient, axis2 = TRBV, axis3 = TRBJ) +
geom_alluvium(aes(fill = Patient)) +
#Boxes around levels
geom_stratum(fill = NA) +
#Label levels
geom_text(aes(label = after_stat(stratum)),
stat = "stratum", size = 5) +
#Relabel x axis
scale_x_discrete(limits = c("Patient", "TRBV", "TRBJ")) +
#Format background and legend
theme_bw() +
theme(legend.position = "none")
## Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
You could also color by the V segment if that is what you want to highlight.
ggplot(tcr_summary) +
aes(y = n, axis1 = Patient, axis2 = TRBV, axis3 = TRBJ) +
geom_alluvium(aes(fill = TRBV)) + #Change color variable
#Boxes around levels
geom_stratum(fill = NA) +
#Label levels
geom_text(aes(label = after_stat(stratum)),
stat = "stratum", size = 5) +
#Relabel x axis
scale_x_discrete(limits = c("Patient", "TRBV", "TRBJ")) +
#Format background and legend
theme_bw() +
theme(legend.position = "none")
Sometimes alluvial plots are shown as circles instead. While you can
modify ggalluvial
plots to circle form, it requires a lot
of code. Instead, we will use circlize
as it is
specifically designed for these circular plots, called chord
diagrams.
Like alluvial plots, chord diagrams can have any number of groups. First, we just show the VJ connections with patient information. The default colors are random so your plot may differ.
tcr_summary %>%
select(-Patient) %>%
chordDiagram(.)
You can set the colors instead of them being random.
chord_colors <- c("#CC79A7","#D55E00","#F0E442",
"#0072B2","#009E73","#56B4E9")
names(chord_colors) <- c("TRBV1", "TRBV2", "TRBV3",
"TRBJ1", "TRBJ2", "TRBJ3")
tcr_summary %>%
select(-Patient) %>%
chordDiagram(., grid.col = chord_colors)
Unlike the alluvial plot, adding patient as a group along the circle does not make sense as you’d then need to follow two different connections around. There is honestly not an easy way to add patient to the above plot. We recommend creating 1 plot per patient to do these comparisons like so.
tcr_summary %>%
filter(Patient == "pt01") %>%
select(-Patient) %>%
chordDiagram(., grid.col = chord_colors)
title("Patient 01")
tcr_summary %>%
filter(Patient == "pt02") %>%
select(-Patient) %>%
chordDiagram(., grid.col = chord_colors)
title("Patient 02")
You can see more ways to customize chord plots at https://jokergoo.github.io/circlize_book/book