Description

This workshop introduces popular data visualization methods for single cell RNA-seq data. Specifically, we will cover UMAP/tSNE and heatmaps.

During the workshop, we will build an R script together, which is available at https://github.com/BIGslu/workshops/blob/main/2023.02.07_scRNAseq.viz.workshop/2023.02.07_live.notes.R

The video recording is available at https://youtu.be/x-BtCyLEH8c

Prior to this workshop

Please following the setup instructions at https://bigslu.github.io/workshops/setup/setup.html

If you are brand new to R, we recommend completely our 1-hour Introduction to R workshop to prepare, https://bigslu.github.io/workshops/introR.workshop/introR.html

Setup

R project and script

Create a new R project and new R script to save your code. See Intro R for more information if you are unfamiliar with these.

Load packages

You should have installed packages prior to the workshop. For more information on installation, see the setup instructions.

Every time you open a new RStudio session, the packages you want to use need to be loaded into the R workspace with the library function. This tells R to access the package’s functions and prevents RStudio from lags that would occur if it automatically loaded every downloaded package every time you opened it. To put this in perspective, I had 464 packages installed at the time this document was made.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 1.0.0 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Because tidyverse is a meta-package, it automatically tells you what packages it is loading and their versions. In addition, the Conflicts section let’s you know functions in the tidyverse that alter exist in your R session. Because you chose to load the package, calling the function filter will use the tidyverse function not the stats function (which comes with base R). If you for some reason needed the stats version, you can specify it with package:: like stats::filter.

library(Seurat)
## Attaching SeuratObject
library(ComplexHeatmap)
## Loading required package: grid
## ========================================
## ComplexHeatmap version 2.12.1
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
## 
## If you use it in published research, please cite either one:
## - Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
##     genomic data. Bioinformatics 2016.
## - Gu, Z. Complex Heatmap Visualization. iMeta 2022.
## 
## 
## The new InteractiveComplexHeatmap package can directly export static 
## complex heatmaps into an interactive Shiny app with zero effort. Have a try!
## 
## This message can be suppressed by:
##   suppressPackageStartupMessages(library(ComplexHeatmap))
## ========================================
library(circlize)
## ========================================
## circlize version 0.4.15
## CRAN page: https://cran.r-project.org/package=circlize
## Github page: https://github.com/jokergoo/circlize
## Documentation: https://jokergoo.github.io/circlize_book/book/
## 
## If you use it in published research, please cite:
## Gu, Z. circlize implements and enhances circular visualization
##   in R. Bioinformatics 2014.
## 
## This message can be suppressed by:
##   suppressPackageStartupMessages(library(circlize))
## ========================================

These packages automatically print some information of dependencies, citation, etc. You don’t need to worry about these messages.

library(ggalluvial)

And finally, these packages load silently.

Download and load data

We will use example data provided by 10X. These data were pre-cleaned and normalized for this workshop so we can get right to plotting! You can see the cleaning steps in this markdown with more information in the Seurat tutorial.

Please download the data and place the RData file in your project directory.

Then, load the data into R.

load("pbmc_clean.RData")

Data quality

First let’s explore the final data quality. While Seurat has a number of plotting functions, we will also build a plot from scratch so that you can further customize as needed.

The percent mitochondrial data are contained in the data frame pbmc[["percent.mt"]]. See the data cleaning notes for how this metric was calculated.

class(pbmc[["percent.mt"]])
## [1] "data.frame"
head(pbmc[["percent.mt"]])
##                  percent.mt
## AAACATACAACCAC-1  3.0177759
## AAACATTGAGCTAC-1  3.7935958
## AAACATTGATCAGC-1  0.8897363
## AAACCGTGCTTCCG-1  1.7430845
## AAACCGTGTATGCG-1  1.2244898
## AAACGCACTGGTAC-1  1.6643551

Violin plot

Seurat provides an easy way to make violin plots with VlnPlot.

VlnPlot(pbmc, features = "percent.mt", group.by = "orig.ident")

Now let’s recapitulate this plot in ggplot. Ggplot uses layers connected by + to progressively build more and more complex and customized plots. To start, we create a simple dot plot but we see that this is not very readable since there are so many data points (one per cell in the data set).

ggplot(data = pbmc[["percent.mt"]]) +
  aes(x = 1, y = percent.mt) +
  geom_point()

We can improve readability by jittering the points on the x axis.

ggplot(data = pbmc[["percent.mt"]]) +
  aes(x = 1, y = percent.mt) +
  geom_jitter(height = 0, width = 0.2)

Next, we add a violin by adding another layer.

ggplot(data = pbmc[["percent.mt"]]) +
  aes(x = 1, y = percent.mt) +
  geom_jitter(height = 0, width = 0.4) +
  geom_violin()

Oh no! Our violin trend covers up our points. This is because ggplot adds layers progressively. We easily fix this by calling the violin layer before the jitter layer.

ggplot(data = pbmc[["percent.mt"]]) +
  aes(x = 1, y = percent.mt) +
  geom_violin() +
  geom_jitter(height = 0, width = 0.4)

Finally, let’s customize the plot to look more like the Seurat version (though not exactly the same as some defaults in the original plot are unnecessary - like the legend). Checkout the other workshops at this end of this document to learn more about ggplot customization.

ggplot(data = pbmc[["percent.mt"]]) +
  aes(x = 1, y = percent.mt) +
  geom_violin(fill = "salmon") + #Add color to violin
  geom_jitter(height = 0, width = 0.4, size = 0.5) + #Make points smaller
  #Add nice labels
  labs(x = "Identity pbmc3k", 
       y = "",
       title = "percent.mt") +
  #Format background and axes
  theme_classic() +
  #Remove meaningless x-axis values
  theme(axis.ticks.x = element_blank(),
        axis.text.x = element_blank())

About random seeds

Your exact jitter plot will look different than mine. This is because this function uses a random seed to determine the x-axis values. You can force a reproducible plot by specifically setting the random seed.

The most common seeds are 0, 1, 123, and 42. You can use any number and change it up between scripts if you’d like.

set.seed(4389)
ggplot(data = pbmc[["percent.mt"]]) +
  aes(x = 1, y = percent.mt) +
  geom_jitter(height = 0, width = 0.2)

An alternative to jitter

Alternatively, you can improve readability by using a bee swarm. Checkout more in the ggbeeswarm package.

Practice violin plot

  • Further customize the mitochondrial plot by adding a horizontal line showing the 5% cutoff used in data clean. Hint: checkout geom_hline( )
  • Make a violin plot showing the number of RNA features per cell. These data are in pbmc[["nFeature_RNA"]] and our original quality filtering retained cells with 200 - 2500 features.

Ordination

Ordination simplifies the multi-dimensional gene expression data to a small number of latent variables that can be plotted on a simple XY plane. You may be familiar with principle component analysis (PCA) which is linear ordination. Single cell data performs better with non-linear ordinations like those described below.

In our ordination plots, we will color cell clusters and differentially expressed genes determined in the data cleaning notes.

UMAP

Uniform Manifold Approximation and Projection (UMAP) is one possible non-linear ordination. We, unfortunately, don’t have time to go into the math but you can learn more in McInnes et al.

Let’s make a UMAP before discussing more. First, calculate the UMAP for 10 axes (i.e. latent variables).

pbmc <- RunUMAP(pbmc, dims = 1:10)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 16:50:41 UMAP embedding parameters a = 0.9922 b = 1.112
## 16:50:41 Read 2638 rows and found 10 numeric columns
## 16:50:41 Using Annoy for neighbor search, n_neighbors = 30
## 16:50:41 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 16:50:42 Writing NN index file to temp file /var/folders/_c/rdwjy6qn3bl7xt82xy_wp90w0000gn/T//RtmpMk2cP7/file11f42582a1c63
## 16:50:42 Searching Annoy index using 1 thread, search_k = 3000
## 16:50:42 Annoy recall = 100%
## 16:50:42 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 16:50:43 Initializing from normalized Laplacian + noise (using irlba)
## 16:50:43 Commencing optimization for 500 epochs, with 105124 positive edges
## 16:50:47 Optimization finished

Seurat can quickly plot the results with DimPlot.

DimPlot(pbmc, reduction = "umap", group.by = "orig.ident")

And further color the pre-determined cell clusters.

DimPlot(pbmc, reduction = "umap", group.by = "ident")

Next, we can re-capitulate this plot in ggplot. We do some data wrangling in the tidyverse to extract and format all the data we need. See the other workshops at the end of this document to learn more about the tidyverse.

#Extract the XY values for UMP
umap_xy <- as.data.frame(pbmc[["umap"]]@cell.embeddings)
#Add cell cluster information
cell_clusters <- as.data.frame(pbmc$seurat_clusters)
#Combine
umap_xy_clusters <- umap_xy %>% 
  #move rownames to data column
  rownames_to_column("cell") %>% 
  #merge with cluster info
  full_join(
    cell_clusters %>% rownames_to_column("cell")
  ) %>% 
  #clean column names
  rename(ident = `pbmc$seurat_clusters`)
## Joining, by = "cell"
head(umap_xy_clusters)
##               cell     UMAP_1    UMAP_2 ident
## 1 AAACATACAACCAC-1   3.448149 -3.393726     2
## 2 AAACATTGAGCTAC-1   3.107640 11.810047     3
## 3 AAACATTGATCAGC-1   6.537256 -4.587552     2
## 4 AAACCGTGCTTCCG-1 -10.413080  3.015030     1
## 5 AAACCGTGTATGCG-1   3.034070  2.898531     6
## 6 AAACGCACTGGTAC-1   4.263805 -5.398699     2

Then plot!

ggplot(data = umap_xy_clusters) +
  aes(x = UMAP_1, y = UMAP_2) +
  geom_point(size = 0.5) +
  aes(color = ident) +
  labs(title = "ident", color = "") +
  theme_classic()

Error with UMAP

Sometimes UMAP is not automatically installed with Seurat. If you get an error, try unloading the package, installing UMAP, and re-loading the package.

detach("package:Seurat", unload = TRUE)
reticulate::py_install(packages = 'umap-learn')
library(Seurat)

Highlight specific genes

We can plot gene expression per cell on the ordination. In Seurat, you list the gene(s) (i.e. features) and it does so with FeaturePlot

FeaturePlot(pbmc, reduction = "umap", features = c("LYZ", "MS4A1"))

Or we can extract and manipulate the pbmc data to make a similar plot ourselves.

#Define gene of interest
genes_of_interest <- c("LYZ", "MS4A1")

#Extract normalized gene expression data
pbmc_counts <- as.data.frame(GetAssayData(
  object = pbmc, slot = "data"))

#Filter for gene of interest
umap_xy_expression <- pbmc_counts %>% 
  rownames_to_column("gene") %>% 
  filter(gene %in% genes_of_interest) %>% 
  #long format
  pivot_longer(-gene, names_to = "cell") %>% 
  #combine with UMAP data
  full_join(umap_xy_clusters)
## Joining, by = "cell"
head(umap_xy_expression)
## # A tibble: 6 × 6
##   gene  cell             value UMAP_1 UMAP_2 ident
##   <chr> <chr>            <dbl>  <dbl>  <dbl> <fct>
## 1 MS4A1 AAACATACAACCAC-1  0      3.45  -3.39 2    
## 2 MS4A1 AAACATTGAGCTAC-1  2.58   3.11  11.8  3    
## 3 MS4A1 AAACATTGATCAGC-1  0      6.54  -4.59 2    
## 4 MS4A1 AAACCGTGCTTCCG-1  0    -10.4    3.02 1    
## 5 MS4A1 AAACCGTGTATGCG-1  0      3.03   2.90 6    
## 6 MS4A1 AAACGCACTGGTAC-1  0      4.26  -5.40 2
ggplot(data = umap_xy_expression) +
  aes(x = UMAP_1, y = UMAP_2) +
  geom_point(size = 0.5) +
  aes(color = value) + #Color by expression
  scale_color_gradient(low = "lightgrey", 
                       high = "blue") + #Change colors
  facet_wrap(~ gene, scales = "free") + #Split by gene
  labs(color = "") +
  theme_classic()

Note that the ggplot version has a single color scale across all facets of the plot. This is helpful in cases where the ranges of gene expression are similar (like here) but can mask differences if scales are wildly different. If you want two separate scales, we recommend making two separate ggplots.

tSNE

Similarly, we can use the tSNE algorithm to plot these data in Seurat.

pbmc <- RunTSNE(pbmc, dims = 1:10)

DimPlot(pbmc, reduction = "tsne", group.by = "ident")

Or make the plot from scratch.

#Extract the XY values for UMP
tsne_xy <- as.data.frame(pbmc[["tsne"]]@cell.embeddings)
#Use the same cell clusters as UMAP
#Combine
tsne_xy_clusters <- tsne_xy %>% 
  #move rownames to data column
  rownames_to_column("cell") %>% 
  #merge with cluster info
  full_join(
    cell_clusters %>% rownames_to_column("cell")
  ) %>% 
  #clean column names
  rename(ident = `pbmc$seurat_clusters`)
## Joining, by = "cell"
head(tsne_xy_clusters)
##               cell     tSNE_1     tSNE_2 ident
## 1 AAACATACAACCAC-1 -11.648669   6.765393     2
## 2 AAACATTGAGCTAC-1 -22.229239 -21.442232     3
## 3 AAACATTGATCAGC-1  -1.887318  23.947976     2
## 4 AAACCGTGCTTCCG-1  29.506157 -10.035204     1
## 5 AAACCGTGTATGCG-1 -33.962000   9.758798     6
## 6 AAACGCACTGGTAC-1  -2.840661  13.441918     2
ggplot(data = tsne_xy_clusters) +
  aes(x = tSNE_1, y = tSNE_2) +
  geom_point(size = 0.5) +
  aes(color = ident) +
  labs(title = "ident", color = "") +
  theme_classic()

Practice ordination

Heatmap

You may wish to visualize expression of more genes than is reasonable to show in individual ordinations. Heatmaps are one option to show a lot of data in a small space.

Here, we will plot the expression of the top 3 marker genes in clusters. Gene markers were determined in data cleaning notes and are contained in pbmc_markers

top3 <- pbmc_markers %>%
  group_by(cluster) %>%
  top_n(n = 3, wt = avg_log2FC)

head(top3)
## # A tibble: 6 × 7
## # Groups:   cluster [2]
##       p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene  
##       <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr> 
## 1 1.74e-109       1.07 0.897 0.593 2.39e-105 0       LDHB  
## 2 1.17e- 83       1.33 0.435 0.108 1.60e- 79 0       CCR7  
## 3 3.28e- 49       1.05 0.333 0.103 4.50e- 45 0       LEF1  
## 4 0               5.57 0.996 0.215 0         1       S100A9
## 5 0               5.48 0.975 0.121 0         1       S100A8
## 6 1.07e-267       4.55 1     0.517 1.47e-263 1       LYZ

Seurat has DoHeatmap to plot these data.

DoHeatmap(pbmc, features = top3$gene)

Note that by default, this function only shows data with log expression from -2.5 to 2.5. Anything outside this range is set to this minima or maxima. This is an important distinction as your gene of interest may not fall within the range. You can change this range with the parameters disp.min and disp.max. It is rare that you’ll need to change the minimum given the log transformation but the maximum is low for some genes.

For example, to plot all of the data for our top 3 genes, we need to increase the maximum to 10. However, this makes the scale less readable and may point to a need for additional data cleaning as these high values are likely outliers.

DoHeatmap(pbmc, features = top3$gene, disp.max = 10)

To create the default Seurat heatmap from scratch, there are a lot of R package options. Many people have their favorites and some packages have some features, some have others. Here, we will use ComplexHeatmap, because it can do nearly everything you can think of in heatmaps. However, the trade-off if that it’s syntax is somewhat confusing. There is an extensive reference for ComplexHeatmap that will help you continue to customize heatmaps in the future.

First, we must extract and reformat the data. ComplexHeatmap takes an input matrix that looks mostly like the final heatmap, e.g. the rows are genes and the columns are cells. Unfortunately, the pbmc_markers data from Seurat does not contain cell level expression (just averages), so we must go to the normalized expression data in pbmc.

We extract these data similar to the raw counts we used in ordination only now we get the log-scaled data to be consistent with the Seurat heatmap.

pbmc_counts_log <- as.data.frame(GetAssayData(
  object = pbmc, 
  slot = "scale.data"))

Filter these data for the top cluster genes, recode outlier values to 2.5, order cells and genes by cluster, and convert to a matrix with rownames.

#Order cells by clusters. Remember that we originally 
# got these data from as.data.frame(pbmc$seurat_clusters)
cell_order <- cell_clusters %>% 
  arrange(`pbmc$seurat_clusters`)

pbmc_counts_log_top3 <- pbmc_counts_log %>% 
  #Filter top genes
  rownames_to_column("gene") %>% 
  filter(gene %in% top3$gene) %>% 
  #Remove outliers
  mutate_if(is.numeric, ~ifelse(. > 2.5, 2.5, .)) %>% 
  #Reorder cells
  select(gene, all_of(rownames(cell_order))) %>% 
  #Reorder genes
  mutate(gene = factor(gene, levels = top3$gene)) %>% 
  arrange(gene) %>% 
  #Convert to matrix
  column_to_rownames("gene") %>% 
  as.matrix()

Then create a default ComplexHeatmap. We turn off clustering to keep the order the same as the original matrix.

Heatmap(pbmc_counts_log_top3, use_raster = FALSE,
        cluster_rows = FALSE, cluster_columns = FALSE)

Now, let’s modify the heatmap to look more like Seurat.

#Make color scale to span the data values
#Force black center at 0.5 to match Seurat
col_min <- min(pbmc_counts_log_top3, na.rm = TRUE)
col_max <- max(pbmc_counts_log_top3, na.rm = TRUE)
col_fun <- colorRamp2(c(col_min, 0.5, col_max),
                      c("magenta", "black", "yellow"))

#Format cluster annotation
hm_clusters <- cell_order$`pbmc$seurat_clusters`
names(hm_clusters) <- rownames(cell_order)

#Make cluster colors the same as ggplot default
hm_colors <- scales::hue_pal()(9)
names(hm_colors) <- as.character(c(0:8))

#Format ComplexHeatmap annotation
#This is the not great syntax that the reference can help with
hm_anno <- HeatmapAnnotation(
  Identity = hm_clusters,
  col = list(Identity = hm_colors),
  show_annotation_name = FALSE)

#Make heatmap
Heatmap(pbmc_counts_log_top3, use_raster = FALSE,
        cluster_rows = FALSE, cluster_columns = FALSE,
        name = "Expression",
        column_split = hm_clusters,
        col = col_fun, #recolor
        show_column_names = FALSE, #remove cell (column) labels
        row_names_side = "left", #move gene (row) labels
        top_annotation = hm_anno) #add cluster annotation

Heatmap clustering

The Seurat clustering above is aimed at grouping the cell clusters in numeric order. However, you may glean more information with hierarchical clustering from the actual expression data.

For example, if you cluster across all genes and within each cluster, we see that clusters 5 and 1 are very similar except in the gene FCGR3A.

Heatmap(pbmc_counts_log_top3, use_raster = FALSE,
        cluster_rows = TRUE, cluster_columns = TRUE, #Turn clustering on
        name = "Expression",
        column_split = hm_clusters,
        col = col_fun, #recolor
        show_column_names = FALSE, #remove cell (column) labels
        row_names_side = "left", #move gene (row) labels
        top_annotation = hm_anno #add cluster annotation
        )

Or if we allow completely de novo clustering, we see that some cell clusters group well given only this gene set while others do not and likely require more marker genes to resolve.

Heatmap(pbmc_counts_log_top3, use_raster = FALSE,
        cluster_rows = TRUE, cluster_columns = TRUE, #Turn clustering on
        name = "Expression",
        col = col_fun, #recolor
        show_column_names = FALSE, #remove cell (column) labels
        row_names_side = "left", #move gene (row) labels
        top_annotation = hm_anno #add cluster annotation
        )

Practice heatmap

  • Create a heatmap of the top 20 marker genes in clusters. Which clusters better resolve with de novo clustering?
  • Create a heatmap of the top 3 marker genes in only clusters 0 and 1.

Alluvial plot

We don’t have any variables in these example data that lend themselves well to alluvial plotting. So, we will create a dummy T-cell receptor data set and use alluvial plots to compare V and J segment usage similar to Figure 2 in this paper.

Create the dummy data first. Be sure to set the same seed if you want your plot to look exactly like the one in this tutorial.

These data are simply the V and J name pairs for 100 “cells” from two individuals.

set.seed(86)
tcr <- data.frame(
  Patient = c("pt01", "pt02"),
  TRBV = paste0("TRBV", sample(1:3, 200, replace = TRUE)),
  TRBJ = paste0("TRBJ", sample(1:3, 200, replace = TRUE))
)

head(tcr)
##   Patient  TRBV  TRBJ
## 1    pt01 TRBV3 TRBJ3
## 2    pt02 TRBV1 TRBJ3
## 3    pt01 TRBV1 TRBJ1
## 4    pt02 TRBV1 TRBJ2
## 5    pt01 TRBV3 TRBJ3
## 6    pt02 TRBV2 TRBJ1

We then calculate how many times each VJ pair occurs in each patient. Note that count( ) is a special function in the tidyverse which is shorthand for mutate(n = n( )).

tcr_summary <- tcr %>% 
  group_by(Patient) %>% 
  count(TRBV, TRBJ) %>% 
  ungroup()

tcr_summary
## # A tibble: 18 × 4
##    Patient TRBV  TRBJ      n
##    <chr>   <chr> <chr> <int>
##  1 pt01    TRBV1 TRBJ1     8
##  2 pt01    TRBV1 TRBJ2    10
##  3 pt01    TRBV1 TRBJ3    10
##  4 pt01    TRBV2 TRBJ1     9
##  5 pt01    TRBV2 TRBJ2    17
##  6 pt01    TRBV2 TRBJ3    12
##  7 pt01    TRBV3 TRBJ1     6
##  8 pt01    TRBV3 TRBJ2    17
##  9 pt01    TRBV3 TRBJ3    11
## 10 pt02    TRBV1 TRBJ1     7
## 11 pt02    TRBV1 TRBJ2     9
## 12 pt02    TRBV1 TRBJ3    16
## 13 pt02    TRBV2 TRBJ1    16
## 14 pt02    TRBV2 TRBJ2    10
## 15 pt02    TRBV2 TRBJ3    12
## 16 pt02    TRBV3 TRBJ1     9
## 17 pt02    TRBV3 TRBJ2     9
## 18 pt02    TRBV3 TRBJ3    12

We make a default alluvial plot in ggplot with ggalluvial. We set the y variable to total counts (n) and define each x axis value in the order that we want. You can add more x groups beyond 3 as well.

ggplot(tcr_summary) +
  aes(y = n, axis1 = Patient, axis2 = TRBV, axis3 = TRBJ) +
  geom_alluvium(aes(fill = Patient))

Then adding labels and further improving the plot. Since this is a random dummy data set, there are no trends to note.

ggplot(tcr_summary) +
  aes(y = n, axis1 = Patient, axis2 = TRBV, axis3 = TRBJ) +
  geom_alluvium(aes(fill = Patient)) +
  #Boxes around levels
  geom_stratum(fill = NA) +
  #Label levels
  geom_text(aes(label = after_stat(stratum)),
            stat = "stratum", size = 5) +
  #Relabel x axis
  scale_x_discrete(limits = c("Patient", "TRBV", "TRBJ")) +
  #Format background and legend
  theme_bw() +
  theme(legend.position = "none")
## Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.

You could also color by the V segment if that is what you want to highlight.

ggplot(tcr_summary) +
  aes(y = n, axis1 = Patient, axis2 = TRBV, axis3 = TRBJ) +
  geom_alluvium(aes(fill = TRBV)) + #Change color variable
  #Boxes around levels
  geom_stratum(fill = NA) +
  #Label levels
  geom_text(aes(label = after_stat(stratum)),
            stat = "stratum", size = 5) +
  #Relabel x axis
  scale_x_discrete(limits = c("Patient", "TRBV", "TRBJ")) +
  #Format background and legend
  theme_bw() +
  theme(legend.position = "none")

Sometimes alluvial plots are shown as circles instead. While you can modify ggalluvial plots to circle form, it requires a lot of code. Instead, we will use circlize as it is specifically designed for these circular plots, called chord diagrams.

Like alluvial plots, chord diagrams can have any number of groups. First, we just show the VJ connections with patient information. The default colors are random so your plot may differ.

tcr_summary %>%
  select(-Patient) %>%
  chordDiagram(.)

You can set the colors instead of them being random.

chord_colors <- c("#CC79A7","#D55E00","#F0E442",
                  "#0072B2","#009E73","#56B4E9")
names(chord_colors) <- c("TRBV1", "TRBV2", "TRBV3",
                         "TRBJ1", "TRBJ2", "TRBJ3")

tcr_summary %>%
  select(-Patient) %>%
  chordDiagram(., grid.col = chord_colors)

Unlike the alluvial plot, adding patient as a group along the circle does not make sense as you’d then need to follow two different connections around. There is honestly not an easy way to add patient to the above plot. We recommend creating 1 plot per patient to do these comparisons like so.

tcr_summary %>%
  filter(Patient == "pt01") %>% 
  select(-Patient) %>%
  chordDiagram(., grid.col = chord_colors)
title("Patient 01")

tcr_summary %>%
  filter(Patient == "pt02") %>% 
  select(-Patient) %>%
  chordDiagram(., grid.col = chord_colors)
title("Patient 02")

You can see more ways to customize chord plots at https://jokergoo.github.io/circlize_book/book