Trajectory analysis of flow cytometry data
Before you start
This vignette assumes you have performed prior analysis using the basic vignette. To start, you need a flow_object, with UMAP coordinates and louvain clusters computed. For the purpose of the vignette, the package provides a test dataset.
Important: you may need to reanalyze your data upstream differently if you wish to perform trajectory analysis. Trajectory analysis is sensitive to smaller number of events, as it is much harder to find transitions between cell states when you have fewer cells in those transitions. Depending on your biological system, you may need to subsample more heavily at the start (250k - 1M events) to make sure that you are capturing all states in sufficient quantities.
Load libraries
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(unsupflowhelper))
suppressPackageStartupMessages(library(flowCore))
suppressPackageStartupMessages(library(patchwork)) ## to combine multiple plots in a single plot
suppressPackageStartupMessages(library(monocle3))
suppressPackageStartupMessages(library(tradeSeq))
::opts_chunk$set(echo = TRUE)
knitr
set.seed(1)
Load data
Go back to the basic vignette to see how to generate or load a flow_object.
Here, we will use the test dataset
data("th_trajectory")
Partitions
Explanation
The first concept to define are partitions. Monocle3 partitions act as supergroups of cells, or metaclusters if you will. They are meant to identify larger groups of cells that are still somewhat connected despite being organised in multiple clusters, but to separate from other groups of cells with a larger distance gap. A good example would be, in biological terms, CD4 and CD8 T cells. CD4 T cells would be part of one partition, and CD8 T cells would be part of another. They both contain naive, effector and memory subsets, but differ enough to be separated.
By default, the trajectory analysis will try to find trajectories across all cells, but very often that doesn’t make much biological sense. There is no reason for CD4 T cells to differentiate into CD8 T cells in the periphery. Partitioning helps you set up the analysis to generate parallel trajectories within each partition.
However, partitioning is not a perfect capture all of each group of cells. Sometimes, you are looking at related entities (say Naive CD4 T cells) and activated effector cells, but partitioning split them into separate partitions because the gap between them is too large: this is something you see when looking at a very potent stimulating agent like a TLR agonist, or a MHC tetramer.
Partitioning is meant to guide you in how to analyze your data, but ultimately, it’s your knowledge of the biology and the hypotheses that you’re testing that will be the final judge.
Code
By default, all cells are set to partition "1"
automatically. By running SetPartitions()
, you can either
let monocle3 detect partitions automatically
(automatic = T
), or supply it with a data.frame that
contains 1. row.names as cellID and 2. partition labels in a column
called partition. The former can be more practical if you want to have
your partitions named instead of generic numbers.
<- SetPartitions(th_trajectory, automatic = T)
th_trajectory
plot_umap(th_trajectory, group.by = "Condition") + plot_umap(th_trajectory, group.by = "partition")
Here, you see that cells towards the top (Th2; Partition 1) are split into a separate partition from the bottom (Partition 1) despite having a small group of cells connecting them to Partition 2. We know that biologically , those Th0 and Th1 cells are related, as the former differentiates into the latter. The partitioning here suggests that either 1. there are not enough events in that “bridge”, or that Th2 cells are too different from the rest phenotypically. Indeed, you can see they separate in the UMAP plot quite drastically.
Export to cds
Cluster subsampling
It is not always advisable to generate a trajectory analysis on the intact dataset, especially if you subsampled more heavily as mentioned in the Before you start section. While this approach makes it easier to find transition states, it also makes it computationally heavy.
The solution for this, is to perform cluster-based subsampling. This subsampling method randomly selects n events per cluster to equalize the data and make it more computationally manageable.
<- ClusterSubsampling(th_trajectory, n = 2000) flow_obj_th
Export flow_object to cds
Monocle3 works on cell_data_set
objects, not
flow_object
. Unsupflowhelper provides a wrapper function to
generate this object. However, you can provide either the intact dataset
(if it’s computationally feasible) or the subsampled one.
Furthermore, you can use the SubsetFlowObject
function
to select the partition(s) you wish to generate a
trajectory analysis for.
<- flow_to_cds(SubsetFlowObject(flow_obj_th, subset = partition %in% c("1", "2")))
cds
### monocle3 has some plotting functions of it's own.
plot_cells(cds, color_cells_by = "cluster", show_trajectory_graph = F)
Build trajectories
Learn graph
This is where monocle3 will attempt to find a trajectory across the
cells. By default, monocle3 will use embedded
partitions to find trajectories within each. If you wish to perform a
trajectory analysis within partitions, set
use_partition = FALSE
.
Also, different parameters will influence if the trajectory gets closed in a loop or open-ended, or the minimal branch length of secondary branches. More details in monocle3’s help documentation.
<- learn_graph(cds, close_loop = F, use_partition = F, learn_graph_control = list(minimal_branch_len = 15))
cds
plot_cells(cds, color_cells_by = "cluster", show_trajectory_graph = T)
cds umap plot
Pick graph segments
Having a graph is a great first start, but now we have to give it some directionality and meaning.
First step would be to infer where the tree
starts. If your dataset has multiple timepoints, you
could compare them using plot_density_umap()
to figure out
if one tree end matches with a region that is enriched in the earliest
timepoint. Another approach is to use your knowledge of the biological
system: in this case, we know that the bottom left clusters, i.e. nodes
1,8 and 9, are located in the region where Th0 samples are located.
Given that uncommitted Th0 cells are known to differentiate into Th1 or
Th2 cells, this seems like a good starting point.
Second, we observe that the graph splits off in 2 trajectories at the node 7. Thus we are looking at 2 possible trajectories. Each of those trajectories will have a parallel analysis.
<- choose_graph_segments(cds, clear_cds = F)
cds_th1 <- order_cells(cds_th1) cds_th1
<- choose_graph_segments(cds, clear_cds = F)
cds_th2 <- order_cells(cds_th2) cds_th2
Find trajectory Markers
After you ran order_cells
, your cds
object
contains pseudotime values for that trajectory.
Now, you might want to know which markers are variable across that trajectory. Which markers are correlated to pseudotime, are different between start and end, or perhaps even transient changes.
R package tradeSeq covers this for single-cell RNA-seq. While this package was not made for flow cytometry data, it also works to a good extent. One thing that differs is that, given single-cell data is sparse compared to flow, pvalues are extremely significant. A better metric to asses top markers is the waldStat, with higher values representing stronger markers.
<- FindTrajectoryMarkers(cds = cds_th1) th1_markers
Show correlated markers
$associationTest
th1_markers
<- FindTrajectoryMarkers(th1_markers) th1_markers_test
waldStat | df | pvalue | meanLogFC | |
---|---|---|---|---|
CD45RA | 1145.84230 | 5 | 0.0000000 | 0.2146127 |
CD45RO | 953.53324 | 5 | 0.0000000 | 0.2606542 |
Tbet | 771.22620 | 5 | 0.0000000 | 0.1690975 |
IL17A | 410.58815 | 5 | 0.0000000 | 0.1175697 |
IL4 | 216.60385 | 5 | 0.0000000 | 0.1477331 |
CD27 | 147.86698 | 5 | 0.0000000 | 0.0468434 |
IFNg | 97.80167 | 5 | 0.0000000 | 0.1783196 |
GATA3 | 50.67563 | 5 | 0.0000000 | 0.0333202 |
RORgt | 44.62429 | 5 | 0.0000000 | 0.0631953 |
TNFa | 24.10348 | 5 | 0.0002074 | 0.0863878 |
Show markers that differ between beginning and End.
$startVsEndTest th1_markers
waldStat | df | pvalue | logFClineage1 | |
---|---|---|---|---|
GATA3 | 452.0088153 | 1 | 0.0000000 | 0.4808084 |
CD45RO | 194.7305547 | 1 | 0.0000000 | -0.4800809 |
IFNg | 158.7021158 | 1 | 0.0000000 | 0.5862225 |
IL17A | 95.5864034 | 1 | 0.0000000 | -0.2945250 |
Tbet | 54.1693951 | 1 | 0.0000000 | 0.3379965 |
CD45RA | 34.8993640 | 1 | 0.0000000 | -0.2400213 |
RORgt | 33.7015927 | 1 | 0.0000000 | 0.2710490 |
IL4 | 27.0121192 | 1 | 0.0000002 | -0.1836674 |
TNFa | 6.9767628 | 1 | 0.0082575 | 0.1352903 |
CD27 | 0.0588948 | 1 | 0.8082513 | 0.0048742 |
Finally, the last value is the fitted GAM model by tradeSeq. You can
use it combined with startVsEnd with more specific pseudotimevalues
instead of the extremities, to find potentially transient markers, as
described in tradeSeq
’s main
vignette
startVsEndTest(th1_markers$fitGAM, pseudotimeValues = c(1, 5))
Import trajectories
Now that pseudotime values and trajectories are computed, we can import them back into a flow_object.
<- import_trajectory(th_trajectory,
th_trajectory cds = cds_th1,
trajectory_name = "Th1 trajectory")
<- import_trajectory(th_trajectory,
th_trajectory cds = cds_th2,
trajectory_name = "Th2 trajectory")
Now, if you used ClusterSubsampling, you’ll notice a lot of missing values, especially in high frequency clusters.
Imported pseudotimes
Pseudotime imputation
If you did use Cluster subsampling, you have lineages and markers, but it will be difficult to compare enrichment of cells witin or across trajectories in different metadata groups, as you potentially subsampled more heavily one group than another.
This issue is then solved by imputing pseudotime values from nearest neighbors within a trajectory.
<- impute_pseudotime(th_trajectory, trajectory = "Th1 trajectory") th_trajectory
Imputation
Visualization
Pseudotime UMAP
Now that trajectories and pseudotime have been added to your flow object, you have access to more visuals.
You can visualize pseudotime values for a given trajectory on the
UMAP plot with the plot_umap_pseudotime
function.
plot_umap_pseudotime(th_trajectory, trajectory = "Th1 trajectory", dot_size = 1)
Moreover, you can plot the trajectories on other UMAP plots, such as
plot_umap
, plot_density_umap
or
plot_umap_heat
with the parameter
show_trajectories = T
.
plot_umap(th_trajectory, show_trajectories = T, group.by = "louvain")
plot_umap_heat(th_trajectory, markers = c("Tbet","GATA3"), show_trajectories = T )
Pseudotime density
You may want to compare sample condition distributions
plot_grouped_pseudotime(th_trajectory, group.by = "Condition", trajectory = "Th1 trajectory", adjust = 2)
Trajectory markers
You will likely wish to see the expression of markers in relation to a given trajectory.
By default, the plot_pseudotime_markers
function will
produce a heatmap for all included markers.
plot_pseudotime_markers(th_trajectory, trajectory = "Th1 trajectory")
Marker heatmap
However, as you can see, some markers (i.e. GATA3) do not vary in
relation to pseudotime. You can visually appreciate that, perhaps select
some biologically relevant markers, or get markers from
FindTrajectoryMarkers
(above).
plot_pseudotime_markers(th_trajectory, trajectory = "Th1 trajectory", markers = c("Tbet", "IFNg", "CD45RO", "CD27", "CD45RA"))
Selected markers heatmap
RPM Bioinfo Solutions, adam.nicolas.pelletier@rpmbioinfo-solutions.com↩︎