Trajectory analysis of flow cytometry data

Before you start

This vignette assumes you have performed prior analysis using the basic vignette. To start, you need a flow_object, with UMAP coordinates and louvain clusters computed. For the purpose of the vignette, the package provides a test dataset.

Important: you may need to reanalyze your data upstream differently if you wish to perform trajectory analysis. Trajectory analysis is sensitive to smaller number of events, as it is much harder to find transitions between cell states when you have fewer cells in those transitions. Depending on your biological system, you may need to subsample more heavily at the start (250k - 1M events) to make sure that you are capturing all states in sufficient quantities.

Load libraries

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(unsupflowhelper))
suppressPackageStartupMessages(library(flowCore))
suppressPackageStartupMessages(library(patchwork)) ## to combine multiple plots in a single plot
suppressPackageStartupMessages(library(monocle3))
suppressPackageStartupMessages(library(tradeSeq))
knitr::opts_chunk$set(echo = TRUE)


set.seed(1)

Load data

Go back to the basic vignette to see how to generate or load a flow_object.

Here, we will use the test dataset

data("th_trajectory")

Partitions

Explanation

The first concept to define are partitions. Monocle3 partitions act as supergroups of cells, or metaclusters if you will. They are meant to identify larger groups of cells that are still somewhat connected despite being organised in multiple clusters, but to separate from other groups of cells with a larger distance gap. A good example would be, in biological terms, CD4 and CD8 T cells. CD4 T cells would be part of one partition, and CD8 T cells would be part of another. They both contain naive, effector and memory subsets, but differ enough to be separated.

By default, the trajectory analysis will try to find trajectories across all cells, but very often that doesn’t make much biological sense. There is no reason for CD4 T cells to differentiate into CD8 T cells in the periphery. Partitioning helps you set up the analysis to generate parallel trajectories within each partition.

However, partitioning is not a perfect capture all of each group of cells. Sometimes, you are looking at related entities (say Naive CD4 T cells) and activated effector cells, but partitioning split them into separate partitions because the gap between them is too large: this is something you see when looking at a very potent stimulating agent like a TLR agonist, or a MHC tetramer.

Partitioning is meant to guide you in how to analyze your data, but ultimately, it’s your knowledge of the biology and the hypotheses that you’re testing that will be the final judge.

Code

By default, all cells are set to partition "1" automatically. By running SetPartitions(), you can either let monocle3 detect partitions automatically (automatic = T), or supply it with a data.frame that contains 1. row.names as cellID and 2. partition labels in a column called partition. The former can be more practical if you want to have your partitions named instead of generic numbers.

th_trajectory <- SetPartitions(th_trajectory, automatic = T)


plot_umap(th_trajectory, group.by = "Condition") + plot_umap(th_trajectory, group.by = "partition")

Here, you see that cells towards the top (Th2; Partition 1) are split into a separate partition from the bottom (Partition 1) despite having a small group of cells connecting them to Partition 2. We know that biologically , those Th0 and Th1 cells are related, as the former differentiates into the latter. The partitioning here suggests that either 1. there are not enough events in that “bridge”, or that Th2 cells are too different from the rest phenotypically. Indeed, you can see they separate in the UMAP plot quite drastically.

Export to cds

Cluster subsampling

It is not always advisable to generate a trajectory analysis on the intact dataset, especially if you subsampled more heavily as mentioned in the Before you start section. While this approach makes it easier to find transition states, it also makes it computationally heavy.

The solution for this, is to perform cluster-based subsampling. This subsampling method randomly selects n events per cluster to equalize the data and make it more computationally manageable.

flow_obj_th <- ClusterSubsampling(th_trajectory, n = 2000)

Export flow_object to cds

Monocle3 works on cell_data_set objects, not flow_object. Unsupflowhelper provides a wrapper function to generate this object. However, you can provide either the intact dataset (if it’s computationally feasible) or the subsampled one.

Furthermore, you can use the SubsetFlowObject function to select the partition(s) you wish to generate a trajectory analysis for.

cds <- flow_to_cds(SubsetFlowObject(flow_obj_th, subset = partition %in% c("1", "2")))

### monocle3 has some plotting functions of it's own. 
plot_cells(cds, color_cells_by = "cluster", show_trajectory_graph = F)

Build trajectories

Learn graph

This is where monocle3 will attempt to find a trajectory across the cells. By default, monocle3 will use embedded partitions to find trajectories within each. If you wish to perform a trajectory analysis within partitions, set use_partition = FALSE.

Also, different parameters will influence if the trajectory gets closed in a loop or open-ended, or the minimal branch length of secondary branches. More details in monocle3’s help documentation.

cds <- learn_graph(cds, close_loop = F, use_partition = F, learn_graph_control = list(minimal_branch_len = 15))


plot_cells(cds, color_cells_by = "cluster", show_trajectory_graph = T)

cds umap plot

Pick graph segments

Having a graph is a great first start, but now we have to give it some directionality and meaning.

First step would be to infer where the tree starts. If your dataset has multiple timepoints, you could compare them using plot_density_umap() to figure out if one tree end matches with a region that is enriched in the earliest timepoint. Another approach is to use your knowledge of the biological system: in this case, we know that the bottom left clusters, i.e. nodes 1,8 and 9, are located in the region where Th0 samples are located. Given that uncommitted Th0 cells are known to differentiate into Th1 or Th2 cells, this seems like a good starting point.

Second, we observe that the graph splits off in 2 trajectories at the node 7. Thus we are looking at 2 possible trajectories. Each of those trajectories will have a parallel analysis.

cds_th1 <- choose_graph_segments(cds, clear_cds = F)
cds_th1 <- order_cells(cds_th1)

order_cells_th1

cds_th2 <- choose_graph_segments(cds, clear_cds = F)
cds_th2 <- order_cells(cds_th2)

Choose_graph_segments_th2 order_cells_th2

Find trajectory Markers

After you ran order_cells, your cds object contains pseudotime values for that trajectory.

Now, you might want to know which markers are variable across that trajectory. Which markers are correlated to pseudotime, are different between start and end, or perhaps even transient changes.

R package tradeSeq covers this for single-cell RNA-seq. While this package was not made for flow cytometry data, it also works to a good extent. One thing that differs is that, given single-cell data is sparse compared to flow, pvalues are extremely significant. A better metric to asses top markers is the waldStat, with higher values representing stronger markers.

th1_markers <- FindTrajectoryMarkers(cds = cds_th1)

Show correlated markers

th1_markers$associationTest


th1_markers_test <- FindTrajectoryMarkers(th1_markers)

	waldStat	df	pvalue	meanLogFC
CD45RA	1145.84230	5	0.0000000	0.2146127
CD45RO	953.53324	5	0.0000000	0.2606542
Tbet	771.22620	5	0.0000000	0.1690975
IL17A	410.58815	5	0.0000000	0.1175697
IL4	216.60385	5	0.0000000	0.1477331
CD27	147.86698	5	0.0000000	0.0468434
IFNg	97.80167	5	0.0000000	0.1783196
GATA3	50.67563	5	0.0000000	0.0333202
RORgt	44.62429	5	0.0000000	0.0631953
TNFa	24.10348	5	0.0002074	0.0863878

Show markers that differ between beginning and End.

th1_markers$startVsEndTest

	waldStat	df	pvalue	logFClineage1
GATA3	452.0088153	1	0.0000000	0.4808084
CD45RO	194.7305547	1	0.0000000	-0.4800809
IFNg	158.7021158	1	0.0000000	0.5862225
IL17A	95.5864034	1	0.0000000	-0.2945250
Tbet	54.1693951	1	0.0000000	0.3379965
CD45RA	34.8993640	1	0.0000000	-0.2400213
RORgt	33.7015927	1	0.0000000	0.2710490
IL4	27.0121192	1	0.0000002	-0.1836674
TNFa	6.9767628	1	0.0082575	0.1352903
CD27	0.0588948	1	0.8082513	0.0048742

Finally, the last value is the fitted GAM model by tradeSeq. You can use it combined with startVsEnd with more specific pseudotimevalues instead of the extremities, to find potentially transient markers, as described in tradeSeq’s main vignette

startVsEndTest(th1_markers$fitGAM, pseudotimeValues = c(1, 5))

Import trajectories

Now that pseudotime values and trajectories are computed, we can import them back into a flow_object.

th_trajectory <- import_trajectory(th_trajectory, 
                                   cds = cds_th1,
                                   trajectory_name = "Th1 trajectory")

th_trajectory <- import_trajectory(th_trajectory, 
                                   cds = cds_th2,
                                   trajectory_name = "Th2 trajectory")

Now, if you used ClusterSubsampling, you’ll notice a lot of missing values, especially in high frequency clusters.

Imported pseudotimes

Pseudotime imputation

If you did use Cluster subsampling, you have lineages and markers, but it will be difficult to compare enrichment of cells witin or across trajectories in different metadata groups, as you potentially subsampled more heavily one group than another.

This issue is then solved by imputing pseudotime values from nearest neighbors within a trajectory.

th_trajectory <- impute_pseudotime(th_trajectory, trajectory = "Th1 trajectory")

Imputation

Visualization

Pseudotime UMAP

Now that trajectories and pseudotime have been added to your flow object, you have access to more visuals.

You can visualize pseudotime values for a given trajectory on the UMAP plot with the plot_umap_pseudotime function.

plot_umap_pseudotime(th_trajectory, trajectory = "Th1 trajectory", dot_size = 1)

Moreover, you can plot the trajectories on other UMAP plots, such as plot_umap, plot_density_umap or plot_umap_heat with the parameter show_trajectories = T.

plot_umap(th_trajectory, show_trajectories = T, group.by = "louvain")

plot_umap_heat(th_trajectory, markers = c("Tbet","GATA3"), show_trajectories = T )

Pseudotime density

You may want to compare sample condition distributions

plot_grouped_pseudotime(th_trajectory, group.by = "Condition", trajectory = "Th1 trajectory", adjust = 2)

Trajectory markers

You will likely wish to see the expression of markers in relation to a given trajectory.

By default, the plot_pseudotime_markers function will produce a heatmap for all included markers.

plot_pseudotime_markers(th_trajectory, trajectory = "Th1 trajectory")

Marker heatmap

However, as you can see, some markers (i.e. GATA3) do not vary in relation to pseudotime. You can visually appreciate that, perhaps select some biologically relevant markers, or get markers from FindTrajectoryMarkers (above).

plot_pseudotime_markers(th_trajectory, trajectory = "Th1 trajectory", markers = c("Tbet", "IFNg", "CD45RO", "CD27", "CD45RA"))

Selected markers heatmap

RPM Bioinfo Solutions, adam.nicolas.pelletier@rpmbioinfo-solutions.com ↩︎