Getting Started¶
Prerequisite¶
Make sure that LANG
and LC_ALL
environment variables have been set. You can use the following command to check this:
locale
If some are not set, you can set them to the default language for instance:
export LANG="C"
export LC_ALL="C"
Dependencies¶
Make sure you have the following software installed,
-
- Currently VSN-Pipelines requires Nextflow version
20.10.0
or higher.
- Currently VSN-Pipelines requires Nextflow version
A container system, either of:
NOTE: Due to licensing restrictions, to use the cellranger components of VSN you must build and/or provide a container with cellranger
and bcl2fastq2
installed yourself.
A sample Dockerfile
can be found in ./src/cellranger/
, you must download bcl2fastq2 from the Illumina website, and cellranger from the 10x Genomics website yourself to build this container.
Quick start¶
To run a quick test of the single sample analysis pipeline, we can use the 1k PBMC datasets provided by 10x Genomics. This will take only ~3min to run.
The data first needs to be downloaded (instructions can be found here).
Next, update to the latest pipeline version:
nextflow pull vib-singlecell-nf/vsn-pipelines
Next, generate a config file using the standard settings for the test data, and the appropriate profiles (e.g., replace
singularity
withdocker
if necessary):nextflow config vib-singlecell-nf/vsn-pipelines \ -profile tenx,singularity,single_sample > single_sample.config
The test pipeline can now be run using the config file just generated, specifying the
single_sample
workflow as an entrypoint:nextflow -C single_sample.config \ run vib-singlecell-nf/vsn-pipelines \ -entry single_sample
Example Output¶
$ nextflow -C nextflow.config run $VSN -entry single_sample
N E X T F L O W ~ version 20.10.0
Launching `/staging/leuven/stg_00002/lcb/dwmax/documents/aertslab/GitHub/vib-singlecell-nf/vsn-pipelines/main.nf` [silly_pare] - revision: 77be3ba59d
WARN: DSL 2 IS AN EXPERIMENTAL FEATURE UNDER DEVELOPMENT -- SYNTAX MAY CHANGE IN FUTURE RELEASE
executor > local (83)
[44/e02c9e] process > single_sample:SINGLE_SAMPLE:SC__FILE_CONVERTER (1) [100%] 2 of 2 ✔
[22/723593] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:QC_FILTER:SC__SCANPY__COMPUTE_QC_STATS (2) [100%] 2 of 2 ✔
[2e/10d845] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:QC_FILTER:SC__SCANPY__CELL_FILTER (2) [100%] 2 of 2 ✔
[d6/fbe4b6] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:QC_FILTER:SC__SCANPY__GENE_FILTER (2) [100%] 2 of 2 ✔
[22/d4a31b] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:QC_FILTER:GENERATE_DUAL_INPUT_REPORT:SC__SCANPY__GENERATE_DUAL_INPUT_REPORT (2) [100%] 2 of 2 ✔
[20/b43313] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:QC_FILTER:GENERATE_DUAL_INPUT_REPORT:SC__SCANPY__REPORT_TO_HTML (2) [100%] 2 of 2 ✔
[e3/ee3f9c] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:NORMALIZE_TRANSFORM:SC__SCANPY__NORMALIZATION (2) [100%] 2 of 2 ✔
[79/7f4e25] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:NORMALIZE_TRANSFORM:PUBLISH_H5AD_NORMALIZED:COMPRESS_HDF5 (2) [100%] 2 of 2 ✔
[40/370971] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:NORMALIZE_TRANSFORM:PUBLISH_H5AD_NORMALIZED:SC__PUBLISH (2) [100%] 2 of 2 ✔
[f1/aa0726] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:NORMALIZE_TRANSFORM:PUBLISH_H5AD_NORMALIZED:SC__PUBLISH_PROXY (2) [100%] 2 of 2 ✔
[76/e42ef9] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:NORMALIZE_TRANSFORM:SC__SCANPY__DATA_TRANSFORMATION (2) [100%] 2 of 2 ✔
[04/11b8b8] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:HVG_SELECTION:SC__SCANPY__FIND_HIGHLY_VARIABLE_GENES (2) [100%] 2 of 2 ✔
[1e/e1d058] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:HVG_SELECTION:SC__SCANPY__SUBSET_HIGHLY_VARIABLE_GENES (2) [100%] 2 of 2 ✔
[07/b3580a] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:HVG_SELECTION:SC__SCANPY__FEATURE_SCALING (2) [100%] 2 of 2 ✔
[b4/00bf5e] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:HVG_SELECTION:PUBLISH_H5AD_HVG_SCALED:COMPRESS_HDF5 (2) [100%] 2 of 2 ✔
[8f/4d5d49] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:HVG_SELECTION:PUBLISH_H5AD_HVG_SCALED:SC__PUBLISH (2) [100%] 2 of 2 ✔
[9a/3c5d0d] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:HVG_SELECTION:PUBLISH_H5AD_HVG_SCALED:SC__PUBLISH_PROXY (2) [100%] 2 of 2 ✔
[dc/40cda6] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:HVG_SELECTION:GENERATE_REPORT:SC__SCANPY__GENERATE_REPORT (2) [100%] 2 of 2 ✔
[62/9dc791] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:HVG_SELECTION:GENERATE_REPORT:SC__SCANPY__REPORT_TO_HTML (2) [100%] 2 of 2 ✔
[8c/ed79b8] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:DIM_REDUCTION_PCA:SC__SCANPY__DIM_REDUCTION__PCA (2) [100%] 2 of 2 ✔
[be/ed9c2e] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:NEIGHBORHOOD_GRAPH:SC__SCANPY__NEIGHBORHOOD_GRAPH (2) [100%] 2 of 2 ✔
[01/ec367e] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:DIM_REDUCTION_TSNE_UMAP:SC__SCANPY__DIM_REDUCTION__TSNE (2) [100%] 2 of 2 ✔
[ea/7fbf7c] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:DIM_REDUCTION_TSNE_UMAP:SC__SCANPY__DIM_REDUCTION__UMAP (2) [100%] 2 of 2 ✔
[e5/a5a70a] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:DIM_REDUCTION_TSNE_UMAP:GENERATE_REPORT:SC__SCANPY__GENERATE_REPORT (2) [100%] 2 of 2 ✔
[dd/b38b9b] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:DIM_REDUCTION_TSNE_UMAP:GENERATE_REPORT:SC__SCANPY__REPORT_TO_HTML (2) [100%] 2 of 2 ✔
[5f/5bcb4d] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:CLUSTER_IDENTIFICATION:SC__SCANPY__CLUSTERING (2) [100%] 2 of 2 ✔
[fa/9765a9] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:CLUSTER_IDENTIFICATION:GENERATE_REPORT:SC__SCANPY__GENERATE_REPORT (2) [100%] 2 of 2 ✔
[aa/7b6adb] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:CLUSTER_IDENTIFICATION:GENERATE_REPORT:SC__SCANPY__REPORT_TO_HTML (2) [100%] 2 of 2 ✔
[0f/82f171] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:CLUSTER_IDENTIFICATION:SC__SCANPY__MARKER_GENES (2) [100%] 2 of 2 ✔
[96/04fc81] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:UTILS__GENERATE_WORKFLOW_CONFIG_REPORT [100%] 1 of 1 ✔
[ee/7fe3fa] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:SC__SCANPY__MERGE_REPORTS (2) [100%] 2 of 2 ✔
[6f/7cbcb5] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:SC__SCANPY__REPORT_TO_HTML (2) [100%] 2 of 2 ✔
[87/7e681b] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:FINALIZE:SC__H5AD_TO_FILTERED_LOOM (2) [100%] 2 of 2 ✔
[f0/176c0c] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:FINALIZE:FILE_CONVERTER_TO_SCOPE:SC__H5AD_TO_LOOM (1) [100%] 2 of 2 ✔
[b3/608cde] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:FINALIZE:FILE_CONVERTER_TO_SCANPY:SC__H5AD_MERGE (2) [100%] 2 of 2 ✔
[d1/43da78] process > single_sample:SINGLE_SAMPLE:SCANPY__SINGLE_SAMPLE:PUBLISH:SC__PUBLISH_PROXY (2) [100%] 2 of 2 ✔
[c3/6209f8] process > single_sample:PUBLISH_SINGLE_SAMPLE_SCOPE:COMPRESS_HDF5 (2) [100%] 2 of 2 ✔
[d5/e1a0c3] process > single_sample:PUBLISH_SINGLE_SAMPLE_SCOPE:SC__PUBLISH (2) [100%] 2 of 2 ✔
[4b/2e236a] process > single_sample:PUBLISH_SINGLE_SAMPLE_SCOPE:SC__PUBLISH_PROXY (2) [100%] 2 of 2 ✔
[87/f3f350] process > single_sample:PUBLISH_SINGLE_SAMPLE_SCANPY:COMPRESS_HDF5 (2) [100%] 2 of 2 ✔
[d4/2c09af] process > single_sample:PUBLISH_SINGLE_SAMPLE_SCANPY:SC__PUBLISH (2) [100%] 2 of 2 ✔
[da/3817b5] process > single_sample:PUBLISH_SINGLE_SAMPLE_SCANPY:SC__PUBLISH_PROXY (2) [100%] 2 of 2 ✔
WARN: To render the execution DAG in the required format it is required to install Graphviz -- See http://www.graphviz.org for more info.
Completed at: 12-Nov-2020 10:55:52
Duration : 2m 36s
CPU hours : 0.6
Succeeded : 83
Output¶
The pipelines will generate 3 types of results in the output directory (params.global.outdir), by default out/
data
: contains the workflow output file (in h5ad format), plus symlinks to all the intermediate files.loom
: contains final loom files which can be imported inside SCope visualization tool for further visualization of the results.notebooks
: contains all the notebooks generated along the pipeline (e.g.: Quality control report)pipeline_reports
: Nextflow dag, execution, timeline, and trace reports
For a full list of the pipelines available please see the pipelines page.
Further pipeline configuration details¶
This pipeline can be fully configured and run on custom data with a few steps.
The recommended method is to first run nextflow config ...
to generate a complete config file (with the default parameters) in your working directory.
The tool-specific parameters, as well as Docker/Singularity profiles, are included when specifying the appropriate profiles to nextflow config
.
First, update to the latest pipeline version (this will update the Nextflow cache of the repository, typically located in
~/.nextflow/assets/vib-singlecell-nf/
):nextflow pull vib-singlecell-nf/vsn-pipelines
Next, a config file needs to be generated. This step will merge parameters from multiple profiles together to create a master config which specifies all parameters used by the pipeline. In this example, these are
tenx
for the input data,singularity
to use the Singularity system (replace withdocker
if necessary), andsingle_sample
to load the defaults for the single sample pipeline. In your working directory, runnextflow config ...
with the appropriate profiles:nextflow config vib-singlecell-nf/vsn-pipelines \ -profile tenx,singularity,single_sample > single_sample.config
Now, edits can be made to
single_sample.config
. Generally, the default values are acceptable to use for a first pass, but certain variables (input directory, etc.) need to be changed.In particular, the following parameters are frequently modified in practice:
params.global.project_name
: a project name which will be included in some of the output file names.params.data.tenx.cellranger_mex
, which should point to theouts/
folder generated by Cell Ranger (if using 10x data). See Information on using 10x Genomics datasets for additional info.- Filtering parameters (
params.sc.scanpy.filter
): filtering parameters, which will be applied to all samples, can be set here: min/max genes, mitochondrial read fraction, and min cells. See Multi-sample parameters for additional info on how to specify sample-specific parameters. - Louvain cluster resolution:
params.sc.scanpy.clustering.resolution
. - Cell- and sample- level annotations are also possible.
Run the workflow using the new config file (using
-C
is recommended to use only this file), specifying the proper workflow as the entry point:nextflow -C single_sample.config \ run vib-singlecell-nf/vsn-pipelines \ -entry single_sample
Additional resources for running on custom data¶
Finally, see the list of case studies with specific examples and full config files at VSN-Pipelines-examples.
Input Data Formats¶
Depending on the type of data you run the pipeline with, one or more appropriate profiles should be set when running nextflow config
.
These profiles are indicated in the sections below.
Specifying multiple samples¶
All the input data parameters are compatible with the following features:
- Glob patterns
"data/10x/1k_pbmc/1k_pbmc_*/outs/"
- Comma separated paths (paths can contain glob patterns)
"data/10x/1k_pbmc/1k_pbmc_v2_chemistry/outs/, data/10x/1k_pbmc/1k_pbmc_v3_chemistry/outs/"
- Array of paths (paths can contain glob patterns)
[
"data/10x/1k_pbmc/1k_pbmc_v2_chemistry/outs/",
"data/10x/1k_pbmc/1k_pbmc_v3_chemistry/outs/"
]
Cell Ranger (10x Genomics)¶
Data from a standard Cell Ranger output directory can be easily ingested into the pipeline by using the proper input channel (tenx_mex
or tenx_h5
, depending on which file should be used).
Multiple samples can be selected by providing the path to this directory using glob patterns.
/home/data/
└── cellranger
├── sample_A
│ └── outs
│ ├── filtered_feature_bc_matrix
│ │ ├── barcodes.tsv
│ │ ├── genes.tsv
│ │ └── matrix.mtx
│ └── filtered_feature_bc_matrix.h5
└── sample_B
└── outs
├── filtered_feature_bc_matrix
│ ├── barcodes.tsv
│ ├── genes.tsv
│ └── matrix.mtx
└── filtered_feature_bc_matrix.h5
MEX¶
To use the Cell Ranger Market Exchange (MEX) files, use the following profile when generating the config file:
-profile tenx
This profile adds the following parameter (params.data.tenx.cellranger_mex
) into the generated .config file:
[...]
data {
tenx {
cellranger_mex = "/home/data/cellranger/sample*/outs/"
}
}
[...]
H5¶
To use the Cell Ranger h5
file as input, use the following profile:
-profile tenx_h5
This profile adds the params.data.tenx.cellranger_h5
parameter into the generated .config file:
[...]
data {
tenx {
cellranger_h5 = "/home/data/cellranger/sample*/outs/"
}
}
[...]
Input file detection¶
Setting the input directory appropriately, using a glob in the directory path in place of the sample names, will collect all the samples listed in the filtered_[feature|gene]_bc_matrix
directories listed above.
For example, in params.data.tenx
, setting:
cellranger_mex = "/home/data/cellranger/sample*/outs/"
or
cellranger_h5 = "/home/data/cellranger/sample*/outs/"
will recursively find all 10x samples in that directory.
The pipeline will use either the outs/filtered_feature_bc_matrix/
or the outs/raw_feature_bc_matrix/
depending on the setting of the params.sc.file_converter.useFilteredMatrix
(true
uses filtered; false
uses raw).
H5AD (Scanpy)¶
Use the following profile when generating the config file:
-profile h5ad
In the generated .config file, make sure the file_paths
parameter is set with the paths to the .h5ad
files:
[...]
data {
h5ad {
file_paths = "data/1k_pbmc_v*_chemistry_SUFFIX.SC__FILE_CONVERTER.h5ad"
suffix = "_SUFFIX.SC__FILE_CONVERTER.h5ad"
}
}
[...]
- The
suffix
parameter is used to infer the sample name from the file paths (it is removed from the input file path to derive a sample name).
In case there are multiple .h5ad files that need to be processed with different suffixes, the multi-labelled strategy should be used to define the h5ad parameter:
[...]
data {
h5ad {
GROUP1 {
file_paths = "[path-to-group1-files]/*.SUFFIX1.h5ad"
suffix = ".SUFFIX1.h5ad"
group = ["technology", "10x"]
}
GROUP2 {
file_paths = "[path-to-group1-files]/*.SUFFIX2.h5ad"
suffix = ".SUFFIX2.h5ad"
group = ["technology", "smart-seq2"]
}
}
}
[...]
Notes:
GROUP1
,GROUP2
are just example names here. They can be replaced by any value as long as they are alphanumeric (underscores are allowed).- All the different suffix defined should unique.
file_paths
andsuffix
do allow list of paths/globs in the multi-labelled strategy.group
[optional] should be an array of 2 elements where first element define the group name and the second the group value. This will add cell-based annotation for each group of files
Loom¶
Use the following profile when generating the config file:
-profile loom
In the generated .config file, make sure the file_paths
parameter is set with the paths to the .loom
files:
[...]
data {
loom {
file_paths = "data/1k_pbmc_v*_chemistry_SUFFIX.SC__FILE_CONVERTER.loom"
suffix = "_SUFFIX.SC__FILE_CONVERTER.loom"
}
}
[...]
- The
suffix
parameter is used to infer the sample name from the file paths (it is removed from the input file path to derive a sample name).
Seurat Rds¶
Use the following profile when generating the config file:
-profile seurat_rds
In the generated .config file, make sure the file_paths
parameter is set with the paths to the .Rds
files:
[...]
data {
seurat_rds {
file_paths = "data/1k_pbmc_v*_chemistry_SUFFIX.SC__FILE_CONVERTER.Rds"
suffix = "_SUFFIX.SC__FILE_CONVERTER.Rds"
}
}
[...]
- The pipelines expect a Seurat v3 object contained in the .Rds file. (Seurat v2 objects are currently not supported).
- The
suffix
parameter is used to infer the sample name from the file paths (it is removed from the input file path to derive a sample name).
TSV¶
Use the following profile when generating the config file:
-profile tsv
In the generated .config file, make sure the file_paths
parameter is set with the paths to the .tsv
files:
[...]
data {
h5ad {
file_paths = "data/1k_pbmc_v*_chemistry_SUFFIX.SC__FILE_CONVERTER.tsv"
suffix = "_SUFFIX.SC__FILE_CONVERTER.tsv"
}
}
[...]
- The
suffix
parameter is used to infer the sample name from the file paths (it is removed from the input file path to derive a sample name).
CSV¶
Use the following profile when generating the config file:
-profile csv
In the generated .config file, make sure the file_paths
parameter is set with the paths to the .csv
files:
[...]
data {
h5ad {
file_paths = "data/1k_pbmc_v*_chemistry_SUFFIX.SC__FILE_CONVERTER.csv"
suffix = "_SUFFIX.SC__FILE_CONVERTER.csv"
}
}
[...]
- The
suffix
parameter is used to infer the sample name from the file paths (it is removed from the input file path to derive a sample name).
Pipelines¶
Single-sample Pipelines¶
Pipelines to run on a single sample or multiple samples separately and in parallel.
single_sample
¶
The single_sample
workflow will process 10x data, taking in 10x-structured data, and metadata file.
The standard analysis steps are run: filtering, normalization, log-transformation, HVG selection, dimensionality reduction, clustering, and loom file generation.
The output is a loom file with the results embedded.
single_sample_scenic
¶
Runs the single_sample
workflow above, then runs the scenic
workflow on the output, generating a comprehensive loom file with the combined results.
This could be very resource intensive, depending on the dataset.
single_sample_scrublet
¶
Runs the single_sample
workflow above together with the Scrublet workflow.
The single_sample
workflow is running from the input data.
The scrublet
workflow is running from the input data.
The final processed file from the single_sample
pipeline is annotated with the cell-based data generated by Scrublet.
The pipelines generate the following relevant files for each sample:
Output File | Description |
---|---|
out/data/*.SINGLE_SAMPLE_SCRUBLET.loom | SCope-ready loom file containing resulting loom file from a single_sample workflow but with additional metadata (doublet scores and predicted doublet for the cells) based on Scrublet run. |
out/data/scrublet/*.SC__SCRUBLET__DOUBLET_DETECTION.ScrubletObject.pklz | Pickled file containing the Scrublet object. |
out/data/scrublet/*.SCRUBLET.SC__ANNOTATE_BY_CELL_METADATA.h5ad | h5ad file with raw data and doublets annotated. |
out/data/scrublet/*.SINGLE_SAMPLE_SCRUBLET.h5ad | h5ad file resulting from a single_sample workflow run and with doublets (inferred from Scrublet) removed. |
Cuurently there are 3 methods available to call doublets from Scrublet doublet scores:
- (Default) Scrublet will try to automatically identify the doublet score threshold. The threshold is then used to call doublets based on the doublet scores available in the scrublet__doublet_scores column. The doublets called are available in the scrublet__predicted_doublets column.
- It can happen that Scrublet fails to find the automatic treshold. In that case, the pipeline will fail and let you know that either the method define in 3. has to be used or a custom threshold has to be provided. Either way, the pipeline will generate the Scrublet histograms. This is helpful especially if the user decide to select a custom threshold which will need to be reflected in the config as follows:
params {
sc {
scublet {
threshold = [
"<sample-name>": <custom-threshold>
]
}
}
}
- This method is specifc to sample generated by the 10x Genomics single-cell platform. This method is based on the rate of the expected number of doublets in 10x Genomics samples. The number of doublets called (D) will be equal to the rate of doublets (given a number of cells) times the number of cells in that 10x Genomics sample. The cells are then ranked by their Scrublet doublet score (descending order) and the top D cells are called as doublets.
decontx
¶
Runs the decontx
workflow.
The pipelines generate the following files for each sample:
Output File | Description |
---|---|
out/data/*.CELDA_DECONTX_{FILTER,CORRECT}.h5ad | A h5ad file with either the filtered matrix using one of the provided filters or the corrected (decontaminated) matrix by DecontX. |
out/data/celda/*.CELDA__DECONTX.Rds | A Rds file containing the SingleCellExperiment object processed by DecontX. |
out/data/celda/*.CELDA__DECONTX.Contamination_Outlier_Table.tsv | A cell-based .tsv file containing data generated by DecontX and additional outlier masks:
|
out/data/celda/*.CELDA__DECONTX.Contamination_Outlier_Thresholds.tsv | A .tsv containing a table with the different threshold for generating the outlier masks. |
out/data/celda/*.CELDA__DECONTX.Contamination_Score_Density_with_{doublemad,scater_isOutlier_3MAD,custom_gt_0.5}.pdf | A .pdf plot showing the density of the decontamination score from DecontX and the outlier area highlighted for the given outlier threshold. |
out/data/celda/*.CELDA__DECONTX.UMAP_Contamination_Score.pdf | A .pdf plot showing the DecontX contamination score on top of a UMAP generated from the decontaminated matrix. |
out/data/celda/*.CELDA__DECONTX.UMAP_Clusters.pdf | A .pdf plot showing a UMAP generated by DecontX and from the decontaminated matrix. |
single_sample_decontx
¶
Runs the single_sample
workflow above together with the DecontX workflow.
The DecontX workflow is running from the input data.
The final processed file from the single_sample
pipeline is annotated with the cell-based data generated by DecontX.
See single_sample
and decontx
to know more about the files generated by this pipeline.
single_sample_decontx_scrublet
¶
Runs the single_sample
workflow above together with the DecontX workflow.
The single_sample
workflow is running from the input data.
The decontx
workflow is running from the input data.
The scrublet
workflow is running from the output of the DecontX workflow.
The final processed file from the single_sample
pipeline is annotated with the cell-based data generated by DecontX and Scrublet.
See single_sample
, decontx
and scrublet
to know more about the files generated by this pipeline.
scenic
¶
Runs the scenic
workflow alone, generating a loom file with only the SCENIC results.
Currently, the required input is a loom file (set by params.sc.scenic.filteredLoom).
scenic_multiruns
¶
Runs the scenic
workflow multiple times (set by params.sc.scenic.numRuns
), generating a loom file with the aggregated results from the multiple SCENIC runs.
Note that this is not a complete entry-point itself, but a configuration option for the scenic module. Simply adding -profile scenic_multiruns during the config step will activate this analysis option for any of the standard entrypoints.
cellranger¶
Runs the cellranger
workflow (makefastq
, then count
).
Input parameters are specified within the config file:
params.sc.cellranger.mkfastq.csv
: path to the CSV samplesheetparams.sc.cellranger.mkfastq.runFolder
: path of Illumina BCL run folderparams.sc.cellranger.count.transcriptome
: path to the Cell Ranger compatible transcriptome reference
cellranger_count_metadata¶
Given the data stored as:
MKFASTQ_ID_SEQ_RUN1
|-- MAKE_FASTQS_CS
-- outs
|-- fastq_path
|-- HFLC5BBXX
|-- test_sample1
| |-- sample1_S1_L001_I1_001.fastq.gz
| |-- sample1_S1_L001_R1_001.fastq.gz
| |-- sample1_S1_L001_R2_001.fastq.gz
| |-- sample1_S1_L002_I1_001.fastq.gz
| |-- sample1_S1_L002_R1_001.fastq.gz
| |-- sample1_S1_L002_R2_001.fastq.gz
| |-- sample1_S1_L003_I1_001.fastq.gz
| |-- sample1_S1_L003_R1_001.fastq.gz
| |-- sample1_S1_L003_R2_001.fastq.gz
|-- test_sample2
| |-- sample2_S2_L001_I1_001.fastq.gz
| |-- sample2_S2_L001_R1_001.fastq.gz
| |-- ...
|-- Reports
|-- Stats
|-- Undetermined_S0_L001_I1_001.fastq.gz
...
-- Undetermined_S0_L003_R2_001.fastq.gz
MKFASTQ_ID_SEQ_RUN2
|-- MAKE_FASTQS_CS
-- outs
|-- fastq_path
|-- HFLY8GGLL
|-- test_sample1
| |-- ...
|-- test_sample2
| |-- ...
|-- ...
and a metadata table:
sample_name | fastqs_parent_dir_path | fastqs_dir_name | fastqs_sample_prefix | expect_cells |
---|---|---|---|---|
Sample1_Bio_Rep1 | MKFASTQ_ID_SEQ_RUN1/outs/fastq_path/HFLY8GGLL | test_sample1 | sample1 | 5000 |
Sample1_Bio_Rep1 | MKFASTQ_ID_SEQ_RUN2/outs/fastq_path/HFLC5BBXX | test_sample1 | sample1 | 5000 |
Sample1_Bio_Rep2 | MKFASTQ_ID_SEQ_RUN1/outs/fastq_path/HFLY8GGLL | test_sample2 | sample2 | 5000 |
Sample1_Bio_Rep2 | MKFASTQ_ID_SEQ_RUN2/outs/fastq_path/HFLC5BBXX | test_sample2 | sample2 | 5000 |
Optional columns:
short_uuid
:sample_name
will be prefix by this value. This should be the same between sequencing runs of the same biological replicateexpect_cells
: This number will be used as argument for the--expect-cells
parameter incellranger count
.chemistry
: This chemistry will be used as argument for the--chemistry
parameter incellranger count
.
and a config:
nextflow config \
~/vib-singlecell-nf/vsn-pipelines \
-profile cellranger_count_metadata \
> nextflow.config
and a workflow run command:
nextflow run \
~/vib-singlecell-nf/vsn-pipelines \
-entry cellranger_count_metadata
The workflow will run Cell Ranger count on 2 samples, each using the 2 sequencing runs.
NOTES:
- If
fastqs_dir_name
does not exist, set it tonone
demuxlet/freemuxlet¶
Runs the demuxlet
or freemuxlet
workflows (dsc-pileup
[with prefiltering], then freemuxlet
or demuxlet
)
Input parameters are specified within the config file:
params.sc.popscle.vcf
: path to the VCF file for demultiplexingparams.sc.popscle.freemuxlet.nSamples
: Number of clusters to extract (should match the number of samples pooled)params.sc.popscle.demuxlet.field
: Field in the VCF with genotype information
Sample Aggregation Pipelines¶
Pipelines to aggregate multiple datasets together.
bbknn
¶
Runs the bbknn
workflow (sample-specific filtering, merging of individual samples, normalization, log-transformation, HVG selection, PCA analysis, then the batch-effect correction steps: BBKNN, clustering, dimensionality reduction (UMAP only)).
The output is a loom file with the results embedded.
Source: https://github.com/Teichlab/bbknn/blob/master/examples/pancreas.ipynb
Output File | Description |
---|---|
out/data/*.BBKNN.h5ad | Scanpy-ready h5ad file containing all results. The raw.X slot contains the log-normalized data (if normalization & transformation steps applied) while the X slot contains the log-normalized scaled data. |
out/data/*.BBKNN.loom | SCope-ready loom file containing all results. |
bbknn_scenic
¶
Runs the bbknn
workflow above, then runs the scenic
workflow on the output, generating a comprehensive loom file with the combined results.
This could be very resource intensive, depending on the dataset.
Output File | Description |
---|---|
out/data/*.BBKNN.h5ad | Scanpy-ready h5ad file containing all results from a bbknn workflow run. The raw.X slot contains the log-normalized data (if normalization & transformation steps applied) while the X slot contains the log-normalized scaled data. |
out/data/*.BBKNN_SCENIC.loom | SCope-ready loom file containing all results from a bbknn workflow and a scenic workflow run (e.g.: regulon AUC matrix, regulons, …). |
harmony
¶
Runs the harmony
workflow (sample-specific filtering, merging of individual samples, normalization, log-transformation, HVG selection, PCA analysis, batch-effect correction (Harmony), clustering, dimensionality reduction (t-SNE and UMAP)).
The output is a loom file with the results embedded.
Output File | Description |
---|---|
out/data/*.HARMONY.h5ad | Scanpy-ready h5ad file containing all results. The raw.X slot contains the log-normalized data (if normalization & transformation steps applied) while the X slot contains the log-normalized scaled data. |
out/data/*.HARMONY.loom | SCope-ready loom file containing all results. |
harmony_scenic
¶
Runs the harmony
workflow above, then runs the scenic
workflow on the output, generating a comprehensive loom file with the combined results.
This could be very resource intensive, depending on the dataset.
Output File | Description |
---|---|
out/data/*.HARMONY.h5ad | Scanpy-ready h5ad file containing all results from a harmony workflow run. The raw.X slot contains the log-normalized data (if normalization & transformation steps applied) while the X slot contains the log-normalized scaled data. |
out/data/*.HARMONY_SCENIC.loom | SCope-ready loom file containing all results from a harmony workflow and a scenic workflow run (e.g.: regulon AUC matrix, regulons, …). |
mnncorrect
¶
Runs the mnncorrect
workflow (sample-specific filtering, merging of individual samples, normalization, log-transformation, HVG selection, PCA analysis, batch-effect correction (mnnCorrect), clustering, dimensionality reduction (t-SNE and UMAP)).
The output is a loom file with the results embedded.
Output File | Description |
---|---|
out/data/*.MNNCORRECT.h5ad | Scanpy-ready h5ad file containing all results. The raw.X slot contains the log-normalized data (if normalization & transformation steps applied) while the X slot contains the log-normalized scaled data. |
out/data/*.MNNCORRECT.loom | SCope-ready loom file containing all results. |
Utility Pipelines¶
Contrary to the aformentioned pipelines, these are not end-to-end. They are used to perform small incremental processing steps.
cell_annotate¶
Runs the cell_annotate
workflow which will perform a cell-based annotation of the data using a set of provided .tsv metadata files.
We show a use case here below with 10x Genomics data were it will annotate different samples using the obo
method. For more information
about this cell-based annotation feature please visit Cell-based metadata annotation section.
First, generate the config :
nextflow config \
~/vib-singlecell-nf/vsn-pipelines \
-profile tenx,utils_cell_annotate,singularity
Make sure the following parts of the generated config are properly set:
[...]
data {
tenx {
cellranger_mex = '~/out/counts/*/outs/'
}
}
sc {
scanpy {
container = 'vibsinglecellnf/scanpy:0.5.2'
}
cell_annotate {
off = 'h5ad'
method = 'obo'
indexColumnName = 'BARCODE'
cellMetaDataFilePath = "~/out/data/*.best"
sampleSuffixWithExtension = '_demuxlet.best'
annotationColumnNames = ['DROPLET.TYPE', 'NUM.SNPS', 'NUM.READS', 'SNG.BEST.GUESS']
}
[...]
}
[...]
Now we can run it with the following command:
nextflow -C nextflow.config \
run ~/vib-singlecell-nf/vsn-pipelines \
-entry cell_annotate \
> nextflow.config
cell_annotate_filter
¶
Runs the cell_annotate_filter
workflow which will perform a cell-based annotation of the data using a set of provided .tsv metadata files following by a cell-based filtering.
We show a use case here below with 10x Genomics data were it will annotate different samples using the obo
method. For more information
about this cell-based annotation feature please visit Cell-based metadata annotation section and Cell-based metadata filtering section.
First, generate the config :
nextflow config \
~/vib-singlecell-nf/vsn-pipelines \
-profile tenx,utils_cell_annotate,utils_cell_filter,singularity \
> nextflow.config
Make sure the following parts of the generated config are properly set:
[...]
data {
tenx {
cellranger_mex = '~/out/counts/*/outs/'
}
}
sc {
scanpy {
container = 'vibsinglecellnf/scanpy:0.5.2'
}
cell_annotate {
off = 'h5ad'
method = 'obo'
indexColumnName = 'BARCODE'
cellMetaDataFilePath = "~/out/data/*.best"
sampleSuffixWithExtension = '_demuxlet.best'
annotationColumnNames = ['DROPLET.TYPE', 'NUM.SNPS', 'NUM.READS', 'SNG.BEST.GUESS']
}
cell_filter {
off = 'h5ad'
method = 'internal'
filters = [
[
id:'NO_DOUBLETS',
sampleColumnName:'sample_id',
filterColumnName:'DROPLET.TYPE',
valuesToKeepFromFilterColumn: ['SNG']
]
]
}
[...]
}
[...]
Now we can run it with the following command:
nextflow -C nextflow.config \
run ~/vib-singlecell-nf/vsn-pipelines \
-entry cell_filter
sra
¶
Runs the sra
workflow which will download all (or user-defined selected) FASTQ files from a particular SRA project ID and format with properly and humand friendly names.
First, generate the config :
nextflow config \
~/vib-singlecell-nf/vsn-pipelines \
-profile sra,singularity \
> nextflow.config
NOTE: If you’re a VSC user, you might want to add the vsc
profile.
Now we can run it with the following command:
nextflow -C nextflow.config \
run ~/vib-singlecell-nf/vsn-pipelines \
-entry cell_filter
$ nextflow -C nextflow.config run ~/vib-singlecell-nf/vsn-pipelines -entry sra
N E X T F L O W ~ version 20.11.0-edge
Launching `~/vib-singlecell-nf/vsn-pipelines/main.nf` [cranky_kare] - revision: c5e34d476a
executor > local (1)
[c3/4bf7a2] process > sra:DOWNLOAD_FROM_SRA:SRA_TO_METADATA (1) [100%] 1 of 1 _
[- ] process > sra:DOWNLOAD_FROM_SRA:DOWNLOAD_FASTQS_FROM_SRA_ACC_ID -
[- ] process > sra:DOWNLOAD_FROM_SRA:NORMALIZE_SRA_FASTQS -
[SRR11442507, scATAC_Control_Superior_and_Middle_Temporal_Gyri_1]
[SRR11442506, scATAC_Control_Substantia_Nigra_2]
...
Advanced Features¶
Two-pass strategy¶
Typically, cell- and gene-level filtering is one of the first steps performed in the analysis pipelines.
This usually results in the pipeline being run in two passes.
In the first pass, the default filters are applied (which are probably not valid for new datasets), and a separate QC report is generated for each sample.
These QC reports can be inspected and the filters can be adjusted in the config file
either for all samples (by editing the params.sc.scanpy.filter
settings directly, or for individual samples by using the strategy described in multi-sample parameters.
Then, the second pass restarts the pipeline with the correct filtering parameters applied (use nextflow run ... -resume
to skip already completed steps).
Other notes¶
In order to run a specific pipeline (e.g. single_sample
),
the pipeline name must be specified as a profile when running nextflow config ...
(so that the default parameters are included),
and as the entry workflow when running the pipeline with nextflow run
.
One exception to this is that the -entry
pipeline can be one that is a subset of the one present in the config file.
For example, in a pipeline with long running step that occurs after filtering (e.g. single_sample_scenic
),
it can be useful to generate the full config file (nextflow config vib-singlecell-nf/vsn-pipelines -profile single_sample_scenic
),
then run a first pass for filtering using nextflow run vib-singlecell-nf/vsn-pipelines -entry single_sample
, and a second pass using the full pipeline -entry single_sample_scenic
).
Avoid re-running SCENIC and use pre-existing results¶
Often one would like to test different batch effect correction methods with SCENIC. Naively, one would run the following commands:
nextflow config ~/vibsinglecellnf -profile tenx,bbknn,dm6,scenic,scenic_use_cistarget_motifs,singularity > bbknn.config
nextflow -C bbknn.config run vib-singlecell-nf/vsn-pipelines -entry bbknn_scenic
and,
nextflow config ~/vibsinglecellnf -profile tenx,bbknn,dm6,scenic,scenic_use_cistarget_motifs,singularity > bbknn.config
nextflow -C harmony.config run vib-singlecell-nf/vsn-pipelines -entry harmony_scenic
The annoying bit here is that we run SCENIC twice. This is what we would like to avoid since the SCENIC results will be the same. To avoid this one can run the following code for generating the harmony_scenic.config,
nextflow config ~/vibsinglecellnf -profile tenx,harmony,scenic_append_only,singularity > harmony.config
This will add a different scenic entry in the config:
params {
sc {
scenic {
container = 'vibsinglecellnf/scenic:0.9.19'
report_ipynb = '/src/scenic/bin/reports/scenic_report.ipynb'
existingScenicLoom = ''
sampleSuffixWithExtension = '' // Suffix after the sample name in the file path
scenicoutdir = "${params.global.outdir}/scenic/"
scenicScopeOutputLoom = 'SCENIC_SCope_output.loom'
}
}
}
Make sure that the following entries are correctly set before running the pipeline,
existingScenicLoom = ''
sampleSuffixWithExtension = '' // Suffix after the sample name in the file path
Finally run the pipeline,
nextflow -C harmony.config run vib-singlecell-nf/vsn-pipelines -entry harmony_scenic
Set the seed¶
Some steps in the pipelines are non-deterministic. In order to have reproducible results, a seed is set by default to:
workflow.manifest.version.replaceAll("\\.","").toInteger()
The seed is a number derived from the version of the pipeline used at the time of the analysis run.
To override the seed (integer) you have edit the nextflow.config
file with:
params {
global {
seed = [your-custom-seed]
}
}
This filter will only be applied on the final loom file of the VSN-Pipelines. All the intermediate files prior to the loom file will still contain all of them the markers.
Change log fold change (logFC) and false discovery rate (FDR) thresholds for the marker genes stored in the final SCope loom¶
By default, the logFC and FDR thresholds are set to 0 and 0.05 respectively.
If you want to change those thresholds applied on the markers genes, edit the nextflow.config
with the following entries,
params {
sc {
scope {
markers {
log_fc_threshold = 0.5
fdr_fc_threshold = 0.01
}
}
}
}
This filter will only be applied on the final loom file of the VSN-Pipelines. All the intermediate files prior to the loom file will still contain all of them the markers.
Automated selection of the optimal number of principal components¶
When generating the config using nextflow config
(see above), add the pcacv
profile.
Remarks:
- Make sure
nComps
config parameter (underdim_reduction
>pca
) is not set. - If
nPcs
is not set for t-SNE or UMAP config entries, then all the PCs from the PCA will be used in the computation.
Currently, only the Scanpy related pipelines have this feature implemented.
Cell-based metadata annotation¶
There are 2 ways of using this feature: either when running an end-to-end pipeline (e.g.: single_sample
, harmony
, bbknn
, …) or on its own as a independent workflow.
The profile utils_cell_annotate
should be added along with the other profiles when generating the main config using the nextflow config
command.
For more detailed information about those parameters, please check the `cell_annotate parameter details <Parameters of cell_annotate_>`_ section below.
Please check the cell_annotate workflow.
The utils_cell_annotate
profile is adding the following part to the config:
params {
sc {
cell_annotate {
off = 'h5ad'
method = ''
cellMetaDataFilePath = ''
sampleSuffixWithExtension = ''
indexColumnName = ''
sampleColumnName = ''
annotationColumnNames = ['']
}
}
}
Two methods (params.sc.cell_annotate.method
) are available:
aio
obo
If you have a single file containing the metadata information of all your samples, use aio
method otherwise use obo
.
For both methods, here are the mandatory parameters to set:
off
should be set toh5ad
method
choose eitherobo
oraio
annotationColumnNames
is an array of columns names fromcellMetaDataFilePath
containing different annotation metadata to add.
If aio
used, the following additional parameters are required:
cellMetaDataFilePath
is a file path pointing to a single .tsv file (with header) with at least 2 columns: a column containing all the cell IDs and an annotation column.indexColumnName
is the column name fromcellMetaDataFilePath
containing the cell IDs information. This column can have unique values; if it’s not the case, it’s important that the combination of the values from theindexColumnName
and thesampleColumnName
are unique.sampleColumnName
is the column name fromcellMetaDataFilePath
containing the sample ID/name information. Make sure that the values from this column match the samples IDs inferred from the data files. To know how those are inferred, please read the Input Data Formats section.
If obo
is used, the following parameters are required:
cellMetaDataFilePath
- In multi-sample mode, is a file path containing a glob pattern. The target file paths should each pointing to a .tsv file (with header) with at least 2 columns: a column containing all the cell IDs and an annotation column.
- In single-sample mode, is a file path pointing to a single .tsv file (with header) with at least 2 columns: a column containing all the cell IDs and an annotation column.
- Note: the file name(s) of
cellMetaDataFilePath
is/are required to contain the sample ID(s).
sampleSuffixWithExtension
is the suffix used to extract the sample ID from the file name(s) ofcellMetaDataFilePath
. The suffix should be the part after the sample name in the file path.indexColumnName
is the column name fromcellMetaDataFilePath
containing the cell IDs information. This column must have unique values.
Sample-based metadata annotation¶
The profile utils_sample_annotate
should be added when generating the main config using nextflow config
. This will add the following entry in the config:
params {
sc {
sample_annotate {
iff = '10x_cellranger_mex'
off = 'h5ad'
type = 'sample'
metadataFilePath = 'data/10x/1k_pbmc/metadata.tsv'
}
}
}
Then, the following parameters should be updated to use the module feature:
metadataFilePath
is a .tsv file (with header) with at least 2 columns where the first column need to match the sample IDs. Any other columns will be added as annotation in the final loom i.e.: all the cells related to their sample will get annotated with their given annotations.
id | chemistry | … |
---|---|---|
1k_pbmc_v2_chemistry | v2 | … |
1k_pbmc_v3_chemistry | v3 | … |
Sample-annotating the samples using this system will allow any user to query all the annotation using the SCope portal. This is especially relevant when samples needs to be compared across specific annotations (check compare tab with SCope).
Cell-based metadata filtering¶
There are 2 ways of using this feature: either when running an end-to-end pipeline (e.g.: single_sample
, harmony
, bbknn
, …) or on its own as a independent workflow.
The utils_cell_filter
profile is required when generating the config file. This profile will add the following part:
params {
sc {
cell_filter {
off = 'h5ad'
method = ''
filters = [
[
id: '',
sampleColumnName: '',
filterColumnName: '',
valuesToKeepFromFilterColumn: ['']
]
]
}
}
}
For more detailed information about the parameters to set in params.sc.cell_filter
, please check the cell_filter parameter details section below.
Please check the cell_filter workflow or cell_annotate_filter workflow to perform cell-based annotation and cell-based filtering sequentially.
Two methods (params.sc.cell_filter.method
) are available:
internal
external
If you have a single file containing the metadata information of all your samples, use external
method otherwise use internal
.
For both methods, here are the mandatory parameters to set:
off
should be set toh5ad
method
choose eitherinternal
orexternal
filters
is a List of Maps where each Map is required to have the following parameters:id
is a short identifier for the filtervaluesToKeepFromFilterColumn
is array of values from thefilterColumnName
that should be kept (other values will be filtered out).
If internal
used, the following additional parameters are required:
filters
is a List of Maps where each Map is required to have the following parameters:sampleColumnName
is the column name containing the sample ID/name information. It should exist in theobs
column attribute of the h5ad.filterColumnName
is the column name that will be used to filter out cells. It should exist in theobs
column attribute of the h5ad.
If external
used, the following additional parameters are required:
filters
is a List of Maps where each Map is required to have the following parameters:cellMetaDataFilePath
is a file path pointing to a single .tsv file (with header) with at least 3 columns: a column containing all the cell IDs, another containing the sample ID/name information, and a column to use for the filtering.indexColumnName
is the column name fromcellMetaDataFilePath
containing the cell IDs information. This column must have unique values.- optional
sampleColumnName
is the column name fromcellMetaDataFilePath
containing the sample ID/name information. Make sure that the values from this column match the samples IDs inferred from the data files. To know how those are inferred, please read the Input Data Formats section. - optional
filterColumnName
is the column name fromcellMetaDataFilePath
which be used to filter out cells.
Multi-sample parameters¶
It’s possible to define custom parameters for the different samples. It’s as easy as defining a hashmap in groovy or a dictionary-like structure in Python. You’ll just have to repeat the following structure for the parameters which you want to enable the multi-sample feature for
params {
sc {
scanpy {
container = 'vibsinglecellnf/scanpy:0.5.2'
filter {
report_ipynb = '/src/scanpy/bin/reports/sc_filter_qc_report.ipynb'
// Here we enable the multi-sample feature for the cellFilterMinNgenes parameter
cellFilterMinNGenes = [
'1k_pbmc_v2_chemistry': 600,
'1k_pbmc_v3_chemistry': 800
]
// cellFilterMaxNGenes will be set to 4000 for all the samples
cellFilterMaxNGenes = 4000
// Here we again enable the multi-sample feature for the cellFilterMaxPercentMito parameter
cellFilterMaxPercentMito = [
'1k_pbmc_v2_chemistry': 0.15,
'1k_pbmc_v3_chemistry': 0.05
]
// geneFilterMinNCells will be set to 3 for all the samples
geneFilterMinNCells = 3
iff = '10x_mtx'
off = 'h5ad'
outdir = 'out'
}
}
}
If you want to apply custom parameters for some specific samples and have a “general” parameter for the rest of the samples, you should use the ‘default’ key as follows:
params {
sc {
scanpy {
container = 'vibsinglecellnf/scanpy:0.5.2'
filter {
report_ipynb = '/src/scanpy/bin/reports/sc_filter_qc_report.ipynb'
// Here we enable the multi-sample feature for the cellFilterMinNgenes parameter
cellFilterMinNGenes = [
'1k_pbmc_v2_chemistry': 600,
'default': 800
]
[...]
}
}
}
Using this config, the parameter params.sc.scanpy.cellFilterMinNGenes
will be applied with a threshold value of 600
to 1k_pbmc_v2_chemistry
. The rest of the samples will use the value 800
to filter the cells having less than that number of genes.
This strategy can be applied to any other parameter of the config.
Parameter exploration¶
Since v0.9.0
, it is possible to explore several combinations of parameters. The latest version of the VSN-Pipelines allows to explore the following parameters:
params.sc.scanpy.clustering
method
methods = ['louvain','leiden']
resolution
resolutions = [0.4, 0.8]
In case the parameter exploration mode is used within the params.sc.scanpy.clustering
parameter, it will generated a range of different clusterings.
For non-expert, it’s often difficult to know which clustering to pick. It’s however possible to use the DIRECTS
module in order to select a default clustering. In order, to use
this automated clustering selection method, add the directs
profile when generating the main config using nextflow config
. The config will get populated with:
directs {
container = 'vibsinglecellnf/directs:0.1.0'
labels {
processExecutor = 'local'
}
select_default_clustering {
fromMinClusterSize = 5
toMinClusterSize = 100
byMinClusterSize = 5
}
}
Currently, only the Scanpy related pipelines have this feature implemented.
Regress out variables¶
By default, don’t regress any variable out. To enable this features, the scanpy_regress_out
profile should be added when generating the main config using nextflow config
. This will add the following entry in the config:
params {
sc {
scanpy {
regress_out {
variablesToRegressOut = []
off = 'h5ad'
}
}
}
}
Add any variable in variablesToRegressOut
to regress out: e.g.: ‘n_counts’, ‘percent_mito’.
Skip steps¶
By default, the pipelines are run from raw data (unfiltered data, not normalized).
If you have already performed an independent steps with another it’s possible to skip some steps from the pipelines. Currently, here are the steps that can be skipped:
- Scanpy
filtering
- Scanpy
normalization
In order to skip the Scanpy filtering step, we need to add 3 new profiles when generating the config:
min
scanpy_data_transformation
scanpy_normalization
The following command, will create a Nextflow config which the pipeline will understand and will not run the Scanpy filtering step:
nextflow config \
~/vib-singlecell-nf/vsn-pipelines \
-profile min,[data-profile],scanpy_data_transformation,scanpy_normalization,[...],singularity > nextflow.config
[data-profile]
: Can be one of the different possible data profiles e.g.:h5ad
[...]
: Can be other profiles likebbknn
,harmony
,pcacv
, …
Quiet mode¶
By default, VSN will output some additional messages to the terminal, such as the global seed, and the names and paths of the samples detected by the input channel.
These messages can be suppressed by using the --quiet
flag when starting the nextflow process:
nextflow -C example.config run vib-singlecell-nf/vsn-pipelines -entry single_sample --quiet
Case Studies¶
See the full list of case studies and examples at VSN-Pipelines-examples.
scATAC-seq Preprocessing¶
This pipeline takes fastq files from paired end single cell ATAC-seq, and applies preprocessing steps to align the reads to a reference genome, and produce a bam file and scATAC-seq fragments file.
This workflow is currently available in the develop_atac
branch (use nextflow pull vib-singlecell-nf/vsn-pipelines -r develop_atac
to sync this branch).
Pipeline Steps¶
The full steps are:
- Barcode correction:
- For ‘standard’ and ‘multiome’ samples (e.g. 10x Genomics or similar) correction is performed against a whitelist by this method from aertslab/single_cell_toolkit.
- For ‘biorad’ samples, barcode correction is performed by this script in our aertslab/single_cell_toolkit (previously, this was done with BAP).
- Fastq barcoding: Add the barcode sequence to the comment field of the fastq sequence identifier. Uses methods from aertslab/single_cell_toolkit.
- Read/adapter trimming (Trim_Galore or fastp).
- Mapping to a reference genome:
- Mark PCR and optical duplicates (MarkDuplicates (Picard) or MarkDuplicatesSpark (GATK)).
- Estimate library complexity with EstimateLibraryComplexity (Picard).
- A fragments file is created using Sinto.
Pipeline Details¶
Input Metadata¶
The input to this pipeline is a (tab-delimited) metadata table with the sample ID, sequencing technology, and locations of the fastq files. Note that the fastq file fields must be full paths; this is not shown here for clarity:
sample_name | technology | fastq_PE1_path | fastq_barcode_path | fastq_PE2_path |
---|---|---|---|---|
sample_1 | standard | sample_1_R1.fastq.gz | sample_1_R2.fastq.gz | sample_1_R3.fastq.gz |
sample_2 | multiome | sample_2_R1.fastq.gz | sample_2_R2.fastq.gz | sample_2_R3.fastq.gz |
sample_3 | biorad | sample_3_R1.fastq.gz | sample_3_R3.fastq.gz | |
sample_4 | revcomp_wl | sample_4_R1.fastq.gz | sample_2_R2.fastq.gz | sample_4_R3.fastq.gz |
sample_5 | hydrop_3x96 | sample_5_R1.fastq.gz | sample_5_R2.fastq.gz | sample_5_R3.fastq.gz |
sample_6 | hydrop_2x384 | sample_5_R1.fastq.gz | sample_5_R2.fastq.gz | sample_5_R3.fastq.gz |
The columns represent:
sample_name
Sample name for labeling the sample in the pipeline and output files. This can be any arbitrary string.technology
: This controls the barcode correction and processing methods to use for the fastq files. Currently only thebiorad
option involves different processing steps. Otherwise, the value in this field (e.g.standard
,multiome
) controls which barcode whitelist is used for correction. See below for additional details.fastq_PE1_path
: The full path to the fastq file for the first read in a pair.fastq_barcode_path
: The full path to the fastq file containing the barcodes. This column can be blank/empty depending on the technology setting (e.g.biorad
).fastq_PE2_path
: The full path to the fastq file for the second read in a pair.
Fastq input¶
Fastq input for each sample can be given as a single set of files (R1, R2, R3), or it can be multiple files, in the case of samples which have been split across multiple sequencing lanes. A combination of these cases can be processed together in one metadata file.
Within the pipeline, all of the reads from each set of files will be considered and labeled as one read group. Read group names are taken from the first read in the fastq:
@A01044:19:HLYKFDRXX:1:2101:4291:1000 1:N:0:ACTCAGAC
will produce a RG in the bam:
@RG ID:A01044:19:HLYKFDRXX:1 SM:sample_1 LB:A01044:19:HLYKFDRXX:1__sample_1 PL:ILLUMINA
Single fastq input¶
In this situation, there is only one set of reads per sample, the metadata file will look very similar to one of the rows from above. There will be one read group in the final bam file.
Split fastq input¶
In this case, multiple fastq files (in rows) for each sample can be given.
In this example, there are two sets of fastqs for sample_1
that were run on two separate lanes.
Note that the sample ID is the same for both rows:
sample_name | technology | fastq_PE1_path | fastq_barcode_path | fastq_PE2_path |
---|---|---|---|---|
sample_1 | standard | sample_1_S1_L001_R1_001.fastq.gz | sample_1_S1_L001_R2_001.fastq.gz | sample_1_S1_L001_R3_001.fastq.gz |
sample_1 | standard | sample_1_S1_L002_R1_001.fastq.gz | sample_1_S1_L002_R2_001.fastq.gz | sample_1_S1_L002_R3_001.fastq.gz |
In this situation, each set of fastqs will be processed separately for the barcode correction, barcode addition to the fastq comment field, adaptor trimming, and mapping steps. Following mapping, each mapped bam is merged and duplicates are marked using the full data. Downstream steps are done with the merged data.
Generating the metadata file¶
Note that there is an easy way to create the metadata from the file paths for each sample by using the following bash command (expand to view). Special thanks here to Gert Hulselmans for expanding the capabilities of this function.
metadata generator
create_atac_metadata () {
local sample="${1}";
local technology="${2}";
local fastq_prefix="${3}";
local read_labels="${4}";
if [ "${sample}" == "header" ]; then
printf 'sample_name\ttechnology\tfastq_PE1_path\tfastq_barcode_path\tfastq_PE2_path\n';
return 0;
fi
if [ ${#@} -ne 4 ] ; then
printf 'Usage: create_atac_metadata sample technology fastq_prefix read_labels\n\n';
printf 'Arguments:\n';
printf ' sample: sample name\n';
printf ' technology: "standard", "hydrop_3x96", "hydrop_2x384", or "biorad"\n';
printf ' fastq_prefix: path prefix to FASTQ files.\n';
printf ' read_labels: comma separated read labels for R1, R2 and R3 that select: R1,R2,R3.\n';
return 1;
fi
read_labels_arr=(${read_labels//,/ });
# Get R1, R2 and R3 FASTQ filenames for
R1=(${fastq_prefix}*${read_labels_arr[0]}*.{fastq,fq,fastq.gz,fq.gz})
R2=(${fastq_prefix}*${read_labels_arr[1]}*.{fastq,fq,fastq.gz,fq.gz})
R3=(${fastq_prefix}*${read_labels_arr[2]}*.{fastq,fq,fastq.gz,fq.gz})
for i in "${!R1[@]}" ; do
# Check if R1 FASTQ file exist (and is not just a glob like "${sample}*R1*.fq").
if [ -e "${R1[i]}" ] ; then
printf "${sample}\t${technology}\t${R1[i]}\t${R2[i]}\t${R3[i]}\n";
fi
done
}
To run use the options:
- Sample ID (if this parameter is “header”, it will print the metadata header and stop)
- Technology (e.g. “standard”)
- The “file prefix” full path to your fastq files, matching the common portions of the file names (without any glob
*
expansions) - The “read labels” to indicate how the files are named and match the remainder of the file names (e.g. “R1,R2,R3”, “R1,UMI,R2”, etc.)
create_atac_metadata header > auto_metadata.tsv
create_atac_metadata sample_1 standard /path/to/sample_1_subset_S R1,R2,R3 >> auto_metadata.tsv
create_atac_metadata sample_2 standard /path/to/sample_2_subset_S R1,R2,R3 >> auto_metadata.tsv
create_atac_metadata sample_5 hydrop_3x96 /path/to/sample_5_ R1,R2,R3 >> auto_metadata.tsv
create_atac_metadata sample_6 hydrop_2x384 /path/to/sample_6_ R1,R2,R3 >> auto_metadata.tsv
Technology types¶
The “technology” field in the metadata table controls two things:
- How technology-specific pipeline steps are applied.
Currently there are three specific settings (
biorad
,hydrop_3x96
, andhydrop_2x384
) that use alternate pipelines processes (to extract and correct the barcode sequence from the input fastqs). Using any other keyword is allowed, and samples will be run with the standard pipeline steps (barcode correction against a whitelist). - Which whitelist is used for barcode correction.
The “technology” field must match a key in the
params.tools.singlecelltoolkit.barcode_correction.whitelist
parameter list in the config file for that sample to be associated with a particular barcode whitelist. The “technology” field and whitelist key name can be set to any arbitrary string (e.g.standard
), with the exception of the technology-specific keywords above.
The main modes are:
standard
¶
The standard
setting is the main pipeline mode.
It assumes a typical 10x Genomics style format with two read pair fastqs and a barcode fastq (note that in the example here, the barcode correction has already been performed, writing the CB
tag into the comment of the barcode fastq):
$ zcat sample_1_R1.fastq.gz | head -n 4
@A00311:74:HMLK5DMXX:1:1101:2013:1000 1:N:0:ACTCAGAC
NTTGTCTCAGCACCCCCCGACATGGATTCAGGCTGTCTCTTATACACATC
+
#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
$ zcat sample_1_R2.fastq.gz | head -n 4
@A00311:74:HMLK5DMXX:1:1101:2013:1000 2:N:0:ACTCAGAC CB:Z:CTGTTCGCAAAGCATA
CTGTTCGCAAAGCATA
+
F:FFFFFFFFFFFFFF
$ zcat sample_1_R3.fastq.gz | head -n 4
@A00311:74:HMLK5DMXX:1:1101:2013:1000 3:N:0:ACTCAGAC
CCTGAATCCATGTCGGGGGGTGCTGAGACAAGCTGTCTCTTATACACAT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
The barcoding step here uses a helper script from aertslab/single_cell_toolkit which transforms this input into two paired fastq files with the barcode information embedded in the fastq comments field:
$ zcat sample_1_dex_R1.fastq.gz | head -n 4
@A00311:74:HMLK5DMXX:1:1101:2013:1000 CR:Z:CTGTTCGCAAAGCATA CY:Z:F:FFFFFFFFFFFFFF CB:Z:CTGTTCGCAAAGCATA
NTTGTCTCAGCACCCCCCGACATGGATTCAGGCTGTCTCTTATACACATC
+
#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
$ zcat sample_1_dex_R2.fastq.gz | head -n 4
@A00311:74:HMLK5DMXX:1:1101:2013:1000 CR:Z:CTGTTCGCAAAGCATA CY:Z:F:FFFFFFFFFFFFFF CB:Z:CTGTTCGCAAAGCATA
CCTGAATCCATGTCGGGGGGTGCTGAGACAAGCTGTCTCTTATACACAT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
multiome
/alternate¶
The multiome
or alternately-named settings work with the same pipeline steps as standard
with the exception of the whitelist used for barcode correction.
The whitelists are supplied in the params file (params.tools.singlecelltoolkit.barcode_correction.whitelist
).
This can be used to supply alternate whitelists for certain samples, for example if you need to supply a reverse complemented whitelist for samples run in certain sequencing machines.
hydrop_3x96
/hydrop_2x384
¶
The HyDrop settings (either hydrop_3x96
or hydrop_2x384
depending on the library preparation used) processes data generated by the HyDrop ATAC protocol
(see hydrop.aertslab.org and the associated preprint).
This approach differs from the standard pipeline in only the initial step, which is to extract and process the HyDrop barcodes from the sequencing output.
Here, this script is used to take the R2 read from the sequencer:
$ zcat sample_5_R2.fastq.gz | head -n 4
@VH00445:5:AAAL5KYM5:1:1101:63923:1019 2:N:0:ACACGTGGAC
CACTGGTGGTAGGGTACTCGGACAAGTGGAGCAGTAGCTGAAGTGTAGAAG
+
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
and transform it into:
$ zcat sample_5_hydrop_barcode_R2.fastq.gz
@VH00445:5:AAAL5KYM5:1:1101:63923:1019 2:N:0:ACACGTGGAC
CACTGGTGGTGACAAGTGGAAAGTGTAGAA
+
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
The two HyDrop modes (hydrop_3x96
, hydrop_2x384
) differ only in the way the initial barcode extraction script works.
Following this, they are processed in the same way as the standard pipeline, including whitelist-based barcode correction (note that the two HyDrop modes require different barcode whitelists to be used here).
biorad
¶
The biorad
setting processes BioRad data using
this script
in our aertslab/single_cell_toolkit
(previously, this was done with BAP).
This takes input data:
$ zcat sample_2_R1.fastq.gz | head -n 4
@A00794:327:HTJ55DRXX:1:2101:1154:1016 1:N:0:TAAGGCGA
GATCACCATATGCATGACATTCACGAGTCACTGAGTAACGCCTCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCTGCAATGGCTGGAGCACACCCCATACTCATTCTGGTCTCCTT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFF:FFFFFFFFFFFFFFFFF,FF,FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFF,FFFFFF
$ zcat sample_2_R2.fastq.gz | head -n 4
@A00794:327:HTJ55DRXX:1:2101:1154:1016 2:N:0:TAAGGCGA
GTGTTTGGCTGAGGAAAGTGTGTGAAGCCCCGATATGTGA
+
FFF,FFF:FFF:FF,FFFFF:F:FFFFFFFFFFF,,F:FF
and directly produces paired fastq files with the barcode added in the fastq comments field:
$ zcat sample_2_dex_R1.fastq.gz | head -n 4
@A00794:327:HTJ55DRXX:1:2101:1154:1016 CR:Z:GATCACCATTCACGTAACGCC CY:Z:FFFFFFFFFFFFFF:FFFFFF CB:Z:GATCACCATTCACGTAACGCC br:Z:0,0,0_0,0,0,1
CTGCAATGGCTGGAGCACACCCCATACTCATTCTGGTCTCCTT
+
F:FFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFF,FFFFFF
$ zcat sample_2_dex_R2.fastq.gz | head -n 4
@A00794:327:HTJ55DRXX:1:2101:1154:1016 CR:Z:GATCACCATTCACGTAACGCC CY:Z:FFFFFFFFFFFFFF:FFFFFF CB:Z:GATCACCATTCACGTAACGCC br:Z:0,0,0_0,0,0,1
GTGTTTGGCTGAGGAAAGTGTGTGAAGCCCCGATATGTGA
+
FFF,FFF:FFF:FF,FFFFF:F:FFFFFFFFFFF,,F:FF
Running the workflow¶
Technical considerations¶
Direct the Nextflow work directory to an alternate path (e.g. a scratch drive) using the
NXF_WORK
environmental variable:nwork=/path/to/scratch/example_project mkdir $nwork export NXF_WORK=$nwork
Note that if you start a new shell, NXF_WORK
must be set again, or the pipeline will not resume properly.
- Temporary directory mapping.
For large BAM files, the system default temp location may become full.
A workaround is to include a volume mapping to the alternate
/tmp
-B /alternate/path/to/tmp:/tmp
using the volume mount options in Docker or Singularity. For example in the container engine options:
- Singularity run options:
runOptions = '--cleanenv -H $PWD -B /data,/alternate/path/to/tmp:/tmp'
- Docker run options:
runOptions = '-i -v /data:/data -v /alternate/path/to/tmp:/tmp'
Configuration¶
To generate a config file, use the atac_preprocess
profile along with docker
or singularity
.
Note that the full path to vib-singlecell-nf/vsn-pipelines/main_atac.nf
must be used:
nextflow config \
vib-singlecell-nf/vsn-pipelines/main_atac.nf \
-profile atac_preprocess,singularity \
> atac_preprocess.config
Note
It is also possible to run the pycisTopic QC steps directly after this atac_preprocess
pipeline, with a single command.
Please see
here
for details on how to run with this configuration.
Parameters¶
The ATAC-specific parameters are described here. The important parameters to verify are:
params.data.atac_preprocess.metadata
: the path to the metadata file.params.tools.bwamaptools.bwa_fasta
: the path to the bwa reference fasta file. This should be already indexed withbwa index
, and the index files located in the same directory as the fasta file. Note thatbwa
andbwa-mem2
use different indexes that are not interchangeable.params.tools.singlecelltoolkit.barcode_correction.whitelist
: Whitelists for barcode correction are supplied here. The whitelists are matched to samples based on the parameter key here (‘standard’, ‘multiome’, ‘hydrop_3x96’, ‘hydrop_2x384’, etc.) and the technology field listed for each sample in the metadata file. Barcode whitelists can (optionally) be gzipped. There are currently no checks performed to ensure that the sample barcodes have any overlap to the whitelist (the barcode correction reports should be checked for this).
Choice of tools¶
Several steps have options for the choice of method to use.
These options are controlled within params.atac_preprocess_tools
.
- Adapter trimming (
adapter_trimming_method
): Can be either ofTrim_Galore
(default), orfastp
. - Duplicate marking (
mark_duplicates_method
): Can be either ofMarkDuplicates
(Picard tools, default) orMarkDuplicatesSpark
(GATK). We currently recommend Picard MarkDuplicates because it has the capability to perform barcode-aware marking of PCR duplicates. MarkDuplicatesSpark has the advantage of parallelization, however it requires a large SSD to use for temporary files.
Additionally:
- Mapping: Use parameter
params.tools.bwamaptools.bwa_version
to select eitherbwa
orbwa-mem2
. These should give virtually identical results, howeverbwa-mem2
, while faster, has used more memory in our tests. Note that the index (bwa_index
) is not interchangeable between the versions.
Optional parameters¶
- Within
params.tools.sinto.fragments
:- One of (but not both)
barcodetag
orbarcode_regex
needs to be set to tell Sinto where to find the barcodes in the bam file. The default is to usebarcodetag
ofCB
. mapq
: Controls quality filtering settings for generating the fragments file. Discards reads with quality score lower than this number (default 30).
- One of (but not both)
Execution¶
After configuring, the workflow can be run with:
nextflow -C atac_preprocess.config run \
vib-singlecell-nf/vsn-pipelines/main_atac.nf \
-entry atac_preprocess -resume
Output¶
An example output tree is shown here.
out/
├── data
│ ├── bam
│ │ ├── sample_1.bwa.out.possorted.bam
│ │ ├── sample_1.bwa.out.possorted.bam.bai
│ │ ├── sample_2.bwa.out.possorted.bam
│ │ └── sample_2.bwa.out.possorted.bam.bai
│ ├── fragments
│ │ ├── sample_1.sinto.fragments.tsv.gz
│ │ ├── sample_1.sinto.fragments.tsv.gz.tbi
│ │ ├── sample_2.sinto.fragments.tsv.gz
│ │ └── sample_2.sinto.fragments.tsv.gz.tbi
│ └── reports
│ ├── barcode
│ │ ├── sample_1____S7_R1_001.corrected.bc_stats.log
│ │ └── sample_2____S8_R1_001.corrected.bc_stats.log
│ ├── mapping_stats
│ │ ├── sample_1.mapping_stats.tsv
│ │ └── sample_2.mapping_stats.tsv
│ ├── mark_duplicates
│ │ ├── sample_1.library_complexity_metrics.txt
│ │ ├── sample_1.mark_duplicates_metrics.txt
│ │ ├── sample_2.library_complexity_metrics.txt
│ │ └── sample_2.mark_duplicates_metrics.txt
│ └── trim
│ ├── sample_1____S7_R1_001.fastp.trimming_report.html
│ └── sample_2____S8_R1_001.fastp.trimming_report.html
└── nextflow_reports
├── execution_report.html
├── execution_timeline.html
├── execution_trace.txt
└── pipeline_dag.dot
scATAC-seq QC and Cell Calling¶
This workflow uses the Python implementation of cisTopic (pycisTopic) to perform quality control and cell calling. The inputs here are a fragments and bam file for each sample.
This workflow is currently available in the develop_atac
branch (use nextflow pull vib-singlecell-nf/vsn-pipelines -r develop_atac
to sync this branch).
Running the workflow¶
Technical considerations¶
Direct the Nextflow work directory to an alternate path (e.g. a scratch drive) using the
NXF_WORK
environmental variable:nwork=/path/to/scratch/example_project mkdir $nwork export NXF_WORK=$nwork
Note that if you start a new shell, NXF_WORK
must be set again, or the pipeline will not resume properly.
- Important for pycisTopic Ray issues: the system default temp location may become full.
A workaround is to include a volume mapping to the alternate
/tmp
-B /alternate/path/to/tmp:/tmp
using the volume mount options in Docker or Singularity. For example in the container engine options:
- Singularity run options:
runOptions = '--cleanenv -H $PWD -B /data,/alternate/path/to/tmp:/tmp'
- Docker run options:
runOptions = '-i -v /data:/data -v /alternate/path/to/tmp:/tmp'
- Use the
--quiet
flag withnextflow run
to suppress the printing of each file that is detected by the pipeline.
Configuration¶
For each sample, this pipeline take a bam and a fragments file.
These can be specified separately, or from a Cell Ranger ATAC/ARC outs/
path.
Input with independent bam and fragments files¶
Use the profiles bam
and fragments
:
nextflow config vib-singlecell-nf/vsn-pipelines/main_atac.nf \
-profile atac_qc_filtering,bam,fragments,vsc,pycistopic_hg38 \
> atac_qc.config
Preset profiles are available for human (pycistopic_hg38
), mouse (pycistopic_mm10
), and fly (pycistopic_dmel
).
Or, these profiles can be omitted and set manually in the config (biomart, macs2).
Input data (bam and fragments files) are specified in the data section:
data {
fragments {
file_paths = '/staging/leuven/stg_00002/lcb/cflerin/analysis/asap/20210527_hydrop-atac_asabr/atac_preprocess/out_run1/data/fragments/ASA__*tsv.gz'
suffix = '.sinto.fragments.tsv.gz'
index_extension = '.tbi'
}
bam {
file_paths = '/staging/leuven/stg_00002/lcb/cflerin/analysis/asap/20210527_hydrop-atac_asabr/atac_preprocess/out_run1/data/bam/ASA*bam'
suffix = '.bwa.out.possorted.bam'
index_extension = '.bai'
}
}
Multiple files can be specified with *
in file_paths
or by separating the paths with a comma.
Warning
The suffix
for both bam and fragments will be removed from the filename to get sample IDs.
The sample names obtained must match between bam and fragments for the files to be paired properly in the workflow.
Input with Cell Ranger ATAC data¶
Use the tenx_atac
profile:
nextflow config vib-singlecell-nf/vsn-pipelines/main_atac.nf \
-profile atac_qc_filtering,tenx_atac,vsc,pycistopic_hg38 \
> atac_qc.config
Input data (the Cell Ranger outs/
path) are specified in the data section:
data {
tenx_atac {
cellranger_mex = '/data/cellranger_atac_2.0/*/outs,/data/processed/cellranger_arc_2.0.0/*/outs'
}
}
Multiple files can be specified with *
in tenx_atac
or by separating the paths with a comma.
Input directly from the preprocessing pipeline¶
It is also possible to run these QC steps directly after the atac_preprocess
pipeline, with a single command.
In this case, all the appropriate configuration profiles must be included at the configuration start:
nextflow config vib-singlecell-nf/vsn-pipelines/main_atac.nf \
-profile atac_preprocess,atac_qc_filtering,pycistopic_hg38,vsc \
> atac_preprocess_and_qc.config
Note that here, we do not include bam
and fragments
profiles that specify the input data locations to the QC steps since these are piped directly from the preprocessing pipeline.
One caveat to this is that it could potentially make it harder to run the qc pipeline with -resume
later on, especially if the Nextflow work/
directory is not saved due to disk space concerns.
To execute the preprocessing and mapping pipeline in one step, use the atac_preprocess_with_qc
entry point:
nextflow -C atac_preprocess_and_qc.config run \
vib-singlecell-nf/vsn-pipelines/main_atac.nf \
-entry atac_preprocess_with_qc -resume --quiet
Execution¶
After configuring, the workflow can be run with:
nextflow -C atac_qc.config run \
vib-singlecell-nf/vsn-pipelines/main_atac.nf \
-entry atac_qc_filtering --quiet -resume
After completing, view the report in out/notebooks/<project_name>__pycisTopic_QC_report.html
. To change the filtering settings, use the params.tools.pycistopic.call_cells
section.
Adjusting the filter settings¶
In the pycisTopic parameters, filter settings can be applied in this section:
pycistopic {
call_cells {
report_ipynb = '/src/pycistopic/bin/pycisTopic_qc_report_template.ipynb'
use_density_coloring_on_scatterplot = true
use_detailed_title_on_scatterplot = true
filter_frags_lower = '1000'
filter_frags_upper = ''
filter_tss_lower = '8'
filter_tss_upper = ''
filter_frip_lower = ''
filter_frip_upper = ''
filter_dup_rate_lower = ''
filter_dup_rate_upper = ''
}
}
If a setting is empty (''
), this filter will not be applied.
If set to a single value (i.e. filter_frags_lower=1000
), this will apply this filter value to all samples.
To use sample-specific filters, this can be written as:
filter_frags_lower = [
'default': 1000,
'Sample_1': 1500,
'Sample_2': 2000,
]
The default
setting (optional) is applied to all samples not listed in array.
If this default setting is missing, no filter will be applied to samples not listed in the array (all barcodes kept).
After setting the filters, the pipeline can be re-run to apply the new filters (use -resume
).
The additional settings control the output of the scatter plots in the report:
* use_density_coloring_on_scatterplot
: Slower when turned on; it can be helpful to set this to false
until the proper thresholds are determined.
* use_detailed_title_on_scatterplot
: Adds the cell count and median values after filtering to the title of each plot.
Output¶
An example output tree is shown here.
out/
├── data
│ ├── macs2
│ │ ├── sample_1.peaks.narrowPeak
│ │ ├── sample_1.summits.bed
│ │ ├── sample_2.peaks.narrowPeak
│ │ └── sample_2.summits.bed
│ └── pycistopic
│ └── qc
│ ├── benchmark_library_downsampled__metadata.pickle
│ ├── benchmark_library_downsampled__profile_data.pickle
│ ├── selected_barcodes
│ │ ├── sample_1.cell_barcodes.txt
│ │ └── sample_2.cell_barcodes.txt
│ └── selected_barcodes_nFrag
│ ├── sample_1.barcodes_nFrag_thr.txt
│ └── sample_2.barcodes_nFrag_thr.txt
└── notebooks
├── example_project__pycisTopic_QC_report.html
└── example_project__pycisTopic_QC_report.ipynb
macs2
: contains the narrowPeak and bed file for each sample.pycistopic
: *qc
: contains Python objects (in pickle format) for the metadata and profile data computed by pycisTopic.selected_barcodes
: contains a text file with selected cell barcodes (one per line) based on the thresholds set in the config file.selected_barcodes_nFrag
: contains a text file with barcodes (one per line) that have unique fragment counts greater than theparams.tools.pycistopic.compute_qc_stats.n_frag
setting in the pycisTopic parameters.
Development Guide¶
Create module¶
Tool-based modules are located in src/<tool-name>
, and each module has a specific structure for scripts and Nextflow processes (see Repository structure below).
Case study: Add Harmony¶
Harmony is a method published in Nature Methods that performs integration of single-cell data.
Links:
- GitHub: https://github.com/immunogenomics/harmony
- Tutorial: https://github.com/immunogenomics/harmony/blob/master/vignettes/quickstart.Rmd
Steps:
Create a new issue on
vsn-pipelines
GitHub repository explaining which module you are going to add (e.g.: Add Harmony batch correction method).Fork the
vsn-pipelines
repository to your own GitHub account (if you are an external collaborator).From your local copy of
vsn-pipelines
GitHub repository, create a new branch calledfeature/[github-issue-id]-[description]
.In this case,
[github-issue-id] = 115
[description] = add_harmony_batch_correction_method
It is highly recommended to start from the
develop
branch:git checkout develop git fetch git pull git checkout -b feature/115-add_harmony_batch_correction_method
Use the template repository in the vib-singlecell-nf organisation to create the framework for the new module in
src/<tool-name>
:git clone --depth=1 https://github.com/vib-singlecell-nf/template.git src/harmony
Now, you can start to edit file in the tool module that is now located in
src/<tool-name>
. Optionally, you can delete the.git
directory in the new module to avoid confusion in future local development:rm -rf src/harmony/.git
Create the Dockerfile recipe
FROM dweemx/sctx-seurat:3.1.2 RUN apt-get -y update && \ apt-get install -y libcurl4-openssl-dev libxml2-dev zlib1g-dev libhdf5-dev && \ apt-get install -y libssl-dev && \ # png.h: No such file or directory apt-get install -y libpng-dev && \ R -e "install.packages('optparse')" && \ R -e "devtools::install_github(repo = 'aertslab/SCopeLoomR')" && \ R -e "devtools::install_github('immunogenomics/harmony')" && \ # Need to run ps apt-get -y install procps && \ apt-get -y install libxml2 && \ # Clean rm -rf /tmp/* && \ apt-get autoremove -y && \ apt-get autoclean -y && \ rm -rf /var/cache/apt/* && \ rm -rf /var/lib/apt/lists/* && \ apt-get clean
Rename the
nextflow.config
file to create theharmony.config
configuration file.- Each process’s options should be in their own level. With a single process, you do not need one extra level.
params { sc { harmony { container = 'vibsinglecellnf/harmony:1.0' report_ipynb = "${params.misc.test.enabled ? '../../..' : ''}/src/harmony/bin/reports/sc_harmony_report.ipynb" varsUse = ['batch'] } } }
The
report_ipynb
Jupyter Notebook is available here.Create the R script to run Harmony
#!/usr/bin/env Rscript print("##################################################") print("# Harmony: Algorithm for single cell integration #") print("##################################################") # Loading dependencies scripts library("optparse") parser <- OptionParser( prog = "run_harmony.R", description = "Scalable integration of single cell RNAseq data for batch correction and meta analysis" ) parser <- add_option( parser, c("-i", "--input-file"), action = "store", default = NULL, help = "Input file [default]" ) parser <- add_option( parser, c("-a", "--vars-use"), action = "store", default = NULL, help = "If meta_data is dataframe, this defined which variable(s) to remove (character vector)." ) parser <- add_option( parser, c("-p", "--do-pca"), action = "store", default = FALSE, help = "Whether to perform PCA on input matrix." ) parser <- add_option( parser, c("-o", "--output-prefix"), action = "store", default = "foo", help="Prefix path to save output files. [default %default]" ) parser <- add_option( parser, c("-s", "--seed"), action = "store", default = 617, help="Seed. [default %default]" ) args <- parse_args(parser) cat("Parameters: \n") print(args) if(is.null(args$`vars-use`)) { stop("The parameter --vars-use has to be set.") } # Required by irlba::irlba (which harmony depends on) for reproducibility if(!is.null(args$seed)) { set.seed(args$seed) } else { warnings("No seed is set, this will likely give none reproducible results.") } input_ext <- tools::file_ext(args$`input-file`) if(input_ext == "h5ad") { # Current fix until https://github.com/satijalab/seurat/issues/2485 is fixed file <- hdf5r::h5file(filename = args$`input-file`, mode = 'r') if(!("X_pca" %in% names(x = file[["obsm"]]))) { stop("X_pca slot is not found in the AnnData (h5ad).") } obs <- file[['obs']][] pca_embeddings <- t(x = file[["obsm"]][["X_pca"]][,]) row.names(x = pca_embeddings) <- obs$index colnames(x = pca_embeddings) <- paste0("PCA_", seq(from = 1, to = ncol(x = pca_embeddings))) metadata <- obs # seurat <- Seurat::ReadH5AD(file = args$`input-file`) # if(!("pca" %in% names(seurat@reductions)) || is.null(x = seurat@reductions$pca)) # stop("Expects a PCA embeddings data matrix but it does not exist.") # data <- seurat@reductions$pca # pca_embeddings <- data@cell.embeddings # metadata <- seurat@meta.data } else { stop(paste0("Unrecognized input file format: ", input_ext, ".")) } print(paste0("PCA embeddings matrix has ", dim(x = data)[1], " rows, ", dim(x = data)[2], " columns.")) if(sum(args$`vars-use` %in% colnames(x = metadata)) != length(x = args$`vars-use`)) { stop("Some argument value from the parameter(s) --vars-use are not found in the metadata.") } # Run Harmony # Expects PCA matrix (Cells as rows and PCs as columns.) harmony_embeddings <- harmony::HarmonyMatrix( data_mat = pca_embeddings , meta_data = metadata , vars_use = args$`vars-use` , do_pca = args$`do-pca` , verbose = FALSE ) # Save the results ## PCA corrected embeddings write.table( x = harmony_embeddings, file = paste0(args$`output-prefix`, ".tsv"), quote = FALSE, sep = "\t", row.names = TRUE, col.names = NA )
Create the Nextflow process that will run the Harmony R script defined in the previous step.
nextflow.preview.dsl=2 binDir = !params.containsKey("test") ? "${workflow.projectDir}/src/harmony/bin/" : "" process SC__HARMONY__HARMONY_MATRIX { container params.sc.harmony.container publishDir "${params.global.outdir}/data/intermediate", mode: 'symlink' clusterOptions "-l nodes=1:ppn=${params.global.threads} -l walltime=1:00:00 -A ${params.global.qsubaccount}" input: tuple val(sampleId), path(f) output: tuple val(sampleId), path("${sampleId}.SC__HARMONY__HARMONY_MATRIX.tsv") script: def sampleParams = params.parseConfig(sampleId, params.global, params.sc.harmony) processParams = sampleParams.local varsUseAsArguments = processParams.varsUse.collect({ '--vars-use' + ' ' + it }).join(' ') """ ${binDir}run_harmony.R \ --seed ${params.global.seed} \ --input-file ${f} \ ${varsUseAsArguments} \ --output-prefix "${sampleId}.SC__HARMONY__HARMONY_MATRIX" """ }
Create a Nextflow “subworkflow” that will call the Nextflow process defined in the previous step and perform some other tasks (dimensionality reduction, cluster identification, marker genes identification and report generation)
This step is not required. However it this step is skipped, the code would still need to added into the main
harmony
workflow (workflows/harmony.nf, see the next step)nextflow.preview.dsl=2 ////////////////////////////////////////////////////// // process imports: include { clean; } from '../../utils/processes/utils.nf' params(params) include { COMBINE_BY_PARAMS; } from "../../utils/workflows/utils.nf" params(params) include { PUBLISH as PUBLISH_BEC_OUTPUT; PUBLISH as PUBLISH_BEC_DIMRED_OUTPUT; PUBLISH as PUBLISH_FINAL_HARMONY_OUTPUT; } from "../../utils/workflows/utils.nf" params(params) include { SC__HARMONY__HARMONY_MATRIX; } from './../processes/runHarmony.nf' params(params) include { SC__H5AD_UPDATE_X_PCA; } from './../../utils/processes/h5adUpdate.nf' params(params) include { NEIGHBORHOOD_GRAPH; } from './../../scanpy/workflows/neighborhood_graph.nf' params(params) include { DIM_REDUCTION_TSNE_UMAP; } from './../../scanpy/workflows/dim_reduction.nf' params(params) include { SC__SCANPY__CLUSTERING_PARAMS; } from './../../scanpy/processes/cluster.nf' params(params) include { CLUSTER_IDENTIFICATION; } from './../../scanpy/workflows/cluster_identification.nf' params(params) // Don't only import a specific process (the function needs also to be imported) // reporting: include { GENERATE_DUAL_INPUT_REPORT } from './../../scanpy/workflows/create_report.nf' params(params) ////////////////////////////////////////////////////// // Define the workflow workflow BEC_HARMONY { take: normalizedTransformedData dimReductionData // Expects (sampleId, anndata) clusterIdentificationPreBatchEffectCorrection main: // Run Harmony harmony_embeddings = SC__HARMONY__HARMONY_MATRIX( dimReductionData.map { it -> tuple(it[0], it[1]) } ) SC__H5AD_UPDATE_X_PCA( dimReductionData.map { it -> tuple(it[0], it[1]) }.join(harmony_embeddings) ) PUBLISH_BEC_OUTPUT( SC__H5AD_UPDATE_X_PCA.out, "BEC_HARMONY.output", "h5ad", null, false ) NEIGHBORHOOD_GRAPH( SC__H5AD_UPDATE_X_PCA.out.join( dimReductionData.map { it -> tuple(it[0], it[2], *it[3..(it.size()-1)]) } ) ) // Run dimensionality reduction DIM_REDUCTION_TSNE_UMAP( NEIGHBORHOOD_GRAPH.out ) PUBLISH_BEC_DIMRED_OUTPUT( DIM_REDUCTION_TSNE_UMAP.out.dimred_tsne_umap, "BEC_HARMONY.dimred_output", "h5ad", null, false ) // Run clustering // Define the parameters for clustering def clusteringParams = SC__SCANPY__CLUSTERING_PARAMS( clean(params.sc.scanpy.clustering) ) CLUSTER_IDENTIFICATION( normalizedTransformedData, DIM_REDUCTION_TSNE_UMAP.out.dimred_tsne_umap, "Post Batch Effect Correction (Harmony)" ) marker_genes = CLUSTER_IDENTIFICATION.out.marker_genes.map { it -> tuple( it[0], // sampleId it[1], // data !clusteringParams.isParameterExplorationModeOn() ? null : it[2..(it.size()-1)], // Stash params ) } PUBLISH_FINAL_HARMONY_OUTPUT( marker_genes.map { it -> tuple(it[0], it[1], it[2]) }, "BEC_HARMONY.final_output", "h5ad", null, clusteringParams.isParameterExplorationModeOn() ) // This will generate a dual report with results from // - Pre batch effect correction // - Post batch effect correction becDualDataPrePost = COMBINE_BY_PARAMS( clusterIdentificationPreBatchEffectCorrection, // Use PUBLISH output to avoid "input file name collision" PUBLISH_FINAL_HARMONY_OUTPUT.out, clusteringParams ) harmony_report = GENERATE_DUAL_INPUT_REPORT( becDualDataPrePost, file(workflow.projectDir + params.sc.harmony.report_ipynb), "SC_BEC_HARMONY_report", clusteringParams.isParameterExplorationModeOn() ) emit: data = CLUSTER_IDENTIFICATION.out.marker_genes cluster_report = CLUSTER_IDENTIFICATION.out.report harmony_report }
In the
vsn-pipelines
, create a new main workflow calledharmony.nf
underworkflows/
:nextflow.preview.dsl=2 //////////////////////////////////////////////////////// // Import sub-workflows/processes from the utils module: include { getBaseName } from '../src/utils/processes/files.nf' include { clean; SC__FILE_CONVERTER; SC__FILE_CONCATENATOR } from '../src/utils/processes/utils.nf' params(params) include { COMBINE_BY_PARAMS } from '../src/utils/workflows/utils.nf' params(params) include { SC__H5AD_TO_FILTERED_LOOM } from '../src/utils/processes/h5adToLoom.nf' params(params) include { FILE_CONVERTER } from '../src/utils/workflows/fileConverter.nf' params(params) include { UTILS__GENERATE_WORKFLOW_CONFIG_REPORT } from '../src/utils/processes/reports.nf' params(params) //////////////////////////////////////////////////////// // Import sub-workflows/processes from the tool module: include { QC_FILTER } from '../src/scanpy/workflows/qc_filter.nf' params(params) include { NORMALIZE_TRANSFORM } from '../src/scanpy/workflows/normalize_transform.nf' params(params) include { HVG_SELECTION } from '../src/scanpy/workflows/hvg_selection.nf' params(params) include { NEIGHBORHOOD_GRAPH } from '../src/scanpy/workflows/neighborhood_graph.nf' params(params) include { DIM_REDUCTION_PCA } from '../src/scanpy/workflows/dim_reduction_pca.nf' params(params) include { DIM_REDUCTION_TSNE_UMAP } from '../src/scanpy/workflows/dim_reduction.nf' params(params) // cluster identification include { SC__SCANPY__CLUSTERING_PARAMS } from '../src/scanpy/processes/cluster.nf' params(params) include { CLUSTER_IDENTIFICATION } from '../src/scanpy/workflows/cluster_identification.nf' params(params) include { BEC_HARMONY } from '../src/harmony/workflows/bec_harmony.nf' params(params) // reporting: include { SC__SCANPY__MERGE_REPORTS } from '../src/scanpy/processes/reports.nf' params(params) include { SC__SCANPY__REPORT_TO_HTML } from '../src/scanpy/processes/reports.nf' params(params) workflow harmony { take: data main: out = data | \ SC__FILE_CONVERTER | \ FILTER_AND_ANNOTATE_AND_CLEAN if(params.sc.scanpy.containsKey("filter")) { out = QC_FILTER( out ).filtered // Remove concat } if(params.sc.containsKey("file_concatenator")) { out = SC__FILE_CONCATENATOR( out.map { it -> it[1] }.toSortedList( { a, b -> getBaseName(a, "SC") <=> getBaseName(b, "SC") } ) ) } if(params.sc.scanpy.containsKey("data_transformation") && params.sc.scanpy.containsKey("normalization")) { out = NORMALIZE_TRANSFORM( out ) } out = HVG_SELECTION( out ) DIM_REDUCTION_PCA( out ) NEIGHBORHOOD_GRAPH( DIM_REDUCTION_PCA.out ) DIM_REDUCTION_TSNE_UMAP( NEIGHBORHOOD_GRAPH.out ) // Perform the clustering step w/o batch effect correction (for comparison matter) clusterIdentificationPreBatchEffectCorrection = CLUSTER_IDENTIFICATION( NORMALIZE_TRANSFORM.out, DIM_REDUCTION_TSNE_UMAP.out.dimred_tsne_umap, "Pre Batch Effect Correction" ) // Perform the batch effect correction BEC_HARMONY( NORMALIZE_TRANSFORM.out, // include only PCA since Harmony will correct this DIM_REDUCTION_PCA.out, clusterIdentificationPreBatchEffectCorrection.marker_genes ) // Conversion // Convert h5ad to X (here we choose: loom format) if(params.sc.containsKey("file_concatenator")) { filteredloom = SC__H5AD_TO_FILTERED_LOOM( SC__FILE_CONCATENATOR.out ) scopeloom = FILE_CONVERTER( BEC_HARMONY.out.data.groupTuple(), 'HARMONY.final_output', 'loom', SC__FILE_CONCATENATOR.out ) } else { filteredloom = SC__H5AD_TO_FILTERED_LOOM( SC__FILE_CONVERTER.out ) scopeloom = FILE_CONVERTER( BEC_HARMONY.out.data.groupTuple(), 'HARMONY.final_output', 'loom', SC__FILE_CONVERTER.out ) } project = CLUSTER_IDENTIFICATION.out.marker_genes.map { it -> it[0] } UTILS__GENERATE_WORKFLOW_CONFIG_REPORT( file(workflow.projectDir + params.utils.workflow_configuration.report_ipynb) ) // Collect the reports: // Define the parameters for clustering def clusteringParams = SC__SCANPY__CLUSTERING_PARAMS( clean(params.sc.scanpy.clustering) ) // Pairing clustering reports with bec reports if(!clusteringParams.isParameterExplorationModeOn()) { clusteringBECReports = BEC_HARMONY.out.cluster_report.map { it -> tuple(it[0], it[1]) }.combine( BEC_HARMONY.out.harmony_report.map { it -> tuple(it[0], it[1]) }, by: 0 ).map { it -> tuple(it[0], *it[1..it.size()-1], null) } } else { clusteringBECReports = COMBINE_BY_PARAMS( BEC_HARMONY.out.cluster_report.map { it -> tuple(it[0], it[1], *it[2]) }, BEC_HARMONY.out.harmony_report, clusteringParams ) } ipynbs = project.combine( UTILS__GENERATE_WORKFLOW_CONFIG_REPORT.out ).join( HVG_SELECTION.out.report.map { it -> tuple(it[0], it[1]) } ).combine( clusteringBECReports, by: 0 ).map { it -> tuple(it[0], it[1..it.size()-2], it[it.size()-1]) } // reporting: SC__SCANPY__MERGE_REPORTS( ipynbs, "merged_report", clusteringParams.isParameterExplorationModeOn() ) SC__SCANPY__REPORT_TO_HTML(SC__SCANPY__MERGE_REPORTS.out) emit: filteredloom scopeloom }
Add a new Nextflow profile in the
profiles
section of the mainnextflow.config
of thevsn-pipelines
repository:profiles { harmony { includeConfig 'src/scanpy/scanpy.config' includeConfig 'src/harmony/harmony.config' } ... }
Finally add a new entry in
main.nf
of thevsn-pipelines
repository// run multi-sample with bbknn, output a scope loom file workflow harmony { include { harmony as HARMONY } from './workflows/harmony' params(params) include { PUBLISH as PUBLISH_HARMONY } from "./src/utils/workflows/utils" params(params) getDataChannel | HARMONY PUBLISH_HARMONY( HARMONY.out.scopeloom, "HARMONY", "loom", null, false ) }
You should now be able to configure (
nextflow config ...
) and run theharmony
pipeline (nextflow run ...
).After confirming that your module is functional, you should create a pull request to merge your changes into the
develop
branch.- Make sure you have removed all references to
TEMPLATE
in your repository - Include some basic documentation for your module so people know what it does and how to use it.
The pull request will be reviewed and accepted once it is confirmed to be working. Once the
develop
branch is merged intomaster
, the new tool will be part of the new release of VSN Pipelines!- Make sure you have removed all references to
Repository structure¶
Root¶
The repository root contains a main.nf
and associated nextflow.config
.
The root main.nf
imports and calls sub-workflows defined in the modules.
Modules¶
A “module” consists of a folder labeled with the tool name (Scanpy, SCENIC, utils, etc.), with subfolders for
bin/
(scripts passed into the container)processes/
(where Nextflow processes are defined)
The root of the modules folder contains workflow files + associated configs (as many as there are workflows):
main.nf
+nextflow.config
single_sample.nf
+scenic.config
- …
src/
├── cellranger
│ ├── main.nf
│ ├── nextflow.config
│ └── processes
│ ├── count.nf
│ └── mkfastq.nf
│
├── channels
│ └── tenx.nf
│
├── scenic
│ ├── bin
│ │ ├── grnboost2_without_dask.py
│ ├── processes
│ │ ├── aucell.nf
│ │ ├── cistarget.nf
│ │ ├── grnboost2withoutDask.nf
│ ├── main.nf
│ └── scenic.config
│
└── utils
├── bin
│ ├── h5ad_to_loom.py
│ ├── sc_file_concatenator.py
│ └── sc_file_converter.py
├── utils.config
└── processes
├── files.nf
├── h5ad_to_loom.nf
├── utils_1.test.nf
├── utils_2.test.nf
└── utils.nf
Workflows¶
Workflows (chains of nf processes) are defined in the module root folder (e.g. src/Scanpy/bec_bbknn.nf ) Workflows import multiple processes and define the workflow by name:
include SC__CELLRANGER__MKFASTQ from './processes/mkfastq' params(params)
include SC__CELLRANGER__COUNT from './processes/count' params(params)
workflow CELLRANGER {
main:
SC__CELLRANGER__MKFASTQ(file(params.sc.cellranger.mkfastq.csv), path(params.sc.cellranger.mkfastq.runFolder))
SC__CELLRANGER__COUNT(file(params.sc.cellranger.count.transcriptome), SC__CELLRANGER__MKFASTQ.out.flatten())
emit:
SC__CELLRANGER__COUNT.out
}
Workflow imports¶
Entire sub-workflows can also be imported in other workflows with one command (inheriting all of the process imports from the workflow definition):
include CELLRANGER from '../cellranger/main.nf' params(params)
This leads to the ability to easily define high-level workflows in the master nf file: vib-singlecell-nf/vsn-pipelines/main.nf
:
include CELLRANGER from './src/cellranger/main.nf' params(params)
include BEC_BBKNN from './src/scanpy/bec_bbknn.nf' params(params)
include SCENIC from './src/scenic/main.nf' params(params)
workflow {
CELLRANGER()
BEC_BBKNN( CELLRANGER.out )
SCENIC( BEC_BBKNN.out )
}
Parameters structure¶
Parameters are stored in a separate config file per workflow, plus the main nextflow.config
.
These parameters are merged when starting the run using e.g.:
includeConfig 'src/scenic/nextflow.config'
The parameter structure internally (post-merge) is:
params {
global {
baseFilePath = "/opt/vib-singlecell-nf"
project_name = "MCF7"
...
}
sc {
utils {
file_converter {
...
}
file_annotator {
...
}
file_concatenator {
...
}
}
scanpy {
container = 'docker://vib-singlecell-nf/scanpy:0.5.0'
filter {
...
}
data_transformation {
...
}
normalization {
...
}
feature_selection {
...
}
feature_scaling {
...
}
dim_reduction {
pca {
method = 'pca'
...
}
umap {
method = 'tsne'
...
}
}
batch_effect_correct {
...
}
clustering {
...
}
}
}
}
Module testing¶
Modules and processes can be tested independently, you can find an example in src/utils/main.test.nf
.
The SC__FILE_CONVERTER
process is tested against the tiny
dataset available in data/01.count
.
Attributions¶
VSN-Pipelines is a collection of workflows targeted toward the analysis of single cell data. VSN is dependendent on, and takes functions from many tools, developed both internally and externally, which are listed here.
Tools¶
- GreenleafLab/ArchR
- caleblareau/bap
- lh3/bwa
- Samtools
- campbio/celda
- Directs
- DropletUtils
- Drop-seq Tools
- EDirect
- OpenGene/fastp
- hangnoh/flybaseR
- dweemx/flybaseR
- immunogenomics/harmony
- pcacv
- Picard
- statgen/popscle
- aertslab/popscle_helper_tools
- aertslab/cisTopic
- theislab/scanpy
- aertslab/pySCENIC
- aertslab/SCENIC
- swolock/scrublet
- aertslab/single_cell_toolkit
- timoast/sinto
- constantAmateur/SoupX
- ncbi/sra-tools
- alexdobin/STAR
- Trim Galore
VSN-Pipelines¶
A repository of pipelines for single-cell data analysis in Nextflow DSL2.
Full documentation is available on Read the Docs, or take a look at the Quick Start guide.
This main repo contains multiple workflows for analyzing single cell transcriptomics data, and depends on a number of tools, which are organized into subfolders within the src/
directory.
The VIB-Singlecell-NF organization contains this main repo along with a collection of example runs (VSN-Pipelines-examples).
Currently available workflows are listed below.
If VSN-Pipelines is useful for your research, consider citing:
- VSN-Pipelines All Versions (latest): 10.5281/zenodo.3703108.
Raw Data Processing Workflows¶
These are set up to run Cell Ranger and DropSeq pipelines.
Pipeline / Entrypoint | Purpose | Documentation |
---|---|---|
cellranger | Process 10x Chromium data | cellranger |
demuxlet_freemuxlet | Demultiplexing | demuxlet_freemuxlet |
nemesh | Process Drop-seq data | nemesh |
Single Sample Workflows¶
The Single Sample Workflows perform a “best practices” scRNA-seq analysis. Multiple samples can be run in parallel, treating each sample separately.
Pipeline / Entrypoint | Purpose | Documentation |
---|---|---|
single_sample | Independent samples | |
single_sample_scenic | Ind. samples + SCENIC | |
scenic | SCENIC GRN inference | |
scenic_multiruns | SCENIC run multiple times | |
single_sample_scenic_multiruns | Ind. samples + multi-SCENIC | |
single_sample_scrublet | Ind. samples + Scrublet | |
decontx | DecontX | |
single_sample_decontx | Ind. samples + DecontX | |
single_sample_decontx_scrublet | Ind. samples + DecontX + Scrublet |
Sample Aggregation Workflows¶
Sample Aggregation Workflows: perform a “best practices” scRNA-seq analysis on a merged and batch-corrected group of samples. Available batch correction methods include BBKNN, mnnCorrect, and Harmony.
Pipeline / Entrypoint | Purpose | Documentation |
---|---|---|
bbknn | Sample aggregation + BBKNN | |
bbknn_scenic | BBKNN + SCENIC | |
harmony | Sample aggregation + Harmony | |
harmony_scenic | Harmony + SCENIC | |
mnncorrect | Sample aggregation + mnnCorrect |
In addition, the pySCENIC implementation of the SCENIC workflow is integrated here and can be run in conjunction with any of the above workflows. The output of each of the main workflows is a loom-format file, which is ready for import into the interactive single-cell web visualization tool SCope. In addition, data is also output in h5ad format, and reports are generated for the major pipeline steps.
scATAC-seq workflows¶
Single cell ATAC-seq processing steps are now included in VSN Pipelines. Currently, a preprocesing workflow is available, which will take fastq inputs, apply barcode correction, read trimming, bwa mapping, and output bam and fragments files for further downstream analysis. See here for complete documentation.