Pipelines¶

Single-sample Pipelines¶

Pipelines to run on a single sample or multiple samples separately and in parallel.

single_sample ¶

The single_sample workflow will process 10x data, taking in 10x-structured data, and metadata file. The standard analysis steps are run: filtering, normalization, log-transformation, HVG selection, dimensionality reduction, clustering, and loom file generation. The output is a loom file with the results embedded.

Single-sample Workflow

single_sample_scenic ¶

Runs the single_sample workflow above, then runs the scenic workflow on the output, generating a comprehensive loom file with the combined results. This could be very resource intensive, depending on the dataset.

Single-sample SCENIC Workflow

single_sample_scrublet ¶

Runs the single_sample workflow above together with the Scrublet workflow.

Single-sample Scrublet Workflow

The single_sample workflow is running from the input data. The scrublet workflow is running from the input data. The final processed file from the single_sample pipeline is annotated with the cell-based data generated by Scrublet.

The pipelines generate the following relevant files for each sample:

Output Files (not exhaustive list)¶
Output File	Description
out/data/*.SINGLE_SAMPLE_SCRUBLET.loom	SCope-ready loom file containing resulting loom file from a single_sample workflow but with additional metadata (doublet scores and predicted doublet for the cells) based on Scrublet run.
out/data/scrublet/*.SC__SCRUBLET__DOUBLET_DETECTION.ScrubletObject.pklz	Pickled file containing the Scrublet object.
out/data/scrublet/*.SCRUBLET.SC__ANNOTATE_BY_CELL_METADATA.h5ad	h5ad file with raw data and doublets annotated.
out/data/scrublet/*.SINGLE_SAMPLE_SCRUBLET.h5ad	h5ad file resulting from a `single_sample` workflow run and with doublets (inferred from Scrublet) removed.

Cuurently there are 3 methods available to call doublets from Scrublet doublet scores:

(Default) Scrublet will try to automatically identify the doublet score threshold. The threshold is then used to call doublets based on the doublet scores available in the scrublet__doublet_scores column. The doublets called are available in the scrublet__predicted_doublets column.
It can happen that Scrublet fails to find the automatic treshold. In that case, the pipeline will fail and let you know that either the method define in 3. has to be used or a custom threshold has to be provided. Either way, the pipeline will generate the Scrublet histograms. This is helpful especially if the user decide to select a custom threshold which will need to be reflected in the config as follows:

params {
    tools {
        scublet {
            threshold = [
              "<sample-name>": <custom-threshold>
            ]
        }
    }
}

This method is specifc to sample generated by the 10x Genomics single-cell platform. This method is based on the rate of the expected number of doublets in 10x Genomics samples. The number of doublets called (D) will be equal to the rate of doublets (given a number of cells) times the number of cells in that 10x Genomics sample. The cells are then ranked by their Scrublet doublet score (descending order) and the top D cells are called as doublets.

decontx ¶

Runs the decontx workflow.

DecontX Workflow

The pipelines generate the following files for each sample:

Output Files¶
Output File	Description
out/data/*.CELDA_DECONTX_{FILTER,CORRECT}.h5ad	A h5ad file with either the filtered matrix using one of the provided filters or the corrected (decontaminated) matrix by DecontX.
out/data/celda/*.CELDA__DECONTX.Rds	A Rds file containing the SingleCellExperiment object processed by DecontX.
out/data/celda/*.CELDA__DECONTX.Contamination_Outlier_Table.tsv	A cell-based .tsv file containing data generated by DecontX and additional outlier masks: decontX_contamination decontX_clusters celda_decontx__{doublemad,scater_isOutlier_3MAD,custom_gt_0.5}_predicted_outliers
out/data/celda/*.CELDA__DECONTX.Contamination_Outlier_Thresholds.tsv	A .tsv containing a table with the different threshold for generating the outlier masks.
out/data/celda/*.CELDA__DECONTX.Contamination_Score_Density_with_{doublemad,scater_isOutlier_3MAD,custom_gt_0.5}.pdf	A .pdf plot showing the density of the decontamination score from DecontX and the outlier area highlighted for the given outlier threshold.
out/data/celda/*.CELDA__DECONTX.UMAP_Contamination_Score.pdf	A .pdf plot showing the DecontX contamination score on top of a UMAP generated from the decontaminated matrix.
out/data/celda/*.CELDA__DECONTX.UMAP_Clusters.pdf	A .pdf plot showing a UMAP generated by DecontX and from the decontaminated matrix.

single_sample_decontx ¶

Runs the single_sample workflow above together with the DecontX workflow.

Single-sample DecontX Workflow

The DecontX workflow is running from the input data. The final processed file from the single_sample pipeline is annotated with the cell-based data generated by DecontX.

See single_sample and decontx to know more about the files generated by this pipeline.

single_sample_decontx_scrublet ¶

Runs the single_sample workflow above together with the DecontX workflow.

Single-sample DecontX Scrublet Workflow

The single_sample workflow is running from the input data. The decontx workflow is running from the input data. The scrublet workflow is running from the output of the DecontX workflow. The final processed file from the single_sample pipeline is annotated with the cell-based data generated by DecontX and Scrublet.

See single_sample, decontx and scrublet to know more about the files generated by this pipeline.

scenic ¶

Runs the scenic workflow alone, generating a loom file with only the SCENIC results. Currently, the required input is a loom file (set by params.tools.scenic.filteredLoom).

SCENIC Workflow

scenic_multiruns ¶

Runs the scenic workflow multiple times (set by params.tools.scenic.numRuns), generating a loom file with the aggregated results from the multiple SCENIC runs.

Note that this is not a complete entry-point itself, but a configuration option for the scenic module. Simply adding -profile scenic_multiruns during the config step will activate this analysis option for any of the standard entrypoints.

SCENIC Multi-runs Workflow

cellranger¶

Runs the cellranger workflow (makefastq, then count). Input parameters are specified within the config file:

params.tools.cellranger.mkfastq.csv: path to the CSV samplesheet
params.tools.cellranger.mkfastq.runFolder: path of Illumina BCL run folder
params.tools.cellranger.count.transcriptome: path to the Cell Ranger compatible transcriptome reference

cellranger_count_metadata¶

Given the data stored as:

MKFASTQ_ID_SEQ_RUN1
|-- MAKE_FASTQS_CS
 -- outs
    |-- fastq_path
        |-- HFLC5BBXX
            |-- test_sample1
            |   |-- sample1_S1_L001_I1_001.fastq.gz
            |   |-- sample1_S1_L001_R1_001.fastq.gz
            |   |-- sample1_S1_L001_R2_001.fastq.gz
            |   |-- sample1_S1_L002_I1_001.fastq.gz
            |   |-- sample1_S1_L002_R1_001.fastq.gz
            |   |-- sample1_S1_L002_R2_001.fastq.gz
            |   |-- sample1_S1_L003_I1_001.fastq.gz
            |   |-- sample1_S1_L003_R1_001.fastq.gz
            |   |-- sample1_S1_L003_R2_001.fastq.gz
            |-- test_sample2
            |   |-- sample2_S2_L001_I1_001.fastq.gz
            |   |-- sample2_S2_L001_R1_001.fastq.gz
            |   |-- ...
        |-- Reports
        |-- Stats
        |-- Undetermined_S0_L001_I1_001.fastq.gz
        ...
        -- Undetermined_S0_L003_R2_001.fastq.gz
MKFASTQ_ID_SEQ_RUN2
|-- MAKE_FASTQS_CS
 -- outs
    |-- fastq_path
        |-- HFLY8GGLL
            |-- test_sample1
            |   |-- ...
            |-- test_sample2
            |   |-- ...
        |-- ...

and a metadata table:

Minimally Required Metadata Table¶
sample_name	fastqs_parent_dir_path	fastqs_dir_name	fastqs_sample_prefix	expect_cells
Sample1_Bio_Rep1	MKFASTQ_ID_SEQ_RUN1/outs/fastq_path/HFLY8GGLL	test_sample1	sample1	5000
Sample1_Bio_Rep1	MKFASTQ_ID_SEQ_RUN2/outs/fastq_path/HFLC5BBXX	test_sample1	sample1	5000
Sample1_Bio_Rep2	MKFASTQ_ID_SEQ_RUN1/outs/fastq_path/HFLY8GGLL	test_sample2	sample2	5000
Sample1_Bio_Rep2	MKFASTQ_ID_SEQ_RUN2/outs/fastq_path/HFLC5BBXX	test_sample2	sample2	5000

Optional columns:

short_uuid: sample_name will be prefix by this value. This should be the same between sequencing runs of the same biological replicate
expect_cells: This number will be used as argument for the --expect-cells parameter in cellranger count.
chemistry: This chemistry will be used as argument for the --chemistry parameter in cellranger count.

and a config:

nextflow config \
   ~/vib-singlecell-nf/vsn-pipelines \
   -profile cellranger_count_metadata \
   > nextflow.config

and a workflow run command:

nextflow run \
    ~/vib-singlecell-nf/vsn-pipelines \
    -entry cellranger_count_metadata

The workflow will run Cell Ranger count on 2 samples, each using the 2 sequencing runs.

NOTES:

If fastqs_dir_name does not exist, set it to none

demuxlet/freemuxlet¶

Runs the demuxlet or freemuxlet workflows (dsc-pileup [with prefiltering], then freemuxlet or demuxlet) Input parameters are specified within the config file:

params.tools.popscle.vcf: path to the VCF file for demultiplexing
params.tools.popscle.freemuxlet.nSamples: Number of clusters to extract (should match the number of samples pooled)
params.tools.popscle.demuxlet.field: Field in the VCF with genotype information

nemesh¶

Runs the nemesh pipeline (Drop-seq) on a single sample or multiple samples separately.

Source

Sample Aggregation Pipelines¶

Pipelines to aggregate multiple datasets together.

bbknn ¶

Runs the bbknn workflow (sample-specific filtering, merging of individual samples, normalization, log-transformation, HVG selection, PCA analysis, then the batch-effect correction steps: BBKNN, clustering, dimensionality reduction (UMAP only)). The output is a loom file with the results embedded.

Source: https://github.com/Teichlab/bbknn/blob/master/examples/pancreas.ipynb

BBKNN Workflow

Output Files (not exhaustive list)¶
Output File	Description
out/data/*.BBKNN.h5ad	Scanpy-ready h5ad file containing all results. The raw.X slot contains the log-normalized data (if normalization & transformation steps applied) while the X slot contains the log-normalized scaled data.
out/data/*.BBKNN.loom	SCope-ready loom file containing all results.

bbknn_scenic ¶

Runs the bbknn workflow above, then runs the scenic workflow on the output, generating a comprehensive loom file with the combined results. This could be very resource intensive, depending on the dataset.

BBKNN SCENIC Workflow

Output Files (not exhaustive list)¶
Output File	Description
out/data/*.BBKNN.h5ad	Scanpy-ready h5ad file containing all results from a bbknn workflow run. The raw.X slot contains the log-normalized data (if normalization & transformation steps applied) while the X slot contains the log-normalized scaled data.
out/data/*.BBKNN_SCENIC.loom	SCope-ready loom file containing all results from a bbknn workflow and a scenic workflow run (e.g.: regulon AUC matrix, regulons, …).

harmony ¶

Runs the harmony workflow (sample-specific filtering, merging of individual samples, normalization, log-transformation, HVG selection, PCA analysis, batch-effect correction (Harmony), clustering, dimensionality reduction (t-SNE and UMAP)). The output is a loom file with the results embedded.

Harmony Workflow

Output Files (not exhaustive list)¶
Output File	Description
out/data/*.HARMONY.h5ad	Scanpy-ready h5ad file containing all results. The raw.X slot contains the log-normalized data (if normalization & transformation steps applied) while the X slot contains the log-normalized scaled data.
out/data/*.HARMONY.loom	SCope-ready loom file containing all results.

harmony_scenic ¶

Runs the harmony workflow above, then runs the scenic workflow on the output, generating a comprehensive loom file with the combined results. This could be very resource intensive, depending on the dataset.

HARMONY SCENIC Workflow

Output Files (not exhaustive list)¶
Output File	Description
out/data/*.HARMONY.h5ad	Scanpy-ready h5ad file containing all results from a harmony workflow run. The raw.X slot contains the log-normalized data (if normalization & transformation steps applied) while the X slot contains the log-normalized scaled data.
out/data/*.HARMONY_SCENIC.loom	SCope-ready loom file containing all results from a harmony workflow and a scenic workflow run (e.g.: regulon AUC matrix, regulons, …).

mnncorrect ¶

Runs the mnncorrect workflow (sample-specific filtering, merging of individual samples, normalization, log-transformation, HVG selection, PCA analysis, batch-effect correction (mnnCorrect), clustering, dimensionality reduction (t-SNE and UMAP)). The output is a loom file with the results embedded.

mnnCorrect Workflow

Output Files (not exhaustive list)¶
Output File	Description
out/data/*.MNNCORRECT.h5ad	Scanpy-ready h5ad file containing all results. The raw.X slot contains the log-normalized data (if normalization & transformation steps applied) while the X slot contains the log-normalized scaled data.
out/data/*.MNNCORRECT.loom	SCope-ready loom file containing all results.

Utility Pipelines¶

Contrary to the aformentioned pipelines, these are not end-to-end. They are used to perform small incremental processing steps.

cell_annotate¶

Runs the cell_annotate workflow which will perform a cell-based annotation of the data using a set of provided .tsv metadata files. We show a use case here below with 10x Genomics data were it will annotate different samples using the obo method. For more information about this cell-based annotation feature please visit Cell-based metadata annotation section.

First, generate the config :

nextflow config \
   ~/vib-singlecell-nf/vsn-pipelines \
   -profile tenx,utils_cell_annotate,singularity

Make sure the following parts of the generated config are properly set:

[...]
data {
  tenx {
     cellranger_mex = '~/out/counts/*/outs/'
  }
}
tools {
    scanpy {
        container = 'vibsinglecellnf/scanpy:1.8.1'
    }
    cell_annotate {
        off = 'h5ad'
        method = 'obo'
        indexColumnName = 'BARCODE'
        cellMetaDataFilePath = "~/out/data/*.best"
        sampleSuffixWithExtension = '_demuxlet.best'
        annotationColumnNames = ['DROPLET.TYPE', 'NUM.SNPS', 'NUM.READS', 'SNG.BEST.GUESS']
    }
    [...]
}
[...]

Now we can run it with the following command:

nextflow -C nextflow.config \
   run ~/vib-singlecell-nf/vsn-pipelines \
   -entry cell_annotate \
   > nextflow.config

cell_annotate_filter ¶

Runs the cell_annotate_filter workflow which will perform a cell-based annotation of the data using a set of provided .tsv metadata files following by a cell-based filtering. We show a use case here below with 10x Genomics data were it will annotate different samples using the obo method. For more information about this cell-based annotation feature please visit Cell-based metadata annotation section and Cell-based metadata filtering section.

First, generate the config :

nextflow config \
   ~/vib-singlecell-nf/vsn-pipelines \
   -profile tenx,utils_cell_annotate,utils_cell_filter,singularity \
   > nextflow.config

Make sure the following parts of the generated config are properly set:

[...]
data {
  tenx {
     cellranger_mex = '~/out/counts/*/outs/'
  }
}
tools {
    scanpy {
        container = 'vibsinglecellnf/scanpy:1.8.1'
    }
    cell_annotate {
        off = 'h5ad'
        method = 'obo'
        indexColumnName = 'BARCODE'
        cellMetaDataFilePath = "~/out/data/*.best"
        sampleSuffixWithExtension = '_demuxlet.best'
        annotationColumnNames = ['DROPLET.TYPE', 'NUM.SNPS', 'NUM.READS', 'SNG.BEST.GUESS']
    }
    cell_filter {
        off = 'h5ad'
        method = 'internal'
        filters = [
            [
                id:'NO_DOUBLETS',
                sampleColumnName:'sample_id',
                filterColumnName:'DROPLET.TYPE',
                valuesToKeepFromFilterColumn: ['SNG']
            ]
        ]
    }
    [...]
}
[...]

Now we can run it with the following command:

nextflow -C nextflow.config \
   run ~/vib-singlecell-nf/vsn-pipelines \
   -entry cell_filter

sra¶

Runs the sra workflow which will download all (or user-defined selected) FASTQ files from a particular SRA project and format those with properly and human readable names.

First, generate the config :

nextflow config \
  ~/vib-singlecell-nf/vsn-pipelines \
    -profile sra,singularity \
    > nextflow.config

NOTES:

The download of SRA files is by default limited to 20 Gb. If this limit needs to be increased please set params.tools.sratoolkit.maxSize accordingly. This limit can be ‘removed’ by setting the parameter to an arbitrarily high number (e.g.: 9999999999999).
If you’re a VSC user, you might want to add the vsc profile.
The final output (FASTQ files) will available in out/data/sra
If you’re downloading 10x Genomics scATAC-seq data, make sure to set params.tools.sratoolkit.includeTechnicalReads = true and properly set params.utils.sra_normalize_fastqs.fastq_read_suffixes. In the case of downloading the scATAC-seq samples of SRP254409, fastq_read_suffixes would be set to ["R1", "R2", "I1", "I2"].

Now we can run it with the following command:

nextflow -C nextflow.config \
   run ~/vib-singlecell-nf/vsn-pipelines \
    -entry sra

$ nextflow -C nextflow.config run ~/vib-singlecell-nf/vsn-pipelines -entry sra
N E X T F L O W  ~  version 21.04.3
Launching `~/vib-singlecell-nf/vsn-pipelines/main.nf` [sleepy_goldstine] - revision: ba1dedbf51
executor >  local (23)
[12/25b9d4] process > sra:DOWNLOAD_FROM_SRA:SRA_TO_METADATA (1)                                             [100%] 1 of 1 _
[e2/d5a429] process > sra:DOWNLOAD_FROM_SRA:SRATOOLKIT__DOWNLOAD_FASTQS:DOWNLOAD_FASTQS_FROM_SRA_ACC_ID (4) [ 33%] 3 of 9
[30/cba7a0] process > sra:DOWNLOAD_FROM_SRA:SRATOOLKIT__DOWNLOAD_FASTQS:FIX_AND_COMPRESS_SRA_FASTQ (3)      [100%] 3 of 3
[76/97ce6e] process > sra:DOWNLOAD_FROM_SRA:NORMALIZE_SRA_FASTQS (3)                                        [100%] 3 of 3
[8c/3125c4] process > sra:PUBLISH:SC__PUBLISH (11)                                                          [100%] 12 of 12
...