Pipelines

Single-sample Pipelines

Pipelines to run on a single sample or multiple samples separately and in parallel.

single_sample single_sample

The single_sample workflow will process 10x data, taking in 10x-structured data, and metadata file. The standard analysis steps are run: filtering, normalization, log-transformation, HVG selection, dimensionality reduction, clustering, and loom file generation. The output is a loom file with the results embedded.

Single-sample Workflow


single_sample_scenic single_sample_scenic

Runs the single_sample workflow above, then runs the scenic workflow on the output, generating a comprehensive loom file with the combined results. This could be very resource intensive, depending on the dataset.

Single-sample SCENIC Workflow


single_sample_scrublet single_sample_scrublet

Runs the single_sample workflow above together with the Scrublet workflow.

Single-sample Scrublet Workflow

The single_sample workflow is running from the input data. The scrublet workflow is running from the input data. The final processed file from the single_sample pipeline is annotated with the cell-based data generated by Scrublet.

The pipelines generate the following relevant files for each sample:

Output Files (not exhaustive list)
Output File Description
out/data/*.SINGLE_SAMPLE_SCRUBLET.loom SCope-ready loom file containing resulting loom file from a single_sample workflow but with additional metadata (doublet scores and predicted doublet for the cells) based on Scrublet run.
out/data/scrublet/*.SC__SCRUBLET__DOUBLET_DETECTION.ScrubletObject.pklz Pickled file containing the Scrublet object.
out/data/scrublet/*.SCRUBLET.SC__ANNOTATE_BY_CELL_METADATA.h5ad h5ad file with raw data and doublets annotated.
out/data/scrublet/*.SINGLE_SAMPLE_SCRUBLET.h5ad h5ad file resulting from a single_sample workflow run and with doublets (inferred from Scrublet) removed.

Cuurently there are 3 methods available to call doublets from Scrublet doublet scores:

  1. (Default) Scrublet will try to automatically identify the doublet score threshold. The threshold is then used to call doublets based on the doublet scores available in the scrublet__doublet_scores column. The doublets called are available in the scrublet__predicted_doublets column.
  2. It can happen that Scrublet fails to find the automatic treshold. In that case, the pipeline will fail and let you know that either the method define in 3. has to be used or a custom threshold has to be provided. Either way, the pipeline will generate the Scrublet histograms. This is helpful especially if the user decide to select a custom threshold which will need to be reflected in the config as follows:
params {
    tools {
        scublet {
            threshold = [
              "<sample-name>": <custom-threshold>
            ]
        }
    }
}
  1. This method is specifc to sample generated by the 10x Genomics single-cell platform. This method is based on the rate of the expected number of doublets in 10x Genomics samples. The number of doublets called (D) will be equal to the rate of doublets (given a number of cells) times the number of cells in that 10x Genomics sample. The cells are then ranked by their Scrublet doublet score (descending order) and the top D cells are called as doublets.

decontx decontx

Runs the decontx workflow.

DecontX Workflow

The pipelines generate the following files for each sample:

Output Files
Output File Description
out/data/*.CELDA_DECONTX_{FILTER,CORRECT}.h5ad A h5ad file with either the filtered matrix using one of the provided filters or the corrected (decontaminated) matrix by DecontX.
out/data/celda/*.CELDA__DECONTX.Rds A Rds file containing the SingleCellExperiment object processed by DecontX.
out/data/celda/*.CELDA__DECONTX.Contamination_Outlier_Table.tsv

A cell-based .tsv file containing data generated by DecontX and additional outlier masks:

  • decontX_contamination
  • decontX_clusters
  • celda_decontx__{doublemad,scater_isOutlier_3MAD,custom_gt_0.5}_predicted_outliers
out/data/celda/*.CELDA__DECONTX.Contamination_Outlier_Thresholds.tsv A .tsv containing a table with the different threshold for generating the outlier masks.
out/data/celda/*.CELDA__DECONTX.Contamination_Score_Density_with_{doublemad,scater_isOutlier_3MAD,custom_gt_0.5}.pdf A .pdf plot showing the density of the decontamination score from DecontX and the outlier area highlighted for the given outlier threshold.
out/data/celda/*.CELDA__DECONTX.UMAP_Contamination_Score.pdf A .pdf plot showing the DecontX contamination score on top of a UMAP generated from the decontaminated matrix.
out/data/celda/*.CELDA__DECONTX.UMAP_Clusters.pdf A .pdf plot showing a UMAP generated by DecontX and from the decontaminated matrix.

single_sample_decontx single_sample_decontx

Runs the single_sample workflow above together with the DecontX workflow.

Single-sample DecontX Workflow

The DecontX workflow is running from the input data. The final processed file from the single_sample pipeline is annotated with the cell-based data generated by DecontX.

See single_sample and decontx to know more about the files generated by this pipeline.


single_sample_decontx_scrublet single_sample_decontx_scrublet

Runs the single_sample workflow above together with the DecontX workflow.

Single-sample DecontX Scrublet Workflow

The single_sample workflow is running from the input data. The decontx workflow is running from the input data. The scrublet workflow is running from the output of the DecontX workflow. The final processed file from the single_sample pipeline is annotated with the cell-based data generated by DecontX and Scrublet.

See single_sample, decontx and scrublet to know more about the files generated by this pipeline.


scenic scenic

Runs the scenic workflow alone, generating a loom file with only the SCENIC results. Currently, the required input is a loom file (set by params.tools.scenic.filteredLoom).

SCENIC Workflow


scenic_multiruns scenic_multiruns single_sample_scenic_multiruns

Runs the scenic workflow multiple times (set by params.tools.scenic.numRuns), generating a loom file with the aggregated results from the multiple SCENIC runs.

Note that this is not a complete entry-point itself, but a configuration option for the scenic module. Simply adding -profile scenic_multiruns during the config step will activate this analysis option for any of the standard entrypoints.

SCENIC Multi-runs Workflow


cellranger

Runs the cellranger workflow (makefastq, then count). Input parameters are specified within the config file:

  • params.tools.cellranger.mkfastq.csv: path to the CSV samplesheet
  • params.tools.cellranger.mkfastq.runFolder: path of Illumina BCL run folder
  • params.tools.cellranger.count.transcriptome: path to the Cell Ranger compatible transcriptome reference

cellranger_count_metadata

Given the data stored as:

MKFASTQ_ID_SEQ_RUN1
|-- MAKE_FASTQS_CS
 -- outs
    |-- fastq_path
        |-- HFLC5BBXX
            |-- test_sample1
            |   |-- sample1_S1_L001_I1_001.fastq.gz
            |   |-- sample1_S1_L001_R1_001.fastq.gz
            |   |-- sample1_S1_L001_R2_001.fastq.gz
            |   |-- sample1_S1_L002_I1_001.fastq.gz
            |   |-- sample1_S1_L002_R1_001.fastq.gz
            |   |-- sample1_S1_L002_R2_001.fastq.gz
            |   |-- sample1_S1_L003_I1_001.fastq.gz
            |   |-- sample1_S1_L003_R1_001.fastq.gz
            |   |-- sample1_S1_L003_R2_001.fastq.gz
            |-- test_sample2
            |   |-- sample2_S2_L001_I1_001.fastq.gz
            |   |-- sample2_S2_L001_R1_001.fastq.gz
            |   |-- ...
        |-- Reports
        |-- Stats
        |-- Undetermined_S0_L001_I1_001.fastq.gz
        ...
        -- Undetermined_S0_L003_R2_001.fastq.gz
MKFASTQ_ID_SEQ_RUN2
|-- MAKE_FASTQS_CS
 -- outs
    |-- fastq_path
        |-- HFLY8GGLL
            |-- test_sample1
            |   |-- ...
            |-- test_sample2
            |   |-- ...
        |-- ...

and a metadata table:

Minimally Required Metadata Table
sample_name fastqs_parent_dir_path fastqs_dir_name fastqs_sample_prefix expect_cells
Sample1_Bio_Rep1 MKFASTQ_ID_SEQ_RUN1/outs/fastq_path/HFLY8GGLL test_sample1 sample1 5000
Sample1_Bio_Rep1 MKFASTQ_ID_SEQ_RUN2/outs/fastq_path/HFLC5BBXX test_sample1 sample1 5000
Sample1_Bio_Rep2 MKFASTQ_ID_SEQ_RUN1/outs/fastq_path/HFLY8GGLL test_sample2 sample2 5000
Sample1_Bio_Rep2 MKFASTQ_ID_SEQ_RUN2/outs/fastq_path/HFLC5BBXX test_sample2 sample2 5000

Optional columns:

  • short_uuid: sample_name will be prefix by this value. This should be the same between sequencing runs of the same biological replicate
  • expect_cells: This number will be used as argument for the --expect-cells parameter in cellranger count.
  • chemistry: This chemistry will be used as argument for the --chemistry parameter in cellranger count.

and a config:

nextflow config \
   ~/vib-singlecell-nf/vsn-pipelines \
   -profile cellranger_count_metadata \
   > nextflow.config

and a workflow run command:

nextflow run \
    ~/vib-singlecell-nf/vsn-pipelines \
    -entry cellranger_count_metadata

The workflow will run Cell Ranger count on 2 samples, each using the 2 sequencing runs.

NOTES:

  • If fastqs_dir_name does not exist, set it to none

demuxlet/freemuxlet

Runs the demuxlet or freemuxlet workflows (dsc-pileup [with prefiltering], then freemuxlet or demuxlet) Input parameters are specified within the config file:

  • params.tools.popscle.vcf: path to the VCF file for demultiplexing
  • params.tools.popscle.freemuxlet.nSamples: Number of clusters to extract (should match the number of samples pooled)
  • params.tools.popscle.demuxlet.field: Field in the VCF with genotype information

nemesh

Runs the nemesh pipeline (Drop-seq) on a single sample or multiple samples separately.

Source


Sample Aggregation Pipelines

Pipelines to aggregate multiple datasets together.

bbknn bbknn

Runs the bbknn workflow (sample-specific filtering, merging of individual samples, normalization, log-transformation, HVG selection, PCA analysis, then the batch-effect correction steps: BBKNN, clustering, dimensionality reduction (UMAP only)). The output is a loom file with the results embedded.

Source: https://github.com/Teichlab/bbknn/blob/master/examples/pancreas.ipynb

BBKNN Workflow

Output Files (not exhaustive list)
Output File Description
out/data/*.BBKNN.h5ad Scanpy-ready h5ad file containing all results. The raw.X slot contains the log-normalized data (if normalization & transformation steps applied) while the X slot contains the log-normalized scaled data.
out/data/*.BBKNN.loom SCope-ready loom file containing all results.

bbknn_scenic bbknn_scenic

Runs the bbknn workflow above, then runs the scenic workflow on the output, generating a comprehensive loom file with the combined results. This could be very resource intensive, depending on the dataset.

BBKNN SCENIC Workflow

Output Files (not exhaustive list)
Output File Description
out/data/*.BBKNN.h5ad Scanpy-ready h5ad file containing all results from a bbknn workflow run. The raw.X slot contains the log-normalized data (if normalization & transformation steps applied) while the X slot contains the log-normalized scaled data.
out/data/*.BBKNN_SCENIC.loom SCope-ready loom file containing all results from a bbknn workflow and a scenic workflow run (e.g.: regulon AUC matrix, regulons, …).

harmony harmony

Runs the harmony workflow (sample-specific filtering, merging of individual samples, normalization, log-transformation, HVG selection, PCA analysis, batch-effect correction (Harmony), clustering, dimensionality reduction (t-SNE and UMAP)). The output is a loom file with the results embedded.

Harmony Workflow

Output Files (not exhaustive list)
Output File Description
out/data/*.HARMONY.h5ad Scanpy-ready h5ad file containing all results. The raw.X slot contains the log-normalized data (if normalization & transformation steps applied) while the X slot contains the log-normalized scaled data.
out/data/*.HARMONY.loom SCope-ready loom file containing all results.

harmony_scenic harmony_scenic

Runs the harmony workflow above, then runs the scenic workflow on the output, generating a comprehensive loom file with the combined results. This could be very resource intensive, depending on the dataset.

HARMONY SCENIC Workflow

Output Files (not exhaustive list)
Output File Description
out/data/*.HARMONY.h5ad Scanpy-ready h5ad file containing all results from a harmony workflow run. The raw.X slot contains the log-normalized data (if normalization & transformation steps applied) while the X slot contains the log-normalized scaled data.
out/data/*.HARMONY_SCENIC.loom SCope-ready loom file containing all results from a harmony workflow and a scenic workflow run (e.g.: regulon AUC matrix, regulons, …).

mnncorrect mnncorrect

Runs the mnncorrect workflow (sample-specific filtering, merging of individual samples, normalization, log-transformation, HVG selection, PCA analysis, batch-effect correction (mnnCorrect), clustering, dimensionality reduction (t-SNE and UMAP)). The output is a loom file with the results embedded.


mnnCorrect Workflow

Output Files (not exhaustive list)
Output File Description
out/data/*.MNNCORRECT.h5ad Scanpy-ready h5ad file containing all results. The raw.X slot contains the log-normalized data (if normalization & transformation steps applied) while the X slot contains the log-normalized scaled data.
out/data/*.MNNCORRECT.loom SCope-ready loom file containing all results.

Utility Pipelines

Contrary to the aformentioned pipelines, these are not end-to-end. They are used to perform small incremental processing steps.

cell_annotate

Runs the cell_annotate workflow which will perform a cell-based annotation of the data using a set of provided .tsv metadata files. We show a use case here below with 10x Genomics data were it will annotate different samples using the obo method. For more information about this cell-based annotation feature please visit Cell-based metadata annotation section.

First, generate the config :

nextflow config \
   ~/vib-singlecell-nf/vsn-pipelines \
   -profile tenx,utils_cell_annotate,singularity

Make sure the following parts of the generated config are properly set:

[...]
data {
  tenx {
     cellranger_mex = '~/out/counts/*/outs/'
  }
}
tools {
    scanpy {
        container = 'vibsinglecellnf/scanpy:1.8.1'
    }
    cell_annotate {
        off = 'h5ad'
        method = 'obo'
        indexColumnName = 'BARCODE'
        cellMetaDataFilePath = "~/out/data/*.best"
        sampleSuffixWithExtension = '_demuxlet.best'
        annotationColumnNames = ['DROPLET.TYPE', 'NUM.SNPS', 'NUM.READS', 'SNG.BEST.GUESS']
    }
    [...]
}
[...]

Now we can run it with the following command:

nextflow -C nextflow.config \
   run ~/vib-singlecell-nf/vsn-pipelines \
   -entry cell_annotate \
   > nextflow.config

cell_annotate_filter cell_annotate_filter

Runs the cell_annotate_filter workflow which will perform a cell-based annotation of the data using a set of provided .tsv metadata files following by a cell-based filtering. We show a use case here below with 10x Genomics data were it will annotate different samples using the obo method. For more information about this cell-based annotation feature please visit Cell-based metadata annotation section and Cell-based metadata filtering section.

First, generate the config :

nextflow config \
   ~/vib-singlecell-nf/vsn-pipelines \
   -profile tenx,utils_cell_annotate,utils_cell_filter,singularity \
   > nextflow.config

Make sure the following parts of the generated config are properly set:

[...]
data {
  tenx {
     cellranger_mex = '~/out/counts/*/outs/'
  }
}
tools {
    scanpy {
        container = 'vibsinglecellnf/scanpy:1.8.1'
    }
    cell_annotate {
        off = 'h5ad'
        method = 'obo'
        indexColumnName = 'BARCODE'
        cellMetaDataFilePath = "~/out/data/*.best"
        sampleSuffixWithExtension = '_demuxlet.best'
        annotationColumnNames = ['DROPLET.TYPE', 'NUM.SNPS', 'NUM.READS', 'SNG.BEST.GUESS']
    }
    cell_filter {
        off = 'h5ad'
        method = 'internal'
        filters = [
            [
                id:'NO_DOUBLETS',
                sampleColumnName:'sample_id',
                filterColumnName:'DROPLET.TYPE',
                valuesToKeepFromFilterColumn: ['SNG']
            ]
        ]
    }
    [...]
}
[...]

Now we can run it with the following command:

nextflow -C nextflow.config \
   run ~/vib-singlecell-nf/vsn-pipelines \
   -entry cell_filter

sra

Runs the sra workflow which will download all (or user-defined selected) FASTQ files from a particular SRA project and format those with properly and human readable names.

First, generate the config :

nextflow config \
  ~/vib-singlecell-nf/vsn-pipelines \
    -profile sra,singularity \
    > nextflow.config

NOTES:

  • The download of SRA files is by default limited to 20 Gb. If this limit needs to be increased please set params.tools.sratoolkit.maxSize accordingly. This limit can be ‘removed’ by setting the parameter to an arbitrarily high number (e.g.: 9999999999999).
  • If you’re a VSC user, you might want to add the vsc profile.
  • The final output (FASTQ files) will available in out/data/sra
  • If you’re downloading 10x Genomics scATAC-seq data, make sure to set params.tools.sratoolkit.includeTechnicalReads = true and properly set params.utils.sra_normalize_fastqs.fastq_read_suffixes. In the case of downloading the scATAC-seq samples of SRP254409, fastq_read_suffixes would be set to ["R1", "R2", "I1", "I2"].

Now we can run it with the following command:

nextflow -C nextflow.config \
   run ~/vib-singlecell-nf/vsn-pipelines \
    -entry sra

$ nextflow -C nextflow.config run ~/vib-singlecell-nf/vsn-pipelines -entry sra
N E X T F L O W  ~  version 21.04.3
Launching `~/vib-singlecell-nf/vsn-pipelines/main.nf` [sleepy_goldstine] - revision: ba1dedbf51
executor >  local (23)
[12/25b9d4] process > sra:DOWNLOAD_FROM_SRA:SRA_TO_METADATA (1)                                             [100%] 1 of 1 _
[e2/d5a429] process > sra:DOWNLOAD_FROM_SRA:SRATOOLKIT__DOWNLOAD_FASTQS:DOWNLOAD_FASTQS_FROM_SRA_ACC_ID (4) [ 33%] 3 of 9
[30/cba7a0] process > sra:DOWNLOAD_FROM_SRA:SRATOOLKIT__DOWNLOAD_FASTQS:FIX_AND_COMPRESS_SRA_FASTQ (3)      [100%] 3 of 3
[76/97ce6e] process > sra:DOWNLOAD_FROM_SRA:NORMALIZE_SRA_FASTQS (3)                                        [100%] 3 of 3
[8c/3125c4] process > sra:PUBLISH:SC__PUBLISH (11)                                                          [100%] 12 of 12
...