Advanced Features¶
Two-pass strategy¶
Typically, cell- and gene-level filtering is one of the first steps performed in the analysis pipelines.
This usually results in the pipeline being run in two passes.
In the first pass, the default filters are applied (which are probably not valid for new datasets), and a separate QC report is generated for each sample.
These QC reports can be inspected and the filters can be adjusted in the config file either for all samples (by editing the params.tools.scanpy.filter settings directly, or for individual samples by using the strategy described in multi-sample parameters.
Then, the second pass restarts the pipeline with the correct filtering parameters applied (use nextflow run ... -resume to skip already completed steps).
Other notes¶
In order to run a specific pipeline (e.g. single_sample),
the pipeline name must be specified as a profile when running nextflow config ... (so that the default parameters are included),
and as the entry workflow when running the pipeline with nextflow run.
One exception to this is that the -entry pipeline can be one that is a subset of the one present in the config file.
For example, in a pipeline with long running step that occurs after filtering (e.g. single_sample_scenic),
it can be useful to generate the full config file (nextflow config vib-singlecell-nf/vsn-pipelines -profile single_sample_scenic),
then run a first pass for filtering using nextflow run vib-singlecell-nf/vsn-pipelines -entry single_sample, and a second pass using the full pipeline -entry single_sample_scenic).
Avoid re-running SCENIC and use pre-existing results¶
Often one would like to test different batch effect correction methods with SCENIC. Naively, one would run the following commands:
nextflow config ~/vibsinglecellnf -profile tenx,bbknn,dm6,scenic,scenic_use_cistarget_motifs,singularity > bbknn.config
nextflow -C bbknn.config run vib-singlecell-nf/vsn-pipelines -entry bbknn_scenic
and,
nextflow config ~/vibsinglecellnf -profile tenx,bbknn,dm6,scenic,scenic_use_cistarget_motifs,singularity > bbknn.config
nextflow -C harmony.config run vib-singlecell-nf/vsn-pipelines -entry harmony_scenic
The annoying bit here is that we run SCENIC twice. This is what we would like to avoid since the SCENIC results will be the same. To avoid this one can run the following code for generating the harmony_scenic.config,
nextflow config ~/vibsinglecellnf -profile tenx,harmony,scenic_append_only,singularity > harmony.config
This will add a different scenic entry in the config:
params {
tools {
scenic {
container = 'vibsinglecellnf/scenic:0.11.2'
report_ipynb = '/src/scenic/bin/reports/scenic_report.ipynb'
existingScenicLoom = ''
sampleSuffixWithExtension = '' // Suffix after the sample name in the file path
scenicoutdir = "${params.global.outdir}/scenic/"
scenicScopeOutputLoom = 'SCENIC_SCope_output.loom'
}
}
}
Make sure that the following entries are correctly set before running the pipeline,
existingScenicLoom = ''sampleSuffixWithExtension = '' // Suffix after the sample name in the file path
Finally run the pipeline,
nextflow -C harmony.config run vib-singlecell-nf/vsn-pipelines -entry harmony_scenic
Set the seed¶
Some steps in the pipelines are non-deterministic. In order to have reproducible results, a seed is set by default to:
workflow.manifest.version.replaceAll("\\.","").toInteger()
The seed is a number derived from the version of the pipeline used at the time of the analysis run.
To override the seed (integer) you have edit the nextflow.config file with:
params {
global {
seed = [your-custom-seed]
}
}
This filter will only be applied on the final loom file of the VSN-Pipelines. All the intermediate files prior to the loom file will still contain all of them the markers.
Change log fold change (logFC) and false discovery rate (FDR) thresholds for the marker genes stored in the final SCope loom¶
By default, the logFC and FDR thresholds are set to 0 and 0.05 respectively.
If you want to change those thresholds applied on the markers genes, edit the nextflow.config with the following entries,
params {
tools {
scope {
markers {
log_fc_threshold = 0.5
fdr_fc_threshold = 0.01
}
}
}
}
This filter will only be applied on the final loom file of the VSN-Pipelines. All the intermediate files prior to the loom file will still contain all of them the markers.
Automated selection of the optimal number of principal components¶
When generating the config using nextflow config (see above), add the pcacv profile.
Remarks:
- Make sure
nCompsconfig parameter (underdim_reduction.pca) is not set. - If
nPcsis not set for t-SNE or UMAP config entries, then all the PCs from the PCA will be used in the computation.
Currently, only the Scanpy related pipelines have this feature implemented.
Cell-based metadata annotation¶
There are 2 ways of using this feature: either when running an end-to-end pipeline (e.g.: single_sample, harmony, bbknn, …) or on its own as a independent workflow.
The profile utils_cell_annotate should be added along with the other profiles when generating the main config using the nextflow config command.
For more detailed information about those parameters, please check the `cell_annotate parameter details <Parameters of cell_annotate_>`_ section below.
Please check the cell_annotate workflow.
The utils_cell_annotate profile is adding the following part to the config:
params {
tools {
cell_annotate {
off = 'h5ad'
method = ''
cellMetaDataFilePath = ''
sampleSuffixWithExtension = ''
indexColumnName = ''
sampleColumnName = ''
annotationColumnNames = ['']
}
}
}
Two methods (params.utils.cell_annotate.method) are available:
aioobo
If you have a single file containing the metadata information of all your samples, use aio method otherwise use obo.
For both methods, here are the mandatory parameters to set:
offshould be set toh5admethodchoose eitherobooraioannotationColumnNamesis an array of columns names fromcellMetaDataFilePathcontaining different annotation metadata to add.
If aio used, the following additional parameters are required:
cellMetaDataFilePathis a file path pointing to a single .tsv file (with header) with at least 2 columns: a column containing all the cell IDs and an annotation column.indexColumnNameis the column name fromcellMetaDataFilePathcontaining the cell IDs information. This column can have unique values; if it’s not the case, it’s important that the combination of the values from theindexColumnNameand thesampleColumnNameare unique.sampleColumnNameis the column name fromcellMetaDataFilePathcontaining the sample ID/name information. Make sure that the values from this column match the samples IDs inferred from the data files. To know how those are inferred, please read the Input Data Formats section.
If obo is used, the following parameters are required:
cellMetaDataFilePath- In multi-sample mode, is a file path containing a glob pattern. The target file paths should each pointing to a .tsv file (with header) with at least 2 columns: a column containing all the cell IDs and an annotation column.
- In single-sample mode, is a file path pointing to a single .tsv file (with header) with at least 2 columns: a column containing all the cell IDs and an annotation column.
- Note: the file name(s) of
cellMetaDataFilePathis/are required to contain the sample ID(s).
sampleSuffixWithExtensionis the suffix used to extract the sample ID from the file name(s) ofcellMetaDataFilePath. The suffix should be the part after the sample name in the file path.indexColumnNameis the column name fromcellMetaDataFilePathcontaining the cell IDs information. This column must have unique values.
Sample-based metadata annotation¶
The profile utils_sample_annotate should be added when generating the main config using nextflow config. This will add the following entry in the config:
params {
tools {
sample_annotate {
iff = '10x_cellranger_mex'
off = 'h5ad'
type = 'sample'
metadataFilePath = 'data/10x/1k_pbmc/metadata.tsv'
}
}
}
Then, the following parameters should be updated to use the module feature:
metadataFilePathis a .tsv file (with header) with at least 2 columns where the first column need to match the sample IDs. Any other columns will be added as annotation in the final loom i.e.: all the cells related to their sample will get annotated with their given annotations.
| id | chemistry | … |
|---|---|---|
| 1k_pbmc_v2_chemistry | v2 | … |
| 1k_pbmc_v3_chemistry | v3 | … |
Sample-annotating the samples using this system will allow any user to query all the annotation using the SCope portal. This is especially relevant when samples needs to be compared across specific annotations (check compare tab with SCope).
Cell-based metadata filtering¶
There are 2 ways of using this feature: either when running an end-to-end pipeline (e.g.: single_sample, harmony, bbknn, …) or on its own as a independent workflow.
The utils_cell_filter profile is required when generating the config file. This profile will add the following part:
params {
tools {
cell_filter {
off = 'h5ad'
method = ''
filters = [
[
id: '',
sampleColumnName: '',
filterColumnName: '',
valuesToKeepFromFilterColumn: ['']
]
]
}
}
}
For more detailed information about the parameters to set in params.utils.cell_filter, please check the cell_filter parameter details section below.
Please check the cell_filter workflow or cell_annotate_filter workflow to perform cell-based annotation and cell-based filtering sequentially.
Two methods (params.utils.cell_filter.method) are available:
internalexternal
If you have a single file containing the metadata information of all your samples, use external method otherwise use internal.
For both methods, here are the mandatory parameters to set:
offshould be set toh5admethodchoose eitherinternalorexternalfiltersis a List of Maps where each Map is required to have the following parameters:idis a short identifier for the filtervaluesToKeepFromFilterColumnis array of values from thefilterColumnNamethat should be kept (other values will be filtered out).
If internal used, the following additional parameters are required:
filtersis a List of Maps where each Map is required to have the following parameters:sampleColumnNameis the column name containing the sample ID/name information. It should exist in theobscolumn attribute of the h5ad.filterColumnNameis the column name that will be used to filter out cells. It should exist in theobscolumn attribute of the h5ad.
If external used, the following additional parameters are required:
filtersis a List of Maps where each Map is required to have the following parameters:cellMetaDataFilePathis a file path pointing to a single .tsv file (with header) with at least 3 columns: a column containing all the cell IDs, another containing the sample ID/name information, and a column to use for the filtering.indexColumnNameis the column name fromcellMetaDataFilePathcontaining the cell IDs information. This column must have unique values.- optional
sampleColumnNameis the column name fromcellMetaDataFilePathcontaining the sample ID/name information. Make sure that the values from this column match the samples IDs inferred from the data files. To know how those are inferred, please read the Input Data Formats section. - optional
filterColumnNameis the column name fromcellMetaDataFilePathwhich be used to filter out cells.
Multi-sample parameters¶
It’s possible to define custom parameters for the different samples. It’s as easy as defining a hashmap in groovy or a dictionary-like structure in Python. You’ll just have to repeat the following structure for the parameters which you want to enable the multi-sample feature for
params {
tools {
scanpy {
container = 'vibsinglecellnf/scanpy:1.8.1'
filter {
report_ipynb = '/src/scanpy/bin/reports/sc_filter_qc_report.ipynb'
// Here we enable the multi-sample feature for the cellFilterMinNgenes parameter
cellFilterMinNGenes = [
'1k_pbmc_v2_chemistry': 600,
'1k_pbmc_v3_chemistry': 800
]
// cellFilterMaxNGenes will be set to 4000 for all the samples
cellFilterMaxNGenes = 4000
// Here we again enable the multi-sample feature for the cellFilterMaxPercentMito parameter
cellFilterMaxPercentMito = [
'1k_pbmc_v2_chemistry': 0.15,
'1k_pbmc_v3_chemistry': 0.05
]
// geneFilterMinNCells will be set to 3 for all the samples
geneFilterMinNCells = 3
iff = '10x_mtx'
off = 'h5ad'
outdir = 'out'
}
}
}
If you want to apply custom parameters for some specific samples and have a “general” parameter for the rest of the samples, you should use the ‘default’ key as follows:
params {
tools {
scanpy {
container = 'vibsinglecellnf/scanpy:1.8.1'
filter {
report_ipynb = '/src/scanpy/bin/reports/sc_filter_qc_report.ipynb'
// Here we enable the multi-sample feature for the cellFilterMinNgenes parameter
cellFilterMinNGenes = [
'1k_pbmc_v2_chemistry': 600,
'default': 800
]
[...]
}
}
}
Using this config, the parameter params.tools.scanpy.cellFilterMinNGenes will be applied with a threshold value of 600 to 1k_pbmc_v2_chemistry. The rest of the samples will use the value 800 to filter the cells having less than that number of genes.
This strategy can be applied to any other parameter of the config.
Parameter exploration¶
Since v0.9.0, it is possible to explore several combinations of parameters. The latest version of the VSN-Pipelines allows to explore the following parameters:
params.tools.scanpy.clusteringmethodmethods = ['louvain','leiden']
resolutionresolutions = [0.4, 0.8]
In case the parameter exploration mode is used within the params.tools.scanpy.clustering parameter, it will generated a range of different clusterings.
For non-expert, it’s often difficult to know which clustering to pick. It’s however possible to use the DIRECTS module in order to select a default clustering. In order, to use
this automated clustering selection method, add the directs profile when generating the main config using nextflow config. The config will get populated with:
directs {
container = 'vibsinglecellnf/directs:0.1.0'
labels {
processExecutor = 'local'
}
select_default_clustering {
fromMinClusterSize = 5
toMinClusterSize = 100
byMinClusterSize = 5
}
}
Currently, only the Scanpy related pipelines have this feature implemented.
Regress out variables¶
By default, don’t regress any variable out. To enable this features, the scanpy_regress_out profile should be added when generating the main config using nextflow config. This will add the following entry in the config:
params {
tools {
scanpy {
regress_out {
variablesToRegressOut = []
off = 'h5ad'
}
}
}
}
Add any variable in variablesToRegressOut to regress out: e.g.: ‘n_counts’, ‘percent_mito’.
Highly Variable Genes Selection¶
This step is a wrapper around the Scanpy scanpy.pp.highly_variable_genes function and regarding the parameters used it is following the documentation available at scanpy-pp-highly-variable-genes.
By default, it will use the seurat flavor to select variable genes and will also keep the same default values for the 4 different thresholds (as the documentation): min_mean, max_mean, min_disp, max_disp.
params {
tools {
scanpy {
feature_selection {
report_ipynb = "${params.misc.test.enabled ? '../../..' : ''}/src/scanpy/bin/reports/sc_select_variable_genes_report.ipynb"
flavor = 'seurat'
minMean = 0.0125
maxMean = 3
minDisp = 0.5
off = 'h5ad'
}
}
}
}
Other flavors are available as cell_ranger and seurat_v3. In order to use the seurat_v3 flavor, one parameter is required to be specified: nTopGenes in the config file as follows:
params {
tools {
scanpy {
feature_selection {
report_ipynb = "${params.misc.test.enabled ? '../../..' : ''}/src/scanpy/bin/reports/sc_select_variable_genes_report.ipynb"
flavor = 'seurat_v3'
nTopGenes = 2000
off = 'h5ad'
}
}
}
}
Skip steps¶
By default, the pipelines are run from raw data (unfiltered data, not normalized).
If you have already performed an independent steps with another it’s possible to skip some steps from the pipelines. Currently, here are the steps that can be skipped:
- Scanpy filtering
- Scanpy normalization
In order to skip the Scanpy filtering step, we need to add 3 new profiles when generating the config:
minscanpy_data_transformationscanpy_normalization
The following command, will create a Nextflow config which the pipeline will understand and will not run the Scanpy filtering step:
nextflow config \
~/vib-singlecell-nf/vsn-pipelines \
-profile min,[data-profile],scanpy_data_transformation,scanpy_normalization,[...],singularity \
> nextflow.config
[data-profile]: Can be one of the different possible data profiles e.g.:h5ad[...]: Can be other profiles likebbknn,harmony,pcacv, …
Quiet mode¶
By default, VSN will output some additional messages to the terminal, such as the global seed, and the names and paths of the samples detected by the input channel.
These messages can be suppressed by using the --quiet flag when starting the nextflow process:
nextflow -C example.config run vib-singlecell-nf/vsn-pipelines -entry single_sample --quiet