Advanced Features¶
Two-pass strategy¶
Typically, cell- and gene-level filtering is one of the first steps performed in the analysis pipelines.
This usually results in the pipeline being run in two passes.
In the first pass, the default filters are applied (which are probably not valid for new datasets), and a separate QC report is generated for each sample.
These QC reports can be inspected and the filters can be adjusted in the config file either for all samples (by editing the params.tools.scanpy.filter
settings directly, or for individual samples by using the strategy described in multi-sample parameters.
Then, the second pass restarts the pipeline with the correct filtering parameters applied (use nextflow run ... -resume
to skip already completed steps).
Other notes¶
In order to run a specific pipeline (e.g. single_sample
),
the pipeline name must be specified as a profile when running nextflow config ...
(so that the default parameters are included),
and as the entry workflow when running the pipeline with nextflow run
.
One exception to this is that the -entry
pipeline can be one that is a subset of the one present in the config file.
For example, in a pipeline with long running step that occurs after filtering (e.g. single_sample_scenic
),
it can be useful to generate the full config file (nextflow config vib-singlecell-nf/vsn-pipelines -profile single_sample_scenic
),
then run a first pass for filtering using nextflow run vib-singlecell-nf/vsn-pipelines -entry single_sample
, and a second pass using the full pipeline -entry single_sample_scenic
).
Avoid re-running SCENIC and use pre-existing results¶
Often one would like to test different batch effect correction methods with SCENIC. Naively, one would run the following commands:
nextflow config ~/vibsinglecellnf -profile tenx,bbknn,dm6,scenic,scenic_use_cistarget_motifs,singularity > bbknn.config
nextflow -C bbknn.config run vib-singlecell-nf/vsn-pipelines -entry bbknn_scenic
and,
nextflow config ~/vibsinglecellnf -profile tenx,bbknn,dm6,scenic,scenic_use_cistarget_motifs,singularity > bbknn.config
nextflow -C harmony.config run vib-singlecell-nf/vsn-pipelines -entry harmony_scenic
The annoying bit here is that we run SCENIC twice. This is what we would like to avoid since the SCENIC results will be the same. To avoid this one can run the following code for generating the harmony_scenic.config,
nextflow config ~/vibsinglecellnf -profile tenx,harmony,scenic_append_only,singularity > harmony.config
This will add a different scenic entry in the config:
params {
tools {
scenic {
container = 'vibsinglecellnf/scenic:0.11.2'
report_ipynb = '/src/scenic/bin/reports/scenic_report.ipynb'
existingScenicLoom = ''
sampleSuffixWithExtension = '' // Suffix after the sample name in the file path
scenicoutdir = "${params.global.outdir}/scenic/"
scenicScopeOutputLoom = 'SCENIC_SCope_output.loom'
}
}
}
Make sure that the following entries are correctly set before running the pipeline,
existingScenicLoom = ''
sampleSuffixWithExtension = '' // Suffix after the sample name in the file path
Finally run the pipeline,
nextflow -C harmony.config run vib-singlecell-nf/vsn-pipelines -entry harmony_scenic
Set the seed¶
Some steps in the pipelines are non-deterministic. In order to have reproducible results, a seed is set by default to:
workflow.manifest.version.replaceAll("\\.","").toInteger()
The seed is a number derived from the version of the pipeline used at the time of the analysis run.
To override the seed (integer) you have edit the nextflow.config
file with:
params {
global {
seed = [your-custom-seed]
}
}
This filter will only be applied on the final loom file of the VSN-Pipelines. All the intermediate files prior to the loom file will still contain all of them the markers.
Change log fold change (logFC) and false discovery rate (FDR) thresholds for the marker genes stored in the final SCope loom¶
By default, the logFC and FDR thresholds are set to 0 and 0.05 respectively.
If you want to change those thresholds applied on the markers genes, edit the nextflow.config
with the following entries,
params {
tools {
scope {
markers {
log_fc_threshold = 0.5
fdr_fc_threshold = 0.01
}
}
}
}
This filter will only be applied on the final loom file of the VSN-Pipelines. All the intermediate files prior to the loom file will still contain all of them the markers.
Automated selection of the optimal number of principal components¶
When generating the config using nextflow config
(see above), add the pcacv
profile.
Remarks:
- Make sure
nComps
config parameter (underdim_reduction.pca
) is not set. - If
nPcs
is not set for t-SNE or UMAP config entries, then all the PCs from the PCA will be used in the computation.
Currently, only the Scanpy related pipelines have this feature implemented.
Cell-based metadata annotation¶
There are 2 ways of using this feature: either when running an end-to-end pipeline (e.g.: single_sample
, harmony
, bbknn
, …) or on its own as a independent workflow.
The profile utils_cell_annotate
should be added along with the other profiles when generating the main config using the nextflow config
command.
For more detailed information about those parameters, please check the `cell_annotate parameter details <Parameters of cell_annotate_>`_ section below.
Please check the cell_annotate workflow.
The utils_cell_annotate
profile is adding the following part to the config:
params {
tools {
cell_annotate {
off = 'h5ad'
method = ''
cellMetaDataFilePath = ''
sampleSuffixWithExtension = ''
indexColumnName = ''
sampleColumnName = ''
annotationColumnNames = ['']
}
}
}
Two methods (params.utils.cell_annotate.method
) are available:
aio
obo
If you have a single file containing the metadata information of all your samples, use aio
method otherwise use obo
.
For both methods, here are the mandatory parameters to set:
off
should be set toh5ad
method
choose eitherobo
oraio
annotationColumnNames
is an array of columns names fromcellMetaDataFilePath
containing different annotation metadata to add.
If aio
used, the following additional parameters are required:
cellMetaDataFilePath
is a file path pointing to a single .tsv file (with header) with at least 2 columns: a column containing all the cell IDs and an annotation column.indexColumnName
is the column name fromcellMetaDataFilePath
containing the cell IDs information. This column can have unique values; if it’s not the case, it’s important that the combination of the values from theindexColumnName
and thesampleColumnName
are unique.sampleColumnName
is the column name fromcellMetaDataFilePath
containing the sample ID/name information. Make sure that the values from this column match the samples IDs inferred from the data files. To know how those are inferred, please read the Input Data Formats section.
If obo
is used, the following parameters are required:
cellMetaDataFilePath
- In multi-sample mode, is a file path containing a glob pattern. The target file paths should each pointing to a .tsv file (with header) with at least 2 columns: a column containing all the cell IDs and an annotation column.
- In single-sample mode, is a file path pointing to a single .tsv file (with header) with at least 2 columns: a column containing all the cell IDs and an annotation column.
- Note: the file name(s) of
cellMetaDataFilePath
is/are required to contain the sample ID(s).
sampleSuffixWithExtension
is the suffix used to extract the sample ID from the file name(s) ofcellMetaDataFilePath
. The suffix should be the part after the sample name in the file path.indexColumnName
is the column name fromcellMetaDataFilePath
containing the cell IDs information. This column must have unique values.
Sample-based metadata annotation¶
The profile utils_sample_annotate
should be added when generating the main config using nextflow config
. This will add the following entry in the config:
params {
tools {
sample_annotate {
iff = '10x_cellranger_mex'
off = 'h5ad'
type = 'sample'
metadataFilePath = 'data/10x/1k_pbmc/metadata.tsv'
}
}
}
Then, the following parameters should be updated to use the module feature:
metadataFilePath
is a .tsv file (with header) with at least 2 columns where the first column need to match the sample IDs. Any other columns will be added as annotation in the final loom i.e.: all the cells related to their sample will get annotated with their given annotations.
id | chemistry | … |
---|---|---|
1k_pbmc_v2_chemistry | v2 | … |
1k_pbmc_v3_chemistry | v3 | … |
Sample-annotating the samples using this system will allow any user to query all the annotation using the SCope portal. This is especially relevant when samples needs to be compared across specific annotations (check compare tab with SCope).
Cell-based metadata filtering¶
There are 2 ways of using this feature: either when running an end-to-end pipeline (e.g.: single_sample
, harmony
, bbknn
, …) or on its own as a independent workflow.
The utils_cell_filter
profile is required when generating the config file. This profile will add the following part:
params {
tools {
cell_filter {
off = 'h5ad'
method = ''
filters = [
[
id: '',
sampleColumnName: '',
filterColumnName: '',
valuesToKeepFromFilterColumn: ['']
]
]
}
}
}
For more detailed information about the parameters to set in params.utils.cell_filter
, please check the cell_filter parameter details section below.
Please check the cell_filter workflow or cell_annotate_filter workflow to perform cell-based annotation and cell-based filtering sequentially.
Two methods (params.utils.cell_filter.method
) are available:
internal
external
If you have a single file containing the metadata information of all your samples, use external
method otherwise use internal
.
For both methods, here are the mandatory parameters to set:
off
should be set toh5ad
method
choose eitherinternal
orexternal
filters
is a List of Maps where each Map is required to have the following parameters:id
is a short identifier for the filtervaluesToKeepFromFilterColumn
is array of values from thefilterColumnName
that should be kept (other values will be filtered out).
If internal
used, the following additional parameters are required:
filters
is a List of Maps where each Map is required to have the following parameters:sampleColumnName
is the column name containing the sample ID/name information. It should exist in theobs
column attribute of the h5ad.filterColumnName
is the column name that will be used to filter out cells. It should exist in theobs
column attribute of the h5ad.
If external
used, the following additional parameters are required:
filters
is a List of Maps where each Map is required to have the following parameters:cellMetaDataFilePath
is a file path pointing to a single .tsv file (with header) with at least 3 columns: a column containing all the cell IDs, another containing the sample ID/name information, and a column to use for the filtering.indexColumnName
is the column name fromcellMetaDataFilePath
containing the cell IDs information. This column must have unique values.- optional
sampleColumnName
is the column name fromcellMetaDataFilePath
containing the sample ID/name information. Make sure that the values from this column match the samples IDs inferred from the data files. To know how those are inferred, please read the Input Data Formats section. - optional
filterColumnName
is the column name fromcellMetaDataFilePath
which be used to filter out cells.
Multi-sample parameters¶
It’s possible to define custom parameters for the different samples. It’s as easy as defining a hashmap in groovy or a dictionary-like structure in Python. You’ll just have to repeat the following structure for the parameters which you want to enable the multi-sample feature for
params {
tools {
scanpy {
container = 'vibsinglecellnf/scanpy:1.8.1'
filter {
report_ipynb = '/src/scanpy/bin/reports/sc_filter_qc_report.ipynb'
// Here we enable the multi-sample feature for the cellFilterMinNgenes parameter
cellFilterMinNGenes = [
'1k_pbmc_v2_chemistry': 600,
'1k_pbmc_v3_chemistry': 800
]
// cellFilterMaxNGenes will be set to 4000 for all the samples
cellFilterMaxNGenes = 4000
// Here we again enable the multi-sample feature for the cellFilterMaxPercentMito parameter
cellFilterMaxPercentMito = [
'1k_pbmc_v2_chemistry': 0.15,
'1k_pbmc_v3_chemistry': 0.05
]
// geneFilterMinNCells will be set to 3 for all the samples
geneFilterMinNCells = 3
iff = '10x_mtx'
off = 'h5ad'
outdir = 'out'
}
}
}
If you want to apply custom parameters for some specific samples and have a “general” parameter for the rest of the samples, you should use the ‘default’ key as follows:
params {
tools {
scanpy {
container = 'vibsinglecellnf/scanpy:1.8.1'
filter {
report_ipynb = '/src/scanpy/bin/reports/sc_filter_qc_report.ipynb'
// Here we enable the multi-sample feature for the cellFilterMinNgenes parameter
cellFilterMinNGenes = [
'1k_pbmc_v2_chemistry': 600,
'default': 800
]
[...]
}
}
}
Using this config, the parameter params.tools.scanpy.cellFilterMinNGenes
will be applied with a threshold value of 600
to 1k_pbmc_v2_chemistry
. The rest of the samples will use the value 800
to filter the cells having less than that number of genes.
This strategy can be applied to any other parameter of the config.
Parameter exploration¶
Since v0.9.0
, it is possible to explore several combinations of parameters. The latest version of the VSN-Pipelines allows to explore the following parameters:
params.tools.scanpy.clustering
method
methods = ['louvain','leiden']
resolution
resolutions = [0.4, 0.8]
In case the parameter exploration mode is used within the params.tools.scanpy.clustering
parameter, it will generated a range of different clusterings.
For non-expert, it’s often difficult to know which clustering to pick. It’s however possible to use the DIRECTS
module in order to select a default clustering. In order, to use
this automated clustering selection method, add the directs
profile when generating the main config using nextflow config
. The config will get populated with:
directs {
container = 'vibsinglecellnf/directs:0.1.0'
labels {
processExecutor = 'local'
}
select_default_clustering {
fromMinClusterSize = 5
toMinClusterSize = 100
byMinClusterSize = 5
}
}
Currently, only the Scanpy related pipelines have this feature implemented.
Regress out variables¶
By default, don’t regress any variable out. To enable this features, the scanpy_regress_out
profile should be added when generating the main config using nextflow config
. This will add the following entry in the config:
params {
tools {
scanpy {
regress_out {
variablesToRegressOut = []
off = 'h5ad'
}
}
}
}
Add any variable in variablesToRegressOut
to regress out: e.g.: ‘n_counts’, ‘percent_mito’.
Highly Variable Genes Selection¶
This step is a wrapper around the Scanpy scanpy.pp.highly_variable_genes
function and regarding the parameters used it is following the documentation available at scanpy-pp-highly-variable-genes.
By default, it will use the seurat
flavor to select variable genes and will also keep the same default values for the 4 different thresholds (as the documentation): min_mean
, max_mean
, min_disp
, max_disp
.
params {
tools {
scanpy {
feature_selection {
report_ipynb = "${params.misc.test.enabled ? '../../..' : ''}/src/scanpy/bin/reports/sc_select_variable_genes_report.ipynb"
flavor = 'seurat'
minMean = 0.0125
maxMean = 3
minDisp = 0.5
off = 'h5ad'
}
}
}
}
Other flavors are available as cell_ranger
and seurat_v3
. In order to use the seurat_v3
flavor, one parameter is required to be specified: nTopGenes
in the config file as follows:
params {
tools {
scanpy {
feature_selection {
report_ipynb = "${params.misc.test.enabled ? '../../..' : ''}/src/scanpy/bin/reports/sc_select_variable_genes_report.ipynb"
flavor = 'seurat_v3'
nTopGenes = 2000
off = 'h5ad'
}
}
}
}
Skip steps¶
By default, the pipelines are run from raw data (unfiltered data, not normalized).
If you have already performed an independent steps with another it’s possible to skip some steps from the pipelines. Currently, here are the steps that can be skipped:
- Scanpy
filtering
- Scanpy
normalization
In order to skip the Scanpy filtering step, we need to add 3 new profiles when generating the config:
min
scanpy_data_transformation
scanpy_normalization
The following command, will create a Nextflow config which the pipeline will understand and will not run the Scanpy filtering step:
nextflow config \
~/vib-singlecell-nf/vsn-pipelines \
-profile min,[data-profile],scanpy_data_transformation,scanpy_normalization,[...],singularity \
> nextflow.config
[data-profile]
: Can be one of the different possible data profiles e.g.:h5ad
[...]
: Can be other profiles likebbknn
,harmony
,pcacv
, …
Quiet mode¶
By default, VSN will output some additional messages to the terminal, such as the global seed, and the names and paths of the samples detected by the input channel.
These messages can be suppressed by using the --quiet
flag when starting the nextflow process:
nextflow -C example.config run vib-singlecell-nf/vsn-pipelines -entry single_sample --quiet