Input Data Formats¶
Depending on the type of data you run the pipeline with, one or more appropriate profiles should be set when running nextflow config
.
These profiles are indicated in the sections below.
Specifying multiple samples¶
All the input data parameters are compatible with the following features:
- Glob patterns
"data/10x/1k_pbmc/1k_pbmc_*/outs/"
- Comma separated paths (paths can contain glob patterns)
"data/10x/1k_pbmc/1k_pbmc_v2_chemistry/outs/, data/10x/1k_pbmc/1k_pbmc_v3_chemistry/outs/"
- Array of paths (paths can contain glob patterns)
[
"data/10x/1k_pbmc/1k_pbmc_v2_chemistry/outs/",
"data/10x/1k_pbmc/1k_pbmc_v3_chemistry/outs/"
]
Cell Ranger (10x Genomics)¶
Data from a standard Cell Ranger output directory can be easily ingested into the pipeline by using the proper input channel (tenx_mex
or tenx_h5
, depending on which file should be used).
Multiple samples can be selected by providing the path to this directory using glob patterns.
/home/data/
└── cellranger
├── sample_A
│ └── outs
│ ├── filtered_feature_bc_matrix
│ │ ├── barcodes.tsv
│ │ ├── genes.tsv
│ │ └── matrix.mtx
│ └── filtered_feature_bc_matrix.h5
└── sample_B
└── outs
├── filtered_feature_bc_matrix
│ ├── barcodes.tsv
│ ├── genes.tsv
│ └── matrix.mtx
└── filtered_feature_bc_matrix.h5
MEX¶
To use the Cell Ranger Market Exchange (MEX) files, use the following profile when generating the config file:
-profile tenx
This profile adds the following parameter (params.data.tenx.cellranger_mex
) into the generated .config file:
[...]
data {
tenx {
cellranger_mex = "/home/data/cellranger/sample*/outs/"
}
}
[...]
H5¶
To use the Cell Ranger h5
file as input, use the following profile:
-profile tenx_h5
This profile adds the params.data.tenx.cellranger_h5
parameter into the generated .config file:
[...]
data {
tenx {
cellranger_h5 = "/home/data/cellranger/sample*/outs/"
}
}
[...]
Input file detection¶
Setting the input directory appropriately, using a glob in the directory path in place of the sample names, will collect all the samples listed in the filtered_[feature|gene]_bc_matrix
directories listed above.
For example, in params.data.tenx
, setting:
cellranger_mex = "/home/data/cellranger/sample*/outs/"
or
cellranger_h5 = "/home/data/cellranger/sample*/outs/"
will recursively find all 10x samples in that directory.
The pipeline will use either the outs/filtered_feature_bc_matrix/
or the outs/raw_feature_bc_matrix/
depending on the setting of the params.utils.file_converter.useFilteredMatrix
(true
uses filtered; false
uses raw).
H5AD (Scanpy)¶
Use the following profile when generating the config file:
-profile h5ad
In the generated .config file, make sure the file_paths
parameter is set with the paths to the .h5ad
files:
[...]
data {
h5ad {
file_paths = "data/1k_pbmc_v*_chemistry_SUFFIX.SC__FILE_CONVERTER.h5ad"
suffix = "_SUFFIX.SC__FILE_CONVERTER.h5ad"
}
}
[...]
- The
suffix
parameter is used to infer the sample name from the file paths (it is removed from the input file path to derive a sample name).
In case there are multiple .h5ad files that need to be processed with different suffixes, the multi-labelled strategy should be used to define the h5ad parameter:
[...]
data {
h5ad {
GROUP1 {
file_paths = "[path-to-group1-files]/*.SUFFIX1.h5ad"
suffix = ".SUFFIX1.h5ad"
group = ["technology", "10x"]
}
GROUP2 {
file_paths = "[path-to-group1-files]/*.SUFFIX2.h5ad"
suffix = ".SUFFIX2.h5ad"
group = ["technology", "smart-seq2"]
}
}
}
[...]
Notes:
GROUP1
,GROUP2
are just example names here. They can be replaced by any value as long as they are alphanumeric (underscores are allowed).- All the different suffix defined should unique.
file_paths
andsuffix
do allow list of paths/globs in the multi-labelled strategy.group
[optional] should be an array of 2 elements where first element define the group name and the second the group value. This will add cell-based annotation for each group of files
Loom¶
Use the following profile when generating the config file:
-profile loom
In the generated .config file, make sure the file_paths
parameter is set with the paths to the .loom
files:
[...]
data {
loom {
file_paths = "data/1k_pbmc_v*_chemistry_SUFFIX.SC__FILE_CONVERTER.loom"
suffix = "_SUFFIX.SC__FILE_CONVERTER.loom"
}
}
[...]
- The
suffix
parameter is used to infer the sample name from the file paths (it is removed from the input file path to derive a sample name).
Seurat Rds¶
Use the following profile when generating the config file:
-profile seurat_rds
In the generated .config file, make sure the file_paths
parameter is set with the paths to the .Rds
files:
[...]
data {
seurat_rds {
file_paths = "data/1k_pbmc_v*_chemistry_SUFFIX.SC__FILE_CONVERTER.Rds"
suffix = "_SUFFIX.SC__FILE_CONVERTER.Rds"
}
}
[...]
- The pipelines expect a Seurat v3 object contained in the .Rds file. (Seurat v2 objects are currently not supported).
- The
suffix
parameter is used to infer the sample name from the file paths (it is removed from the input file path to derive a sample name).
TSV¶
Use the following profile when generating the config file:
-profile tsv
In the generated .config file, make sure the file_paths
parameter is set with the paths to the .tsv
files:
[...]
data {
h5ad {
file_paths = "data/1k_pbmc_v*_chemistry_SUFFIX.SC__FILE_CONVERTER.tsv"
suffix = "_SUFFIX.SC__FILE_CONVERTER.tsv"
}
}
[...]
- The
suffix
parameter is used to infer the sample name from the file paths (it is removed from the input file path to derive a sample name).
CSV¶
Use the following profile when generating the config file:
-profile csv
In the generated .config file, make sure the file_paths
parameter is set with the paths to the .csv
files:
[...]
data {
h5ad {
file_paths = "data/1k_pbmc_v*_chemistry_SUFFIX.SC__FILE_CONVERTER.csv"
suffix = "_SUFFIX.SC__FILE_CONVERTER.csv"
}
}
[...]
- The
suffix
parameter is used to infer the sample name from the file paths (it is removed from the input file path to derive a sample name).