scATAC-seq Preprocessing¶
This pipeline takes fastq files from paired end single cell ATAC-seq, and applies preprocessing steps to align the reads to a reference genome, and produce a bam file and scATAC-seq fragments file.
This workflow is currently available in the develop_atac
branch (use nextflow pull vib-singlecell-nf/vsn-pipelines -r develop_atac
to sync this branch).
Pipeline Steps¶
The full steps are:
- Barcode correction:
- For ‘standard’ and ‘multiome’ samples (e.g. 10x Genomics or similar) correction is performed against a whitelist by this method from aertslab/single_cell_toolkit.
- For ‘biorad’ samples, barcode correction is performed by this script in our aertslab/single_cell_toolkit (previously, this was done with BAP).
- Fastq barcoding: Add the barcode sequence to the comment field of the fastq sequence identifier. Uses methods from aertslab/single_cell_toolkit.
- Read/adapter trimming (Trim_Galore or fastp).
- Mapping to a reference genome:
- Mark PCR and optical duplicates (MarkDuplicates (Picard) or MarkDuplicatesSpark (GATK)).
- Estimate library complexity with EstimateLibraryComplexity (Picard).
- A fragments file is created using Sinto.
Pipeline Details¶
Input Metadata¶
The input to this pipeline is a (tab-delimited) metadata table with the sample ID, sequencing technology, and locations of the fastq files. Note that the fastq file fields must be full paths; this is not shown here for clarity:
sample_name | technology | fastq_PE1_path | fastq_barcode_path | fastq_PE2_path |
---|---|---|---|---|
sample_1 | standard | sample_1_R1.fastq.gz | sample_1_R2.fastq.gz | sample_1_R3.fastq.gz |
sample_2 | multiome | sample_2_R1.fastq.gz | sample_2_R2.fastq.gz | sample_2_R3.fastq.gz |
sample_3 | biorad | sample_3_R1.fastq.gz | sample_3_R3.fastq.gz | |
sample_4 | revcomp_wl | sample_4_R1.fastq.gz | sample_2_R2.fastq.gz | sample_4_R3.fastq.gz |
sample_5 | hydrop_3x96 | sample_5_R1.fastq.gz | sample_5_R2.fastq.gz | sample_5_R3.fastq.gz |
sample_6 | hydrop_2x384 | sample_5_R1.fastq.gz | sample_5_R2.fastq.gz | sample_5_R3.fastq.gz |
The columns represent:
sample_name
Sample name for labeling the sample in the pipeline and output files. This can be any arbitrary string.technology
: This controls the barcode correction and processing methods to use for the fastq files. Currently only thebiorad
option involves different processing steps. Otherwise, the value in this field (e.g.standard
,multiome
) controls which barcode whitelist is used for correction. See below for additional details.fastq_PE1_path
: The full path to the fastq file for the first read in a pair.fastq_barcode_path
: The full path to the fastq file containing the barcodes. This column can be blank/empty depending on the technology setting (e.g.biorad
).fastq_PE2_path
: The full path to the fastq file for the second read in a pair.
Fastq input¶
Fastq input for each sample can be given as a single set of files (R1, R2, R3), or it can be multiple files, in the case of samples which have been split across multiple sequencing lanes. A combination of these cases can be processed together in one metadata file.
Within the pipeline, all of the reads from each set of files will be considered and labeled as one read group. Read group names are taken from the first read in the fastq:
@A01044:19:HLYKFDRXX:1:2101:4291:1000 1:N:0:ACTCAGAC
will produce a RG in the bam:
@RG ID:A01044:19:HLYKFDRXX:1 SM:sample_1 LB:A01044:19:HLYKFDRXX:1__sample_1 PL:ILLUMINA
Single fastq input¶
In this situation, there is only one set of reads per sample, the metadata file will look very similar to one of the rows from above. There will be one read group in the final bam file.
Split fastq input¶
In this case, multiple fastq files (in rows) for each sample can be given.
In this example, there are two sets of fastqs for sample_1
that were run on two separate lanes.
Note that the sample ID is the same for both rows:
sample_name | technology | fastq_PE1_path | fastq_barcode_path | fastq_PE2_path |
---|---|---|---|---|
sample_1 | standard | sample_1_S1_L001_R1_001.fastq.gz | sample_1_S1_L001_R2_001.fastq.gz | sample_1_S1_L001_R3_001.fastq.gz |
sample_1 | standard | sample_1_S1_L002_R1_001.fastq.gz | sample_1_S1_L002_R2_001.fastq.gz | sample_1_S1_L002_R3_001.fastq.gz |
In this situation, each set of fastqs will be processed separately for the barcode correction, barcode addition to the fastq comment field, adaptor trimming, and mapping steps. Following mapping, each mapped bam is merged and duplicates are marked using the full data. Downstream steps are done with the merged data.
Generating the metadata file¶
Note that there is an easy way to create the metadata from the file paths for each sample by using the following bash command (expand to view). Special thanks here to Gert Hulselmans for expanding the capabilities of this function.
metadata generator
create_atac_metadata () {
local sample="${1}";
local technology="${2}";
local fastq_prefix="${3}";
local read_labels="${4}";
if [ "${sample}" == "header" ]; then
printf 'sample_name\ttechnology\tfastq_PE1_path\tfastq_barcode_path\tfastq_PE2_path\n';
return 0;
fi
if [ ${#@} -ne 4 ] ; then
printf 'Usage: create_atac_metadata sample technology fastq_prefix read_labels\n\n';
printf 'Arguments:\n';
printf ' sample: sample name\n';
printf ' technology: "standard", "hydrop_3x96", "hydrop_2x384", or "biorad"\n';
printf ' fastq_prefix: path prefix to FASTQ files.\n';
printf ' read_labels: comma separated read labels for R1, R2 and R3 that select: R1,R2,R3.\n';
return 1;
fi
read_labels_arr=(${read_labels//,/ });
# Get R1, R2 and R3 FASTQ filenames for
R1=(${fastq_prefix}*${read_labels_arr[0]}*.{fastq,fq,fastq.gz,fq.gz})
R2=(${fastq_prefix}*${read_labels_arr[1]}*.{fastq,fq,fastq.gz,fq.gz})
R3=(${fastq_prefix}*${read_labels_arr[2]}*.{fastq,fq,fastq.gz,fq.gz})
for i in "${!R1[@]}" ; do
# Check if R1 FASTQ file exist (and is not just a glob like "${sample}*R1*.fq").
if [ -e "${R1[i]}" ] ; then
printf "${sample}\t${technology}\t${R1[i]}\t${R2[i]}\t${R3[i]}\n";
fi
done
}
To run use the options:
- Sample ID (if this parameter is “header”, it will print the metadata header and stop)
- Technology (e.g. “standard”)
- The “file prefix” full path to your fastq files, matching the common portions of the file names (without any glob
*
expansions) - The “read labels” to indicate how the files are named and match the remainder of the file names (e.g. “R1,R2,R3”, “R1,UMI,R2”, etc.)
create_atac_metadata header > auto_metadata.tsv
create_atac_metadata sample_1 standard /path/to/sample_1_subset_S R1,R2,R3 >> auto_metadata.tsv
create_atac_metadata sample_2 standard /path/to/sample_2_subset_S R1,R2,R3 >> auto_metadata.tsv
create_atac_metadata sample_5 hydrop_3x96 /path/to/sample_5_ R1,R2,R3 >> auto_metadata.tsv
create_atac_metadata sample_6 hydrop_2x384 /path/to/sample_6_ R1,R2,R3 >> auto_metadata.tsv
Technology types¶
The “technology” field in the metadata table controls two things:
- How technology-specific pipeline steps are applied.
Currently there are three specific settings (
biorad
,hydrop_3x96
, andhydrop_2x384
) that use alternate pipelines processes (to extract and correct the barcode sequence from the input fastqs). Using any other keyword is allowed, and samples will be run with the standard pipeline steps (barcode correction against a whitelist). - Which whitelist is used for barcode correction.
The “technology” field must match a key in the
params.tools.singlecelltoolkit.barcode_correction.whitelist
parameter list in the config file for that sample to be associated with a particular barcode whitelist. The “technology” field and whitelist key name can be set to any arbitrary string (e.g.standard
), with the exception of the technology-specific keywords above.
The main modes are:
standard
¶
The standard
setting is the main pipeline mode.
It assumes a typical 10x Genomics style format with two read pair fastqs and a barcode fastq (note that in the example here, the barcode correction has already been performed, writing the CB
tag into the comment of the barcode fastq):
$ zcat sample_1_R1.fastq.gz | head -n 4
@A00311:74:HMLK5DMXX:1:1101:2013:1000 1:N:0:ACTCAGAC
NTTGTCTCAGCACCCCCCGACATGGATTCAGGCTGTCTCTTATACACATC
+
#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
$ zcat sample_1_R2.fastq.gz | head -n 4
@A00311:74:HMLK5DMXX:1:1101:2013:1000 2:N:0:ACTCAGAC CB:Z:CTGTTCGCAAAGCATA
CTGTTCGCAAAGCATA
+
F:FFFFFFFFFFFFFF
$ zcat sample_1_R3.fastq.gz | head -n 4
@A00311:74:HMLK5DMXX:1:1101:2013:1000 3:N:0:ACTCAGAC
CCTGAATCCATGTCGGGGGGTGCTGAGACAAGCTGTCTCTTATACACAT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
The barcoding step here uses a helper script from aertslab/single_cell_toolkit which transforms this input into two paired fastq files with the barcode information embedded in the fastq comments field:
$ zcat sample_1_dex_R1.fastq.gz | head -n 4
@A00311:74:HMLK5DMXX:1:1101:2013:1000 CR:Z:CTGTTCGCAAAGCATA CY:Z:F:FFFFFFFFFFFFFF CB:Z:CTGTTCGCAAAGCATA
NTTGTCTCAGCACCCCCCGACATGGATTCAGGCTGTCTCTTATACACATC
+
#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
$ zcat sample_1_dex_R2.fastq.gz | head -n 4
@A00311:74:HMLK5DMXX:1:1101:2013:1000 CR:Z:CTGTTCGCAAAGCATA CY:Z:F:FFFFFFFFFFFFFF CB:Z:CTGTTCGCAAAGCATA
CCTGAATCCATGTCGGGGGGTGCTGAGACAAGCTGTCTCTTATACACAT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
multiome
/alternate¶
The multiome
or alternately-named settings work with the same pipeline steps as standard
with the exception of the whitelist used for barcode correction.
The whitelists are supplied in the params file (params.tools.singlecelltoolkit.barcode_correction.whitelist
).
This can be used to supply alternate whitelists for certain samples, for example if you need to supply a reverse complemented whitelist for samples run in certain sequencing machines.
hydrop_3x96
/hydrop_2x384
¶
The HyDrop settings (either hydrop_3x96
or hydrop_2x384
depending on the library preparation used) processes data generated by the HyDrop ATAC protocol
(see hydrop.aertslab.org and the associated preprint).
This approach differs from the standard pipeline in only the initial step, which is to extract and process the HyDrop barcodes from the sequencing output.
Here, this script is used to take the R2 read from the sequencer:
$ zcat sample_5_R2.fastq.gz | head -n 4
@VH00445:5:AAAL5KYM5:1:1101:63923:1019 2:N:0:ACACGTGGAC
CACTGGTGGTAGGGTACTCGGACAAGTGGAGCAGTAGCTGAAGTGTAGAAG
+
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
and transform it into:
$ zcat sample_5_hydrop_barcode_R2.fastq.gz
@VH00445:5:AAAL5KYM5:1:1101:63923:1019 2:N:0:ACACGTGGAC
CACTGGTGGTGACAAGTGGAAAGTGTAGAA
+
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
The two HyDrop modes (hydrop_3x96
, hydrop_2x384
) differ only in the way the initial barcode extraction script works.
Following this, they are processed in the same way as the standard pipeline, including whitelist-based barcode correction (note that the two HyDrop modes require different barcode whitelists to be used here).
biorad
¶
The biorad
setting processes BioRad data using
this script
in our aertslab/single_cell_toolkit
(previously, this was done with BAP).
This takes input data:
$ zcat sample_2_R1.fastq.gz | head -n 4
@A00794:327:HTJ55DRXX:1:2101:1154:1016 1:N:0:TAAGGCGA
GATCACCATATGCATGACATTCACGAGTCACTGAGTAACGCCTCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCTGCAATGGCTGGAGCACACCCCATACTCATTCTGGTCTCCTT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFF:FFFFFFFFFFFFFFFFF,FF,FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFF,FFFFFF
$ zcat sample_2_R2.fastq.gz | head -n 4
@A00794:327:HTJ55DRXX:1:2101:1154:1016 2:N:0:TAAGGCGA
GTGTTTGGCTGAGGAAAGTGTGTGAAGCCCCGATATGTGA
+
FFF,FFF:FFF:FF,FFFFF:F:FFFFFFFFFFF,,F:FF
and directly produces paired fastq files with the barcode added in the fastq comments field:
$ zcat sample_2_dex_R1.fastq.gz | head -n 4
@A00794:327:HTJ55DRXX:1:2101:1154:1016 CR:Z:GATCACCATTCACGTAACGCC CY:Z:FFFFFFFFFFFFFF:FFFFFF CB:Z:GATCACCATTCACGTAACGCC br:Z:0,0,0_0,0,0,1
CTGCAATGGCTGGAGCACACCCCATACTCATTCTGGTCTCCTT
+
F:FFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFF,FFFFFF
$ zcat sample_2_dex_R2.fastq.gz | head -n 4
@A00794:327:HTJ55DRXX:1:2101:1154:1016 CR:Z:GATCACCATTCACGTAACGCC CY:Z:FFFFFFFFFFFFFF:FFFFFF CB:Z:GATCACCATTCACGTAACGCC br:Z:0,0,0_0,0,0,1
GTGTTTGGCTGAGGAAAGTGTGTGAAGCCCCGATATGTGA
+
FFF,FFF:FFF:FF,FFFFF:F:FFFFFFFFFFF,,F:FF
Running the workflow¶
Technical considerations¶
Direct the Nextflow work directory to an alternate path (e.g. a scratch drive) using the
NXF_WORK
environmental variable:nwork=/path/to/scratch/example_project mkdir $nwork export NXF_WORK=$nwork
Note that if you start a new shell, NXF_WORK
must be set again, or the pipeline will not resume properly.
- Temporary directory mapping.
For large BAM files, the system default temp location may become full.
A workaround is to include a volume mapping to the alternate
/tmp
-B /alternate/path/to/tmp:/tmp
using the volume mount options in Docker or Singularity. For example in the container engine options:
- Singularity run options:
runOptions = '--cleanenv -H $PWD -B /data,/alternate/path/to/tmp:/tmp'
- Docker run options:
runOptions = '-i -v /data:/data -v /alternate/path/to/tmp:/tmp'
Configuration¶
To generate a config file, use the atac_preprocess
profile along with docker
or singularity
.
Note that the full path to vib-singlecell-nf/vsn-pipelines/main_atac.nf
must be used:
nextflow config \
vib-singlecell-nf/vsn-pipelines/main_atac.nf \
-profile atac_preprocess,singularity \
> atac_preprocess.config
Note
It is also possible to run the pycisTopic QC steps directly after this atac_preprocess
pipeline, with a single command.
Please see
here
for details on how to run with this configuration.
Parameters¶
The ATAC-specific parameters are described here. The important parameters to verify are:
params.data.atac_preprocess.metadata
: the path to the metadata file.params.tools.bwamaptools.bwa_fasta
: the path to the bwa reference fasta file. This should be already indexed withbwa index
, and the index files located in the same directory as the fasta file. Note thatbwa
andbwa-mem2
use different indexes that are not interchangeable.params.tools.singlecelltoolkit.barcode_correction.whitelist
: Whitelists for barcode correction are supplied here. The whitelists are matched to samples based on the parameter key here (‘standard’, ‘multiome’, ‘hydrop_3x96’, ‘hydrop_2x384’, etc.) and the technology field listed for each sample in the metadata file. Barcode whitelists can (optionally) be gzipped. There are currently no checks performed to ensure that the sample barcodes have any overlap to the whitelist (the barcode correction reports should be checked for this).
Choice of tools¶
Several steps have options for the choice of method to use.
These options are controlled within params.atac_preprocess_tools
.
- Adapter trimming (
adapter_trimming_method
): Can be either ofTrim_Galore
(default), orfastp
. - Duplicate marking (
mark_duplicates_method
): Can be either ofMarkDuplicates
(Picard tools, default) orMarkDuplicatesSpark
(GATK). We currently recommend Picard MarkDuplicates because it has the capability to perform barcode-aware marking of PCR duplicates. MarkDuplicatesSpark has the advantage of parallelization, however it requires a large SSD to use for temporary files.
Additionally:
- Mapping: Use parameter
params.tools.bwamaptools.bwa_version
to select eitherbwa
orbwa-mem2
. These should give virtually identical results, howeverbwa-mem2
, while faster, has used more memory in our tests. Note that the index (bwa_index
) is not interchangeable between the versions.
Optional parameters¶
- Within
params.tools.sinto.fragments
:- One of (but not both)
barcodetag
orbarcode_regex
needs to be set to tell Sinto where to find the barcodes in the bam file. The default is to usebarcodetag
ofCB
. mapq
: Controls quality filtering settings for generating the fragments file. Discards reads with quality score lower than this number (default 30).
- One of (but not both)
Execution¶
After configuring, the workflow can be run with:
nextflow -C atac_preprocess.config run \
vib-singlecell-nf/vsn-pipelines/main_atac.nf \
-entry atac_preprocess -resume
Output¶
An example output tree is shown here.
out/
├── data
│ ├── bam
│ │ ├── sample_1.bwa.out.possorted.bam
│ │ ├── sample_1.bwa.out.possorted.bam.bai
│ │ ├── sample_2.bwa.out.possorted.bam
│ │ └── sample_2.bwa.out.possorted.bam.bai
│ ├── fragments
│ │ ├── sample_1.sinto.fragments.tsv.gz
│ │ ├── sample_1.sinto.fragments.tsv.gz.tbi
│ │ ├── sample_2.sinto.fragments.tsv.gz
│ │ └── sample_2.sinto.fragments.tsv.gz.tbi
│ └── reports
│ ├── barcode
│ │ ├── sample_1____S7_R1_001.corrected.bc_stats.log
│ │ └── sample_2____S8_R1_001.corrected.bc_stats.log
│ ├── mapping_stats
│ │ ├── sample_1.mapping_stats.tsv
│ │ └── sample_2.mapping_stats.tsv
│ ├── mark_duplicates
│ │ ├── sample_1.library_complexity_metrics.txt
│ │ ├── sample_1.mark_duplicates_metrics.txt
│ │ ├── sample_2.library_complexity_metrics.txt
│ │ └── sample_2.mark_duplicates_metrics.txt
│ └── trim
│ ├── sample_1____S7_R1_001.fastp.trimming_report.html
│ └── sample_2____S8_R1_001.fastp.trimming_report.html
└── nextflow_reports
├── execution_report.html
├── execution_timeline.html
├── execution_trace.txt
└── pipeline_dag.dot