Running ViromeXplore

Running the Workflows

Usage

To run the workflows, use the following commands:

nextflow ViromeXplore.nf --pipeline qc_classify --reads "basename_{1,2}.fastq"
nextflow ViromeXplore.nf --pipeline viral_assembly --reads "basename_{1,2}.fastq"
nextflow ViromeXplore.nf --pipeline find_viruses --contigs contigs.fasta
nextflow ViromeXplore.nf --pipeline high_quality_genomes --reads "basename_{1,2}.fastq" --contigs contigs.fasta --viral_contigs viral_contigs.fasta
nextflow ViromeXplore.nf --pipeline taxonomy_annotation --viral_contigs viral_contigs_or_genomes.fasta
nextflow ViromeXplore.nf --pipeline host_prediction --phylogeny viral_phylogeny.nwk --taxonomy host_taxonomy.tsv --matrix virus_host_abundances.tsv

Containers are available for all processes. Use the appropriate profile to run the workflows:

For Docker: -profile docker
For Singularity (default): -profile singularity

If using a cluster system (e.g., SLURM), you can combine profiles to configure resource usage. To do this, modify the config/local.config file and run using:

-profile singularity,slurm
-profile docker,slurm

Make sure to include the selected profile when running the workflow.

Mandatory Arguments

--pipeline Specifies the pipeline to run. Valid options: qc_classify, viral_assembly, find_viruses, high_quality_genomes, taxonomy_annotation, host_prediction

For `qc_classify` and `viral_assembly` pipelines:

--reads Input reads in FASTQ format, e.g.: basename_{1,2}.fastq

For `find_viruses` and `taxonomy_annotation` pipelines:

--contigs Contigs file in FASTA format, e.g.: contigs.fasta

For `high_quality_genomes` pipeline:

--reads Input reads in FASTQ format: basename_{1,2}.fastq
--contigs Contigs file obtained from assembly: contigs.fasta
--viral_contigs Viral contigs or genomes: viral_contigs.fasta

For `taxonomy_annotation` pipeline:

--viral_contigs Viral contigs or genomes: viral_contigs_or_genomes.fasta

For `host_prediction` pipeline:

--phylogeny Phylogenetic tree of the viruses (NEWICK format): virus_phylogeny.nwk
--taxonomy Lineage of host taxa (tab-delimited): taxonomy_file.tsv
--matrix Virus-host abundance matrix (tab-delimited): matrix_abundances.tsv Columns represent taxa; rows represent samples.

Optional Arguments

--result_dir Directory to store output files. Default: ``results``
--cpus Number of CPUs to use. Default: all available
--memory Memory (in GB) to allocate. Default: 12 GB
--help Display help message.
--workdir Work directory for nextflow. Default: work

Tool-Specific Parameters

ViromeQC

samp_type Sample type. Default: ``environmental``

VirSorter2

virsorter_minlength Minimum contig length to keep. Default: ``1500``

Fastp

phred_quality Minimum phred quality score for filtering. Default: ``30``

MEGAHIT

kmers_assembly K-mer sizes to use for assembly. Default: ``21,35,49,63,77,91,105,119,127``

COBRA

cobra_assembly Assembly method used. Default: ``megahit``
min_kmer Minimum k-mer size. Default: ``21``
max_kmer Maximum k-mer size. Default: ``127``

Custom Database Arguments

By default, ViromeXplore uses bundled reference databases. However, users may specify custom databases for the following tools:

--virsorterdb Custom VirSorter2 database path.
--checkvdb Custom CheckV database path.
--kaijudb Custom Kaiju database path.
--virushostdb Custom virus-host database path.
--genomaddb Custom geNomad database path.
--eggnogdb Custom EggNOG database path.

Note

All parameters (including defaults and database locations) are defined in the nextflow.config file. Users may edit this file directly or override parameters via the command line.

Available Pipelines

qc_classify Detects non-viral contamination and classifies reads. (Requires ILLUMINA FASTQ files)
viral_assembly Performs QC and assembly of virome reads. (Requires ILLUMINA FASTQ files)
find_viruses Identifies and annotates viral sequences. (Requires FASTA contigs file)
high_quality_genomes Estimates abundance and improves genome completeness. (Requires FASTA contigs, viral contigs, and ILLUMINA FASTQ files)
taxonomy_annotation Assigns taxonomy and gene functions to viral genomes. (Requires FASTA viral contigs/genomes)
host_prediction Predicts virus-host interactions using abundance, taxonomy, and phylogeny. (Requires NEWICK tree, taxonomy file, and abundance matrix)