Tutorial: Analyzing a Virome Dataset with ViromeXplore

The following tutorial uses a 10,000 read subset of a virome from the human gut (SRR829034). The aim of this tutorial is that the user learns to use the workflows contained in the ViromeXplore software. The demo dataset is automatically downloaded with the github repo.

Quality Control and Classification

Run the following command to perform quality control and viral classification:

nextflow ViromeXplore.nf --pipeline qc_classify --reads "demo/demo_SRR829034_{1,2}.fastq"

The interactive Krona plot can be viewed here:

View interactive Krona plot

Krona plot

The ViromeQC results show an enrichment score of 2.26, indicating that the virome is 2.8× more enriched than a comparable metagenome.

Viral Assembly

Assemble the viral reads:

nextflow ViromeXplore.nf --pipeline viral_assembly --reads "demo/demo_SRR829034_{1,2}.fastq"

View interactive plot

The fatsp report shows that the Q30 of most bases was above 30 so no reads were removed.

quality plot

This workflow also creates an assembly fasta file with genomic contigs. A quick inspection revealed a total of 27 contigs.

Identify Viral Sequences

Run the viral identification workflow on the assembled MAGs:

nextflow ViromeXplore.nf --pipeline find_viruses --contigs results/megahit_output/final.contigs.fa

Virsorter2 Results: - 3 sequences classified as viral - 2 = dsDNA phages - 1 = ssDNA virus

Refer to the following table for more details:

seqname

dsDNAphage

NCLDV

RNA

ssDNA

lavidaviridae

max_score

max_score_group

length

hallmark

viral

cellular

k77_12

0.987

0.260

0.365

1.000

0.267

1.000

ssDNA

2898

1

60.000

0.000

k77_25

0.993

0.253

0.185

0.913

0.280

0.993

dsDNAphage

5029

4

62.500

0.000

k77_26

1.000

0.473

0.120

0.980

0.947

1.000

dsDNAphage

6399

2

100.000

0.000

Check Viral Genome Quality

Results for CheckV:

  • 2 dsDNA viruses: High quality (100% complete)

  • ssDNA virus: 55% complete

contig_id

contig_length

proviral_length

aai_expected_length

aai_completeness

aai_confidence

aai_error

aai_num_hits

aai_top_hit

aai_id

aai_af

hmm_completeness_lower

hmm_completeness_upper

hmm_num_hits

kmer_freq

k77_12

2899

NA

5203.194186213554

55.71577566105885

high

1.36078157711096

34

DTR_883679

93.88

92.87

45.29686796215677

66.77545783982293

3

1.0

k77_25

5106

NA

5006.547801713951

100.0

high

1.48247660798039

173

DTR_883655

100.0

98.39

85.48428064735897

99.99999999999999

5

1.01

k77_26

6476

NA

6397.397761926372

100.0

high

1.28739096702271

62

DTR_883654

99.74

93.64

95.88582479210984

100.00000000000001

3

1.01

Extend Incomplete Genomes

nextflow ViromeXplore.nf --pipeline high_quality_genomes \
  --reads "results/fastp_output/demo_SRR829034_{1,2}.fastp.fq.gz" \
  --contigs results/megahit_output/final.contigs.fa \
  --viral_contigs results/checkv_output/viruses.fna

Log output shows no extension was performed (expected for demo data):

[01/23] Reading contigs and getting the contig end sequences...
[05/23] A total of 3 query contigs were imported.
...
no query was extended, exit! this is normal if you only provide few queries.

Taxonomic Annotation

Since no high-quality genomes were obtained, we run taxonomy annotation directly on the viral contigs:

nextflow ViromeXplore.nf --pipeline taxonomy_annotation \
  --viral_contigs results/checkv_output/viruses.fna

Taxonomy Results: All viral genomese were successfully classified. - Family: Microviridae

seq_name

length

topology

coordinates

n_genes

genetic_code

virus_score

fdr

n_hallmarks

marker_enrichment

taxonomy

k77_26

6476

DTR

NA

9

11

0.9837

NA

2

14.7373

Viruses;Monodnaviria;Sangervirae;Phixviricota;Malgrandaviricetes;Petitvirales;Microviridae

k77_25

5106

DTR

NA

8

11

0.9827

NA

3

11.1680

Viruses;Monodnaviria;Sangervirae;Phixviricota;Malgrandaviricetes;Petitvirales;Microviridae

k77_12

2899

No terminal repeats

NA

5

11

0.9783

NA

1

4.6856

Viruses;Monodnaviria;Sangervirae;Phixviricota;Malgrandaviricetes;Petitvirales;Microviridae

Functional Annotation: - The eggNOG analysis revealed proteins related to:

  • Structural molecule activity

  • ATP binding

  • Viral process

Directory structure

Each pipeline step creates its own directory, keeping the workflow organized and reproducible, for example:

results/
│
├── samtools_output/
├── megahit_output/
├── bowtie_output/
├── vsearch_output/
├── eggnog_mapper_output/
├── checkv_output/
├── viromeQC_output/
├── fastp_output/
├── geNomad_output/
├── virsorter_out/
├── cdhit_output/
├── mapping_summary_output/
├── kaiju_output/
└── cobra_output/