guppy_basecaller --disable_pings --compress_fastq -c dna_r9.4.1_450bps_fast.cfg -i flongle_fast5_pass/ -s flongle_test2 -x 'auto' --recursive guppy_basecaller --disable_pings --compress_fastq -c dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg --fast5_out -i flongle_fast5_pass/ -s flongle_hac_fastq -x 'auto' --recursive $ guppy_basecaller --compress_fastq -c dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg -i flongle_fast5_pass/ -s flongle_hac_fastq -x 'auto' --recursive
ONT Guppy basecalling software version 3.4.1+213a60d0
config file: /opt/ont/guppy/data/dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg
model file: /opt/ont/guppy/data/template_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.jsn
input path: flongle_fast5_pass/
save path: flongle_hac_fastq
chunk size: 1000
chunks per runner: 512
records per file: 4000
fastq compression: ON
num basecallers: 1
gpu device: auto
kernel path:
runners per device: 4
Found 105 fast5 files to process.
Init time: 2790 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 2493578 ms, Samples called: 3970728746, samples/s: 1.59238e+06
Finishing up any open output files.
Basecalling completed successfully.
So from the above we see in high accuracy mode it take the Xavier ~41 minutes to complete the base calling using the default configuration files. For reference the fast calling mode was ~8 minutes.
- When performing GPU basecalling there is always one CPU support thread per GPU caller, so the number of callers (
--num_callers) dictates the maximum number of CPU threads used. - Max chunks per runner (
--chunks_per_runner): The maximum number of chunks which can be submitted to a single neural network runner before it starts computation. Increasing this figure will increase GPU basecalling performance when it is enabled. - Number of GPU runners per device (
--gpu_runners_per_device): The number of neural network runners to create per CUDA device. Increasing this number may improve performance on GPUs with a large number of compute cores, but will increase GPU memory use. This option only affects GPU calling.
There is a rough equation to estimate amount of ram:
runners * chunks_per_runner * chunk_size < 100000 * [max GPU memory in GB]
For example, a GPU with 8 GB of memory would require:
runners * chunks_per_runner * chunk_size < 800000
--num_callers 1
--gpu_runners_per_device 2
--chunks_per_runner 48
chunk_size = 1000
gpu_runners_per_device = 4
chunks_per_runner = 512
chunks_per_caller = 10000
guppy_basecaller --disable_pings --compress_fastq -c dna_r9.4.1_450bps_fast.cfg -i flongle_fast5_pass/ \
-s flongle_test2 -x 'auto' --recursive --num_callers 4 --gpu_runners_per_device 8 --chunks_per_runner 256$ guppy_basecaller --disable_pings --compress_fastq -c dna_r9.4.1_450bps_fast.cfg \
-i flongle_fast5_pass/ -s flongle_test2 -x 'auto' --recursive --num_callers 4 \
--gpu_runners_per_device 8 --chunks_per_runner 256
ONT Guppy basecalling software version 3.4.1+213a60d0
config file: /opt/ont/guppy/data/dna_r9.4.1_450bps_fast.cfg
model file: /opt/ont/guppy/data/template_r9.4.1_450bps_fast.jsn
input path: flongle_fast5_pass/
save path: flongle_test2
chunk size: 1000
chunks per runner: 256
records per file: 4000
fastq compression: ON
num basecallers: 4
gpu device: auto
kernel path:
runners per device: 8
Found 105 fast5 files to process.
Init time: 880 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 428745 ms, Samples called: 3970269916, samples/s: 9.26021e+06
Finishing up any open output files.
Basecalling completed successfully.
I was able to shave a minute off the fast model on the Xavier (above) getting it down to ~7 minutes.
Update: (13th Dec 2019)
Just modifying the number of chunks per running has allowed me to get the time down to under 6.5 mins (see table below).
| chunks_per_runner | time |
|---|---|
| (160) default | ~8 mins |
| 256 | 7 mins 6 secs |
| 512 | 6 mins 28 secs |
| 1024 | 6 min 23 secs |
It looks like we might have reached an optimal point here. Next I'll test some of the other parameters and see if we can speed this up further.
guppy_basecaller --disable_pings --compress_fastq -c dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg \
--num_callers 4 --gpu_runners_per_device 8 --fast5_out -i flongle_fast5_pass/ \
-s flongle_hac_basemod_fastq -x 'auto' --recursive$ guppy_basecaller --disable_pings --compress_fastq -c dna_r9.4.1_450bps_fast.cfg \
-i flongle_fast5_pass/ -s flongle_test2 -x 'auto' --recursive --num_callers 8 \
--gpu_runners_per_device 8 --chunks_per_runner 1024
ONT Guppy basecalling software version 3.4.1+213a60d0
config file: /opt/ont/guppy/data/dna_r9.4.1_450bps_fast.cfg
model file: /opt/ont/guppy/data/template_r9.4.1_450bps_fast.jsn
input path: flongle_fast5_pass/
save path: flongle_test2
chunk size: 1000
chunks per runner: 1024
records per file: 4000
fastq compression: ON
num basecallers: 8
gpu device: auto
kernel path:
runners per device: 8
Found 105 fast5 files to process.
Init time: 897 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 383865 ms, Samples called: 3970269916, samples/s: 1.03429e+07
Finishing up any open output files.
Basecalling completed successfully.
$ guppy_basecaller --disable_pings --compress_fastq -c dna_r9.4.1_450bps_fast.cfg \
-i flongle_fast5_pass/ -s flongle_test2 -x 'auto' --recursive --num_callers 4 \
--gpu_runners_per_device 8 --chunks_per_runner 1024 --chunk_size 2000
ONT Guppy basecalling software version 3.4.1+213a60d0
config file: /opt/ont/guppy/data/dna_r9.4.1_450bps_fast.cfg
model file: /opt/ont/guppy/data/template_r9.4.1_450bps_fast.jsn
input path: flongle_fast5_pass/
save path: flongle_test2
chunk size: 2000
chunks per runner: 1024
records per file: 4000
fastq compression: ON
num basecallers: 4
gpu device: auto
kernel path:
runners per device: 8
Found 105 fast5 files to process.
Init time: 1180 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 503532 ms, Samples called: 3970269916, samples/s: 7.88484e+06
Finishing up any open output files.
Basecalling completed successfully.
$ guppy_basecaller --disable_pings --compress_fastq -c dna_r9.4.1_450bps_fast.cfg \
-i flongle_fast5_pass/ -s flongle_test2 -x 'auto' --recursive --num_callers 8 \
--gpu_runners_per_device 16 --chunks_per_runner 1024 --chunk_size 1000
ONT Guppy basecalling software version 3.4.1+213a60d0
config file: /opt/ont/guppy/data/dna_r9.4.1_450bps_fast.cfg
model file: /opt/ont/guppy/data/template_r9.4.1_450bps_fast.jsn
input path: flongle_fast5_pass/
save path: flongle_test2
chunk size: 1000
chunks per runner: 1024
records per file: 4000
fastq compression: ON
num basecallers: 8
gpu device: auto
kernel path:
runners per device: 16
Found 105 fast5 files to process.
Init time: 1113 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 383466 ms, Samples called: 3970269916, samples/s: 1.03536e+07
Finishing up any open output files.
Basecalling completed successfully.
The below parameters seem to provide the 'optimal' speed increase with a resultant run time of 6 mins and 23 secs.
$ guppy_basecaller --disable_pings --compress_fastq -c dna_r9.4.1_450bps_fast.cfg \
-i flongle_fast5_pass/ -s flongle_test2 -x 'auto' --recursive --num_callers 4 \
--gpu_runners_per_device 8 --chunks_per_runner 1024 --chunk_size 1000
ONT Guppy basecalling software version 3.4.1+213a60d0
config file: /opt/ont/guppy/data/dna_r9.4.1_450bps_fast.cfg
model file: /opt/ont/guppy/data/template_r9.4.1_450bps_fast.jsn
input path: flongle_fast5_pass/
save path: flongle_test2
chunk size: 1000
chunks per runner: 1024
records per file: 4000
fastq compression: ON
num basecallers: 4
gpu device: auto
kernel path:
runners per device: 8
Found 105 fast5 files to process.
Init time: 926 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 382714 ms, Samples called: 3970269916, samples/s: 1.0374e+07
Finishing up any open output files.
Basecalling completed successfully.
guppy_basecaller \
--disable_pings \
--compress_fastq \
-c dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg \
--ipc_threads 16 \
--num_callers 8 \
--gpu_runners_per_device 4 \
--chunks_per_runner 512 \
--device "cuda:0 cuda:1" \ # this parameter should now scale nicely across both cards, I haven't checked though
--recursive \
--fast5_out \
-i fast5_input \
-s fastq_output
guppy_basecaller \
--disable_pings \
--compress_fastq \
-c dna_r9.4.1_450bps_fast.cfg \
--ipc_threads 16 \
--num_callers 8 \
--gpu_runners_per_device 64 \
--chunks_per_runner 256 \
--device "cuda:0 cuda:1" \
--recursive \
-i fast5_input \
-s fastq_output
There has been some discussion about the recent release of Guppy (3.4.1 and 3.4.2) in terms of speed. I was interested in running some benchmarks across different versions. I had a hunch it may have been something to do with the newly introduced compression of the fast5 files...
The only things I am changing are the version of Guppy being used, and in the case of 3.4.3 I am trying with and without vbz compression of the fast5 files. Everything else is as below:
System:
- Debian Sid (unstable)
- 2x 12-Core Intel Xeon Gold 5118 (48 threads)
- 256Gb RAM
- Titan RTX
- Nvidia drivers - 418.56
Guppy GPU basecalling parameters:
- disable_pings
- compress_fastq
- dna_r9.4.1_450bps_fast.cfg
- num_callers 8
- gpu_runners_per_device 64
- chunks_per_runner 256
- device "cuda:0"
- recursive
| guppy version | time (seconds) | samples/s |
|---|---|---|
| 3.1.5# | 93.278 | 4.25638e+07 |
| 3.2.4# | 94.141 | 4.21737e+07 |
| 3.3.0# | 94.953 | 4.1813e+07 |
| 3.3.3# | 95.802 | 4.14425e+07 |
| 3.4.3 (no vbz compressed fast5) | 270.953 | 1.4653e+07 |
| 3.4.3 (vbz compressed fast5) | 82.877 | 4.79056e+07 |
# these versions of Guppy did not support vbz compression of fast5 files.
You can view the 'raw' results/output for each run below:
~/Downloads/software/guppy/3.1.5/ont-guppy/bin/guppy_basecaller \
--disable_pings \
--compress_fastq \
-c dna_r9.4.1_450bps_fast.cfg \
--num_callers 8 \
--gpu_runners_per_device 64 \
--chunks_per_runner 256 \
--device "cuda:0" \
--recursive \
-i flongle_fast5_pass \
-s testrun_fast_3.1.5
ONT Guppy basecalling software version 3.1.5+781ed57
config file: /home/miles/Downloads/software/guppy/3.1.5/ont-guppy/data/dna_r9.4.1_450bps_fast.cfg
model file: /home/miles/Downloads/software/guppy/3.1.5/ont-guppy/data/template_r9.4.1_450bps_fast.jsn
input path: flongle_fast5_pass
save path: testrun_fast_3.1.5
chunk size: 1000
chunks per runner: 256
records per file: 4000
fastq compression: ON
num basecallers: 8
gpu device: cuda:0
kernel path:
runners per device: 64
Found 105 fast5 files to process.
Init time: 1000 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 93278 ms, Samples called: 3970269916, samples/s: 4.25638e+07
Finishing up any open output files.
Basecalling completed successfully.
~/Downloads/software/guppy/3.2.4/ont-guppy/bin/guppy_basecaller \
--disable_pings \
--compress_fastq \
-c dna_r9.4.1_450bps_fast.cfg \
--num_callers 8 \
--gpu_runners_per_device 64 \
--chunks_per_runner 256 \
--device "cuda:0" \
--recursive \
-i flongle_fast5_pass \
-s testrun_fast_3.2.4
ONT Guppy basecalling software version 3.2.4+d9ed22f
config file: /home/miles/Downloads/software/guppy/3.2.4/ont-guppy/data/dna_r9.4.1_450bps_fast.cfg
model file: /home/miles/Downloads/software/guppy/3.2.4/ont-guppy/data/template_r9.4.1_450bps_fast.jsn
input path: flongle_fast5_pass
save path: testrun_fast_3.2.4
chunk size: 1000
chunks per runner: 256
records per file: 4000
fastq compression: ON
num basecallers: 8
gpu device: cuda:0
kernel path:
runners per device: 64
Found 105 fast5 files to process.
Init time: 836 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 94141 ms, Samples called: 3970269916, samples/s: 4.21737e+07
Finishing up any open output files.
Basecalling completed successfully.
~/Downloads/software/guppy/3.3.0/ont-guppy/bin/guppy_basecaller \
--disable_pings \
--compress_fastq \
-c dna_r9.4.1_450bps_fast.cfg \
--num_callers 8 \
--gpu_runners_per_device 64 \
--chunks_per_runner 256 \
--device "cuda:0" \
--recursive \
-i flongle_fast5_pass \
-s testrun_fast_3.3.0
ONT Guppy basecalling software version 3.3.0+ef22818
config file: /home/miles/Downloads/software/guppy/3.3.0/ont-guppy/data/dna_r9.4.1_450bps_fast.cfg
model file: /home/miles/Downloads/software/guppy/3.3.0/ont-guppy/data/template_r9.4.1_450bps_fast.jsn
input path: flongle_fast5_pass
save path: testrun_fast_3.3.0
chunk size: 1000
chunks per runner: 256
records per file: 4000
fastq compression: ON
num basecallers: 8
gpu device: cuda:0
kernel path:
runners per device: 64
Found 105 fast5 files to process.
Init time: 722 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 94953 ms, Samples called: 3970269916, samples/s: 4.1813e+07
Finishing up any open output files.
Basecalling completed successfully.
~/Downloads/software/guppy/3.3.3/ont-guppy/bin/guppy_basecaller \
--disable_pings \
--compress_fastq \
-c dna_r9.4.1_450bps_fast.cfg \
--num_callers 8 \
--gpu_runners_per_device 64 \
--chunks_per_runner 256 \
--device "cuda:0" \
--recursive \
-i flongle_fast5_pass \
-s testrun_fast_3.3.3
ONT Guppy basecalling software version 3.3.3+fa743a6
config file: /home/miles/Downloads/software/guppy/3.3.3/ont-guppy/data/dna_r9.4.1_450bps_fast.cfg
model file: /home/miles/Downloads/software/guppy/3.3.3/ont-guppy/data/template_r9.4.1_450bps_fast.jsn
input path: flongle_fast5_pass
save path: testrun_fast_3.3.3
chunk size: 1000
chunks per runner: 256
records per file: 4000
fastq compression: ON
num basecallers: 8
gpu device: cuda:0
kernel path:
runners per device: 64
Found 105 fast5 files to process.
Init time: 726 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 95802 ms, Samples called: 3970269916, samples/s: 4.14425e+07
Finishing up any open output files.
Basecalling completed successfully.
~/Downloads/software/guppy/3.4.3/ont-guppy/bin/guppy_basecaller \
--disable_pings \
--compress_fastq \
-c dna_r9.4.1_450bps_fast.cfg \
--num_callers 8 \
--gpu_runners_per_device 64 \
--chunks_per_runner 256 \
--device "cuda:0" \
--recursive \
-i flongle_fast5_pass \
-s testrun_fast_3.4.3_uncompressed
ONT Guppy basecalling software version 3.4.3+f4fc735
config file: /home/miles/Downloads/software/guppy/3.4.3/ont-guppy/data/dna_r9.4.1_450bps_fast.cfg
model file: /home/miles/Downloads/software/guppy/3.4.3/ont-guppy/data/template_r9.4.1_450bps_fast.jsn
input path: flongle_fast5_pass
save path: testrun_fast_3.4.3_uncompressed
chunk size: 1000
chunks per runner: 256
records per file: 4000
fastq compression: ON
num basecallers: 8
gpu device: cuda:0
kernel path:
runners per device: 64
Found 105 fast5 files to process.
Init time: 738 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 270953 ms, Samples called: 3970269916, samples/s: 1.4653e+07
Finishing up any open output files.
Basecalling completed successfully.
~/Downloads/software/guppy/3.4.3/ont-guppy/bin/guppy_basecaller \
--disable_pings \
--compress_fastq \
-c dna_r9.4.1_450bps_fast.cfg \
--num_callers 8 \
--gpu_runners_per_device 64 \
--chunks_per_runner 256 \
--device "cuda:0" \
--recursive \
-i flongle_compressed \
-s testrun_fast_3.4.3
ONT Guppy basecalling software version 3.4.3+f4fc735
config file: /home/miles/Downloads/software/guppy/3.4.3/ont-guppy/data/dna_r9.4.1_450bps_fast.cfg
model file: /home/miles/Downloads/software/guppy/3.4.3/ont-guppy/data/template_r9.4.1_450bps_fast.jsn
input path: flongle_compressed
save path: testrun_fast_3.4.3
chunk size: 1000
chunks per runner: 256
records per file: 4000
fastq compression: ON
num basecallers: 8
gpu device: cuda:0
kernel path:
runners per device: 64
Found 105 fast5 files to process.
Init time: 721 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 82877 ms, Samples called: 3970269916, samples/s: 4.79056e+07
Finishing up any open output files.
Basecalling completed successfully.
