Usage

There are two search modes, specified by the –mode argument. By default (i.e. –mode not specified), it will be set to remote, and a fully remote cblaster search will begin using NCBI’s BLAST API. Alternatively, local mode performs a search against a local diamond database, which is much quicker (albeit requiring some initial setup).

Performing a remote search via the NCBI BLAST API

At a minimum, a search could look like one of the following:

$ cblaster search -qf query.fasta
$ cblaster search -qi QBE85649.1 QBE85648.1 QBE85647.1 QBE85646.1 ...

This will launch a remote search against the non-redundant (nr) protein database, retrieve and parse the results, then report any blocks of hits to the terminal. By default, hits are only reported if they are above 30% percent identity and 50% query coverage, and have an e-value below 0.01. If we wanted to be stricter, we could change those values with the following:

$ cblaster search -qf query.fasta --min_identity 70 --min_coverage 90 --evalue 0.001

You can also pass in NCBI search queries using -eq / –entrez_query to pre-filter the target database, which can result in vastly reduced run-times and more targeted results. For example, to only search against Aspergillus sequences:

$ cblaster search -qf query.fasta --entrez_query "Aspergillus"[ORGN]

Look here for a full description of Entrez search terms.

Searching a local database using DIAMOND

Alternatively, a local DIAMOND database can be searched by specifying:

$ cblaster search -qf query.fasta --mode local --database db.dmnd

For this to work, the database must consist of sequences derived from NCBI, such that their identifiers can be used for retrieval of sequences/genomic context. The easiest way to set this up is via NCBI’s batch assembly download option. For example, to build a database of Aspergillus protein sequences:

  1. Search the NCBI Assembly database for Aspergillus genomes
Search for Aspergillus assemblies
  1. Click ‘Download Assemblies’, select ‘Protein FASTA’ and click ‘Download’
Download 'Protein FASTA' files
  1. Extract all FASTA files and concatenate them
$ pigz -d *.gz
$ cat *.faa >> proteins.faa
  1. Build the DIAMOND database
$ diamond makedb --in proteins.faa --db proteins
...
$ ls
database.faa
database.dmnd
```
  1. Run cblaster against the newly created databse
$ cblaster search -m local -qf query.fa -db database.dmnd <options>

Alternatively, you could use ncbi-genome-download to retrieve the sequences from the command line.