Estimating genomic neighbourhood with the gne module

In cblaster, the most important parameter when detecting hit clusters is the maximum inter-hit gap parameter. This determines how far cblaster will look between any two hits before terminating a given cluster. By default, this parameter is set to 20,000 bp; if no new hit is found within 20,000 bp of the previous hit in a cluster, cblaster will terminate extension of that cluster. Though the 20 kb cutoff has worked quite well for us when looking at fungi or bacteria, where gene density within clusters is quite high, it may not work for all datasets. For example, plant gene clusters may have key biosynthetic genes spread out over large stretches of the chromosome, with many genes in between; this is where the gne module comes in.

The gne module lets you robustly detect an appropriate value to use for this parameter by continually re-running cluster detection on a saved search session at different gap values over some interval. It then generates plots of the mean and median cluster sizes (bp), as well as the total number of predicted clusters, at each value. gne is run on a search sessions, like so:

$ cblaster gne session.json

And generate outputs that looks like this:

``gne`` module plot

You can gain insight into the size of the given genomic neighbourhood of your query proteins – the result clusters in this case tend to be just under 20Kbp.

The gne module generates a list of gap values (total number determined by the samples parameter) from 0 to some upper limit (determined by the max_gap parameter). These numbers can be chosen in two ways. By default, gne will take evenly spaced (i.e. linear) values over the range 0-100,000 bp. Alternatively, you can choose to generate these values via a log scale, which will result in more samples at lower values than at higher ones. This can be specified using the --scale argument:

$ cblaster gne session.json --scale log

As these plots typically resemble logarithmic growth (i.e. rise steeply, then level off), it can make sense to sample more heavily in the more unstable region of the curve.

In case you would like the underlying data (e.g. for creating your own plots), gne can generate delimited output. To do this, simply use the -o or --output argument to specify a file to save the data to, and the -d or --delimiter argument to specify the delimiting character. For example, to generate a CSV file:

$ cblaster gne session.json --output gne.csv --delimiter ","

Like plots generated by the search module, gne plots can be saved as static HTML. To do this, provide a file to the -p or --plot argument:

$ cblaster gne session.json -p gne.html