cblaster.classes module

This module stores the classes (Organism, Scaffold, Hit) used in cblaster.

class cblaster.classes.Cluster(indices=None, subjects=None, intermediate_genes=None, query_sequence_order=None, score=None, start=None, end=None, number=None)

Bases: cblaster.classes.Serializer

A cluster of subjects on the same scaffold

indices

indexes of the subjects in the list of subjects

Type:list
of the parent scaffold
subjects

Subject objects that are in this cluster. Note:

Type:list
These are not serialised for this cluster
intermediate_genes
Type:list
start

The start coordinate of the cluster on the parent scaffold

Type:int
end

The end coordinate of the cluster on the parent scaffold

Type:int
number

number that is unique for each cluster in order to identify them

Type:int
NUMBER = count(0)
calculate_score(query_sequence_order=None)

Calculate the score of the current cluster

The score is based on accumulated blastbitscore, total amount of hits against the query and a synteny score if query sequence order is provided. If there are multiple hits in a subject the hit with the top bitscore is selected for the calculation.

Parameters:
  • query_sequence_order (list) – list of sequences of the order in the query file, is
  • provided if the query has a meningfull order (only) –
Returns:

a float

classmethod from_dict(d, subjects=None)

Loads class from dict.

intermediate_end

The end of the cluster taking the intermediate genes into account

intermediate_start

The start of the cluster taking the intermediate genes into account

names
remove_subject(subject, scaffold_index)

Safely remove a subject from a cluster.

This is important when subjects become empty when recomputing a session with different treshholds

Parameters:
  • subject (Subject) – cblaster Subject object
  • scaffold_index (int) – the index of the subject in the scaffold it is saved in
sequences
to_dict(save_subjects=False)

Serialises class to dict.

class cblaster.classes.Hit(query, subject, identity, coverage, evalue, bitscore)

Bases: cblaster.classes.Serializer

A BLAST hit identified during a cblaster search.

This class is first instantiated when parsing BLAST results, and is then updated with genomic coordinates after querying either the Identical Protein Groups (IPG) resource on NCBI, or a local JSON database.

query

Name of query sequence.

Type:str
subject

Name of subject sequence.

Type:str
identity

Percentage identity (%) of hit.

Type:float
coverage

Query coverage (%) of hit.

Type:float
evalue

E-value of hit.

Type:float
bitscore

Bitscore of hit.

Type:float
copy(**kwargs)

Creates a copy of this Hit with any additional args.

classmethod from_dict(d)

Loads class from dict.

to_dict()

Serialises class to dict.

values(decimals=4)

Formats hit attributes for printing.

Parameters:decimals (int) – Total decimal places to show in score values.
Returns:List of formatted attribute strings.
class cblaster.classes.Organism(name, strain, scaffolds=None)

Bases: cblaster.classes.Serializer

A unique organism containing hits found in a cblaster search.

Every strain (or lack thereof) is a unique Organism, and will be reported separately in cblaster results.

name

Organism name, typically the genus and species epithet.

Type:str
strain

Strain name of this organism, e.g. CBS 536.65.

Type:str
scaffolds

Scaffold objects belonging to this organism.

Type:dict
clusters
classmethod from_dict(d)

Loads class from dict.

full_name

The full name (including strain) of the organism. Note: if strain found in name, returns just name.

summary(decimals=4, hide_headers=True, delimiter=None)
to_dict()

Serialises class to dict.

total_hit_clusters

Counts total amount of hit clusters in this Organism.

class cblaster.classes.Scaffold(accession, clusters=None, subjects=None)

Bases: cblaster.classes.Serializer

A genomic scaffold containing hits found in a cblaster search.

accession

Name of this scaffold, typically NCBI accession.

Type:str
subjects

Subject objects located on this scaffold.

Type:list
clusters

Clusters of hits identified on this scaffold.

Type:list
add_clusters(subject_lists, query_sequence_order=None)

Add clusters to this scaffold

After clusters are added they are sorted based on score

Parameters:
  • subject_lists (list) – a list of lists of Subject objects that are
  • a clusters (form) –
  • query_sequence_order (list) – list of sequences of the order in the query file, is
  • provided if the query has a meningfull order (only) –
classmethod from_dict(d)

Loads class from dict.

remove_subject(subject)

Safely remove a subject from a cluster by removing it from the cluster as well.

Parameters:subject (Subject) – cblaster Subject object
summary(hide_headers=False, delimiter=None, decimals=4)
to_dict()

Serialises class to dict.

class cblaster.classes.Serializer

Bases: abc.ABC

JSON serialisation mixin class.

Classes that inherit from this class should implement to_dict and from_dict methods.

classmethod from_dict(d)

Loads class from dict.

classmethod from_json(js)

Instantiates class from JSON handle.

to_dict()

Serialises class to dict.

to_json(fp=None, **kwargs)

Serialises class to JSON.

class cblaster.classes.Session(queries=None, sequences=None, params=None, organisms=None, query=None)

Bases: cblaster.classes.Serializer

Stores the state of a cblaster search.

This class stores query proteins, search parameters, Organism objects created during searches, as well as methods for generating summary tables. It can also be dumped to/loaded from JSON for re-filtering, plotting, etc.

>>> s = Session()
>>> with open("session.json", "w") as fp:
...     s.to_json(fp)
>>> with open("session.json") as fp:
...     s2 = Session.from_json(fp)
>>> s == s2
True
queries

Names of query sequences.

Type:list
params

Search parameters.

Type:dict
organisms

Organism objects created in a search.

Type:list
sequences

Query sequence translations

Type:dict
query

cblaster Cluster object for query

Type:Cluster
format(form, fp=None, **kwargs)

Generates a summary table.

Parameters:
  • form (str) – Type of table to generate (‘summary’ or ‘binary’).
  • fp (file handle) – File handle to write to.
Raises:

ValueErrorform not ‘binary’ or ‘summary’

Returns:

Summary table.

classmethod from_dict(d)

Loads class from dict.

classmethod from_file(file)
classmethod from_files(files)
to_dict()

Serialises class to dict.

class cblaster.classes.Subject(id=None, hits=None, name=None, ipg=None, start=None, end=None, strand=None, sequence=None)

Bases: cblaster.classes.Serializer

A sequence representing one or more BLAST hits.

This class is instantiated during the contextual lookup stage. It is important since it allows for subject sequences which hit >1 of the query sequences, while still staying non-redundant.

hits

Hit objects referencing this subject sequence.

Type:list
ipg

NCBI Identical Protein Group (IPG) id.

Type:int
start

Start of sequence on parent scaffold.

Type:int
end

End of sequence on parent scaffold.

Type:int
strand

Strandedness of the sequence (‘+’ or ‘-‘).

Type:str
classmethod from_dict(d)

Loads class from dict.

to_dict()

Serialises class to dict.

values(decimals=4)

cblaster.context module

cblaster.database module

cblaster.extract module

cblaster.extract_clusters module

cblaster.formatters module

cblaster result formatters.

cblaster.formatters.add_field_whitespace(rows, lengths)

Fills table fields with whitespace to specified lengths.

cblaster.formatters.binary(session, hide_headers=False, delimiter=None, key=<built-in function len>, attr='identity', decimals=4, sort_clusters=False)

Generates a binary summary table from a Session object.

cblaster.formatters.generate_header_string(text, symbol='-')

Generates a 2-line header string with underlined text.

>>> header = generate_header_string("header string", symbol="*")
>>> print(header)
header string
*************
cblaster.formatters.get_cell_values(queries, subjects, key=<built-in function len>, attr=None)

Generates the values of cells in the binary matrix.

This function calls some specified key function (def. max) against all values of a specified attribute (def. None) from Hits inside Subjects which match each query. By default, this function will just count all matching Hits (i.e. len() is called on all Hits whose query attribute matches). To find maximum identities, for example, provide key=max and attr=’identity’ to this function.

Parameters:
  • queries (list) – Names of query sequences.
  • subjects (list) – Subject objects to generate vlaues for.
  • key (callable) – Some callable that takes a list and produces a value.
  • attr (str) – A Hit attribute to calculate values with in key function.
cblaster.formatters.get_maximum_row_lengths(rows)

Finds the longest lengths of fields per column in a collection of rows.

cblaster.formatters.gne_summary(data, hide_headers=False, delimiter=None, decimals=4)
cblaster.formatters.humanise(rows)

Formats a collection of fields as human-readable.

cblaster.formatters.set_decimals(value, decimals=4)
cblaster.formatters.summarise_cluster(cluster, decimals=4, hide_headers=True, delimiter=None)

Generates a summary table for a hit cluster.

Parameters:
  • cluster (Cluster) – collection of Subject objects
  • decimals (int) – number of decimal points to show
  • hide_headers (bool) – hide column headers in output
  • delimiter (str) – delimiting string between the subjects
Returns:

summary table

cblaster.formatters.summarise_gne(data, hide_headers=False, delimiter=None, decimals=4)
cblaster.formatters.summarise_organism(organism, hide_headers=True, delimiter=None, decimals=4)
cblaster.formatters.summarise_scaffold(scaffold, hide_headers=True, delimiter=None, decimals=4)
cblaster.formatters.summary(session, hide_headers=False, delimiter=None, decimals=4, sort_clusters=False)

cblaster.genome_parsers module

cblaster.genome_parsers.find_fasta(gff_path)

Finds a FASTA file corresponding to the given GFF path.

cblaster.genome_parsers.find_feature(array, ftype)
cblaster.genome_parsers.find_files(paths, recurse=True, level=0)
cblaster.genome_parsers.find_gene_name(qualifiers)

Finds a gene name in a dictionary of feature qualifiers.

cblaster.genome_parsers.find_overlapping_location(feature, locations)

Finds the index of a gene location containing feature.

Parameters:
  • feature (SeqFeature) – Feature being matched to a location
  • locations (list) – Start and end coordinates of gene features
Returns:

Index of matching start/end, if any None: No match found

Return type:

int

cblaster.genome_parsers.find_regions(directives)

Looks for ##sequence-region directives in a list of GFF3 directives.

cblaster.genome_parsers.find_translation(record, feature)
cblaster.genome_parsers.iter_overlapping_features(features)
cblaster.genome_parsers.merge_cds_features(features)
cblaster.genome_parsers.organisms_to_tuples(organisms)

Generates insertion tuples from parsed organisms.

Parameters:organisms (list) – Organism dictionaries parsed by parse_file
Returns:SQLite3 database insertion tuples for all genes
Return type:list
cblaster.genome_parsers.parse_cds_features(features, record_start)
cblaster.genome_parsers.parse_file(path, to_tuples=False)

Dispatches a given file path to the correct parser given its extension.

Parameters:
  • path (str) – Path to genome file
  • to_tuples (bool) – Generate insertion tuples from parsed SeqRecords
Returns:

File name and list of SeqRecord objects corresponding to scaffolds in file

Return type:

dict

cblaster.genome_parsers.parse_gff(path)

Parses GFF and corresponding FASTA using GFFutils.

Parameters:path (str) – Path to GFF file. Should have a corresponding FASTA file of the same name with a valid FASTA suffix (.fa, .fasta, .fsa, .fna, .faa).
Returns:SeqRecord objects corresponding to each scaffold in the file
Return type:list
cblaster.genome_parsers.parse_infile(path, format)
cblaster.genome_parsers.return_file_handle(input_file)

Handles compressed and uncompressed files.

cblaster.genome_parsers.seqrecord_to_tuples(record, source)

cblaster.helpers module

cblaster.helpers.dict_to_cluster(sequences, spacing=500)

Creates a mock Cluster from a sequence dictionary.

cblaster.helpers.efetch_sequences(headers)

Retrieve protein sequences from NCBI for supplied accessions.

This function uses EFetch from the NCBI E-utilities to retrieve the sequences for all synthases specified in headers. The calls to EFetch can not exceed 500 accessions this means that the calls have to be limited. It then calls fasta.parse to parse the returned response; note that extra processing has to occur because the returned FASTA will contain a full sequence description in the header line after the accession.

Parameters:headers (list) – Valid NCBI sequence identifiers (accession, GI, etc.).
Returns:a dictionary of sequences keyed on header id
cblaster.helpers.fasta_seqrecords_to_cluster(records, spacing=500)

Creates a mock Cluster from a SeqIO FASTA parser handle.

cblaster.helpers.find_sqlite_db(path)
cblaster.helpers.form_command(parameters)

Flatten a dictionary to create a command list for use in subprocess.run()

cblaster.helpers.get_program_path(aliases)

Get programs path given a list of program names.

Parameters:aliases (list) – Program aliases, e.g. [“diamond”, “diamond-aligner”]
Raises:ValueError – Could not find any of the given aliases on system $PATH.
Returns:Path to program executable.
cblaster.helpers.get_project_root()
cblaster.helpers.get_sequences(query_file=None, query_ids=None, query_profiles=None)

Convenience function to get dictionary of query sequences from file or IDs.

Parameters:
  • query_file (str) – Path to FASTA genbank or EMBL file containing query
  • sequences. (protein) –
  • query_ids (list) – NCBI sequence accessions.
  • query_profiles (list) – Pfam profile accessions.
Raises:

ValueError – Did not receive values for query_file or query_ids.

Returns:

Dictionary of query sequences keyed on accession.

Return type:

sequences (dict)

cblaster.helpers.parse_query_sequences(query_file=None, query_ids=None, query_profiles=None)

Creates a Cluster object from query sequences.

If EMBL/GenBank, Cluster will use exact genomic coordinates parsed from file. Otherwise, a fake Cluster will be created where genes are drawn to scale, but always on positive strand and with fixed intergenic distance.

cblaster.helpers.seqrecord_to_cluster(record)

Creates a Cluster object from a SeqIO GenBank/EMBL parser handle.

cblaster.helpers.sequences_to_fasta(sequences)

Formats sequence dictionary as FASTA.

cblaster.intermediate_genes module

cblaster.local module

cblaster.local.diamond(fasta, database, max_evalue=0.01, min_identity=30, min_coverage=50, hitlist_size=5000, cpus=None, sensitivity='fast')

Launch a local DIAMOND search against a database.

Parameters:
  • fasta (str) – Path to FASTA format query file
  • database (str) – Path to DIAMOND database generated with cblaster makedb
  • max_evalue (float) – Maximum e-value threshold
  • min_identity (float) – Minimum identity (%) cutoff
  • min_coverage (float) – Minimum coverage (%) cutoff
  • hitlist_size (int) – Maximum number of hits to save
  • cpus (int) – Number of CPU threads for DIAMOND to use
Returns:

Rows from DIAMOND search result table (split by newline)

Return type:

list

cblaster.local.parse(results, min_identity=30, min_coverage=50, max_evalue=0.01)

Parse a string containing results of a BLAST/DIAMOND search.

Parameters:
  • results (list) – Results returned by diamond() or blastp()
  • min_identity (float) – Minimum identity (%) cutoff
  • min_coverage (float) – Minimum coverage (%) cutoff
  • max_evalue (float) – Maximum e-value threshold
Returns:

Hit objects representing hits that surpass scoring thresholds

Return type:

list

cblaster.local.search(database, sequences=None, query_file=None, query_ids=None, blast_file=None, dmnd_sensitivity='fast', min_identity=30, min_coverage=50, max_evalue=0.01, hitlist_size=5000, **kwargs)

Launch a new BLAST search using either DIAMOND or command-line BLASTp (remote).

Parameters:
  • database (str) – Path to DIAMOND database
  • sequences (dict) – Query sequences
  • query_file (str) – Path to FASTA file containing query sequences
  • query_ids (list) – NCBI sequence accessions
  • blast_file (str) – Path to the file blast results are written to
Raises:

ValueError – No value given for query_file or query_ids

Returns:

Parsed rows with hits from DIAMOND results table

Return type:

list

cblaster.main module

cblaster.parsers module

Argument parsers.

cblaster.parsers.add_binary_arguments(group)
cblaster.parsers.add_clustering_group(search)
cblaster.parsers.add_config_subparser(subparsers)
cblaster.parsers.add_extract_clusters_subparser(subparsers)
cblaster.parsers.add_extract_subparser(subparsers)
cblaster.parsers.add_filtering_group(search)
cblaster.parsers.add_gne_output_group(parser)
cblaster.parsers.add_gne_params_group(parser)
cblaster.parsers.add_gne_subparser(subparsers)
cblaster.parsers.add_gui_subparser(subparsers)
cblaster.parsers.add_input_group(search)
cblaster.parsers.add_intermediate_genes_group(search)
cblaster.parsers.add_makedb_subparser(subparsers)
cblaster.parsers.add_output_arguments(group)
cblaster.parsers.add_output_group(search)
cblaster.parsers.add_plot_clusters_subparser(subparsers)
cblaster.parsers.add_search_subparser(subparsers)
cblaster.parsers.add_searching_group(search)
cblaster.parsers.full_database_path(database, *acces_modes)

Make sure the database path is also correct, but do not check when providing one of the NCBI databases

Parameters:
  • database (str) – a string that is the path to the database creation files or a NCBI database identifier
  • acces_modes (List) – a list of integers of acces modes for which at least one should be allowed
Returns:

a string that is the full path to the database file or a NCBI database identifier

cblaster.parsers.full_path(file_path, *acces_modes, dir=False)

Test if a file path or directory exists and has the correct permissions and create a full path

For reading acces the file has to be pressent and there has to be read acces. For writing acces the directory with the file has to be present and there has to be write acces in that directory.

Parameters:
  • file_path (str) – relative or absoluete path to a file
  • acces_modes (List) – a list of integers of acces modes for which at least one should be allowed
  • dir (bool) – if the path is to a directory or not
Returns:

A string that is the full path to the provided file_path

Raises:
  • argparse.ArgumentTypeError when the provided path does not exist or the file does not have the correct
  • permissions to be accessed
cblaster.parsers.get_parser()
cblaster.parsers.max_cpus(value)

Ensure that the cpu’s do not go above the available amount. Setting to high cpu’s will crash database creation badly

Parameters:value (int) – number of cpu’s as provided by the user
Returns:value as an integer with 1 <= value <= multiprocessing.cpu_count()
cblaster.parsers.parse_args(args)

cblaster.plot module

cblaster.plot_clusters module

cblaster.remote module

This module handles all interaction with NCBI’s BLAST API, including launching new remote searches, polling for completion status, and retrieval of results.

cblaster.remote.check(rid)

Check completion status of a BLAST search given a Request Identifier (RID).

Parameters:

rid (str) – NCBI BLAST search request identifier (RID)

Returns:

Search has completed successfully and hits were reported

Return type:

bool

Raises:
  • ValueError – Search has failed. This is caused either by program error (in which case, NCBI requests you submit an error report with the RID) or expiration of the RID (only stored for 24 hours).
  • ValueError – Search has completed successfully, but no hits were reported.
cblaster.remote.parse(handle, sequences=None, query_file=None, query_ids=None, max_evalue=0.01, min_identity=30, min_coverage=50)

Parse Tabular results from remote BLAST search performed via API.

Since the API provides no option for returning query coverage, which is a metric we want to use for filtering hits, query sequences must be passed to this function so that their lengths can be compared to the alignment length.

Parameters:
  • handle (list) – File handle (or file handle-like) object corresponding to BLAST results. Note that this function expects an iterable of tab-delimited lines and performs no validation/error checking
  • sequences (dict) – Query sequences
  • query_file (str) – Path to FASTA format query file
  • query_ids (list) – NCBI sequence identifiers
  • max_evalue (float) – Maximum e-value
  • min_identity (float) – Minimum percent identity
  • min_coverage (float) – Minimum percent query coverage
Returns:

Hit objects corresponding to criteria passing BLAST hits

Return type:

list

cblaster.remote.poll(rid, delay=60, max_retries=-1)

Poll BLAST API with given Request Identifier (RID) until results are returned.

As per NCBI usage guidelines, this function will only poll once per minute; this is calculated each time such that wait is constant (i.e. accounts for differing response time on the status check).

Parameters:
  • rid (str) – NCBI BLAST search request identifier (RID)
  • delay (int) – Total delay (seconds) between polling
  • max_retries (int) – Maximum number of polling attempts (-1 for unlimited)
Returns:

BLAST search results split by newline

Return type:

list

cblaster.remote.retrieve(rid, hitlist_size=5000)

Retrieve BLAST results corresponding to a given Request Identifier (RID).

Parameters:
  • rid (str) – NCBI BLAST search request identifiers (RID)
  • hitlist_size (int) – Total number of hits to retrieve
Returns:

BLAST search results split by newline, with HTML parts removed

Return type:

list

cblaster.remote.search(rid=None, sequences=None, query_file=None, query_ids=None, min_identity=0.3, min_coverage=0.5, max_evalue=0.01, blast_file=None, hitlist_size=500, **kwargs)

Perform a remote BLAST search via the NCBI’s BLAST API.

This function launches a new search given a query FASTA file or list of valid NCBI identifiers, polls the API to check the completion status of the search, then retrieves and parses the results.

It is also possible to call other BLAST variants using the program argument.

Parameters:
  • rid (str) – NCBI BLAST search request identifier (RID)
  • sequences (dict) – Query sequences
  • query_file (str) – Path to FASTA format query file
  • query_ids (list) – NCBI sequence identifiers
  • min_identity (float) – Minimum percent identity
  • min_coverage (float) – Minimum percent query coverage
  • max_evalue (float) – Maximum e-value
  • blast_file (str) – Path to file blast results are written to
  • hitlist_size (int) – Number of database sequences to keep
Returns:

Hit objects corresponding to criteria passing BLAST hits

Return type:

list

cblaster.remote.start(sequences=None, query_file=None, query_ids=None, database='nr', program='blastp', megablast=False, filtering='F', evalue=0.1, nucl_reward=None, nucl_penalty=None, gap_costs='11 1', matrix='BLOSUM62', hitlist_size=500, threshold=11, word_size=6, comp_based_stats=2, entrez_query=None)

Launch a remote BLAST search using NCBI BLAST API.

Note that the HITLIST_SIZE, ALIGNMENTS and DESCRIPTIONS parameters must all be set together in order to mimic max_target_seqs behaviour.

Usage guidelines:

  1. Don’t contact server more than once every 10 seconds
  2. Don’t poll for a single RID more than once a minute
  3. Use URL parameter email/tool
  4. Run scripts weekends or 9pm-5am Eastern time on weekdays if >50 searches

For a full description of the parameters, see:

  1. BLAST API documentation<https://ncbi.github.io/blast-cloud/dev/api.html>
  2. BLAST documentation <https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp>
Parameters:
  • sequences (dict) – Query sequence dict generated by helpers.get_sequences()
  • query_file (str) – Path to a query FASTA file
  • query_ids (list) – Collection of NCBI sequence identifiers
  • database (str) – Target NCBI BLAST database
  • program (str) – BLAST variant to run
  • megablast (bool) – Enable megaBLAST option (only with BLASTn)
  • filtering (str) – Low complexity filtering
  • evalue (float) – E-value cutoff
  • nucl_reward (int) – Reward for matching bases (only with BLASTN/megaBLAST)
  • nucl_penalty (int) – Penalty for mismatched bases (only with BLASTN/megaBLAST)
  • gap_costs (str) – Gap existence and extension costs
  • matrix (str) – Scoring matrix name
  • hitlist_size (int) – Number of database sequences to keep
  • threshold (int) – Neighbouring score for initial words
  • word_size (int) – Size of word for initial matches
  • comp_based_stats (int) – Composition based statistics algorithm
  • entrez_query (str) – NCBI Entrez search term for pre-filtering the BLAST database
Returns:

Request Identifier (RID) assigned to the search rtoe (int): Request Time Of Execution (RTOE), estimated run time of the search

Return type:

rid (str)

cblaster.sql module