cblaster.classes module¶
This module stores the classes (Organism, Scaffold, Hit) used in cblaster.
-
class
cblaster.classes.
Cluster
(indices=None, subjects=None, intermediate_genes=None, query_sequence_order=None, score=None, start=None, end=None, number=None)¶ Bases:
cblaster.classes.Serializer
A cluster of subjects on the same scaffold
-
indices
¶ indexes of the subjects in the list of subjects
Type: list
-
of the parent scaffold
-
subjects
¶ Subject objects that are in this cluster. Note:
Type: list
-
These are not serialised for this cluster
-
intermediate_genes
¶ Type: list
-
start
¶ The start coordinate of the cluster on the parent scaffold
Type: int
-
end
¶ The end coordinate of the cluster on the parent scaffold
Type: int
-
number
¶ number that is unique for each cluster in order to identify them
Type: int
-
NUMBER
= count(0)¶
-
calculate_score
(query_sequence_order=None)¶ Calculate the score of the current cluster
The score is based on accumulated blastbitscore, total amount of hits against the query and a synteny score if query sequence order is provided. If there are multiple hits in a subject the hit with the top bitscore is selected for the calculation.
Parameters: - query_sequence_order (list) – list of sequences of the order in the query file, is
- provided if the query has a meningfull order (only) –
Returns: a float
-
classmethod
from_dict
(d, subjects=None)¶ Loads class from dict.
-
intermediate_end
¶ The end of the cluster taking the intermediate genes into account
-
intermediate_start
¶ The start of the cluster taking the intermediate genes into account
-
names
¶
-
remove_subject
(subject, scaffold_index)¶ Safely remove a subject from a cluster.
This is important when subjects become empty when recomputing a session with different treshholds
Parameters: - subject (Subject) – cblaster Subject object
- scaffold_index (int) – the index of the subject in the scaffold it is saved in
-
sequences
¶
-
to_dict
(save_subjects=False)¶ Serialises class to dict.
-
-
class
cblaster.classes.
Hit
(query, subject, identity, coverage, evalue, bitscore)¶ Bases:
cblaster.classes.Serializer
A BLAST hit identified during a cblaster search.
This class is first instantiated when parsing BLAST results, and is then updated with genomic coordinates after querying either the Identical Protein Groups (IPG) resource on NCBI, or a local JSON database.
-
query
¶ Name of query sequence.
Type: str
-
subject
¶ Name of subject sequence.
Type: str
-
identity
¶ Percentage identity (%) of hit.
Type: float
-
coverage
¶ Query coverage (%) of hit.
Type: float
-
evalue
¶ E-value of hit.
Type: float
-
bitscore
¶ Bitscore of hit.
Type: float
-
copy
(**kwargs)¶ Creates a copy of this Hit with any additional args.
-
classmethod
from_dict
(d)¶ Loads class from dict.
-
to_dict
()¶ Serialises class to dict.
-
values
(decimals=4)¶ Formats hit attributes for printing.
Parameters: decimals (int) – Total decimal places to show in score values. Returns: List of formatted attribute strings.
-
-
class
cblaster.classes.
Organism
(name, strain, scaffolds=None)¶ Bases:
cblaster.classes.Serializer
A unique organism containing hits found in a cblaster search.
Every strain (or lack thereof) is a unique Organism, and will be reported separately in cblaster results.
-
name
¶ Organism name, typically the genus and species epithet.
Type: str
-
strain
¶ Strain name of this organism, e.g. CBS 536.65.
Type: str
-
scaffolds
¶ Scaffold objects belonging to this organism.
Type: dict
-
clusters
¶
-
classmethod
from_dict
(d)¶ Loads class from dict.
-
full_name
¶ The full name (including strain) of the organism. Note: if strain found in name, returns just name.
-
summary
(decimals=4, hide_headers=True, delimiter=None)¶
-
to_dict
()¶ Serialises class to dict.
-
total_hit_clusters
¶ Counts total amount of hit clusters in this Organism.
-
-
class
cblaster.classes.
Scaffold
(accession, clusters=None, subjects=None)¶ Bases:
cblaster.classes.Serializer
A genomic scaffold containing hits found in a cblaster search.
-
accession
¶ Name of this scaffold, typically NCBI accession.
Type: str
-
subjects
¶ Subject objects located on this scaffold.
Type: list
-
clusters
¶ Clusters of hits identified on this scaffold.
Type: list
-
add_clusters
(subject_lists, query_sequence_order=None)¶ Add clusters to this scaffold
After clusters are added they are sorted based on score
Parameters: - subject_lists (list) – a list of lists of Subject objects that are
- a clusters (form) –
- query_sequence_order (list) – list of sequences of the order in the query file, is
- provided if the query has a meningfull order (only) –
-
classmethod
from_dict
(d)¶ Loads class from dict.
-
remove_subject
(subject)¶ Safely remove a subject from a cluster by removing it from the cluster as well.
Parameters: subject (Subject) – cblaster Subject object
-
summary
(hide_headers=False, delimiter=None, decimals=4)¶
-
to_dict
()¶ Serialises class to dict.
-
-
class
cblaster.classes.
Serializer
¶ Bases:
abc.ABC
JSON serialisation mixin class.
Classes that inherit from this class should implement to_dict and from_dict methods.
-
classmethod
from_dict
(d)¶ Loads class from dict.
-
classmethod
from_json
(js)¶ Instantiates class from JSON handle.
-
to_dict
()¶ Serialises class to dict.
-
to_json
(fp=None, **kwargs)¶ Serialises class to JSON.
-
classmethod
-
class
cblaster.classes.
Session
(queries=None, sequences=None, params=None, organisms=None, query=None)¶ Bases:
cblaster.classes.Serializer
Stores the state of a cblaster search.
This class stores query proteins, search parameters, Organism objects created during searches, as well as methods for generating summary tables. It can also be dumped to/loaded from JSON for re-filtering, plotting, etc.
>>> s = Session() >>> with open("session.json", "w") as fp: ... s.to_json(fp) >>> with open("session.json") as fp: ... s2 = Session.from_json(fp) >>> s == s2 True
-
queries
¶ Names of query sequences.
Type: list
-
params
¶ Search parameters.
Type: dict
-
organisms
¶ Organism objects created in a search.
Type: list
-
sequences
¶ Query sequence translations
Type: dict
-
format
(form, fp=None, **kwargs)¶ Generates a summary table.
Parameters: - form (str) – Type of table to generate (‘summary’ or ‘binary’).
- fp (file handle) – File handle to write to.
Raises: ValueError
– form not ‘binary’ or ‘summary’Returns: Summary table.
-
classmethod
from_dict
(d)¶ Loads class from dict.
-
classmethod
from_file
(file)¶
-
classmethod
from_files
(files)¶
-
to_dict
()¶ Serialises class to dict.
-
-
class
cblaster.classes.
Subject
(id=None, hits=None, name=None, ipg=None, start=None, end=None, strand=None, sequence=None)¶ Bases:
cblaster.classes.Serializer
A sequence representing one or more BLAST hits.
This class is instantiated during the contextual lookup stage. It is important since it allows for subject sequences which hit >1 of the query sequences, while still staying non-redundant.
-
hits
¶ Hit objects referencing this subject sequence.
Type: list
-
ipg
¶ NCBI Identical Protein Group (IPG) id.
Type: int
-
start
¶ Start of sequence on parent scaffold.
Type: int
-
end
¶ End of sequence on parent scaffold.
Type: int
-
strand
¶ Strandedness of the sequence (‘+’ or ‘-‘).
Type: str
-
classmethod
from_dict
(d)¶ Loads class from dict.
-
to_dict
()¶ Serialises class to dict.
-
values
(decimals=4)¶
-
cblaster.context module¶
cblaster.database module¶
cblaster.extract module¶
cblaster.extract_clusters module¶
cblaster.formatters module¶
cblaster result formatters.
-
cblaster.formatters.
add_field_whitespace
(rows, lengths)¶ Fills table fields with whitespace to specified lengths.
-
cblaster.formatters.
binary
(session, hide_headers=False, delimiter=None, key=<built-in function len>, attr='identity', decimals=4, sort_clusters=False)¶ Generates a binary summary table from a Session object.
-
cblaster.formatters.
generate_header_string
(text, symbol='-')¶ Generates a 2-line header string with underlined text.
>>> header = generate_header_string("header string", symbol="*") >>> print(header) header string *************
-
cblaster.formatters.
get_cell_values
(queries, subjects, key=<built-in function len>, attr=None)¶ Generates the values of cells in the binary matrix.
This function calls some specified key function (def. max) against all values of a specified attribute (def. None) from Hits inside Subjects which match each query. By default, this function will just count all matching Hits (i.e. len() is called on all Hits whose query attribute matches). To find maximum identities, for example, provide key=max and attr=’identity’ to this function.
Parameters: - queries (list) – Names of query sequences.
- subjects (list) – Subject objects to generate vlaues for.
- key (callable) – Some callable that takes a list and produces a value.
- attr (str) – A Hit attribute to calculate values with in key function.
-
cblaster.formatters.
get_maximum_row_lengths
(rows)¶ Finds the longest lengths of fields per column in a collection of rows.
-
cblaster.formatters.
gne_summary
(data, hide_headers=False, delimiter=None, decimals=4)¶
-
cblaster.formatters.
humanise
(rows)¶ Formats a collection of fields as human-readable.
-
cblaster.formatters.
set_decimals
(value, decimals=4)¶
-
cblaster.formatters.
summarise_cluster
(cluster, decimals=4, hide_headers=True, delimiter=None)¶ Generates a summary table for a hit cluster.
Parameters: - cluster (Cluster) – collection of Subject objects
- decimals (int) – number of decimal points to show
- hide_headers (bool) – hide column headers in output
- delimiter (str) – delimiting string between the subjects
Returns: summary table
-
cblaster.formatters.
summarise_gne
(data, hide_headers=False, delimiter=None, decimals=4)¶
-
cblaster.formatters.
summarise_organism
(organism, hide_headers=True, delimiter=None, decimals=4)¶
-
cblaster.formatters.
summarise_scaffold
(scaffold, hide_headers=True, delimiter=None, decimals=4)¶
-
cblaster.formatters.
summary
(session, hide_headers=False, delimiter=None, decimals=4, sort_clusters=False)¶
cblaster.genome_parsers module¶
-
cblaster.genome_parsers.
find_fasta
(gff_path)¶ Finds a FASTA file corresponding to the given GFF path.
-
cblaster.genome_parsers.
find_feature
(array, ftype)¶
-
cblaster.genome_parsers.
find_files
(paths, recurse=True, level=0)¶
-
cblaster.genome_parsers.
find_gene_name
(qualifiers)¶ Finds a gene name in a dictionary of feature qualifiers.
-
cblaster.genome_parsers.
find_overlapping_location
(feature, locations)¶ Finds the index of a gene location containing feature.
Parameters: - feature (SeqFeature) – Feature being matched to a location
- locations (list) – Start and end coordinates of gene features
Returns: Index of matching start/end, if any None: No match found
Return type: int
-
cblaster.genome_parsers.
find_regions
(directives)¶ Looks for ##sequence-region directives in a list of GFF3 directives.
-
cblaster.genome_parsers.
find_translation
(record, feature)¶
-
cblaster.genome_parsers.
iter_overlapping_features
(features)¶
-
cblaster.genome_parsers.
merge_cds_features
(features)¶
-
cblaster.genome_parsers.
organisms_to_tuples
(organisms)¶ Generates insertion tuples from parsed organisms.
Parameters: organisms (list) – Organism dictionaries parsed by parse_file Returns: SQLite3 database insertion tuples for all genes Return type: list
-
cblaster.genome_parsers.
parse_cds_features
(features, record_start)¶
-
cblaster.genome_parsers.
parse_file
(path, to_tuples=False)¶ Dispatches a given file path to the correct parser given its extension.
Parameters: - path (str) – Path to genome file
- to_tuples (bool) – Generate insertion tuples from parsed SeqRecords
Returns: File name and list of SeqRecord objects corresponding to scaffolds in file
Return type: dict
-
cblaster.genome_parsers.
parse_gff
(path)¶ Parses GFF and corresponding FASTA using GFFutils.
Parameters: path (str) – Path to GFF file. Should have a corresponding FASTA file of the same name with a valid FASTA suffix (.fa, .fasta, .fsa, .fna, .faa). Returns: SeqRecord objects corresponding to each scaffold in the file Return type: list
-
cblaster.genome_parsers.
parse_infile
(path, format)¶
-
cblaster.genome_parsers.
return_file_handle
(input_file)¶ Handles compressed and uncompressed files.
-
cblaster.genome_parsers.
seqrecord_to_tuples
(record, source)¶
cblaster.helpers module¶
-
cblaster.helpers.
dict_to_cluster
(sequences, spacing=500)¶ Creates a mock Cluster from a sequence dictionary.
-
cblaster.helpers.
efetch_sequences
(headers)¶ Retrieve protein sequences from NCBI for supplied accessions.
This function uses EFetch from the NCBI E-utilities to retrieve the sequences for all synthases specified in headers. The calls to EFetch can not exceed 500 accessions this means that the calls have to be limited. It then calls fasta.parse to parse the returned response; note that extra processing has to occur because the returned FASTA will contain a full sequence description in the header line after the accession.
Parameters: headers (list) – Valid NCBI sequence identifiers (accession, GI, etc.). Returns: a dictionary of sequences keyed on header id
-
cblaster.helpers.
fasta_seqrecords_to_cluster
(records, spacing=500)¶ Creates a mock Cluster from a SeqIO FASTA parser handle.
-
cblaster.helpers.
find_sqlite_db
(path)¶
-
cblaster.helpers.
form_command
(parameters)¶ Flatten a dictionary to create a command list for use in subprocess.run()
-
cblaster.helpers.
get_program_path
(aliases)¶ Get programs path given a list of program names.
Parameters: aliases (list) – Program aliases, e.g. [“diamond”, “diamond-aligner”] Raises: ValueError
– Could not find any of the given aliases on system $PATH.Returns: Path to program executable.
-
cblaster.helpers.
get_project_root
()¶
-
cblaster.helpers.
get_sequences
(query_file=None, query_ids=None, query_profiles=None)¶ Convenience function to get dictionary of query sequences from file or IDs.
Parameters: - query_file (str) – Path to FASTA genbank or EMBL file containing query
- sequences. (protein) –
- query_ids (list) – NCBI sequence accessions.
- query_profiles (list) – Pfam profile accessions.
Raises: ValueError
– Did not receive values for query_file or query_ids.Returns: Dictionary of query sequences keyed on accession.
Return type: sequences (dict)
-
cblaster.helpers.
parse_query_sequences
(query_file=None, query_ids=None, query_profiles=None)¶ Creates a Cluster object from query sequences.
If EMBL/GenBank, Cluster will use exact genomic coordinates parsed from file. Otherwise, a fake Cluster will be created where genes are drawn to scale, but always on positive strand and with fixed intergenic distance.
-
cblaster.helpers.
seqrecord_to_cluster
(record)¶ Creates a Cluster object from a SeqIO GenBank/EMBL parser handle.
-
cblaster.helpers.
sequences_to_fasta
(sequences)¶ Formats sequence dictionary as FASTA.
cblaster.hmm_search module¶
Hmmfetch and hmmsearch implementation
-
cblaster.hmm_search.
check_pfam_db
(path)¶ Check if Pfam-A db exists else download
Parameters: path – String, path where to check
-
cblaster.hmm_search.
fetch_pfam_profiles
(hmm, keys)¶ Fetch hmm profiles from db and save in a file
Parameters: - db_path – String, path where db are stored
- keys_ls – String, Path to file with acc-nr
Returns: List, strings with acc-numbers
Return type: ls_keys
-
cblaster.hmm_search.
get_pfam_accession
(dat_path: Union[str, pathlib.Path], keys: Collection[str]) → Tuple[Set[str], Set[str]]¶ Get full accession number of Pfam profiles
Looks for keys in ID and AC attributes, such that accessions can be retrieved by name or accession.
Parameters: - keys – Strings of accession profiles numbers
- db_path – Path to dat.gz file with the full acc-nr
Returns: List, string of full acc-number
Return type: key_lines
-
cblaster.hmm_search.
get_profile_names
(profiles: Collection[str]) → Collection[str]¶ Extracts names from profile HMMs using regular expressions.
-
cblaster.hmm_search.
group_profiles
(profiles: Collection[str]) → tuple¶ Group input query profile HMMs by Pfam, custom or invalid.
If the profile is found on disk, it will be loaded directly. If not found locally, but starts with PF, will try to extract from Pfam. Otherwise, marked as invalid.
-
cblaster.hmm_search.
parse_hmmer_output
(results)¶ Parse hmmsearch output
Parameters: file_list – List, string of file name of results that need parsing Returns: - list of class objects, with information
- query, subject, identity, coverage, e-value, bit score
Return type: hit_info
-
cblaster.hmm_search.
perform_hmmer
(fasta: str, query_profiles: List[str], pfam: str, session: cblaster.classes.Session, hmm_out: str = None) → Optional[Collection[cblaster.classes.Hit]]¶ Main of running a hmmer search
Parameters: - fasta – Path to database FASTA file
- query_profiles – Pfam names/accessions, or paths to profile HMM files
- pfam – Path to folder containing Pfam database
- session – cblaster search session
Returns: List of class objects with the hits
-
cblaster.hmm_search.
read_profiles
(files: Collection[str]) → Collection[str]¶ Reads in profile HMMs from a list of files.
-
cblaster.hmm_search.
run_hmmsearch
(fasta, query)¶ Run the hmmsearch command
Parameters: - path_pfam – String, Path to the pfam database
- path_db – String, Path to db that will be searched for profiles
- ls_keys – List, string of pfam profile names
Returns: List, String of result file names
Return type: temp_res
cblaster.intermediate_genes module¶
cblaster.local module¶
-
cblaster.local.
diamond
(fasta, database, max_evalue=0.01, min_identity=30, min_coverage=50, hitlist_size=5000, cpus=None, sensitivity='fast')¶ Launch a local DIAMOND search against a database.
Parameters: - fasta (str) – Path to FASTA format query file
- database (str) – Path to DIAMOND database generated with cblaster makedb
- max_evalue (float) – Maximum e-value threshold
- min_identity (float) – Minimum identity (%) cutoff
- min_coverage (float) – Minimum coverage (%) cutoff
- hitlist_size (int) – Maximum number of hits to save
- cpus (int) – Number of CPU threads for DIAMOND to use
Returns: Rows from DIAMOND search result table (split by newline)
Return type: list
-
cblaster.local.
parse
(results, min_identity=30, min_coverage=50, max_evalue=0.01)¶ Parse a string containing results of a BLAST/DIAMOND search.
Parameters: - results (list) – Results returned by diamond() or blastp()
- min_identity (float) – Minimum identity (%) cutoff
- min_coverage (float) – Minimum coverage (%) cutoff
- max_evalue (float) – Maximum e-value threshold
Returns: Hit objects representing hits that surpass scoring thresholds
Return type: list
-
cblaster.local.
search
(database, sequences=None, query_file=None, query_ids=None, blast_file=None, dmnd_sensitivity='fast', min_identity=30, min_coverage=50, max_evalue=0.01, hitlist_size=5000, **kwargs)¶ Launch a new BLAST search using either DIAMOND or command-line BLASTp (remote).
Parameters: - database (str) – Path to DIAMOND database
- sequences (dict) – Query sequences
- query_file (str) – Path to FASTA file containing query sequences
- query_ids (list) – NCBI sequence accessions
- blast_file (str) – Path to the file blast results are written to
Raises: ValueError
– No value given for query_file or query_idsReturns: Parsed rows with hits from DIAMOND results table
Return type: list
cblaster.main module¶
cblaster.parsers module¶
Argument parsers.
-
cblaster.parsers.
add_binary_arguments
(group)¶
-
cblaster.parsers.
add_clustering_group
(search)¶
-
cblaster.parsers.
add_config_subparser
(subparsers)¶
-
cblaster.parsers.
add_extract_clusters_subparser
(subparsers)¶
-
cblaster.parsers.
add_extract_subparser
(subparsers)¶
-
cblaster.parsers.
add_filtering_group
(search)¶
-
cblaster.parsers.
add_gne_output_group
(parser)¶
-
cblaster.parsers.
add_gne_params_group
(parser)¶
-
cblaster.parsers.
add_gne_subparser
(subparsers)¶
-
cblaster.parsers.
add_gui_subparser
(subparsers)¶
-
cblaster.parsers.
add_input_group
(search)¶
-
cblaster.parsers.
add_intermediate_genes_group
(search)¶
-
cblaster.parsers.
add_makedb_subparser
(subparsers)¶
-
cblaster.parsers.
add_output_arguments
(group)¶
-
cblaster.parsers.
add_output_group
(search)¶
-
cblaster.parsers.
add_plot_clusters_subparser
(subparsers)¶
-
cblaster.parsers.
add_search_subparser
(subparsers)¶
-
cblaster.parsers.
add_searching_group
(search)¶
-
cblaster.parsers.
full_database_path
(database, *acces_modes)¶ Make sure the database path is also correct, but do not check when providing one of the NCBI databases
Parameters: - database (str) – a string that is the path to the database creation files or a NCBI database identifier
- acces_modes (List) – a list of integers of acces modes for which at least one should be allowed
Returns: a string that is the full path to the database file or a NCBI database identifier
-
cblaster.parsers.
full_path
(file_path, *acces_modes, dir=False)¶ Test if a file path or directory exists and has the correct permissions and create a full path
For reading acces the file has to be pressent and there has to be read acces. For writing acces the directory with the file has to be present and there has to be write acces in that directory.
Parameters: - file_path (str) – relative or absoluete path to a file
- acces_modes (List) – a list of integers of acces modes for which at least one should be allowed
- dir (bool) – if the path is to a directory or not
Returns: A string that is the full path to the provided file_path
Raises: - argparse.ArgumentTypeError when the provided path does not exist or the file does not have the correct
- permissions to be accessed
-
cblaster.parsers.
get_parser
()¶
-
cblaster.parsers.
max_cpus
(value)¶ Ensure that the cpu’s do not go above the available amount. Setting to high cpu’s will crash database creation badly
Parameters: value (int) – number of cpu’s as provided by the user Returns: value as an integer with 1 <= value <= multiprocessing.cpu_count()
-
cblaster.parsers.
parse_args
(args)¶
cblaster.plot module¶
cblaster.plot_clusters module¶
cblaster.remote module¶
This module handles all interaction with NCBI’s BLAST API, including launching new remote searches, polling for completion status, and retrieval of results.
-
cblaster.remote.
check
(rid)¶ Check completion status of a BLAST search given a Request Identifier (RID).
Parameters: rid (str) – NCBI BLAST search request identifier (RID)
Returns: Search has completed successfully and hits were reported
Return type: bool
Raises: ValueError
– Search has failed. This is caused either by program error (in which case, NCBI requests you submit an error report with the RID) or expiration of the RID (only stored for 24 hours).ValueError
– Search has completed successfully, but no hits were reported.
-
cblaster.remote.
parse
(handle, sequences=None, query_file=None, query_ids=None, max_evalue=0.01, min_identity=30, min_coverage=50)¶ Parse Tabular results from remote BLAST search performed via API.
Since the API provides no option for returning query coverage, which is a metric we want to use for filtering hits, query sequences must be passed to this function so that their lengths can be compared to the alignment length.
Parameters: - handle (list) – File handle (or file handle-like) object corresponding to BLAST results. Note that this function expects an iterable of tab-delimited lines and performs no validation/error checking
- sequences (dict) – Query sequences
- query_file (str) – Path to FASTA format query file
- query_ids (list) – NCBI sequence identifiers
- max_evalue (float) – Maximum e-value
- min_identity (float) – Minimum percent identity
- min_coverage (float) – Minimum percent query coverage
Returns: Hit objects corresponding to criteria passing BLAST hits
Return type: list
-
cblaster.remote.
poll
(rid, delay=60, max_retries=-1)¶ Poll BLAST API with given Request Identifier (RID) until results are returned.
As per NCBI usage guidelines, this function will only poll once per minute; this is calculated each time such that wait is constant (i.e. accounts for differing response time on the status check).
Parameters: - rid (str) – NCBI BLAST search request identifier (RID)
- delay (int) – Total delay (seconds) between polling
- max_retries (int) – Maximum number of polling attempts (-1 for unlimited)
Returns: BLAST search results split by newline
Return type: list
-
cblaster.remote.
retrieve
(rid, hitlist_size=5000)¶ Retrieve BLAST results corresponding to a given Request Identifier (RID).
Parameters: - rid (str) – NCBI BLAST search request identifiers (RID)
- hitlist_size (int) – Total number of hits to retrieve
Returns: BLAST search results split by newline, with HTML parts removed
Return type: list
-
cblaster.remote.
search
(rid=None, sequences=None, query_file=None, query_ids=None, min_identity=0.3, min_coverage=0.5, max_evalue=0.01, blast_file=None, hitlist_size=500, **kwargs)¶ Perform a remote BLAST search via the NCBI’s BLAST API.
This function launches a new search given a query FASTA file or list of valid NCBI identifiers, polls the API to check the completion status of the search, then retrieves and parses the results.
It is also possible to call other BLAST variants using the program argument.
Parameters: - rid (str) – NCBI BLAST search request identifier (RID)
- sequences (dict) – Query sequences
- query_file (str) – Path to FASTA format query file
- query_ids (list) – NCBI sequence identifiers
- min_identity (float) – Minimum percent identity
- min_coverage (float) – Minimum percent query coverage
- max_evalue (float) – Maximum e-value
- blast_file (str) – Path to file blast results are written to
- hitlist_size (int) – Number of database sequences to keep
Returns: Hit objects corresponding to criteria passing BLAST hits
Return type: list
-
cblaster.remote.
start
(sequences=None, query_file=None, query_ids=None, database='nr', program='blastp', megablast=False, filtering='F', evalue=0.1, nucl_reward=None, nucl_penalty=None, gap_costs='11 1', matrix='BLOSUM62', hitlist_size=500, threshold=11, word_size=6, comp_based_stats=2, entrez_query=None)¶ Launch a remote BLAST search using NCBI BLAST API.
Note that the HITLIST_SIZE, ALIGNMENTS and DESCRIPTIONS parameters must all be set together in order to mimic max_target_seqs behaviour.
Usage guidelines:
- Don’t contact server more than once every 10 seconds
- Don’t poll for a single RID more than once a minute
- Use URL parameter email/tool
- Run scripts weekends or 9pm-5am Eastern time on weekdays if >50 searches
For a full description of the parameters, see:
- BLAST API documentation<https://ncbi.github.io/blast-cloud/dev/api.html>
- BLAST documentation <https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp>
Parameters: - sequences (dict) – Query sequence dict generated by helpers.get_sequences()
- query_file (str) – Path to a query FASTA file
- query_ids (list) – Collection of NCBI sequence identifiers
- database (str) – Target NCBI BLAST database
- program (str) – BLAST variant to run
- megablast (bool) – Enable megaBLAST option (only with BLASTn)
- filtering (str) – Low complexity filtering
- evalue (float) – E-value cutoff
- nucl_reward (int) – Reward for matching bases (only with BLASTN/megaBLAST)
- nucl_penalty (int) – Penalty for mismatched bases (only with BLASTN/megaBLAST)
- gap_costs (str) – Gap existence and extension costs
- matrix (str) – Scoring matrix name
- hitlist_size (int) – Number of database sequences to keep
- threshold (int) – Neighbouring score for initial words
- word_size (int) – Size of word for initial matches
- comp_based_stats (int) – Composition based statistics algorithm
- entrez_query (str) – NCBI Entrez search term for pre-filtering the BLAST database
Returns: Request Identifier (RID) assigned to the search rtoe (int): Request Time Of Execution (RTOE), estimated run time of the search
Return type: rid (str)