tfcomb.annotation module

tfcomb.annotation.annotate_regions(regions, gtf, config=None, best=True, threads=1, verbosity=1)[source]

Annotate regions with genes from .gtf using UROPA [1].

Parameters:

regions (tobias.utils.regions.RegionList() or pandas.DataFrame) – A RegionList object with positions of genomic elements e.g. TFBS or a DataFrame containing chr/start/stop-coordinates. If DataFrame, the function assumes that the order of columns is: ‘chromosome’, ‘start’, ‘end’, ‘id’, ‘score’, ‘strand’.
gtf (str) – Path to .gtf file containing genomic elements for annotation.
config (dict, optional) – A dictionary indicating how regions should be annotated. Default is to annotate feature ‘gene’ within -10000;1000bp of the gene start. See ‘Examples’ of how to set up a custom configuration dictionary.
best (boolean) – Whether to return the best annotation or all valid annotations. Default: True (only best are kept).
threads (int, optional) – Number of threads to use for multiprocessing. Default: 1.
verbosity (int, optional) – Level of verbosity of logger. One of 0,1, 2. Default: 1.

Returns:

Dataframe including regions and annotation information (if applicable, otherwise a warning will be displayed and None is returned).

Return type:

pd.DataFrame or None

References

Examples

>>> custom_config = {"queries": [{"distance": [10000, 1000],
...                                      "feature_anchor": "start",
...                                      "feature": "gene"}],
...                                      "priority": True,
...                                      "show_attributes": "all"}

#Annotate regions (data/ refers to the data directory of the tfcomb github repository)

>>> regions = pd.read_csv("data/GM12878_hg38_chr4_ATAC_peaks.bed")
>>> annotate_regions(regions, gtf="data/chr4_genes.gtf",
                                                  config=custom_config)

tfcomb.annotation.get_annotated_genes(regions, attribute='gene_name')[source]

Get list of genes from the list of annotated regions from annotate_regions().

Parameters:

regions (RegionList() or list of OneTFBS objects) –
attribute (str) – The name of the attribute in the 9th column of the .gtf file. Default: ‘gene_name’.

class tfcomb.annotation.GOAnalysis(*args: Any, **kwargs: Any)[source]

Bases: DataFrame

aspect_translation = {'BP': 'Biological Process', 'CC': 'Cellular Component', 'MF': 'Molecular Function'}

enrichment(genes, organism='hsapiens', background=None, propagate_counts=True, min_depth=1, verbosity=1)[source]

Perform a GO-term enrichment based on a list of genes. This is a TF-COMB wrapper for goatools.

Parameters:

gene_ids (list) – A list of gene ids.
organism (str, optional) – The organism of which the gene_ids originate. Defaults to ‘hsapiens’.
background (list, optional) – A specific list of background gene ids to use. Default: The list of protein coding genes of the ‘organism’ given.
propagate_counts (bool) – Whether to propagate counts up the tree to parent GO’s. Default: True.
min_depth (int) – Minimum depth of GO-terms to show in output table. Default: 1.
verbosity (int, optional) – Default: 1.