TF-COMB stands for "Transcription Factor Co-Occurrence using Market Basket analysis" and is a python module for identifying co-occurring TFs in regulatory regions.
Additionally, TF-COMB applies various downstream methods such as distance, orientation and network analysis.
This notebook shows how to analyze the TF-co-occurrences of predefined TFBS e.g. from ChIP-seq peaks. The data used here is obtained from the ENCODE project (https://pubmed.ncbi.nlm.nih.gov/29126249/) and consists of all TF ChIP-seq experiments for the celltype GM12878, and subset to chr4 in the human genome.
Next, the function .market_basket() is used to perform the co-occurrence analysis of the sites just added to .TFBS:
[6]:
C.market_basket()
Internal counts for 'TF_counts' were not set. Please run .count_within() to obtain TF-TF co-occurrence counts.
WARNING: No counts found in <CombObj>. Running <CombObj>.count_within() with standard parameters.
INFO: Setting up binding sites for counting
INFO: Counting co-occurrences within sites
INFO: Counting co-occurrence within background
INFO: Running with multiprocessing threads == 1. To change this, give 'threads' in the parameter of the function.
INFO: Progress: 10%
INFO: Progress: 20%
INFO: Progress: 30%
INFO: Progress: 40%
INFO: Progress: 50%
INFO: Progress: 60%
INFO: Progress: 70%
INFO: Progress: 80%
INFO: Progress: 90%
INFO: Done finding co-occurrences! Run .market_basket() to estimate significant pairs
INFO: Market basket analysis is done! Results are found in <CombObj>.rules
As is shown in the info messages, this also runs the .count_within() function of ‘C’ (if no counts were found yet). If you want to set specific parameters for count_within, you can split these calculations such as seen here:
The CombObj ‘C’ contains a number of different visualizations for the identified TF-TF co-occurrence pairs. The default measure plotted is ‘cosine’, but many of the options of the plots can be changed as seen in the examples below:
As seen in the bubble plot above, the rules are duplicated due to the cosine measure, which scores TF1-TF2 the same way as TF2-TF1. In order to simplify the results, you can use .simplify_rules() to remove the duplicates from the .rules:
[14]:
C.simplify_rules()
[15]:
C.rules
[15]:
TF1
TF2
TF1_TF2_count
TF1_count
TF2_count
cosine
zscore
CTCF-RAD21
CTCF
RAD21
1751
2432
2241
0.750038
18.643056
RAD21-SMC3
RAD21
SMC3
1376
2241
1638
0.718192
20.314026
CTCF-SMC3
CTCF
SMC3
1361
2432
1638
0.681898
20.245177
IKZF1-IKZF2
IKZF1
IKZF2
1726
2922
2324
0.662343
11.215960
SMC3-ZNF143
SMC3
ZNF143
1060
1638
1652
0.644383
21.838431
...
...
...
...
...
...
...
...
HDAC6-JUNB
HDAC6
JUNB
1
172
1866
0.001765
-6.905016
RELB-SUZ12
RELB
SUZ12
1
2011
172
0.001700
-6.793962
ATF7-SUZ12
ATF7
SUZ12
1
2061
172
0.001680
-6.119294
RAD21-SUZ12
RAD21
SUZ12
1
2241
172
0.001611
-7.500216
IKZF2-SUZ12
IKZF2
SUZ12
1
2324
172
0.001582
-6.649424
10642 rows × 7 columns
Now, you can again use plot_bubble() to visualize the highest-scoring co-occurring pairs:
As shown in previous notebooks, we can now perform the market basket analysis on the sites within .TFBS:
[4]:
C.market_basket(threads=10)
Internal counts for 'TF_counts' were not set. Please run .count_within() to obtain TF-TF co-occurrence counts.
WARNING: No counts found in <CombObj>. Running <CombObj>.count_within() with standard parameters.
INFO: Setting up binding sites for counting
INFO: Counting co-occurrences within sites
INFO: Counting co-occurrence within background
INFO: Progress: 14%
INFO: Progress: 28%
INFO: Progress: 40%
INFO: Progress: 52%
INFO: Progress: 64%
INFO: Progress: 76%
INFO: Progress: 88%
INFO: Finished!
INFO: Done finding co-occurrences! Run .market_basket() to estimate significant pairs
INFO: Market basket analysis is done! Results are found in <CombObj>.rules
In this example, the first rules contain more TF1_TF2_counts than the individual counts of TF1 and TF2:
[9]:
C.rules.head(1)
[9]:
TF1
TF2
TF1_TF2_count
TF1_count
TF2_count
cosine
zscore
POU3F2-SMARCA5
POU3F2
SMARCA5
239
302
241
0.885902
129.586528
How can that be? If there are multiple combinations of TF1-TF2 within the same window, these combinations can add up to more than the number of individual TF positions. This effect can be controlled by setting binarize=True in count_within. This will ensure that each co-occurrence is only counted once per window:
INFO: Counting co-occurrences within sites
INFO: Counting co-occurrence within background
INFO: Progress: 16%
INFO: Progress: 28%
INFO: Progress: 32%
INFO: Progress: 44%
INFO: Progress: 54%
INFO: Progress: 64%
INFO: Progress: 72%
INFO: Progress: 80%
INFO: Progress: 92%
INFO: Finished!
INFO: Done finding co-occurrences! Run .market_basket() to estimate significant pairs
INFO: Market basket analysis is done! Results are found in <CombObj>.rules
[11]:
C.rules.head()
[11]:
TF1
TF2
TF1_TF2_count
TF1_count
TF2_count
cosine
zscore
ZNF121-ZNF770
ZNF121
ZNF770
1249
1242
2394
0.724335
103.405506
ZNF770-ZNF121
ZNF770
ZNF121
1249
2394
1242
0.724335
103.405506
PAX5-ZNF770
PAX5
ZNF770
863
968
2394
0.566906
82.606752
ZNF770-PAX5
ZNF770
PAX5
863
2394
968
0.566906
82.606752
SP1-SP2
SP1
SP2
570
1718
2150
0.296581
35.851669
It is also possible to play around with the maximum overlap allowed. The default is no overlap allowed, but setting max_overlap to 1 (all overlaps allowed), highlights some of the TFs which are highly overlapping:
With increasing number of transcription factors, co-occurrence analysis generally yields thousand of rules, which makes interpretation difficult. In this notebook, we will show how some different methods to select rules of interest from the original analysis. We will use the same data as within the ‘chipseq analysis’ notebook:
[1]:
import tfcomb.objects
C = tfcomb.objects.CombObj().from_pickle("../data/GM12878.pkl")
In this example, we will use the function .select_significant_rules() to select rules based on two measures. Default for this function are the “zscore” and “cosine” measures.
[3]:
selected = C.select_significant_rules()
INFO: x_threshold is None; trying to calculate optimal threshold
INFO: y_threshold is None; trying to calculate optimal threshold
INFO: Creating subset of TFBS and rules using thresholds
The function returns an instance of a CombObj containing only the selected rules and TFBS for the TFs within .rules:
INFO: x_threshold is None; trying to calculate optimal threshold
INFO: y_threshold is None; trying to calculate optimal threshold
INFO: Creating subset of TFBS and rules using thresholds
If you are only interested in the rules concerning a list of TFs, it is possible to use .select_TF_rules to subset the rules to a subset of TF names. Here, we create a subset of rules containing the TFs in TF_list:
[10]:
TF_list = ["ELK1", "CTCF", "ZNF143", "YY1"]
[11]:
selected3 = C.select_TF_rules(TF_list)
INFO: Selected 6 rules
INFO: Creating subset of object
WARNING: 1/1 names in 'TF_list' were not found within 'TF1' names: ['NOTFOUND']
WARNING: 1/1 names in 'TF_list' were not found within 'TF2' names: ['NOTFOUND']
InputError: No rules could be selected - please adjust TF_list and/or TF1/TF2 parameters.
We can now use the CombObj to directly visualize the network. The first time the function is run, the .network attribute will be created within the CombObj:
[3]:
C.plot_network()
WARNING: The .network attribute is not set yet - running build_network().
INFO: Finished! The network is found within <CombObj>.network.
[3]:
You can also change the colors of the nodes and edges via edge_cmap and node_cmap:
By default, the nodes are colored by count of binding sites, and edges are colored by the ‘cosine’ score. However, this is easily adjusted via the ‘color__by’ parameters:
It is often of interest to partition individual TFs into groups of highly connected TFs (highly co-occurring). This is possible using the cluster_network() function. Below we show different settings for performing TF clustering based on co-occurring networks.
Louvain clustering can also work with weighted edges, as seen here for cosine:
[13]:
C.cluster_network(weight="cosine")
INFO: Added 'cluster' attribute to the network attributes
[14]:
C.plot_network(color_node_by="cluster")
[14]:
[15]:
#It is also possible to plot the network without label by overwriting labels via node_attributes:C.plot_network(color_node_by="cluster",legend_size=0,node_attributes={"label":""})
The two objects contain ENCODE ChIP-seq peaks (1bp centered on the middle of the peak) from the celltypes GM12878 and K562 respectively.
[2]:
A = tfcomb.objects.CombObj(verbosity=0)
A.prefix = "GM12878"
A.TFBS_from_bed("../data/GM12878_hg38_chr4_TF_chipseq.bed")
A.market_basket()
A.set_verbosity(1) #reset verbosity to INFO
Internal counts for 'TF_counts' were not set. Please run .count_within() to obtain TF-TF co-occurrence counts.
[3]:
B = tfcomb.objects.CombObj(verbosity=0)
B.prefix = "K562"
B.TFBS_from_bed("../data/K562_hg38_chr4_TF_chipseq.bed")
B.market_basket()
Internal counts for 'TF_counts' were not set. Please run .count_within() to obtain TF-TF co-occurrence counts.
We will now use the .compare-function of CombObj ‘A’ to directly compare it with CombObj ‘B’. What you will see is that many of the TFs are different between the object and are thus removed:
INFO: Selecting rules for contrast: ('GM12878', 'K562')
INFO: measure_threshold is None; trying to calculate optimal threshold
INFO: mean_threshold is None; trying to calculate optimal threshold
INFO: Creating subset of rules using thresholds
We can also plot the network to show the pairs which are either increasing or decreasing in ‘cosine’ measure between the two cell types:
[9]:
selection.plot_network()
INFO: Finished! The network is found within <CombObj>.network.
[9]:
The strictness of the automatic threshold can be adjusted with measure_threshold_percent and mean_threshold_percent:
INFO: Selecting rules for contrast: ('GM12878', 'K562')
INFO: measure_threshold is None; trying to calculate optimal threshold
INFO: mean_threshold is None; trying to calculate optimal threshold
INFO: Creating subset of rules using thresholds
[11]:
selection2.plot_network()
INFO: Finished! The network is found within <CombObj>.network.
[11]:
It is also possible to set specific thresholds with measure_threshold and mean_threshold (these will overwrite the automatic thresholding set by measure_threshold_percent and mean_threshold_percent):
Due to the way the market basket analysis is working, the C.rules result table contains entries for every transcription factor combination present in the data.
For a binding analysis not all of these rules are of particular interest. In this notebook the method .select_significant_rules() is used to filter rules of interest. For more details on rule selection please refer to the example notebook Select rules.
[2]:
selection=C.select_significant_rules()
INFO: x_threshold is None; trying to calculate optimal threshold
INFO: y_threshold is None; trying to calculate optimal threshold
INFO: Creating subset of TFBS and rules using thresholds
There are two different ways to run this analysis. The automated way will be shown in this chapter. Followed by an in depth guide showing the analysis step by step in the next chapter.
Here we will start with the automated analysis for all selected rules.
[3]:
selection.analyze_distances(threads=6)# adjust threads if needed
INFO: DistObject successfully created! It can be accessed via <CombObj>.distObj
INFO: Preparing to count distances.
INFO: Setting up binding sites for counting
INFO: Calculating distances
INFO: Done finding distances! Results are found in .distances
INFO: You can now run .smooth() and/or .correct_background() to preprocess sites before finding peaks.
INFO: Or you can find peaks directly using .analyze_signal_all()
INFO: Smoothing signals with window size 3
INFO: Background correction finished! Results can be found in .corrected
INFO: Analyzing Signal with threads 6
INFO: Calculating zscores for signals
INFO: Finding preferred distances
INFO: Done analyzing signal. Results are found in .peaks
On the x-axis the distance in bp is shown. For example a distance of 100 means the anchors (depending on anchor mode, please refer to Anchor mode on this topic) are 100 bp away of each other. On the y-axis of the plot the calculated zscore per distance is shown.
For an explaination of the results please refer to the in depth guide below.
The binding distance analysis can be started from within any combObj. In this example the appropiate object is called selection, since we only want to use significant rules, not all. The analysis can also be done without pre selection.
[6]:
selection.create_distObj()
INFO: DistObject successfully created! It can be accessed via <CombObj>.distObj
As stated in the information message, the distObj should be created successfully and filled with all important information to start the distance analysis. This includes parameters set for the market masket analysis.
[7]:
selection.distObj
[7]:
<tfcomb.objects.DistObj at 0x7f87d1e6a950>
The 3792 rules (market basket results) selected earlier by select.significant_rules() are automatically passed to the distance object during creation:
[8]:
selection.distObj.rules
[8]:
TF1
TF2
TF1_TF2_count
TF1_count
TF2_count
cosine
zscore
POU3F2-SMARCA5
POU3F2
SMARCA5
239
302
241
0.885902
129.586528
SMARCA5-POU3F2
SMARCA5
POU3F2
239
241
302
0.885902
129.586528
POU2F1-SMARCA5
POU2F1
SMARCA5
263
426
241
0.820810
135.355691
SMARCA5-POU2F1
SMARCA5
POU2F1
263
241
426
0.820810
135.355691
SMARCA5-ZNF582
SMARCA5
ZNF582
172
241
195
0.793419
117.370387
...
...
...
...
...
...
...
...
NFYB-EGR2
NFYB
EGR2
24
219
1304
0.044911
3.401991
MYOG-TAF1
MYOG
TAF1
23
386
681
0.044860
3.375730
TAF1-MYOG
TAF1
MYOG
23
681
386
0.044860
3.375730
ETV1-ZBTB17
ETV1
ZBTB17
20
117
1699
0.044858
3.064184
ZBTB17-ETV1
ZBTB17
ETV1
20
1699
117
0.044858
3.064184
3778 rows × 7 columns
To unify the analysis steps between the market basket and the binding distance ones, the parameters for the:
1. minimal distance
2. maximal distance
3. maximal allowed overlap
4. directionality
5. anchor
will be set automatically when creating the distance object according to the values used for the market basket analysis step.
After creating the distObj, the first step is to count the distances. This will be done with .count_distances(). In this step it is important to decide if the directionality should be taken into account.
If directionality is taken into account the position of the transcription factors do matter. This means there is a difference between TFA -> TFB and TFB -> TFA (compare Orientation analysis notebook). Otherwise TFA -> TFB and TFB -> TFA are the same.
Per default the directionality decision is copied from the market basket step.
[10]:
selection.distObj.directional
[10]:
False
[11]:
selection.distObj.count_distances()
INFO: Preparing to count distances.
INFO: Setting up binding sites for counting
INFO: Calculating distances
INFO: Done finding distances! Results are found in .distances
INFO: You can now run .smooth() and/or .correct_background() to preprocess sites before finding peaks.
INFO: Or you can find peaks directly using .analyze_signal_all()
The resulting dataframe is constructed as followed: - columns: First the transcription factor for the pair (TF1, TF2) followed by the distances in bp - rows: each row representing one rule (pair) with the corresponding results
TFCOMB distance analysis supports three different anchor modes: inner , outer and center. The recommended mode is inner.
inner (default, recommented) is the distance between the transcription factors, it is measures as start(transcription factor B) - end(transcription factor A). If for example transcription factor B is directly adjacent to Transcription factor A, the difference will be zero.
center is the distance measured from mid of transcription factor to mid of transcription factor
outer (uncommonly used) is the distance measured including both transcription factors. end(transcription factor B) - start(transcription factor A)
Since we didn’t count directional, the values for the pairs TF1-TF2 and TF2-TF1 should be equal. For example in the DataFrame above the results for ZNF121-ZNF770 are the same as for ZNF770-ZNF121. This is not true if directionality is considered.
Note: If directionality is not considered, the duplicates can be filtered with .simplify_rules()
There are different ways to plot the distance distribution, you can for example create a kernel density estimate (KDE) plot or a histogram of the distribution.
<AxesSubplot:title={'center':'ZNF121-ZNF770'}, xlabel='Distance (bp)', ylabel='Count per distance'>
For both plots the x-axis shows the distance in bp.
For example a distance of 100 means the anchors (depending on anchor mode, please refere to Anchor mode on this topic) are 100 bp away of each other. Here is an explanation for the neg distance.
For the y-axis of the kde plot the density estimation is shown.
For the y-axis of the histogram the counts per distance is shown.
In order to collect distances from more than one basepair, e.g. in a window, it is possible to smooth the counted distances. This is done using the function .smooth() of the distObj:
[15]:
selection.distObj.smooth(window_size=3)
INFO: Smoothing signals with window size 3
The smoothed (min-max-scaled) distances are visible in the .smoothed variable:
To separate the signal from the background noise a linear regression for every signal needs to be fitted next. The result of the fitted line for every pair can be found in the .linres attribute of the distance object.
[19]:
selection.distObj.correct_background(threads=6)
INFO: Background correction finished! Results can be found in .corrected
The fitted lowess function can be plotted with the .plot() command. The red line indicates the linear regression line which was fitted during this step.
If prominence is set to zscore, the threshold is set in relation to the zscore (of the corrected signal) for each signal. For example if threshold is 2, the threshold prominence (for the score translated signal) will be a zscore of two.
INFO: Analyzing Signal with threads 6
INFO: Calculating zscores for signals
INFO: Finding preferred distances
INFO: Done analyzing signal. Results are found in .peaks
[22]:
TF1
TF2
Distance
Peak Heights
Prominences
Threshold
TF1_TF2_count
Distance_count
Distance_percent
Distance_window
ZFP82-SMARCA5
ZFP82
SMARCA5
8
6.547058
7.582435
2
198
46
23.232323
[7;9]
SMARCA5-ZFP82
SMARCA5
ZFP82
8
6.547058
7.582435
2
198
46
23.232323
[7;9]
ZNF394-SMARCA5
ZNF394
SMARCA5
5
6.940786
7.474373
2
187
74
39.572193
[4;6]
SMARCA5-ZNF394
SMARCA5
ZNF394
5
6.940786
7.474373
2
187
74
39.572193
[4;6]
POU3F2-SMARCA5
POU3F2
SMARCA5
9
6.584317
7.402690
2
234
57
24.358974
[8;10]
...
...
...
...
...
...
...
...
...
...
...
ETV5-ZSCAN22
ETV5
ZSCAN22
98
2.010752
2.010752
2
64
3
4.687500
[97;99]
ZBTB17-RFX1
ZBTB17
RFX1
1
2.010639
2.010639
2
30
1
3.333333
[1;2]
RFX1-ZBTB17
RFX1
ZBTB17
1
2.010639
2.010639
2
30
1
3.333333
[1;2]
WT1-ZIC3
WT1
ZIC3
48
2.226574
2.005542
2
51
4
7.843137
[47;49]
ZIC3-WT1
ZIC3
WT1
48
2.226574
2.005542
2
51
4
7.843137
[47;49]
8160 rows × 10 columns
The resulting dataframe is constructed as followed: - columns: First the transcription factor for the pair (TF1, TF2) followed by
1. __Distance__ in bp \
2. __Peak Heights__ height of the peak (calculated by [_find_peaks()_](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html)). \
3. __Prominences__ prominence of the peak (calculated by [_find_peaks()_](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html)). \
4. __Threshold__ minimum prominence needed to be considered as peak (in this example the [zscore](#5.B.-Zscore) of the signal). \
5. __TF1_TF2_count__ is the total number of co-occurring TF1-TF2 sites. \
6. __Distance_count__ The number of co-occurrences for the given distance. \
7. __Distance_percent__ The percent of Distance_count of the total sites (TF1_TF2_count). \
8. __Distance_window__ The distances collected for the given distance. Will be a window if smoothing was applied. \
rows: each row representing one rule (pair) at a distinct prefered binding distance with the corresponding results
The signal and peaks can now be plotted with the .plot() command.
The results in the .peaks table and the plot matches. For this ZNF121-ZNF770 3 peaks were found, which also can be seen in the plot at positions:
distance: 26
distance: 58
distance: 66
Meaning the zscore threshold with stringency of 2 identifies 3 different preferred distances for this pair. Please note, that the height of the distances differ!
#### Flat threshold
Using a flat threshold, it is possible to only find peaks with a certain number of counts as seen here:
INFO: Analyzing Signal with threads 6
INFO: Calculating zscores for signals
INFO: Finding preferred distances
INFO: Done analyzing signal. Results are found in .peaks
[25]:
TF1
TF2
Distance
Peak Heights
Prominences
Threshold
TF1_TF2_count
Distance_count
Distance_percent
Distance_window
ZNF121-ZNF770
ZNF121
ZNF770
26
125.175641
125.839985
80
1319
381
28.885519
[25;27]
ZNF770-ZNF121
ZNF770
ZNF121
26
125.175641
125.839985
80
1319
381
28.885519
[25;27]
ZNF121-ZNF770
ZNF121
ZNF770
58
121.304647
123.277229
80
1319
376
28.506444
[57;59]
ZNF770-ZNF121
ZNF770
ZNF121
58
121.304647
123.277229
80
1319
376
28.506444
[57;59]
PAX5-ZNF770
PAX5
ZNF770
56
104.564327
106.526327
80
896
325
36.272321
[55;57]
ZNF770-PAX5
ZNF770
PAX5
56
104.564327
106.526327
80
896
325
36.272321
[55;57]
[26]:
selection.distObj.plot(("ZNF121","ZNF770"))
[26]:
<AxesSubplot:title={'center':'ZNF121-ZNF770'}, xlabel='Distance (bp)', ylabel='Count per distance'>
The returned go_table is a subclass of pandas.DataFrame, which contains options for plotting the GO-enrichment results as seen here:
[5]:
type(go_table)
[5]:
tfcomb.annotation.GOAnalysis
[6]:
_=go_table.plot_bubble()
The default aspect shown is “BP” (Biological Process), but by setting aspect, either of “CC” (Cellular Component) and “MF” (Molecular Function can be shown:
INFO: Scanning for TFBS with 4 thread(s)...
INFO: Progress: 12%
INFO: Progress: 20%
INFO: Progress: 30%
INFO: Progress: 40%
INFO: Progress: 50%
INFO: Progress: 60%
INFO: Progress: 70%
INFO: Progress: 80%
INFO: Progress: 91%
INFO: Finished!
INFO: Processing scanned TFBS
INFO: Identified 165810 TFBS (401 unique names) within given regions
Internal counts for 'TF_counts' were not set. Please run .count_within() to obtain TF-TF co-occurrence counts.
WARNING: No counts found in <CombObj>. Running <CombObj>.count_within() with standard parameters.
INFO: Setting up binding sites for counting
INFO: Counting co-occurrences within sites
INFO: Counting co-occurrence within background
INFO: Running with multiprocessing threads == 1. To change this, give 'threads' in the parameter of the function.
INFO: Progress: 10%
INFO: Progress: 20%
INFO: Progress: 30%
INFO: Progress: 40%
INFO: Progress: 50%
INFO: Progress: 60%
INFO: Progress: 70%
INFO: Progress: 80%
INFO: Progress: 90%
INFO: Done finding co-occurrences! Run .market_basket() to estimate significant pairs
INFO: Market basket analysis is done! Results are found in <CombObj>.rules
pairs.write_bed("TFBS_pair_positions.bed",fmt="bed")#show the content of fileimportpandasaspdpd.read_csv("TFBS_pair_positions.bed",sep="\t",header=None,nrows=5)
[6]:
0
1
2
3
4
5
0
chr4
49092715
49092730
SMARCA5
.
-
1
chr4
49092743
49092775
POU3F2
.
+
2
chr4
49092715
49092730
SMARCA5
.
-
3
chr4
49092788
49092865
POU3F2
.
+
4
chr4
49092743
49092775
POU3F2
.
+
The pairs can also be written out as ‘bedpe’ format, which contains the positions of both sites:
/workspace/.conda/tfcomb_env/lib/python3.7/site-packages/tfcomb/network.py:12: MatplotlibDeprecationWarning:
The mpl_toolkits.axes_grid module was deprecated in Matplotlib 2.1 and will be removed two minor releases later. Use mpl_toolkits.axes_grid1 and mpl_toolkits.axisartist, which provide the same functionality instead.
from mpl_toolkits.axes_grid.inset_locator import inset_axes
INFO: Scanning for TFBS with 4 thread(s)...
INFO: Progress: 11%
INFO: Progress: 20%
INFO: Progress: 30%
INFO: Progress: 40%
INFO: Progress: 50%
INFO: Progress: 60%
INFO: Progress: 70%
INFO: Progress: 80%
INFO: Progress: 90%
INFO: Finished!
INFO: Processing scanned TFBS
INFO: Identified 336722 TFBS (401 unique names) within given regions
# mandatory setup for plotting (except pairLines)
pairs.bigwig_path = "../data/GM12878_corrected.bw"
# optional
# globally set signal windows size counted from alignment point
# pairs.comp_plotting_tables(flank=(10, 150), # default = 100
# align="left") # one of ['center', 'left', 'right']
A heatmap of all binding pairs sorted by distance between the transcription factors.
[7]:
import numpy as np
pairs.pairMap(logNorm_cbar=None, # One of [None, "centerLogNorm", "SymLogNorm"]. Select type of colorbar normalization.
show_binding=True, # Show the TF binding positions.
flank_plot="strand", # One of ["strand", "orientation"]. Select what is shown in the flanking plots.
figsize=(6, 6), # Figure size
output=None, # Path to output file.
flank=(50, 150), # Number of bases extended from center. Default = 100 or last used size
align="left", # Pair alignment one of ["center", "left", "right"]. Default = "center" or last used.
alpha=0.3, # Alpha for center line, TF binding positions and diagonal lines
cmap="seismic", # Color palette to use
show_diagonal=True, # Show diagonal binding lines
legend_name_score="Bigwig Score", # Set name of scoring legend (upper left)
xtick_num=10, # Number of shown x-axis ticks
log=np.log1p, # Log function to apply on the scores (any provided by numpy). Default np.log1p
dpi=100) # Dots per inch
Aggregated binding of selected positions. Either select by one or more distances (between TFs) to aggregate signal or select a range of binding positions sorted by increasing distance.
[8]:
pairs.pairTrack(# either
dist=list(range(10,20)), # Select sites to aggregate by one or more distances between TF-pair.
# or
# (range will be used if both are set)
start=None, # Start of site range to be aggregated.
end=None, # End of site range to be aggregated. Sites are sorted by increasing distance.
ymin=None, # y-axis minimum. Default is min value of selected data.
ymax=None, # y-axis maximum. Default is max value of selected data.
ylabel="Bigwig score", # y-axis label
output=None, # Path to output file.
flank=(50, 100), # Number of bases extended from center. Default = 100 or last used size
align="left", # Pair alignment one of ["center", "left", "right"]. Default = "center" or last used.
dpi=70, # Dots per inch
figsize=(6, 4)) # Figure size
pairs.pairTrackAnimation(site_num=None, # Number of sites aggregated in each plot. If None will aggregate by distance.
step=10, # Step size between plots. Ignored if site_num=None
ymin=None, # y-axis minimum. Default is min value of selected data.
ymax=None, # y-axis maximum. Default is max value of selected data.
ylabel="Bigwig score", # y-axis label
interval=50, # Time between plots in milliseconds.
repeat_delay=0, # Delay until the gif is repeated in milliseconds. Not functional
repeat=True, # Whether the gif should be repeated. Only affects jupyter.
output=None, # Path to output file.
flank=(50, 150), # Number of bases extended from center. Default = 100 or last used size
align="left", # Pair alignment one of ["center", "left", "right"]. Default = "center" or last used.
dpi=70, # Dots per inch
figsize=(6, 4)) # Figure size
To explore relationships between items bought by customers a frequent pattern mining approach called market basket analysis (MBA) can be applied.
The basic idea behind TF-COMB is utilizing such a MBA in a way applicable to transcription factor (TF) binding data to uncover
co-occurrences between TFs. Here we present the classical MBA approach and how we altered it to cope with the special biological challenges.
As mentioned above we took this approach and asked if we can transfer the concept to be applied to biological data in a way that
we can answer the question: “If we observe TFA, is it likely we will also observe TFB?”
Assuming TF factors to be items posed some challenges due to the nature of the data, explained below.
What are transactions/windows in TF binding data?
The first question translating the MBA concept to TF binding data was how to define transactions/baskets. Since we assumed TF factors to be
the items we want to examine, we need a counterpart for transactions. Therefore, we constructed arbitrary windows of a given size w.
However, the human genome for example is huge and utilizing a sliding window approach to slice the genome is computational expensive as well
as resulting in many windows without valuable information content. Therefore, we assumed every genomic TF start position to denote the start
of a window, resulting in a number of windows (transactions) equals observed TFs.
TFs and their respective binding sites do not only occur side-by-side but sometimes also overlap each other. To enable full control of these
special cases, TF-COMB offers a parameter to control the amount of allowed overlap ranging from 0 (no overlap allowed) to 1 (full overlap allowed).
Overlap is calculated in percentage of the shorter binding site overlapping the larger binding site.
To assess the importance of the found rules a variety of metrics exist [1] . Besides the above mentioned support, which is calculated automatically,
TF-COMB supports cosine, confidence, lift or jaccard.
To calculate whether a pair is within a specific window of size w and to interpret predicted preferred distances,
it is important to define an anchor point from which the distance is measured.
TF-COMB supports three different anchor modes: inner, outer and center.
The default mode is inner.
inner (default) is the distance between the transcription factors, it is measures as \(start(TF B) - end(TF A)\) . If for example transcription factor B is directly adjacent to Transcription factor A, the difference will be zero.
center is the distance measured from mid of transcription factor to mid of transcription factor \(center(TF B) - center (TF A)\)
outer (uncommonly used) is the distance measured including both transcription factors. \(end(TF B) - start(TF A)\)
Function to calculate TFBS from motifs and genome fasta within the given genomic regions.
Parameters:
regions (str or tobias.utils.regions.RegionList) – Path to a .bed-file containing regions or a tobias-format RegionList object.
motifs (str or tobias.utils.motifs.MotifList) – Path to a file containing JASPAR/MEME-style motifs or a tobias-format MotifList object.
genome (str) – Path to the genome fasta-file to use for scan.
motif_pvalue (float, optional) – The pvalue threshold for the motif search. Default: 1e-05.
motif_naming (str, optional) – How to name TFs based on input motifs. Must be one of: ‘name’, ‘id’, ‘name_id’ or ‘id_name’. Default: “name”.
gc (float between 0-1, optional) – Set GC-content for the motif background model. Default: 0.5.
resolve_overlapping (str, optional) – Control how to treat overlapping occurrences of the same TF. Must be one of “merge”, “highest_score” or “off”. If “highest_score”, the highest scoring overlapping site is kept.
If “merge”, the sites are merged, keeping the information of the first site. If “off”, overlapping TFBS are kept. Default: “merge”.
extend_bp (int, optional) – Extend input regions with ‘extend_bp’ before scanning. Default: 0.
threads (int, optional) – How many threads to use for multiprocessing. Default: 1.
overwrite (boolean, optional) – Whether to overwrite existing sites within .TFBS. Default: False (sites are appended to .TFBS).
Returns:
.TFBS_from_motifs fills the objects’ .TFBS variable
Cluster TFBS based on overlap of individual binding sites. This can be used to pre-process motif-derived TFBS into TF “families” of TFs with similar motifs.
This changes the .name attribute of each site within .TFBS to represent the cluster (or the original TF name if no cluster was found).
Parameters:
threshold (float from 0-1, optional) – The threshold to set when clustering binding sites. Default: 0.5.
merge_overlapping (bool, optional) – Whether to merge overlapping sites following clustering.
If True, overlapping sites from the same cluster will be merged to one site (spanning site1-start -> site2-end).
If False, the original sites (but with updated names) will be kept in .TFBS. Default: True.
Subset .TFBS in object to specific regions or TF names. Can be used to select only a subset of TFBS (e.g. only in promoters) to run analysis on. Note: Either ‘names’ or ‘regions’ must be given - not both.
Parameters:
names (list of strings, optional) – A list of names to keep. Default: None.
regions (str or RegionList, optional) – Path to a .bed-file containing regions or a tobias-format RegionList object. Default: None.
Count co-occurrences between TFBS. This function requires .TFBS to be filled by either TFBS_from_motifs, TFBS_from_bed or TFBS_from_tobias.
This function can be followed by .market_basket to calculate association rules.
Parameters:
min_dist (int) – Minimum distance between two TFBS to be counted as co-occurring. Distances are calculated depending on the ‘anchor’ given. Default: 0.
max_dist (int) – Maximum distance between two TFBS to be counted as co-occurring. Distances are calculated depending on the ‘anchor’ given. Default: 100.
min_overlap (float between 0-1, optional) – Minimum overlap fraction needed between sites, e.g. 0 = no overlap needed, 1 = full overlap needed. Default: 0.
max_overlap (float between 0-1, optional) – Controls how much overlap is allowed for individual sites. A value of 0 indicates that overlapping TFBS will not be saved as co-occurring.
Float values between 0-1 indicate the fraction of overlap allowed (the overlap is always calculated as a fraction of the smallest TFBS). A value of 1 allows all overlaps. Default: 0 (no overlap allowed).
stranded (bool) – Whether to take strand of TFBSs into account. Default: False.
directional (bool) – Decide if direction of found pairs should be taken into account, e.g. whether “<—TF1—> <—TF2—>” is only counted as
TF1-TF2 (directional=True) or also as TF2-TF1 (directional=False). Default: False.
binarize (bool, optional) – Whether to count a TF1-TF2 more than once per window (e.g. in the case of “<TF1> <TF2> <TF2> (…)”). Default: False.
anchor (str, optional) – The anchor to use for calculating distance. Must be one of [“inner”, “outer”, “center”]
n_background (int, optional) – Number of random co-occurrence backgrounds to obtain. This number effects the runtime of .count_within, but ‘threads’ can be used to speed up background calculation. Default: 50.
threads (int, optional) – Number of threads to use. Default: 1.
Returns:
Fills the object variables .TF_counts and .pair_counts.
Get genomic locations of a particular TF pair. Requires .TFBS to be filled.
If ‘count_within’ was run, the parameters used within the latest ‘count_within’ run are used. Else, the default values of tfcomb.utils.get_pair_locations() are used.
Both options can be overwritten by setting kwargs.
Parameters:
pair (tuple) – Name of TF1, TF2 in pair.
TF1_strand (str, optional) – Strand of TF1 in pair. Default: None (strand is not taken into account).
TF2_strand (str, optional) – Strand of TF2 in pair. Default: None (strand is not taken into account).
kwargs (arguments) – Any additional arguments are passed to tfcomb.utils.get_pair_locations.
Runs market basket analysis on the TF1-TF2 counts. Requires prior run of .count_within().
Parameters:
measure (str or list of strings, optional) – The measure(s) to use for market basket analysis. Can be any of: [“cosine”, “confidence”, “lift”, “jaccard”]. Default: ‘cosine’.
threads (int, optional) – Threads to use for multiprocessing. This is passed to .count_within() in case the <CombObj> does not contain any counts yet. Default: 1.
keep_zero (bool, optional) – Whether to keep rules with 0 occurrences in .rules table. Default: False (remove 0-rules).
n_baskets (int, optional) – The number of baskets used for calculating market basket measures. Default: 1e6.
Raises:
InputError – If the measure given is not within available measures.
Simplify rules so that TF1-TF2 and TF2-TF1 pairs only occur once within .rules.
This is useful for association metrics such as ‘cosine’, where the association of TF1->TF2 equals TF2->TF1.
This function keeps the first unique pair occurring within the rules table.
Select rules based on a list of TF names. The parameters TF1/TF2 can be used to select for which TF to create the selection on (by default: both TF1 and TF2).
Parameters:
TF_list (list) – List of TF names fitting to TF1/TF2 within .rules.
TF1 (bool, optional) – Whether to subset the rules containing ‘TF_list’ TFs within “TF1”. Default: True.
TF2 (bool, optional) – Whether to subset the rules containing ‘TF_list’ TFs within “TF2”. Default: True.
reduce_TFBS (bool, optional) – Whether to reduce the .TFBS of the new object to the TFs remaining in .rules after selection. Setting this to ‘False’ will improve speed, but also increase memory consumption. Default: True.
inplace (bool, optional) – Whether to make selection on current CombObj. If False,
how (string, optional) – How to join TF1 and TF2 subset. Default: inner
Raises:
InputError – If both TF1 and TF2 are False or if no rules were selected based on input.
Returns:
If inplace == False; tfcomb.objects.CombObj() – An object containing a subset of <Combobj>.rules.
custom_list (list of strings) – List of TF pairs (e.g. a string “TF1-TF2”) fitting to TF1/TF2 combination within .rules.
reduce_TFBS (bool, optional) – Whether to reduce the .TFBS of the new object to the TFs remaining in .rules after selection. Setting this to ‘False’ will improve speed, but also increase memory consumption. Default: True.
Select the top ‘n’ rules within .rules. By default, the .rules are sorted for the measure value, so n=100 will select the top 100 highest values for the measure (e.g. cosine).
Parameters:
n (int) – The number of rules to select.
reduce_TFBS (bool, optional) – Whether to reduce the .TFBS of the new object to the TFs remaining in .rules after selection. Setting this to ‘False’ will improve speed, but also increase memory consumption. Default: True.
Make selection of rules based on distribution of x/y-measures
Parameters:
x (str, optional) – The name of the column within .rules containing the measure to be selected on. Default: ‘cosine’.
y (str, optional) – The name of the column within .rules containing the pvalue to be selected on. Default: ‘zscore’
x_threshold (float, optional) – A minimum threshold for the x-axis measure to be selected. If None, the threshold will be estimated from the data. Default: None.
x_threshold_percent (float between 0-1, optional) – If x_threshold is not given, x_threshold_percent controls the strictness of the automatic threshold selection. Default: 0.05.
y_threshold (float, optional) – A minimum threshold for the y-axis measure to be selected. If None, the threshold will be estimated from the data. Default: None.
y_threshold_percent (float between 0-1, optional) – If y_threshold is not given, y_threshold_percent controls the strictness of the automatic threshold selection. Default: 0.05.
reduce_TFBS (bool, optional) – Whether to reduce the .TFBS of the new object to the TFs remaining in .rules after selection. Setting this to ‘False’ will improve speed, but also increase memory consumption. Default: True.
plot (bool, optional) – Whether to show the ‘measure vs. pvalue’-plot or not. Default: True.
kwargs (arguments) – Additional arguments are forwarded to tfcomb.plotting.scatter
table (str or pandas.DataFrame) – A table containing data to add to .rules. If table is a string, ‘table’ is assumed to be the path to a tab-separated table containing a header line and rows of data.
merge (str) – Which information to merge - must be one of “pair”, “TF1” or “TF2”. The option “pair” is used to merge infromation about TF-TF pairs such as protein-protein-interactions.
The ‘TF1’ and ‘TF2’ can be used to include TF-specific information such as expression levels.
TF1_col (str, optional) – The column in table corresponding to “TF1” name. If merge == “TF2”, ‘TF1’ is ignored. Default: “TF1”.
TF2_col (str, optional) – The column in table corresponding to “TF2” name. If merge == “TF1”, ‘TF2’ is ignored. Default: “TF2”.
prefix (str, optional) – A prefix to add to the columns. Can be useful for adding the same information to both TF1 and TF2 (e.g. by using “TF1” and “TF2” prefixes),
or adding same-name columns from different tables. Default: None (no prefix).
Builds a TF-TF co-occurrence network for the rules within object. This is a wrapper for the tfcomb.network.build_nx_network() function,
which uses the python networkx package.
Parameters:
kwargs (arguments) – Any additional arguments are passed to tfcomb.network.build_nx_network().
Return type:
None - fills the .network attribute of the CombObj with a networkx.Graph object
Plot the rules in .rules as a network using Graphviz for python. This function is a wrapper for
building the network (using tfcomb.network.build_network) and subsequently plotting the network (using tfcomb.plotting.network).
Parameters:
color_node_by (str, optional) – A column in .rules or .TF_table to color nodes by. Default: ‘TF1_count’.
color_edge_by (str, optional) – A column in .rules to color edges by. Default: ‘cosine’.
size_edge_by (str, optional) – A column in rules to size edge width by. Default: ‘TF1_TF2_count’.
kwargs (arguments) – All other arguments are passed to tfcomb.plotting.network.
Utility function to create a DiffCombObj directly from a comparison between this CombObj and another CombObj. Requires .market_basket() run on both objects.
Runs DiffCombObj.normalize (if chosen) and DiffCombObj.calculate_foldchanges() under the hood.
Note
Set .prefix for each object to get proper naming of output log2fc columns.
Parameters:
obj_to_compare (tfcomb.objects.CombObj) – Another CombObj to compare to the current CombObj.
measure (str, optional) – The measure to compare between objects. Default: ‘cosine’.
join (string) – How to join the TF names of the two objects. Must be one of “inner” or “outer”. If “inner”, only TFs present in both objects are retained.
If “outer”, TFs from both objects are used, and any missing counts are set to 0. Default: “inner”.
normalize (bool, optional) – Whether to normalize values between objects. Default: True.
join (string) – How to join the TF names of the two objects. Must be one of “inner” or “outer”. If “inner”, only TFs present in both objects are retained.
If “outer”, TFs from both objects are used, and any missing counts are set to 0. Default: “inner”.
fillna (True) – If “join” == “outer”, there can be missing counts for individual rules. If fillna == True, these counts are set to 0. Else, the counts are NA.
Default: True.
Normalize the values for the DiffCombObj given measure (.measure) using quantile normalization.
Overwrites the <prefix>_<measure> columns in .rules with the normalized values.
Select differentially regulated rules using a MA-plot on the basis of measure and mean of measures per contrast.
Parameters:
contrast (tuple) – Name of the contrast to use in tuple format e.g. (<prefix1>,<prefix2>). Default: None (the first contrast is shown).
measure (str, optional) – The measure to use for selecting rules. Default: “cosine” (internally converted to <prefix1>/<prefix2>_<measure>_log2fc).
measure_threshold (tuple, optional) – Threshold for ‘measure’ for selecting rules. Default: None (the threshold is estimated automatically)
measure_threshold_percent (float between 0-1) – If measure_threshold is not set, measure_threshold_percent controls the strictness of the automatic threshold. If you increase this value, more differential rules will be found and vice versa. Default: 0.05.
mean_threshold (float, optional) – Threshold for ‘mean’ for selecting rules. Default: None (the threshold is estimated automatically)
mean_threshold_percent (float between 0-1) – if mean_threshold is not set, mean_threshold_percent controls the strictness of the automatic threshold. If you increase this value, more differential rules will be found and vice versa. Default: 0.05.
plot (boolean, optional) – Whether to plot the volcano plot. Default: True.
kwargs (arguments, optional) – Additional arguments are passed to tfcomb.plotting.scatter.
Returns:
An object containing a subset of <DiffCombobj>.rules
Functionality to plot a heatmap of differentially co-occurring TF pairs for a certain contrast.
Parameters:
contrast (tuple, optional) – Name of the contrast to use in tuple format e.g. (<prefix1>,<prefix2>). Default: None (the first contrast is shown).
n_rules (int, optional) – Number of rules to show from each contrast (default: 10). Note: This is the number of rules either up/down, meaning that the rules shown are n_rules * 2.
color_by (str, optional) – Default: “cosine” (converted to “<prefix1>/<prefix2>_<color_by>”)
sort_by (str, optional) – Column in .rules to sort rules by. Default: None (keep sort)
kwargs (arguments, optional) – Additional arguments are passed to tfcomb.plotting.heatmap.
Plot the network of differential co-occurring TFs.
Parameters:
contrast (tuple) – Name of the contrast to use in tuple format e.g. (<prefix1>,<prefix2>). Default: None (the first contrast is shown).
color_node_by (str, optional) – Name of measure to color node by. If column is not in .rules, the name will be internally converted to “prefix1/prefix2_<color_edge_by>”. Default: None.
size_node_by (str, optional) – Column in .rules to size_node_by. If column is not in .rules, the name will be internally converted to “prefix1/prefix2_<size_node_by>” Default: None.
color_edge_by (str, optional) – The name of measure or column to color edge by (will be internally converted to “prefix1/prefix2_<color_edge_by>”). Default: “cosine_log2fc”.
size_edge_by (str, optional) – The name of measure or column to size edge by.
kwargs (arguments) – Any additional arguments are passed to tfcomb.plotting.network.
Count distances for co_occurring TFs, can be followed by analyze_distances
to determine preferred binding distances
Parameters:
directional (bool or None, optional) – Decide if direction of found pairs should be taken into account, e.g. whether “<—TF1—> <—TF2—>” is only counted as
TF1-TF2 (directional=True) or also as TF2-TF1 (directional=False). If directional is None, self.directional will be used.
Default: None.
stranded (bool or None, optional) – Whether to take strand of TFBS into account when counting distances. If stranded is None, self.stranded will be used.
Default: None
percentage (bool, optional) – Whether to count distances as bp or percentage of longest TF1/TF2 region. If True, output will be collected in 1-percent increments from 0-1.
If False, output depends on the min/max distance values given in the DistObj. Default: False.
Scale the counted distances per pair. Saves the scaled counts into .scaled and updates .datasource.
Parameters:
how (str, optional) – How to scale the counts. Must be one of: [“min-max”, “fraction”]. If “min-max”, all counts are scaled between 0 and 1.
If “fraction”, the sum of all counts are scled between 0 and 1. Default: “min-max”.
Helper function for smoothing all rules with a given window size.
Parameters:
window_size (int, optional) – Window size for the rolling smoothing window. A bigger window produces larger flanking ranks at the sides. Default: 3.
reduce (bool, optional) – Reduce the distances to the positions with a full window, i.e. if the window size is 3, the first and last distances are removed.
This prevents flawed enrichment of peaks at the borders of the distances. Default: True.
Returns:
Fills the object variable .smoothed and updates .datasource
After background correction is done, the signal is analyzed for peaks,
indicating preferred binding distances. There can be more than one peak (more than one preferred binding distance) per
Signal. Peaks are called with scipy.signal.find_peaks().
Parameters:
threads (int) – Number of threads used. Default: 1.
method (str) – Method for transforming counts. Can be one of: “zscore” or “flat”.
If “zscore”, the zscore for the pairs is used.
If “flat”, no transformation is performed.
Default: “zscore”.
threshold (float) – The lower threshold for selecting peaks. Default: 2.
min_count (int) – Minimum count of TF1-TF2 occurrences for a preferred distance to be called. Default: 1 (all occurrences are considered).
save (str) – Path to save the peaks table to. Default: None (table is not written).
Returns:
Fills the object variable self.peaks, self.peaking_count
pair (tuple(str, str)) – TF names to plot. e.g. (“NFYA”,”NFYB”)
method (str, optional) – Plotting method. One of:
- ‘peaks’: Shows the z-score signal and any peaks found by analyze_signal_all.
- ‘correction’: Shows the fit of the lowess curve to the data.
- ‘datasource’, ‘distances’, ‘scaled’, ‘corrected’, ‘smoothed’: Shows the signal of the counts given in the .<method> table.
Default: ‘peaks’.
style (str, optional) – What style to plot the datasource in. Can be one of: [“hist”, “kde”, “line”]. Default: “hist”.
show_peaks (bool, optional) – Whether to show the identified peak(s) (if any were found) in the plot. Default: True.
save (str, optional) – Path to save the plots to. If save is None plots won’t be plotted.
Default: None
config (dict, optional) –
Config for some plotting methods.
e.g. {“nbins”:100} for histogram like plots or {“bwadjust”:0.1} for kde (densitiy) plot.
If set to None, below mentioned default parameters are used.
[kde]: bwadjust, Default: 0.1 (see seaborn.kdeplot())
Default: None
collapse (str, optional) – None if negative data should not be collapsed. [“min”,”max”,”mean”,”sum”] allowed as methods. See ._collapse_negative() for more information.
ax (plt.axis) – Plot to an existing axis object. Default: None (a new axis will be created).
color (str, optional) – Color of the plot hist/line/kde. Default: “tab:blue”.
max_dist (int, optional) – Option to set the max_dist independent of the max_dist used for counting distances. Default: None (max_dist is not changed).
kwargs (arguments) – Additional arguments are passed to plt.hist().
Plot the rules in .rules as a network using Graphviz for python. This function is a wrapper for
building the network (using tfcomb.network.build_network) and subsequently plotting the network (using tfcomb.plotting.network).
Parameters:
color_node_by (str, optional) – A column in .rules or .TF_table to color nodes by. Default: ‘TF1_count’.
color_edge_by (str, optional) – A column in .rules to color edges by. Default: ‘Distance’.
size_edge_by (str, optional) – A column in rules to size edge width by. Default: ‘TF1_TF2_count’.
**kwargs (arguments) – All other arguments are passed to tfcomb.plotting.network.
Build a network object from a table using either ‘networkx’ or ‘graph-tool’.
Parameters:
edge_table (pd.DataFrame) – Table containing rows of edges and edge information between node1/node2.
node1 (str, optional) – The column to use as node1 ID. Default: “TF1”.
node2 (str, optional) – The column to use as node2 ID. Default: “TF2”.
node_table (pandas.DataFrame) – A table of attributes to use for nodes. Default: node attributes are estimated from the columns in edge_table.
directed (bool, optional) – Whether edges are directed or not. Default: False.
multi (bool, optional) – Allow multiple edges between two vertices. NOTE: Only valid for tool == ‘networkx’. If False, the first occurrence of TF1-TF2/TF2-TF1 in the table is used. Default: False.
tool (str, optional) – Which module to use for generating network. Must be one of ‘networkx’ or ‘graph-tool’. Default: ‘networkx’.
verbosity (int, optional) – Verbosity of logging (0/1/2/3). Default: 1.
Returns:
if tool is ‘networkx’ (networkx.Graph / networkx.DiGraph / networkx.MultiGraph / networkx.MultiDiGraph - depending on parameters given.)
Annotate regions with genes from .gtf using UROPA [1].
Parameters:
regions (tobias.utils.regions.RegionList() or pandas.DataFrame) – A RegionList object with positions of genomic elements e.g. TFBS or a DataFrame containing chr/start/stop-coordinates. If DataFrame, the function assumes that the order of columns is:
‘chromosome’, ‘start’, ‘end’, ‘id’, ‘score’, ‘strand’.
gtf (str) – Path to .gtf file containing genomic elements for annotation.
config (dict, optional) – A dictionary indicating how regions should be annotated. Default is to annotate feature ‘gene’ within -10000;1000bp of the gene start. See ‘Examples’ of how to set up a custom configuration dictionary.
best (boolean) – Whether to return the best annotation or all valid annotations. Default: True (only best are kept).
threads (int, optional) – Number of threads to use for multiprocessing. Default: 1.
verbosity (int, optional) – Level of verbosity of logger. One of 0,1, 2. Default: 1.
Returns:
Dataframe including regions and annotation information (if applicable, otherwise a warning will be displayed and None is returned).
Plot scatter-plot of x/y values within table. Can also set thresholds and label values within plot.
Parameters:
table (pd.DataFrame) – A table containing columns of ‘measure’ and ‘pvalue’.
x (str) – Name of column in table containing values to map on the x-axis.
y (str) – Name of column in table containing values to map on the y-axis.
x_threshold (float, tuple of floats or None, optional) – Gives the option to visualize an x-axis threshold within plot. If None, no measure threshold is set. Default: None.
y_threshold (float, tuple of floats or None, optional) – Gives the option to visualize an y-axis threshold within plot. If None, no measure threshold is set. Default: None.
label (str or list, optional) – If None, no point labels are plotted. If “selection”, the . Default: None.
label_fontsize (float, optional) – Size of labels. Default: 9.
label_color (str, optional) – Color of labels. Default: ‘red’.
title (str, optional) – Title of plot. Default: None.
kwargs (arguments) – Any additional arguments are passed to sns.jointplot.
Plot network of a networkx object using Graphviz for python.
Parameters:
network (networkx.Graph) – A networkx Graph/DiGraph object containing the network to plot.
color_node_by (str, optional) – The name of a node attribute
color_edge_by (str, optional) – The name of an edge attribute
size_node_by (str, optional) – The name of a node attribute
size_edge_by (str, optional) – The name of an edge attribute
engine (str, optional) – The graphviz engine to use for finding network layout. Default: “sfdp”.
size (str, optional) – Size of the output figure. Default: “8,8”.
min_edge_size (float, optional) – Default: 2.
max_edge_size (float, optional) – Default: 8.
min_node_size (float, optional) – Default: 14.
max_node_size (float, optional) – Default: 20.
legend_size (int, optional) – Fontsize for legend explaining color_node_by/color_edge_by/size_node_by/size_edge_by. Set to 0 to hide legend. Default: ‘auto’.
node_border (bool, optional) – Whether to plot border on nodes. Can be useful if the node colors are very light. Default: False.
node_cmap (str, optional) – Name of colormap for node coloring. Default: None (colors are automatically chosen).
edge_cmap (str, optional) – Name of colormap for edge coloring. Default: None (colors are automatically chosen).
node_attributes (dict, optional) – Additional node attributes to apply to graph. Default: No additional attributes.
save (str, optional) – Path to save network figure to. Format is inferred from the filename - if not valid, the default format is ‘.pdf’.
verbosity (int, optional) – verbosity of the logging. Default: 1.
Raises:
TypeError – If network is not a networkx.Graph object
InputError – If any of ‘color_node_by’, ‘color_edge_by’ or ‘size_edge_by’ is not in node/edge attributes, or if ‘engine’ is not a valid graphviz engine.
Plot TFBS in genome view via the ‘DnaFeaturesViewer’ package.
Parameters:
TFBS (list) – A list of OneTFBS objects or any other object containing .chrom, .start, .end and .name variables.
window_chrom (str, optional if 'window' is given) – The chromosome of the window to show.
window_start (int, optional if 'window' is given) – The genomic coordinates for the start of the window.
window_end (int, optional if 'window' is given) – The genomic coordinates for the end of the window.
window (Object with .chr, .start, .end) – If window_chrom/window_start/window_end are not given, window can be given as an object containing .chrom, .start, .end variables
fasta (str, optional) – The path to a fasta file containing sequence information to show. Default: None.
bigwigs (str, list or dict of strings, optional) – Give the paths to bigwig signals to show within graph. Default: None.
bigwigs_sharey (bool or list, optional) – Whether bigwig signals should share y-axis range. If True, all signals will be shared.
It is also possible to give a list of bigwig indices (starting at 0), which should share y-axis values, e.g. [0,1,3] for the 1st, 2nd and 4th bigwig to share signal.
If list of lists, each lists correspond to a grouping, e.g. [[0,2], [1,3]]. Default: False.
Write the locations of (TF1, TF2) pairs to a bed-file.
Parameters:
locations (list) – The output of get_pair_locations().
outfile (str) – The path which the pair locations should be written to.
fmt (str, optional) – The format of the output file. Must be one of “bed” or “bedpe”. If “bed”, the TF1/TF2 sites are written individually (see merge option to merge sites). If “bedpe”, the sites are written in BEDPE format. Default: “bed”.
merge (bool, optional) – If fmt=”bed”, ‘merge’ controls how the locations are written out. If True, will be written as one region spanning TF1.start-TF2.end. If False, TF1/TF2 sites are written individually. Default: False.
flank (int or tuple, default 100) – Window size of TFBSpair. Adds given amount of bases in both directions counted from alignment anchor (see align) between binding sites. Use a tuple of ints to set left and right flank independently.
align (str, default 'center') –
Position from which the flanking regions are added. Must be one of ‘center’, ‘left’, ‘right’.
’center’: Midpoint between binding positions (rounded down if uneven).
‘left’: End of first binding position in pair.
‘right’: Start of second binding position in pair.
Create a heatmap of TF binding pairs sorted for distance.
Parameters:
logNorm_cbar (str, default None) –
[None, “centerLogNorm”, “SymLogNorm”]
Choose a type of normalization for the colorbar.
SymLogNorm:
Use matplotlib.colors.SymLogNorm. This does not center to 0
centerLogNorm:
Use custom matplotlib.colors.SymLogNorm from stackoverflow. Note this creates a weird colorbar.
show_binding (bool, default True) – Shows the TF binding positions as a grey background.
flank_plot (str, default 'strand') – [“strand”, “orientation”, None]
Decide if the plots flanking the heatmap should be colored for strand, strand-orientation or disabled.
output (str, default None) – Save plot to given file.
flank (int or int tuple, default None) – Bases added to both sides counted from center. Forwarded to comp_plotting_tables().
align (str, default None) – Alignment of pairs. One of [‘left’, ‘right’, ‘center’]. Forwarded to comp_plotting_tables().
alpha (float, default 0.7) – Alpha value for diagonal lines, TF binding positions and center line.
cmap (matplotlib colormap name or object, or list of colors, default 'seismic') – Color palette used in the main heatmap. Forwarded to seaborn.heatmap(cmap)
show_diagonal (boolean, default True) – Shows diagonal lines for identifying preference in binding distance.
legend_name_score (str, default 'Bigwig Score') – Name of the score legend (upper legend).
xtick_num (int, default 10) – Number of ticks shown on the x-axis. Disable ticks with None or values < 0.
log (function, default numpy.log1p) – Function applied to each row of scores. Excludes 0 and will use absolute value for negative numbers adding the sign afterwards.
Use any of the numpy.log functions. For example numpy.log, numpy.log10 or numpy.log1p. None to disable.
dpi (float, default 300) – The resolution of the figure in dots-per-inch.
Create an aggregated footprint track on the paired binding sites.
Either aggregate all sites for a specific distance or give a range of sites that should be aggregated.
If the second approach spans multiple distances the binding locations are shown as a range as well.
Parameters:
dist (int or int list, default None) – Show track for one or more distances between binding sites.
start (int, default None) – Define start of range of sites that should be aggregated. If set will ignore ‘dist’.
end (int, default None) – Define end of range of sites that should be aggregated. If set will ignore ‘dist’.
ymin (int, default None) – Y-axis minimum limit.
ymax (int, default None) – Y-axis maximum limit.
ylabel (str, default 'Bigwig signal') – Label for the y-axis.
output (str, default None) – Save plot to given file.
flank (int or int tuple, default None) – Bases added to both sides counted from center. Forwarded to comp_plotting_tables().
align (str, default None) – Alignment of pairs. One of [‘left’, ‘right’, ‘center’]. Forwarded to comp_plotting_tables().
dpi (float, default 70) – The resolution of the figure in dots-per-inch.
_ret_param (bool, default False) – Intended for internal animation use!
If True will cause the function to return a dict of function call parameters used to create plot.
Multiprocessing-safe function to scan for motif occurrences
Parameters:
genome (str or) – If string , genome will be opened
regions (RegionList()) – A RegionList() object of regions
resolve (str) – How to resolve overlapping sites from the same TF. Must be one of “off”, “highest_score” or “merge”. If “highest_score”, the highest scoring overlapping site is kept.
If “merge”, the sites are merged, keeping the information of the first site. If “off”, overlapping TFBS are kept. Default: “merge”.
Resolve overlapping sites within a list of genomic regions.
Parameters:
sites (RegionList) – A list of TFBS/regions with .chrom, .start, .end and .name information.
how (str) – How to resolve the overlapping site. Must be one of “highest_score”, “merge”. If “highest_score”, the highest scoring overlapping site is kept.
If “merge”, the sites are merged, keeping the information of the first site. Default: “merge”.
per_name (bool) – Whether to resolve overlapping only per name or across all sites. If ‘True’ overlaps are only resolved if the name of the sites are equal.
If ‘False’, overlaps are resolved across all sites. Default: True.
Overlap regions in regionlist ‘a’ with regions from regionlist ‘b’ and add
a boolean attribute to the regions in ‘a’ containing overlap status with ‘b’.
Parameters:
a (list of OneTFBS objects) – A list of objects containing genomic locations.
b (list of OneTFBS objects) – A list of objects containing genomic locations to overlap with ‘a’ regions.
att (str, optional) – The name of the attribute to add to ‘a’ objects. Default: “overlap”.
Function to get upper/lower threshold(s) based on the distribution of data. The threshold is calculated as the probability of “percent” (upper=1-percent).
Parameters:
data (list or array) – An array of data to find threshold on.
which (str) – Which threshold to calculate. Can be one of “upper”, “lower”, “both”. Default: “upper”.
percent (float between 0-1) – Controls how strict the threshold should be set in comparison to the distribution. Default: 0.05.
Return type:
If which is one of “upper”/”lower”, get_threshold returns a float. If “both”, get_threshold returns a list of two float thresholds.