Binding distance analysis

In this example notebook, a binding distance analysis will be shown and explained step by step.

To start such an analysis, a market basket analysis is required first. The data used is the same as in the TFBS from motif.

For more details on how to perform the market basket step please have a look at the TFBS from motif or ChIP-seq analysis examples.

Content:

Prepare object
Select rules
Automated analysis
Step-by-step:
1. Create object
2. Count distances
3. Smoothing counts
4. Scale counts
5. Correct background
6. Analyse signal
- z-score
- Flat
Further downstream analysis
1. Analyzing hubs
2. Signal classification

Prepare a CombObj

[1]:

import tfcomb.objects

C = tfcomb.objects.CombObj()
C.TFBS_from_motifs(regions="../data/GM12878_hg38_chr4_ATAC_peaks.bed",
                   motifs="../data/HOCOMOCOv11_HUMAN_motifs.txt",
                   genome="../data/hg38_chr4.fa.gz",
                   threads=4)
C.count_within(max_overlap=0.0, threads=4)
C.market_basket()
C.rules.head()

INFO: Scanning for TFBS with 4 thread(s)...
INFO: Progress: 11%
INFO: Progress: 20%
INFO: Progress: 30%
INFO: Progress: 40%
INFO: Progress: 50%
INFO: Progress: 60%
INFO: Progress: 71%
INFO: Progress: 82%
INFO: Progress: 91%
INFO: Finished!
INFO: Processing scanned TFBS
INFO: Identified 165810 TFBS (401 unique names) within given regions
INFO: Setting up binding sites for counting
INFO: Counting co-occurrences within sites
INFO: Counting co-occurrence within background
INFO: Progress: 16%
INFO: Progress: 20%
INFO: Progress: 32%
INFO: Progress: 40%
INFO: Progress: 50%
INFO: Progress: 62%
INFO: Progress: 72%
INFO: Progress: 80%
INFO: Progress: 92%
INFO: Finished!
INFO: Done finding co-occurrences! Run .market_basket() to estimate significant pairs
INFO: Market basket analysis is done! Results are found in <CombObj>.rules

[1]:

	TF1	TF2	TF1_TF2_count	TF1_count	TF2_count	cosine	zscore
POU3F2-SMARCA5	POU3F2	SMARCA5	239	302	241	0.885902	129.586528
SMARCA5-POU3F2	SMARCA5	POU3F2	239	241	302	0.885902	129.586528
POU2F1-SMARCA5	POU2F1	SMARCA5	263	426	241	0.820810	135.355691
SMARCA5-POU2F1	SMARCA5	POU2F1	263	241	426	0.820810	135.355691
SMARCA5-ZNF582	SMARCA5	ZNF582	172	241	195	0.793419	117.370387

Selecting rules

Due to the way the market basket analysis is working, the C.rules result table contains entries for every transcription factor combination present in the data.

For a binding analysis not all of these rules are of particular interest. In this notebook the method .select_significant_rules() is used to filter rules of interest. For more details on rule selection please refer to the example notebook Select rules.

[2]:

selection = C.select_significant_rules()

INFO: x_threshold is None; trying to calculate optimal threshold
INFO: y_threshold is None; trying to calculate optimal threshold
INFO: Creating subset of TFBS and rules using thresholds

../_images/examples_Distance_analysis_7_1.png

Method 1: Automated analysis

There are two different ways to run this analysis. The automated way will be shown in this chapter. Followed by an in depth guide showing the analysis step by step in the next chapter.

Here we will start with the automated analysis for all selected rules.

[3]:

selection.analyze_distances(threads=6) # adjust threads if needed

INFO: DistObject successfully created! It can be accessed via <CombObj>.distObj
INFO: Preparing to count distances.
INFO: Setting up binding sites for counting
INFO: Calculating distances
INFO: Done finding distances! Results are found in .distances
INFO: You can now run .smooth() and/or .correct_background() to preprocess sites before finding peaks.
INFO: Or you can find peaks directly using .analyze_signal_all()
INFO: Smoothing signals with window size 3
INFO: Background correction finished! Results can be found in .corrected
INFO: Analyzing Signal with threads 6
INFO: Calculating zscores for signals
INFO: Finding preferred distances
INFO: Done analyzing signal. Results are found in .peaks

[4]:

selection.distObj.peaks.loc[(selection.distObj.peaks.TF1 == "ZNF121") & (selection.distObj.peaks.TF2 == "ZNF770")]

[4]:

	TF1	TF2	Distance	Peak Heights	Prominences	Threshold	TF1_TF2_count	Distance_count	Distance_percent	Distance_window
ZNF121-ZNF770	ZNF121	ZNF770	26	4.043651	4.443567	2	1319	381	28.885519	[25;27]
ZNF121-ZNF770	ZNF121	ZNF770	58	3.906961	4.353073	2	1319	376	28.506444	[57;59]
ZNF121-ZNF770	ZNF121	ZNF770	66	2.068096	2.270661	2	1319	220	16.679303	[65;67]

[5]:

_ = selection.distObj.plot(("ZNF121", "ZNF770"))

../_images/examples_Distance_analysis_13_0.png

On the x-axis the distance in bp is shown. For example a distance of 100 means the anchors (depending on anchor mode, please refer to Anchor mode on this topic) are 100 bp away of each other. On the y-axis of the plot the calculated zscore per distance is shown.

For an explaination of the results please refer to the in depth guide below.

Method 2: Step-by-Step Analysis

Besides the automated way, the analysis can be done step-by-step. This chapter is an detailed guide, covering all 5 major steps.

1. Create a distance object

The binding distance analysis can be started from within any combObj. In this example the appropiate object is called selection, since we only want to use significant rules, not all. The analysis can also be done without pre selection.

[6]:

selection.create_distObj()

INFO: DistObject successfully created! It can be accessed via <CombObj>.distObj

As stated in the information message, the distObj should be created successfully and filled with all important information to start the distance analysis. This includes parameters set for the market masket analysis.

[7]:

selection.distObj

[7]:

<tfcomb.objects.DistObj at 0x7f87d1e6a950>

The 3792 rules (market basket results) selected earlier by select.significant_rules() are automatically passed to the distance object during creation:

[8]:

selection.distObj.rules

[8]:

	TF1	TF2	TF1_TF2_count	TF1_count	TF2_count	cosine	zscore
POU3F2-SMARCA5	POU3F2	SMARCA5	239	302	241	0.885902	129.586528
SMARCA5-POU3F2	SMARCA5	POU3F2	239	241	302	0.885902	129.586528
POU2F1-SMARCA5	POU2F1	SMARCA5	263	426	241	0.820810	135.355691
SMARCA5-POU2F1	SMARCA5	POU2F1	263	241	426	0.820810	135.355691
SMARCA5-ZNF582	SMARCA5	ZNF582	172	241	195	0.793419	117.370387
...	...	...	...	...	...	...	...
NFYB-EGR2	NFYB	EGR2	24	219	1304	0.044911	3.401991
MYOG-TAF1	MYOG	TAF1	23	386	681	0.044860	3.375730
TAF1-MYOG	TAF1	MYOG	23	681	386	0.044860	3.375730
ETV1-ZBTB17	ETV1	ZBTB17	20	117	1699	0.044858	3.064184
ZBTB17-ETV1	ZBTB17	ETV1	20	1699	117	0.044858	3.064184

3778 rows × 7 columns

To unify the analysis steps between the market basket and the binding distance ones, the parameters for the:
minimal distance
maximal distance
maximal allowed overlap
directionality
anchor

will be set automatically when creating the distance object according to the values used for the market basket analysis step.

[9]:

selection.distObj.max_overlap

[9]:

0.0

2. Distance counting

After creating the distObj, the first step is to count the distances. This will be done with .count_distances(). In this step it is important to decide if the directionality should be taken into account.

If directionality is taken into account the position of the transcription factors do matter. This means there is a difference between TFA -> TFB and TFB -> TFA (compare Orientation analysis notebook). Otherwise TFA -> TFB and TFB -> TFA are the same.

Per default the directionality decision is copied from the market basket step.

[10]:

selection.distObj.directional

[10]:

False

[11]:

selection.distObj.count_distances()

INFO: Preparing to count distances.
INFO: Setting up binding sites for counting
INFO: Calculating distances
INFO: Done finding distances! Results are found in .distances
INFO: You can now run .smooth() and/or .correct_background() to preprocess sites before finding peaks.
INFO: Or you can find peaks directly using .analyze_signal_all()

The resulting dataframe is constructed as followed: - columns: First the transcription factor for the pair (TF1, TF2) followed by the distances in bp - rows: each row representing one rule (pair) with the corresponding results

[12]:

selection.distObj.distances

[12]:

	TF1	TF2	0	1	2	3	4	5	6	7	...	91	92	93	94	95	96	97	98	99	100
POU3F2-SMARCA5	POU3F2	SMARCA5	2	0	0	7	0	15	0	1	...	0	0	7	1	6	0	0	3	0	3
SMARCA5-POU3F2	SMARCA5	POU3F2	2	0	0	7	0	15	0	1	...	0	0	7	1	6	0	0	3	0	3
POU2F1-SMARCA5	POU2F1	SMARCA5	0	2	0	39	0	0	12	1	...	4	0	5	0	0	4	0	1	0	1
SMARCA5-POU2F1	SMARCA5	POU2F1	0	2	0	39	0	0	12	1	...	4	0	5	0	0	4	0	1	0	1
SMARCA5-ZNF582	SMARCA5	ZNF582	0	0	38	0	36	1	1	2	...	0	4	0	2	0	0	2	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
NFYB-EGR2	NFYB	EGR2	2	0	1	0	0	0	1	0	...	0	0	0	0	0	0	1	0	0	0
MYOG-TAF1	MYOG	TAF1	0	1	1	0	1	0	0	0	...	0	0	0	0	0	0	0	0	1	0
TAF1-MYOG	TAF1	MYOG	0	1	1	0	1	0	0	0	...	0	0	0	0	0	0	0	0	1	0
ETV1-ZBTB17	ETV1	ZBTB17	0	0	0	0	0	0	1	1	...	0	0	0	1	1	0	1	0	0	0
ZBTB17-ETV1	ZBTB17	ETV1	0	0	0	0	0	0	1	1	...	0	0	0	1	1	0	1	0	0	0

3778 rows × 103 columns

Additional options for counting distances

Negative distances

Negative distances indicate overlapping. Caveat: this is basepair resolution and strongly dependend on motif length.

Negative distance can occur if the distance anchor is set to inner mode (see Anchor mode) and overlapping is allowed.

Anchor mode

TFCOMB distance analysis supports three different anchor modes: inner , outer and center. The recommended mode is inner.

inner (default, recommented) is the distance between the transcription factors, it is measures as start(transcription factor B) - end(transcription factor A). If for example transcription factor B is directly adjacent to Transcription factor A, the difference will be zero.
center is the distance measured from mid of transcription factor to mid of transcription factor
outer (uncommonly used) is the distance measured including both transcription factors. end(transcription factor B) - start(transcription factor A)

Directionality

Since we didn’t count directional, the values for the pairs TF1-TF2 and TF2-TF1 should be equal. For example in the DataFrame above the results for ZNF121-ZNF770 are the same as for ZNF770-ZNF121. This is not true if directionality is considered.

Note: If directionality is not considered, the duplicates can be filtered with .simplify_rules()

Plotting

There are different ways to plot the distance distribution, you can for example create a kernel density estimate (KDE) plot or a histogram of the distribution.

[13]:

selection.distObj.plot(("ZNF121", "ZNF770"), style="kde")

[13]:

<AxesSubplot:title={'center':'ZNF121-ZNF770'}, xlabel='Distance (bp)', ylabel='Count per distance'>

../_images/examples_Distance_analysis_37_1.png

[14]:

selection.distObj.plot(("ZNF121", "ZNF770"), style="hist")

[14]:

<AxesSubplot:title={'center':'ZNF121-ZNF770'}, xlabel='Distance (bp)', ylabel='Count per distance'>

../_images/examples_Distance_analysis_38_1.png

For both plots the x-axis shows the distance in bp.

For example a distance of 100 means the anchors (depending on anchor mode, please refere to Anchor mode on this topic) are 100 bp away of each other. Here is an explanation for the neg distance.

For the y-axis of the kde plot the density estimation is shown.

For the y-axis of the histogram the counts per distance is shown.

3. Smoothing counts

In order to collect distances from more than one basepair, e.g. in a window, it is possible to smooth the counted distances. This is done using the function .smooth() of the distObj:

[15]:

selection.distObj.smooth(window_size=3)

INFO: Smoothing signals with window size 3

The smoothed (min-max-scaled) distances are visible in the .smoothed variable:

[16]:

selection.distObj.smoothed.head()

[16]:

	TF1	TF2	1	2	3	4	5	6	7	8	...	90	91	92	93	94	95	96	97	98	99
POU3F2-SMARCA5	POU3F2	SMARCA5	0.666667	2.333333	2.333333	7.333333	5.000000	5.333333	10.333333	10.333333	...	1.666667	1.666667	2.333333	2.666667	4.666667	2.333333	2.000000	1.000000	1.000000	2.000000
SMARCA5-POU3F2	SMARCA5	POU3F2	0.666667	2.333333	2.333333	7.333333	5.000000	5.333333	10.333333	10.333333	...	1.666667	1.666667	2.333333	2.666667	4.666667	2.333333	2.000000	1.000000	1.000000	2.000000
POU2F1-SMARCA5	POU2F1	SMARCA5	0.666667	13.666667	13.000000	13.000000	4.000000	4.333333	5.333333	1.333333	...	1.333333	1.333333	3.000000	1.666667	1.666667	1.333333	1.333333	1.666667	0.333333	0.666667
SMARCA5-POU2F1	SMARCA5	POU2F1	0.666667	13.666667	13.000000	13.000000	4.000000	4.333333	5.333333	1.333333	...	1.333333	1.333333	3.000000	1.666667	1.666667	1.333333	1.333333	1.666667	0.333333	0.666667
SMARCA5-ZNF582	SMARCA5	ZNF582	12.666667	12.666667	24.666667	12.333333	12.666667	1.333333	1.333333	3.333333	...	1.333333	1.333333	1.333333	2.000000	0.666667	0.666667	0.666667	0.666667	1.000000	0.333333

5 rows × 101 columns

4. Scale counts (optional)

Optionally, it is possible to scale the counts to be in the same ranges regardless of number of counts per pair.

[17]:

selection_copy = selection.copy() #create a copy to not alter the real selection object
selection_copy.distObj.scale()

[18]:

selection_copy.distObj.scaled.head()

[18]:

	TF1	TF2	1	2	3	4	5	6	7	8	...	90	91	92	93	94	95	96	97	98	99
POU3F2-SMARCA5	POU3F2	SMARCA5	0.017857	0.107143	0.107143	0.375	0.250000	0.267857	0.535714	0.535714	...	0.071429	0.071429	0.107143	0.125000	0.232143	0.107143	0.089286	0.035714	0.035714	0.089286
SMARCA5-POU3F2	SMARCA5	POU3F2	0.017857	0.107143	0.107143	0.375	0.250000	0.267857	0.535714	0.535714	...	0.071429	0.071429	0.107143	0.125000	0.232143	0.107143	0.089286	0.035714	0.035714	0.089286
POU2F1-SMARCA5	POU2F1	SMARCA5	0.025000	1.000000	0.950000	0.950	0.275000	0.300000	0.375000	0.075000	...	0.075000	0.075000	0.200000	0.100000	0.100000	0.075000	0.075000	0.100000	0.000000	0.025000
SMARCA5-POU2F1	SMARCA5	POU2F1	0.025000	1.000000	0.950000	0.950	0.275000	0.300000	0.375000	0.075000	...	0.075000	0.075000	0.200000	0.100000	0.100000	0.075000	0.075000	0.100000	0.000000	0.025000
SMARCA5-ZNF582	SMARCA5	ZNF582	0.513514	0.513514	1.000000	0.500	0.513514	0.054054	0.054054	0.135135	...	0.054054	0.054054	0.054054	0.081081	0.027027	0.027027	0.027027	0.027027	0.040541	0.013514

5 rows × 101 columns

5. Background correction

To separate the signal from the background noise a linear regression for every signal needs to be fitted next. The result of the fitted line for every pair can be found in the .linres attribute of the distance object.

[19]:

selection.distObj.correct_background(threads=6)

INFO: Background correction finished! Results can be found in .corrected

The fitted lowess function can be plotted with the .plot() command. The red line indicates the linear regression line which was fitted during this step.

[20]:

_ = selection.distObj.plot(("ZNF121", "ZNF770"), method="correction")

../_images/examples_Distance_analysis_52_0.png

After correction, the plotting function will show the corrected counts:

[21]:

_ = selection.distObj.plot(("ZNF121", "ZNF770"))

../_images/examples_Distance_analysis_54_0.png

6. Analyse signal

As a last step, the corrected signal can now be analyzed. Peaks will be called with *scipy.signal.find_peaks()*. Two different methods are available:

“zscore”
“flat” (number)

Zscore method

If prominence is set to zscore, the threshold is set in relation to the zscore (of the corrected signal) for each signal. For example if threshold is 2, the threshold prominence (for the score translated signal) will be a zscore of two.

[22]:

selection.distObj.analyze_signal_all(method="zscore", threshold=2, threads=6)
selection.distObj.peaks

INFO: Analyzing Signal with threads 6
INFO: Calculating zscores for signals
INFO: Finding preferred distances
INFO: Done analyzing signal. Results are found in .peaks

[22]:

	TF1	TF2	Distance	Peak Heights	Prominences	Threshold	TF1_TF2_count	Distance_count	Distance_percent	Distance_window
ZFP82-SMARCA5	ZFP82	SMARCA5	8	6.547058	7.582435	2	198	46	23.232323	[7;9]
SMARCA5-ZFP82	SMARCA5	ZFP82	8	6.547058	7.582435	2	198	46	23.232323	[7;9]
ZNF394-SMARCA5	ZNF394	SMARCA5	5	6.940786	7.474373	2	187	74	39.572193	[4;6]
SMARCA5-ZNF394	SMARCA5	ZNF394	5	6.940786	7.474373	2	187	74	39.572193	[4;6]
POU3F2-SMARCA5	POU3F2	SMARCA5	9	6.584317	7.402690	2	234	57	24.358974	[8;10]
...	...	...	...	...	...	...	...	...	...	...
ETV5-ZSCAN22	ETV5	ZSCAN22	98	2.010752	2.010752	2	64	3	4.687500	[97;99]
ZBTB17-RFX1	ZBTB17	RFX1	1	2.010639	2.010639	2	30	1	3.333333	[1;2]
RFX1-ZBTB17	RFX1	ZBTB17	1	2.010639	2.010639	2	30	1	3.333333	[1;2]
WT1-ZIC3	WT1	ZIC3	48	2.226574	2.005542	2	51	4	7.843137	[47;49]
ZIC3-WT1	ZIC3	WT1	48	2.226574	2.005542	2	51	4	7.843137	[47;49]

8160 rows × 10 columns

The resulting dataframe is constructed as followed: - columns: First the transcription factor for the pair (TF1, TF2) followed by

__Distance__ in bp \
__Peak Heights__ height of the peak (calculated by [_find_peaks()_](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html)). \
__Prominences__ prominence of the peak (calculated by [_find_peaks()_](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html)). \
__Threshold__ minimum prominence needed to be considered as peak (in this example the [zscore](#5.B.-Zscore) of the signal). \
__TF1_TF2_count__ is the total number of co-occurring TF1-TF2 sites. \
__Distance_count__ The number of co-occurrences for the given distance. \
__Distance_percent__ The percent of Distance_count of the total sites (TF1_TF2_count). \
__Distance_window__ The distances collected for the given distance. Will be a window if smoothing was applied. \

rows: each row representing one rule (pair) at a distinct prefered binding distance with the corresponding results

The signal and peaks can now be plotted with the .plot() command.

[23]:

selection.distObj.plot(("ZNF121", "ZNF770"))

[23]:

<AxesSubplot:title={'center':'ZNF121-ZNF770'}, xlabel='Distance (bp)', ylabel='Z-score normalized counts'>

../_images/examples_Distance_analysis_61_1.png

This plot shows the corrected signal with the called peaks, indicated by a cross.

The grey dottet line shows the decision boundary (in this example a zscore of 2)

[24]:

selection.distObj.peaks.loc[(selection.distObj.peaks.TF1 == "ZNF121") & (selection.distObj.peaks.TF2 == "ZNF770")]

[24]:

	TF1	TF2	Distance	Peak Heights	Prominences	Threshold	TF1_TF2_count	Distance_count	Distance_percent	Distance_window
ZNF121-ZNF770	ZNF121	ZNF770	26	4.043651	4.443567	2	1319	381	28.885519	[25;27]
ZNF121-ZNF770	ZNF121	ZNF770	58	3.906961	4.353073	2	1319	376	28.506444	[57;59]
ZNF121-ZNF770	ZNF121	ZNF770	66	2.068096	2.270661	2	1319	220	16.679303	[65;67]

The results in the .peaks table and the plot matches. For this ZNF121-ZNF770 3 peaks were found, which also can be seen in the plot at positions:

distance: 26
distance: 58
distance: 66

Meaning the zscore threshold with stringency of 2 identifies 3 different preferred distances for this pair. Please note, that the height of the distances differ!

#### Flat threshold

Using a flat threshold, it is possible to only find peaks with a certain number of counts as seen here:

[25]:

selection.distObj.analyze_signal_all(method="flat", threshold=80, threads=6)
selection.distObj.peaks

INFO: Analyzing Signal with threads 6
INFO: Calculating zscores for signals
INFO: Finding preferred distances
INFO: Done analyzing signal. Results are found in .peaks

[25]:

	TF1	TF2	Distance	Peak Heights	Prominences	Threshold	TF1_TF2_count	Distance_count	Distance_percent	Distance_window
ZNF121-ZNF770	ZNF121	ZNF770	26	125.175641	125.839985	80	1319	381	28.885519	[25;27]
ZNF770-ZNF121	ZNF770	ZNF121	26	125.175641	125.839985	80	1319	381	28.885519	[25;27]
ZNF121-ZNF770	ZNF121	ZNF770	58	121.304647	123.277229	80	1319	376	28.506444	[57;59]
ZNF770-ZNF121	ZNF770	ZNF121	58	121.304647	123.277229	80	1319	376	28.506444	[57;59]
PAX5-ZNF770	PAX5	ZNF770	56	104.564327	106.526327	80	896	325	36.272321	[55;57]
ZNF770-PAX5	ZNF770	PAX5	56	104.564327	106.526327	80	896	325	36.272321	[55;57]

[26]:

selection.distObj.plot(("ZNF121", "ZNF770"))

[26]:

<AxesSubplot:title={'center':'ZNF121-ZNF770'}, xlabel='Distance (bp)', ylabel='Count per distance'>

../_images/examples_Distance_analysis_68_1.png

Further downstream analysis

1. Analyzing hubs

This function allows to summarize the number of different partners (with at least one peak) each transcription factor has.

[27]:

selection.distObj.analyze_hubs().sort_values()

[27]:

PAX5      1
ZNF121    1
ZNF770    2
dtype: int64

2. Signal classification

This function allows to classify the pairs (in .distance DataFrame) wheather the signal is peaking or not.

[28]:

selection.distObj.classify_rules()
selection.distObj.classified.sort_values(by="isPeaking")

INFO: classifying rules
INFO: classifcation done. Results can be found in .classified

[28]:

	TF1	TF2	isPeaking
POU3F2-SMARCA5	POU3F2	SMARCA5	False
ELK4-WT1	ELK4	WT1	False
WT1-ELK4	WT1	ELK4	False
ELF2-SP1	ELF2	SP1	False
SP1-ELF2	SP1	ELF2	False
...	...	...	...
ZBTB17-ETV1	ZBTB17	ETV1	False
ZNF121-ZNF770	ZNF121	ZNF770	True
PAX5-ZNF770	PAX5	ZNF770	True
ZNF770-PAX5	ZNF770	PAX5	True
ZNF770-ZNF121	ZNF770	ZNF121	True

3778 rows × 3 columns

True means the signal has at least one peak found, False otherwise.