ChIP-seq analysis

This notebook shows how to analyze the TF-co-occurrences of predefined TFBS e.g. from ChIP-seq peaks. The data used here is obtained from the ENCODE project (https://pubmed.ncbi.nlm.nih.gov/29126249/) and consists of all TF ChIP-seq experiments for the celltype GM12878, and subset to chr4 in the human genome.

Setup a CombObj

First, we load the tfcomb package and set up an empty CombObj:

[1]:
from tfcomb import CombObj
C = CombObj()
[2]:
C
[2]:
<CombObj>

Read TFBS from .bed-file

Next, we are going to fill the CombObj ‘C’ with binding sites from GM12878 ChIP-seq experiments:

[3]:
C.TFBS_from_bed("../data/GM12878_hg38_chr4_TF_chipseq.bed")
INFO: Reading sites from '../data/GM12878_hg38_chr4_TF_chipseq.bed'...
INFO: Processing sites
INFO: Read 112109 sites (151 unique names)

Now, the CombObj contains the .TFBS variable holding all TFBS to use for analysis:

[4]:
C.TFBS[:10]
[4]:
[chr4   11875   11876   ZBTB33  1000    .,
 chr4   116639  116640  JUNB    974     .,
 chr4   116678  116679  RUNX3   1000    .,
 chr4   121620  121621  RUNX3   1000    .,
 chr4   124050  124051  ZNF217  948     .,
 chr4   124052  124053  SMARCA5 802     .,
 chr4   124169  124170  NR2F1   634     .,
 chr4   124289  124290  CBX5    906     .,
 chr4   124363  124364  E4F1    1000    .,
 chr4   124365  124366  PKNOX1  1000    .]

The CombObj will now reflect that TFBS were added:

[5]:
C
[5]:
<CombObj: 112109 TFBS (151 unique names)>

Perform market basket analysis

Next, the function .market_basket() is used to perform the co-occurrence analysis of the sites just added to .TFBS:

[6]:
C.market_basket()
Internal counts for 'TF_counts' were not set. Please run .count_within() to obtain TF-TF co-occurrence counts.
WARNING: No counts found in <CombObj>. Running <CombObj>.count_within() with standard parameters.
INFO: Setting up binding sites for counting
INFO: Counting co-occurrences within sites
INFO: Counting co-occurrence within background
INFO: Running with multiprocessing threads == 1. To change this, give 'threads' in the parameter of the function.
INFO: Progress: 10%
INFO: Progress: 20%
INFO: Progress: 30%
INFO: Progress: 40%
INFO: Progress: 50%
INFO: Progress: 60%
INFO: Progress: 70%
INFO: Progress: 80%
INFO: Progress: 90%
INFO: Done finding co-occurrences! Run .market_basket() to estimate significant pairs
INFO: Market basket analysis is done! Results are found in <CombObj>.rules

As is shown in the info messages, this also runs the .count_within() function of ‘C’ (if no counts were found yet). If you want to set specific parameters for count_within, you can split these calculations such as seen here:

C.count_within(max_distance=200)
C.market_basket()

In any case, running .market_basket() will fill out the .rules variable of the CombObj:

[7]:
C.rules.head()
[7]:
TF1 TF2 TF1_TF2_count TF1_count TF2_count cosine zscore
CTCF-RAD21 CTCF RAD21 1751 2432 2241 0.750038 18.643056
RAD21-CTCF RAD21 CTCF 1751 2241 2432 0.750038 18.643056
RAD21-SMC3 RAD21 SMC3 1376 2241 1638 0.718192 20.314026
SMC3-RAD21 SMC3 RAD21 1376 1638 2241 0.718192 20.314026
CTCF-SMC3 CTCF SMC3 1361 2432 1638 0.681898 20.245177

Printing the CombObj again will also tell you how many rules were found in the market basket analysis:

[8]:
C
[8]:
<CombObj: 112109 TFBS (151 unique names) | Market basket analysis: 21284 rules>

Visualize results

The CombObj ‘C’ contains a number of different visualizations for the identified TF-TF co-occurrence pairs. The default measure plotted is ‘cosine’, but many of the options of the plots can be changed as seen in the examples below:

Heatmap

[9]:
_ = C.plot_heatmap()
../_images/examples_chipseq_analysis_24_0.png
[10]:
#Showing more rules
_ = C.plot_heatmap(n_rules=50)
../_images/examples_chipseq_analysis_25_0.png
[11]:
#Coloring the heatmap by TF1_TF2_count
_ = C.plot_heatmap(color_by="TF1_TF2_count")
../_images/examples_chipseq_analysis_26_0.png

Bubble plot

[12]:
_ = C.plot_bubble()
../_images/examples_chipseq_analysis_28_0.png
[13]:
#Show more rules
_ = C.plot_bubble(n_rules=40, figsize=(12,4))
../_images/examples_chipseq_analysis_29_0.png

Handling duplicate rules

As seen in the bubble plot above, the rules are duplicated due to the cosine measure, which scores TF1-TF2 the same way as TF2-TF1. In order to simplify the results, you can use .simplify_rules() to remove the duplicates from the .rules:

[14]:
C.simplify_rules()
[15]:
C.rules
[15]:
TF1 TF2 TF1_TF2_count TF1_count TF2_count cosine zscore
CTCF-RAD21 CTCF RAD21 1751 2432 2241 0.750038 18.643056
RAD21-SMC3 RAD21 SMC3 1376 2241 1638 0.718192 20.314026
CTCF-SMC3 CTCF SMC3 1361 2432 1638 0.681898 20.245177
IKZF1-IKZF2 IKZF1 IKZF2 1726 2922 2324 0.662343 11.215960
SMC3-ZNF143 SMC3 ZNF143 1060 1638 1652 0.644383 21.838431
... ... ... ... ... ... ... ...
HDAC6-JUNB HDAC6 JUNB 1 172 1866 0.001765 -6.905016
RELB-SUZ12 RELB SUZ12 1 2011 172 0.001700 -6.793962
ATF7-SUZ12 ATF7 SUZ12 1 2061 172 0.001680 -6.119294
RAD21-SUZ12 RAD21 SUZ12 1 2241 172 0.001611 -7.500216
IKZF2-SUZ12 IKZF2 SUZ12 1 2324 172 0.001582 -6.649424

10642 rows × 7 columns

Now, you can again use plot_bubble() to visualize the highest-scoring co-occurring pairs:

[16]:
_ = C.plot_bubble()
../_images/examples_chipseq_analysis_35_0.png

Save CombObj to a pickle object

We can now save the CombObj to a pickle object to be used in other analysis:

[17]:
C.to_pickle("../data/GM12878.pkl")