ChIP-seq analysis

This notebook shows how to analyze the TF-co-occurrences of predefined TFBS e.g. from ChIP-seq peaks. The data used here is obtained from the ENCODE project (https://pubmed.ncbi.nlm.nih.gov/29126249/) and consists of all TF ChIP-seq experiments for the celltype GM12878, and subset to chr4 in the human genome.

Setup a CombObj

First, we load the tfcomb package and set up an empty CombObj:

[1]:

from tfcomb import CombObj
C = CombObj()

[2]:

[2]:

<CombObj>

Read TFBS from .bed-file

Next, we are going to fill the CombObj ‘C’ with binding sites from GM12878 ChIP-seq experiments:

[3]:

C.TFBS_from_bed("../data/GM12878_hg38_chr4_TF_chipseq.bed")

INFO: Reading sites from '../data/GM12878_hg38_chr4_TF_chipseq.bed'...
INFO: Processing sites
INFO: Read 112109 sites (151 unique names)

Now, the CombObj contains the .TFBS variable holding all TFBS to use for analysis:

[4]:

C.TFBS[:10]

[4]:

[chr4   11875   11876   ZBTB33  1000    .,
 chr4   116639  116640  JUNB    974     .,
 chr4   116678  116679  RUNX3   1000    .,
 chr4   121620  121621  RUNX3   1000    .,
 chr4   124050  124051  ZNF217  948     .,
 chr4   124052  124053  SMARCA5 802     .,
 chr4   124169  124170  NR2F1   634     .,
 chr4   124289  124290  CBX5    906     .,
 chr4   124363  124364  E4F1    1000    .,
 chr4   124365  124366  PKNOX1  1000    .]

The CombObj will now reflect that TFBS were added:

[5]:

[5]:

<CombObj: 112109 TFBS (151 unique names)>

Perform market basket analysis

Next, the function .market_basket() is used to perform the co-occurrence analysis of the sites just added to .TFBS:

[6]:

C.market_basket()

Internal counts for 'TF_counts' were not set. Please run .count_within() to obtain TF-TF co-occurrence counts.
WARNING: No counts found in <CombObj>. Running <CombObj>.count_within() with standard parameters.
INFO: Setting up binding sites for counting
INFO: Counting co-occurrences within sites
INFO: Counting co-occurrence within background
INFO: Running with multiprocessing threads == 1. To change this, give 'threads' in the parameter of the function.
INFO: Progress: 10%
INFO: Progress: 20%
INFO: Progress: 30%
INFO: Progress: 40%
INFO: Progress: 50%
INFO: Progress: 60%
INFO: Progress: 70%
INFO: Progress: 80%
INFO: Progress: 90%
INFO: Done finding co-occurrences! Run .market_basket() to estimate significant pairs
INFO: Market basket analysis is done! Results are found in <CombObj>.rules

As is shown in the info messages, this also runs the .count_within() function of ‘C’ (if no counts were found yet). If you want to set specific parameters for count_within, you can split these calculations such as seen here:

C.count_within(max_distance=200)
C.market_basket()

In any case, running .market_basket() will fill out the .rules variable of the CombObj:

[7]:

C.rules.head()

[7]:

	TF1	TF2	TF1_TF2_count	TF1_count	TF2_count	cosine	zscore
CTCF-RAD21	CTCF	RAD21	1751	2432	2241	0.750038	18.643056
RAD21-CTCF	RAD21	CTCF	1751	2241	2432	0.750038	18.643056
RAD21-SMC3	RAD21	SMC3	1376	2241	1638	0.718192	20.314026
SMC3-RAD21	SMC3	RAD21	1376	1638	2241	0.718192	20.314026
CTCF-SMC3	CTCF	SMC3	1361	2432	1638	0.681898	20.245177

Printing the CombObj again will also tell you how many rules were found in the market basket analysis:

[8]:

[8]:

<CombObj: 112109 TFBS (151 unique names) | Market basket analysis: 21284 rules>

Visualize results

The CombObj ‘C’ contains a number of different visualizations for the identified TF-TF co-occurrence pairs. The default measure plotted is ‘cosine’, but many of the options of the plots can be changed as seen in the examples below:

Heatmap

[9]:

_ = C.plot_heatmap()

../_images/examples_chipseq_analysis_24_0.png

[10]:

#Showing more rules
_ = C.plot_heatmap(n_rules=50)

../_images/examples_chipseq_analysis_25_0.png

[11]:

#Coloring the heatmap by TF1_TF2_count
_ = C.plot_heatmap(color_by="TF1_TF2_count")

../_images/examples_chipseq_analysis_26_0.png

Bubble plot

[12]:

_ = C.plot_bubble()

../_images/examples_chipseq_analysis_28_0.png

[13]:

#Show more rules
_ = C.plot_bubble(n_rules=40, figsize=(12,4))

../_images/examples_chipseq_analysis_29_0.png

Handling duplicate rules

As seen in the bubble plot above, the rules are duplicated due to the cosine measure, which scores TF1-TF2 the same way as TF2-TF1. In order to simplify the results, you can use .simplify_rules() to remove the duplicates from the .rules:

[14]:

C.simplify_rules()

[15]:

C.rules

[15]:

	TF1	TF2	TF1_TF2_count	TF1_count	TF2_count	cosine	zscore
CTCF-RAD21	CTCF	RAD21	1751	2432	2241	0.750038	18.643056
RAD21-SMC3	RAD21	SMC3	1376	2241	1638	0.718192	20.314026
CTCF-SMC3	CTCF	SMC3	1361	2432	1638	0.681898	20.245177
IKZF1-IKZF2	IKZF1	IKZF2	1726	2922	2324	0.662343	11.215960
SMC3-ZNF143	SMC3	ZNF143	1060	1638	1652	0.644383	21.838431
...	...	...	...	...	...	...	...
HDAC6-JUNB	HDAC6	JUNB	1	172	1866	0.001765	-6.905016
RELB-SUZ12	RELB	SUZ12	1	2011	172	0.001700	-6.793962
ATF7-SUZ12	ATF7	SUZ12	1	2061	172	0.001680	-6.119294
RAD21-SUZ12	RAD21	SUZ12	1	2241	172	0.001611	-7.500216
IKZF2-SUZ12	IKZF2	SUZ12	1	2324	172	0.001582	-6.649424

10642 rows × 7 columns

Now, you can again use plot_bubble() to visualize the highest-scoring co-occurring pairs:

[16]:

_ = C.plot_bubble()

../_images/examples_chipseq_analysis_35_0.png

Save CombObj to a pickle object

We can now save the CombObj to a pickle object to be used in other analysis:

[17]:

C.to_pickle("../data/GM12878.pkl")