Causality

Causal Impact R Package in Python

How to run Google’s CausalImpact R package from your Python environment

Alexandros Zenonos, PhD
Towards Data Science
4 min readAug 28, 2020

--

Photo by Stephen Phillips — Hostreviews.co.uk on Unsplash

CausalImpact is an R package developed by Google for causal inference using Bayesian Structural time-series models. You can find the R version here.

In short, what this package does is making counterfactual predictions. In other words, what would have happened in a parallel (sort of) universe if an intervention never had happened? Here is a quick example straight from Google’s website: “Given a response time series (e.g., clicks) and a set of control time series (e.g., clicks in non-affected markets or clicks on other sites), the package constructs a Bayesian structural time-series model. This model is then used to try and predict the counterfactual, i.e., how the response metric would have evolved after the intervention if the intervention had never occurred.”

CausalImpact 1.2.1, Brodersen et al., Annals of Applied Statistics (2015). http://google.github.io/CausalImpact/

Image Description: Part A of the image (original) shows, with the dark continuous line, the time series of something we are monitoring. The blue dotted one is the counterfactual prediction. The vertical grey line is the moment when an intervention was made. We can observe that from that moment onwards, blue and black lines drift apart. Part B (pointwise) illustrates the difference of those lines over time which in essence is the causal effect we are interested in, while Part C (cumulative) is the cumulative difference over time.

I know you can work with R, but for Python lovers, I am not aware of the equivalent package. Surely, there are some libraries implementing parts of the original paper. By checking out some of those Python implementations I noticed differences in terms of the results. Long story short, here you can check how to run this package from python. Similarly, the approach is generalisable to probably any R package for that matter.

What worked for me was to create a new Conda environment with both Python libraries and core R packages pre-installed. Here is an example: conda create -n r_env numpy pandas statsmodels r-essentials r-base

Creating the environment should take some time. Also, note that Jupyter notebook requires further configuration so I tend to edit the code in any programming editor instead and run from the command line.

What we would also need is rpy2 which does all the work for us. It is a python interface to the R language. pip install rpy2 would do.

Load all the libraries as below:

#rpy2 lib
from rpy2.robjects.packages import importr
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
from rpy2.robjects import Formula
import rpy2.robjects.packages as rpackages
import rpy2.robjects.vectors as StrVector
from rpy2.ipython.ggplot import image_png
#typical python libs
import numpy as np
import pandas as pd
import datetime
#arma
from statsmodels.tsa.arima_process import ArmaProcess

Create a Pandas dataframe with at least 3 columns, 1 for datetime (we will make it an index below), a predictor x1 or more (x2,x3,x4,…,xn) and a response variable y. This assumes you already have your data ready for consumption, otherwise, you can create some as below:

# Creating synthetic data - skip this if you are running it on your # own data - The data generation is taken from #https://github.com/dafiti/causalimpactar = np.array([1, 0.9])
ma = np.array([1])
arma_process = ArmaProcess(ar, ma)
X = 100 + arma_process.generate_sample(nsample=100)
y = 1.2 * X + np.random.normal(size=100)
y[70:] += 5
base = datetime.datetime.today()
dt = [base - datetime.timedelta(days=x) for x in range(100)]
df = pd.DataFrame({'y': y, 'x1': X,'date':dt}, columns=['y', 'x1','date'])

Making sure you have datetime as an index as discussed above.

#make datetime an index to the df
df.set_index('date',inplace=True)

Define the period before the intervention and the period after the intervention in terms of the index datetime in the dataframe.

#Set pre and post intervention periods
pre_period = [0, 69]
post_period = [70, 99]
#R conversion
pre_period=robjects.FloatVector(pre_period)
post_period=robjects.FloatVector(post_period)

This should give you the interval between the start of the period until the intervention (pre_period) and from the intervention until the last day in your data (post_period).

#Load R libraries from within Python - R interface
utils=rpackages.importr('utils')
utils.chooseCRANmirror(ind=1)
packnames=('CausalImpact','bsts') # any other R library required
names_to_install = [x for x in packnames if not rpackages.isinstalled(x)]
#Load package required to install R packages
from rpy2.robjects.vectors import StrVector
if len(names_to_install) > 0:
utils.install_packages(StrVector(names_to_install))

This might take some time; sip some coffee, and some more.

But now here we are. Ready to run causal inference on our data.

robjects.numpy2ri.activate()
pandas2ri.activate()
rdf=robjects.conversion.py2rpy(df)
causalimpact=importr('CausalImpact')
impact=causalimpact.CausalImpact(rdf,pre_period,post_period)
summary_func=robjects.r('function(x) summary(x)')
summary_func(impact)
#Summary with descriptive report
summary_report_func=robjects.r('function(x) summary(x,"report")')
summary_report_func(impact)
#Create causality plot
img_file='causalimpact.png'
rstr="""
library(ggplot2)
function(x,y){
p<-plot(x)
ggsave(y,plot=p)
}
"""rfunc=robjects.r(rstr)
rfunc(impact,img_file)

For more options and discussion on the causalimpact package see below:

Loved the article? Become a Medium member to continue learning without limits. I’ll receive a portion of your membership fee if you use the following link, with no extra cost to you.

--

--