A short tutorial on the Robust Synthetic Control python library, Part 1: counterfactuals

I have posted a couple of blogs on the powerful technique of (multidimensional) Robust Synthetic Control here and here. In this post I will give a short tutorial on how you can use mRSC to perform your own analysis using the python package my collaborator Jehangir has made available on github. This posting will be about counterfactual analysis.

We will work with the canonical example of the synthetic control based counterfactual analysis of the impact California's Prop 99. All the data and code is included in the github repository linked above. I will post the python code as run on a Jupyter Notebook, and the "tslib" library referenced above has been downloaded and is available.

Preliminaries: importing the libraries.

In [1]:

import sys, os
sys.path.append("../..")
sys.path.append("..")
sys.path.append(os.getcwd())

from matplotlib import pyplot as plt
import matplotlib.ticker as ticker
import matplotlib.dates as mdates



import numpy as np
import pandas as pd
import copy

from tslib.src import tsUtils
from tslib.src.synthcontrol.syntheticControl import RobustSyntheticControl
from tslib.tests import testdata

Reading in the data and converting it into a usable form

In [2]:

directory = os.path.dirname(testdata.__file__)

filename = directory + '/prop99.csv'


df = pd.read_csv(filename)
df = df[df['SubMeasureDesc'] == 'Cigarette Consumption (Pack Sales Per Capita)']
pivot = df.pivot_table(values='Data_Value', index='LocationDesc', columns=['Year'])
dfProp99 = pd.DataFrame(pivot.to_records())

allColumns = dfProp99.columns.values

The code above extracts out the relevant portion for our analysis, and finally what we need looks like the following. This is the format that you should get your dataset into for analysis. One row per "unit", and one column for every timepoint.

In [3]:

pd.set_option('display.max_columns', 10)
dfProp99.head()

Out[3]:

	LocationDesc	1970	1971	1972	1973	...	2010	2011	2012	2013	2014
0	Alabama	89.8	95.4	101.1	102.9	...	71.5	68.4	67.2	64.6	61.7
1	Alaska	121.3	123.0	130.0	125.8	...	43.8	43.3	41.2	39.0	37.2
2	Arizona	115.2	109.6	125.0	128.3	...	24.8	27.1	25.0	24.4	23.0
3	Arkansas	100.3	104.1	103.9	108.0	...	63.2	61.1	60.5	57.5	54.4
4	California	123.0	121.0	123.5	124.4	...	26.3	26.0	25.2	23.9	22.7

5 rows × 46 columns

Now we extract the list of donor pools ("otherStates"), the intervention unit ("caStateKey"), the start and end year for the dataset and the year of the intervention.

In [4]:

states = list(np.unique(dfProp99['LocationDesc']))
years = np.delete(allColumns, [0])
caStateKey = 'California'
states.remove(caStateKey)
otherStates = states

yearStart = 1970
yearTrainEnd = 1989
yearTestEnd = 2015

p = 1.0

Now we prep the data and split it into training and test sets.

In [5]:

trainingYears = []
for i in range(yearStart, yearTrainEnd, 1):
 trainingYears.append(str(i))

testYears = []
for i in range(yearTrainEnd, yearTestEnd, 1):
 testYears.append(str(i))

trainDataMasterDict = {}
trainDataDict = {}
testDataDict = {}
for key in otherStates:
 series = dfProp99.loc[dfProp99['LocationDesc'] == key]

 trainDataMasterDict.update({key: series[trainingYears].values[0]})

 # randomly hide training data
 (trainData, pObservation) = tsUtils.randomlyHideValues(copy.deepcopy(trainDataMasterDict[key]), p)
 trainDataDict.update({key: trainData})
 testDataDict.update({key: series[testYears].values[0]})
series = dfProp99[dfProp99['LocationDesc'] == caStateKey]
trainDataMasterDict.update({caStateKey: series[trainingYears].values[0]})
trainDataDict.update({caStateKey: series[trainingYears].values[0]})
testDataDict.update({caStateKey: series[testYears].values[0]})

trainMasterDF = pd.DataFrame(data=trainDataMasterDict)
trainDF = pd.DataFrame(data=trainDataDict)
testDF = pd.DataFrame(data=testDataDict)

This is what the test and train matrices look like (in the library we represent each donor pool or intervention unit as a separate column, and each timepoint as a row)

In [6]:

trainDF.head()

Out[6]:

	Alabama	Alaska	Arizona	Arkansas	Colorado	...	Washington	West Virginia	Wisconsin	Wyoming	California
0	89.8	121.3	115.2	100.3	124.8	...	96.7	114.5	106.4	132.2	123.0
1	95.4	123.0	109.6	104.1	125.5	...	97.0	111.5	105.4	131.7	121.0
2	101.1	130.0	125.0	103.9	134.3	...	88.5	117.5	108.8	140.0	123.5
3	102.9	125.8	128.3	108.0	137.9	...	91.0	116.6	109.5	141.2	124.4
4	108.2	130.4	133.1	109.7	132.8	...	98.6	119.9	111.8	145.8	126.7

5 rows × 51 columns

In [7]:

testDF.head()

Out[7]:

	Alabama	Alaska	Arizona	Arkansas	Colorado	...	Washington	West Virginia	Wisconsin	Wyoming	California
0	105.6	94.4	96.8	118.3	88.8	...	86.1	104.0	100.3	111.4	82.4
1	108.6	100.2	88.9	113.1	87.4	...	83.4	104.1	94.1	96.9	77.8
2	107.9	101.8	81.2	116.8	90.2	...	78.7	100.1	95.5	109.1	68.7
3	109.1	98.5	79.0	126.0	88.3	...	81.1	98.0	96.2	110.8	67.5
4	108.5	95.2	80.3	113.8	88.6	...	79.4	111.0	91.2	108.4	63.4

5 rows × 51 columns

Next, we do a sanity check to see if the matrix is indeed low rank. It checks out and looks like we will need only about 4 singular values.

In [8]:

(U, s, Vh) = np.linalg.svd((trainDF) - np.mean(trainDF))
s2 = np.power(s, 2)
spectrum = np.cumsum(s2)/np.sum(s2)

plt.plot(spectrum)
plt.grid()
plt.title("Cumulative energy")
plt.figure()
plt.plot(s2)
plt.grid()


plt.xlabel("Ordered Singular Values") 
plt.ylabel("Energy")


plt.title("Singular Value Spectrum")

Out[8]:

Text(0.5,1,'Singular Value Spectrum')

This is the all important step, now we create the synthetic control model.

In [9]:

singvals = 4
rscModel = RobustSyntheticControl(caStateKey, singvals, len(trainDF), probObservation=1.0, modelType='svd', svdMethod='numpy', otherSeriesKeysArray=otherStates)
rscModel.fit(trainDF)
denoisedDF = rscModel.model.denoisedDF()

Next up, getting the counterfactual predictions and model fit.

In [10]:

predictions = []
predictions = np.dot(testDF[otherStates], rscModel.model.weights)
actual = dfProp99.loc[dfProp99['LocationDesc'] == caStateKey]
actual = actual.drop('LocationDesc', axis=1)
actual = actual.iloc[0]
model_fit = np.dot(trainDF[otherStates][:], rscModel.model.weights)

And finally, the plot! That's it. You can clearly see the impact that Prop99 had, via the sharper decline in per capita smoking compared to the counterfactual.

In [11]:

fig, ax = plt.subplots(1,1)
tick_spacing = 5
# this is a bug in matplotlib
label_markings = np.insert(years[::tick_spacing], 0, 'dummy')

ax.set_xticks(np.arange(len(label_markings)))
ax.set_xticklabels(label_markings, rotation=45)
ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))

plt.plot(years, actual ,label='actual')
plt.xlabel('Year')
plt.ylabel('Per capita cigarette consumption')
plt.plot(trainingYears, model_fit, label='fitted model')
plt.plot(testYears, predictions, label='counterfactual')
plt.title(caStateKey+', Singular Values used: '+str(singvals))

xposition = pd.to_datetime(yearTrainEnd,  errors='coerce')
plt.axvline(x=str(yearTrainEnd), color='k', linestyle='--', linewidth=4)
plt.grid()
plt.legend()

Out[11]:

<matplotlib.legend.Legend at 0x7f9478d14ac8>

mRSC: A new way to answer What Ifs and do time series prediction

Introduction What if the federal minimum wage is raised to 16 dollars an hour? What if Steve Smith bats at number 5 in the Ashes 2019 instead of number 3? What if Australian style gun laws were implemented in the USA - what would be the impact on gun related violence? What if Eden Hazard attacks today instead of winging in the midfield? "What if?” is one of the favorite questions that occupy minds, from sports fans to policymakers to philosophers. Invariably, there is no one answer to the What ifs and everyone remains convinced in their own alternate realities but a new wave of work has been looking at data-driven approaches to answer (at least a subset of) these What If questions. The mathematical tool of (Robust) Synthetic Control examines these What If questions by creating a synthetic version of reality and explore its evolution in time as a counterfactual to the actual reality. Recently, together with my collaborators Jehangir Amjad (MIT/Google) Devavra...

juanOctober 20, 2020 at 12:54 PM
It is very interesting. Just one question. I have tried to replicate this exercise and rscModel.model.weights gives weight to all states. Is it a good result?
Thank you in advance.
UnknownOctober 30, 2020 at 1:44 PM
Could you please explain why there is a gap between the fitted model line (orange) and the treatment year (dash line)? Thank you very much!
baresfuertonerNovember 22, 2020 at 8:17 PM
Great post. Thanks for content writter.

python course london
OpenSource Research CollaborationMarch 2, 2023 at 4:41 AM
Explore the potential of Synthetic data research for advancing research in various fields with our comprehensive article. Learn how synthetic data is generated, its advantages, and the results of a comparative study. Discover how this innovative approach can provide cost-effective and privacy-preserving solutions for generating large datasets.

Peer Unreviewed

Search This Blog

A short tutorial on the Robust Synthetic Control python library, Part 1: counterfactuals

Comments

Post a Comment

Popular posts from this blog

The business of ZeroRating

mRSC: A new way to answer What Ifs and do time series prediction