Skip to main content

A short tutorial on the Robust Synthetic Control python library, Part 1: counterfactuals


I have posted a couple of blogs on the powerful technique of (multidimensional) Robust Synthetic Control here and here. In this post I will give a short tutorial on how you can use mRSC to perform your own analysis using the python package my collaborator Jehangir has made available on github. This posting will be about counterfactual analysis.

We will work with the canonical example of the synthetic control based counterfactual analysis of the impact California's Prop 99. All the data and code is included in the github repository linked above. I will post the python code as run on a Jupyter Notebook, and the "tslib" library referenced above has been downloaded and is available.
Preliminaries: importing the libraries.
In [1]:
import sys, os
sys.path.append("../..")
sys.path.append("..")
sys.path.append(os.getcwd())

from matplotlib import pyplot as plt
import matplotlib.ticker as ticker
import matplotlib.dates as mdates



import numpy as np
import pandas as pd
import copy

from tslib.src import tsUtils
from tslib.src.synthcontrol.syntheticControl import RobustSyntheticControl
from tslib.tests import testdata
Reading in the data and converting it into a usable form
In [2]:
directory = os.path.dirname(testdata.__file__)

filename = directory + '/prop99.csv'


df = pd.read_csv(filename)
df = df[df['SubMeasureDesc'] == 'Cigarette Consumption (Pack Sales Per Capita)']
pivot = df.pivot_table(values='Data_Value', index='LocationDesc', columns=['Year'])
dfProp99 = pd.DataFrame(pivot.to_records())

allColumns = dfProp99.columns.values
The code above extracts out the relevant portion for our analysis, and finally what we need looks like the following. This is the format that you should get your dataset into for analysis. One row per "unit", and one column for every timepoint.
In [3]:
pd.set_option('display.max_columns', 10)
dfProp99.head()
Out[3]:
LocationDesc1970197119721973...20102011201220132014
0Alabama89.895.4101.1102.9...71.568.467.264.661.7
1Alaska121.3123.0130.0125.8...43.843.341.239.037.2
2Arizona115.2109.6125.0128.3...24.827.125.024.423.0
3Arkansas100.3104.1103.9108.0...63.261.160.557.554.4
4California123.0121.0123.5124.4...26.326.025.223.922.7
5 rows × 46 columns
Now we extract the list of donor pools ("otherStates"), the intervention unit ("caStateKey"), the start and end year for the dataset and the year of the intervention.
In [4]:
states = list(np.unique(dfProp99['LocationDesc']))
years = np.delete(allColumns, [0])
caStateKey = 'California'
states.remove(caStateKey)
otherStates = states

yearStart = 1970
yearTrainEnd = 1989
yearTestEnd = 2015

p = 1.0
Now we prep the data and split it into training and test sets.
In [5]:
trainingYears = []
for i in range(yearStart, yearTrainEnd, 1):
 trainingYears.append(str(i))

testYears = []
for i in range(yearTrainEnd, yearTestEnd, 1):
 testYears.append(str(i))

trainDataMasterDict = {}
trainDataDict = {}
testDataDict = {}
for key in otherStates:
 series = dfProp99.loc[dfProp99['LocationDesc'] == key]

 trainDataMasterDict.update({key: series[trainingYears].values[0]})

 # randomly hide training data
 (trainData, pObservation) = tsUtils.randomlyHideValues(copy.deepcopy(trainDataMasterDict[key]), p)
 trainDataDict.update({key: trainData})
 testDataDict.update({key: series[testYears].values[0]})
series = dfProp99[dfProp99['LocationDesc'] == caStateKey]
trainDataMasterDict.update({caStateKey: series[trainingYears].values[0]})
trainDataDict.update({caStateKey: series[trainingYears].values[0]})
testDataDict.update({caStateKey: series[testYears].values[0]})

trainMasterDF = pd.DataFrame(data=trainDataMasterDict)
trainDF = pd.DataFrame(data=trainDataDict)
testDF = pd.DataFrame(data=testDataDict)
This is what the test and train matrices look like (in the library we represent each donor pool or intervention unit as a separate column, and each timepoint as a row)
In [6]:
trainDF.head()
Out[6]:
AlabamaAlaskaArizonaArkansasColorado...WashingtonWest VirginiaWisconsinWyomingCalifornia
089.8121.3115.2100.3124.8...96.7114.5106.4132.2123.0
195.4123.0109.6104.1125.5...97.0111.5105.4131.7121.0
2101.1130.0125.0103.9134.3...88.5117.5108.8140.0123.5
3102.9125.8128.3108.0137.9...91.0116.6109.5141.2124.4
4108.2130.4133.1109.7132.8...98.6119.9111.8145.8126.7
5 rows × 51 columns
In [7]:
testDF.head()
Out[7]:
AlabamaAlaskaArizonaArkansasColorado...WashingtonWest VirginiaWisconsinWyomingCalifornia
0105.694.496.8118.388.8...86.1104.0100.3111.482.4
1108.6100.288.9113.187.4...83.4104.194.196.977.8
2107.9101.881.2116.890.2...78.7100.195.5109.168.7
3109.198.579.0126.088.3...81.198.096.2110.867.5
4108.595.280.3113.888.6...79.4111.091.2108.463.4
5 rows × 51 columns
Next, we do a sanity check to see if the matrix is indeed low rank. It checks out and looks like we will need only about 4 singular values.
In [8]:
(U, s, Vh) = np.linalg.svd((trainDF) - np.mean(trainDF))
s2 = np.power(s, 2)
spectrum = np.cumsum(s2)/np.sum(s2)

plt.plot(spectrum)
plt.grid()
plt.title("Cumulative energy")
plt.figure()
plt.plot(s2)
plt.grid()


plt.xlabel("Ordered Singular Values") 
plt.ylabel("Energy")


plt.title("Singular Value Spectrum")
Out[8]:
Text(0.5,1,'Singular Value Spectrum')
This is the all important step, now we create the synthetic control model.
In [9]:
singvals = 4
rscModel = RobustSyntheticControl(caStateKey, singvals, len(trainDF), probObservation=1.0, modelType='svd', svdMethod='numpy', otherSeriesKeysArray=otherStates)
rscModel.fit(trainDF)
denoisedDF = rscModel.model.denoisedDF()
Next up, getting the counterfactual predictions and model fit.
In [10]:
predictions = []
predictions = np.dot(testDF[otherStates], rscModel.model.weights)
actual = dfProp99.loc[dfProp99['LocationDesc'] == caStateKey]
actual = actual.drop('LocationDesc', axis=1)
actual = actual.iloc[0]
model_fit = np.dot(trainDF[otherStates][:], rscModel.model.weights)
And finally, the plot! That's it. You can clearly see the impact that Prop99 had, via the sharper decline in per capita smoking compared to the counterfactual.
In [11]:
fig, ax = plt.subplots(1,1)
tick_spacing = 5
# this is a bug in matplotlib
label_markings = np.insert(years[::tick_spacing], 0, 'dummy')

ax.set_xticks(np.arange(len(label_markings)))
ax.set_xticklabels(label_markings, rotation=45)
ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))

plt.plot(years, actual ,label='actual')
plt.xlabel('Year')
plt.ylabel('Per capita cigarette consumption')
plt.plot(trainingYears, model_fit, label='fitted model')
plt.plot(testYears, predictions, label='counterfactual')
plt.title(caStateKey+', Singular Values used: '+str(singvals))

xposition = pd.to_datetime(yearTrainEnd,  errors='coerce')
plt.axvline(x=str(yearTrainEnd), color='k', linestyle='--', linewidth=4)
plt.grid()
plt.legend()
Out[11]:
<matplotlib.legend.Legend at 0x7f9478d14ac8>

Comments

  1. It is very interesting. Just one question. I have tried to replicate this exercise and rscModel.model.weights gives weight to all states. Is it a good result?
    Thank you in advance.

    ReplyDelete
  2. Could you please explain why there is a gap between the fitted model line (orange) and the treatment year (dash line)? Thank you very much!

    ReplyDelete
  3. Explore the potential of Synthetic data research for advancing research in various fields with our comprehensive article. Learn how synthetic data is generated, its advantages, and the results of a comparative study. Discover how this innovative approach can provide cost-effective and privacy-preserving solutions for generating large datasets.

    ReplyDelete

Post a Comment

Popular posts from this blog

The business of ZeroRating

ZeroRating conversations are dominating Network Neutrality issues these days, whether it is the FreeBasics controversy  in India, Binge On by T-Mobile, or Verizon's recent announcement of a plan similar to AT&T's sponsored data. Here are a few thoughts to consider about ZeroRating and why it makes no sense (to me). If ISPs Zero Rate content, somebody has to pay for the bandwidth. Suppose the Content provider pays for it. Then there is a pricing problem: ISPs cannot charge the content provider a price above the price they charge consumers. Suppose they charge consumers X per MB of data, and they charge content providers X+Y per MB of data. Then, for sufficient traffic where overheads are accounted for, it is cheaper  for content providers to send recharge coupons back directly to the customers who used their services. Long term, pricing above the consumer price is not sustainable. ISPs cannot  charge the content provider a price below  the price they charge consume

mRSC: A new way to answer What Ifs and do time series prediction

Introduction What if the federal minimum wage is raised to 16 dollars an hour? What if Steve Smith bats at number 5 in the Ashes 2019 instead of number 3? What if Australian style gun laws were implemented in the USA - what would be the impact on gun related violence? What if Eden Hazard attacks today instead of winging in the midfield? "What if?” is one of the favorite questions that occupy minds, from sports fans to policymakers to philosophers. Invariably, there is no one answer to the What ifs and everyone remains convinced in their own alternate realities but a new wave of work has been looking at data-driven approaches to answer (at least a subset of) these What If questions. The mathematical tool of (Robust) Synthetic Control examines these What If questions by creating a synthetic version of reality and explore its evolution in time as a counterfactual to the actual reality. Recently, together with my collaborators Jehangir Amjad (MIT/Google) Devavra