I have posted a couple of blogs on the powerful technique of (multidimensional) Robust Synthetic Control here and here. In this post I will give a short tutorial on how you can use mRSC to perform your own analysis using the python package my collaborator Jehangir has made available on github. This posting will be about counterfactual analysis.
We will work with the canonical example of the synthetic control based counterfactual analysis of the impact California's Prop 99. All the data and code is included in the github repository linked above. I will post the python code as run on a Jupyter Notebook, and the "tslib" library referenced above has been downloaded and is available.
Preliminaries: importing the libraries.
In [1]:
import sys, os
sys.path.append("../..")
sys.path.append("..")
sys.path.append(os.getcwd())
from matplotlib import pyplot as plt
import matplotlib.ticker as ticker
import matplotlib.dates as mdates
import numpy as np
import pandas as pd
import copy
from tslib.src import tsUtils
from tslib.src.synthcontrol.syntheticControl import RobustSyntheticControl
from tslib.tests import testdata
Reading in the data and converting it into a usable form
In [2]:
directory = os.path.dirname(testdata.__file__)
filename = directory + '/prop99.csv'
df = pd.read_csv(filename)
df = df[df['SubMeasureDesc'] == 'Cigarette Consumption (Pack Sales Per Capita)']
pivot = df.pivot_table(values='Data_Value', index='LocationDesc', columns=['Year'])
dfProp99 = pd.DataFrame(pivot.to_records())
allColumns = dfProp99.columns.values
The code above extracts out the relevant portion for our analysis, and finally what we need looks like the following. This is the format that you should get your dataset into for analysis. One row per "unit", and one column for every timepoint.
In [3]:
pd.set_option('display.max_columns', 10)
dfProp99.head()
Out[3]:
LocationDesc | 1970 | 1971 | 1972 | 1973 | ... | 2010 | 2011 | 2012 | 2013 | 2014 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | 89.8 | 95.4 | 101.1 | 102.9 | ... | 71.5 | 68.4 | 67.2 | 64.6 | 61.7 |
1 | Alaska | 121.3 | 123.0 | 130.0 | 125.8 | ... | 43.8 | 43.3 | 41.2 | 39.0 | 37.2 |
2 | Arizona | 115.2 | 109.6 | 125.0 | 128.3 | ... | 24.8 | 27.1 | 25.0 | 24.4 | 23.0 |
3 | Arkansas | 100.3 | 104.1 | 103.9 | 108.0 | ... | 63.2 | 61.1 | 60.5 | 57.5 | 54.4 |
4 | California | 123.0 | 121.0 | 123.5 | 124.4 | ... | 26.3 | 26.0 | 25.2 | 23.9 | 22.7 |
5 rows × 46 columns
Now we extract the list of donor pools ("otherStates"), the intervention unit ("caStateKey"), the start and end year for the dataset and the year of the intervention.
In [4]:
states = list(np.unique(dfProp99['LocationDesc']))
years = np.delete(allColumns, [0])
caStateKey = 'California'
states.remove(caStateKey)
otherStates = states
yearStart = 1970
yearTrainEnd = 1989
yearTestEnd = 2015
p = 1.0
Now we prep the data and split it into training and test sets.
In [5]:
trainingYears = []
for i in range(yearStart, yearTrainEnd, 1):
trainingYears.append(str(i))
testYears = []
for i in range(yearTrainEnd, yearTestEnd, 1):
testYears.append(str(i))
trainDataMasterDict = {}
trainDataDict = {}
testDataDict = {}
for key in otherStates:
series = dfProp99.loc[dfProp99['LocationDesc'] == key]
trainDataMasterDict.update({key: series[trainingYears].values[0]})
# randomly hide training data
(trainData, pObservation) = tsUtils.randomlyHideValues(copy.deepcopy(trainDataMasterDict[key]), p)
trainDataDict.update({key: trainData})
testDataDict.update({key: series[testYears].values[0]})
series = dfProp99[dfProp99['LocationDesc'] == caStateKey]
trainDataMasterDict.update({caStateKey: series[trainingYears].values[0]})
trainDataDict.update({caStateKey: series[trainingYears].values[0]})
testDataDict.update({caStateKey: series[testYears].values[0]})
trainMasterDF = pd.DataFrame(data=trainDataMasterDict)
trainDF = pd.DataFrame(data=trainDataDict)
testDF = pd.DataFrame(data=testDataDict)
This is what the test and train matrices look like (in the library we represent each donor pool or intervention unit as a separate column, and each timepoint as a row)
In [6]:
trainDF.head()
Out[6]:
Alabama | Alaska | Arizona | Arkansas | Colorado | ... | Washington | West Virginia | Wisconsin | Wyoming | California | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 89.8 | 121.3 | 115.2 | 100.3 | 124.8 | ... | 96.7 | 114.5 | 106.4 | 132.2 | 123.0 |
1 | 95.4 | 123.0 | 109.6 | 104.1 | 125.5 | ... | 97.0 | 111.5 | 105.4 | 131.7 | 121.0 |
2 | 101.1 | 130.0 | 125.0 | 103.9 | 134.3 | ... | 88.5 | 117.5 | 108.8 | 140.0 | 123.5 |
3 | 102.9 | 125.8 | 128.3 | 108.0 | 137.9 | ... | 91.0 | 116.6 | 109.5 | 141.2 | 124.4 |
4 | 108.2 | 130.4 | 133.1 | 109.7 | 132.8 | ... | 98.6 | 119.9 | 111.8 | 145.8 | 126.7 |
5 rows × 51 columns
In [7]:
testDF.head()
Out[7]:
Alabama | Alaska | Arizona | Arkansas | Colorado | ... | Washington | West Virginia | Wisconsin | Wyoming | California | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 105.6 | 94.4 | 96.8 | 118.3 | 88.8 | ... | 86.1 | 104.0 | 100.3 | 111.4 | 82.4 |
1 | 108.6 | 100.2 | 88.9 | 113.1 | 87.4 | ... | 83.4 | 104.1 | 94.1 | 96.9 | 77.8 |
2 | 107.9 | 101.8 | 81.2 | 116.8 | 90.2 | ... | 78.7 | 100.1 | 95.5 | 109.1 | 68.7 |
3 | 109.1 | 98.5 | 79.0 | 126.0 | 88.3 | ... | 81.1 | 98.0 | 96.2 | 110.8 | 67.5 |
4 | 108.5 | 95.2 | 80.3 | 113.8 | 88.6 | ... | 79.4 | 111.0 | 91.2 | 108.4 | 63.4 |
5 rows × 51 columns
Next, we do a sanity check to see if the matrix is indeed low rank. It checks out and looks like we will need only about 4 singular values.
In [8]:
(U, s, Vh) = np.linalg.svd((trainDF) - np.mean(trainDF))
s2 = np.power(s, 2)
spectrum = np.cumsum(s2)/np.sum(s2)
plt.plot(spectrum)
plt.grid()
plt.title("Cumulative energy")
plt.figure()
plt.plot(s2)
plt.grid()
plt.xlabel("Ordered Singular Values")
plt.ylabel("Energy")
plt.title("Singular Value Spectrum")
Out[8]:
Text(0.5,1,'Singular Value Spectrum')
This is the all important step, now we create the synthetic control model.
In [9]:
singvals = 4
rscModel = RobustSyntheticControl(caStateKey, singvals, len(trainDF), probObservation=1.0, modelType='svd', svdMethod='numpy', otherSeriesKeysArray=otherStates)
rscModel.fit(trainDF)
denoisedDF = rscModel.model.denoisedDF()
Next up, getting the counterfactual predictions and model fit.
In [10]:
predictions = []
predictions = np.dot(testDF[otherStates], rscModel.model.weights)
actual = dfProp99.loc[dfProp99['LocationDesc'] == caStateKey]
actual = actual.drop('LocationDesc', axis=1)
actual = actual.iloc[0]
model_fit = np.dot(trainDF[otherStates][:], rscModel.model.weights)
And finally, the plot! That's it. You can clearly see the impact that Prop99 had, via the sharper decline in per capita smoking compared to the counterfactual.
In [11]:
fig, ax = plt.subplots(1,1)
tick_spacing = 5
# this is a bug in matplotlib
label_markings = np.insert(years[::tick_spacing], 0, 'dummy')
ax.set_xticks(np.arange(len(label_markings)))
ax.set_xticklabels(label_markings, rotation=45)
ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
plt.plot(years, actual ,label='actual')
plt.xlabel('Year')
plt.ylabel('Per capita cigarette consumption')
plt.plot(trainingYears, model_fit, label='fitted model')
plt.plot(testYears, predictions, label='counterfactual')
plt.title(caStateKey+', Singular Values used: '+str(singvals))
xposition = pd.to_datetime(yearTrainEnd, errors='coerce')
plt.axvline(x=str(yearTrainEnd), color='k', linestyle='--', linewidth=4)
plt.grid()
plt.legend()
Out[11]:
<matplotlib.legend.Legend at 0x7f9478d14ac8>
- Get link
- X
- Other Apps
- Get link
- X
- Other Apps
It is very interesting. Just one question. I have tried to replicate this exercise and rscModel.model.weights gives weight to all states. Is it a good result?
ReplyDeleteThank you in advance.
Could you please explain why there is a gap between the fitted model line (orange) and the treatment year (dash line)? Thank you very much!
ReplyDeleteGreat post. Thanks for content writter.
ReplyDeletepython course london
Explore the potential of Synthetic data research for advancing research in various fields with our comprehensive article. Learn how synthetic data is generated, its advantages, and the results of a comparative study. Discover how this innovative approach can provide cost-effective and privacy-preserving solutions for generating large datasets.
ReplyDelete