Inference on Variable Importance Measure 2편
참조논문
변수의 중요도에 대한 통계적 추론 (Inference on Variable Importance Measure (VIM))
import os
os.chdir(os.path.dirname(os.path.abspath(__file__)))
import numpy as np
import pandas as pd
import tqdm
Settings
Simulation 데이터 세팅은 다음과 같다:
- 데이터 개수 $n = 1000$
- 변수 개수 $p = 5$
- treatment ($\sigma(\cdot)$은 sigmoid function):
- Conditional ATE:
- Outcome:
위의 확률 분포를 토대로, 아래와 같이 python으로 simulation data를 생성할 수 있다.
np.random.seed(1)
n = 1000
p = 5
X = np.random.uniform(low=-1, high=1, size=(n, p))
beta = np.array([[0.5, -0.4, 0.3, 0.2, -0.1]])
logit = X @ beta.T
prob = 1 / (1 + np.exp(-logit))
treatment = np.random.binomial(n=1, p=prob)
beta = np.array([[1, -2, -3, -4, 5]])
cate = X @ beta.T
beta = np.array([[-5, -4, 3, -2, 1]])
outcome = X @ beta.T + treatment * cate + np.random.normal(size=(n, 1))
data = np.concatenate([X, treatment, outcome], axis=1)
covariates = ['X{}'.format(i+1) for i in range(p)]
data = pd.DataFrame(data, columns=covariates + ['treatment', 'outcome'])
Implementation
1. 기계학습 방법론을 언제, 어디서 사용해야 할까?
아래와 같은 총 3개의 기계학습 모형이 필요:
- 조건부 기댓값(conditional expectation) : $Q_n(a, x) = \hat{\mathbb{E}}[Y \vert A=a, X=x]$ where $\hat{P}_n$ be any distribution of $(X, A, Y)$ such that
- $X$의 주변 분포가 empirical distribution으로 주어지고,
- $Y$ given $A = a$ for $a = 0, 1$ and $X = X_i$ for $i = 1, \cdots, n$의 조건부 기댓값이 랜덤 포레스트 회귀 알고리즘을 이용해 추정한 $\hat{\mathbb{E}} [Y \vert A=a, X=X_i]$와 동일
- propensity score : 랜덤 포레스트 분류 알고리즘을 이용해 $\pi_n(a, x) = \hat{Pr}(A=a \vert X=x)$ 추정
- subset conditional expectation : 랜덤 포레스트 회귀 알고리즘을 이용해 $Q_{n,s}(a, x) = \hat{\mathbb{E}} [Q_n(a, x) \vert X_{-s} = x_{-s} ]$ 추정
- $Q_n(A_i, X_i)$를 $X_{i, -s}$에 대해서 regress
위의 3가지 모형을 이용해, 아래와 같은 추가적인 함수를 정의 및 계산:
- treatment rule: $f_n(x) = I(Q_n(1, x) > Q_n(0, x))$
- subset treatment rule: $f_{n, s}(x) = I(Q_{n, s}(1, x) > Q_{n, s}(0, x))$
- Estimate of the nonparametric EIF of $P \mapsto V(f_0, P)$ at $P_0$:
- Estimate of the nonparametric EIF of $P \mapsto V(f_{0, s}, P)$ at $P_0$:
- Estimate of the oracle predictiveness:
- Estimate of the subset oracle predictiveness:
- one-step estimator of $V(f_0, P_0)$:
- one-step estimator of $V(f_{0, s}, P_0)$:
- 점근 분산 (asymptotic variance):
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
np.random.seed(0)
K = 2 # cross-fitted inference
m = data.shape[0] // K
index_list = [m * i for i in range(K)] + [data.shape[0]]
K
는 cross-fitted inference를 위한 데이터 fold의 개수이고,index_list
는 데이터를 나누기 위한 관측치 번호이다.- $\mathcal{D} = \mathcal{D}_1 \cup \mathcal{D}_2$
- $\mathcal{D}_1$: fitting algorithm fitting
- $\mathcal{D}_2$: nonparametric EIF and predictive measure calculation
index_list
를 이용해 아래와 같이 train($\mathcal{D}_1$), test($\mathcal{D}_2$) 데이터를 나눌 수 있다.idx = index_list[i] train = data.iloc[index_list[i] : index_list[i+1]] test = pd.concat([data.iloc[: index_list[i]], data.iloc[index_list[i+1] : ]], axis=0)
2. 기계학습 모형 및 필요한 함수 정의
- conditional expectation: $Q_n(a, x) = \hat{\mathbb{E}}[Y \vert A=a, X=x]$
conditional_mean = RandomForestRegressor(random_state=0) conditional_mean.fit(train[covariates + ['treatment']], train['outcome'])
- propensity score: $\pi_n(a, x) = \hat{Pr}(A=a \vert X=x)$
propensity_score = RandomForestClassifier(random_state=0) propensity_score.fit(train[covariates], train['treatment'])
- treatment rule: $f_n(x) = I(Q_n(1, x) > Q_n(0, x))$
test1 = pd.DataFrame.copy(test) test1['treatment'] = 1 test0 = pd.DataFrame.copy(test) test0['treatment'] = 0 treatment_rule = (conditional_mean.predict(test1[covariates + ['treatment']]) > conditional_mean.predict(test0[covariates + ['treatment']])).astype(int)
- subset conditional expectation $Q_{n,s}(a, x) = \hat{\mathbb{E}} [Q_n(a, x) \vert X_{-s} = x_{-s} ]$
pred = conditional_mean.predict(train[covariates + ['treatment']]) conditional_mean_residual = RandomForestRegressor(random_state=0) conditional_mean_residual.fit(train[subset_complement + ['treatment']], pred)
- subset treatment rule: $f_{n, s}(x) = I(Q_{n, s}(1, x) > Q_{n, s}(0, x))$
test1 = pd.DataFrame.copy(test) test1['treatment'] = 1 test0 = pd.DataFrame.copy(test) test0['treatment'] = 0 treatment_rule_residual = (conditional_mean_residual.predict(test1[subset_complement + ['treatment']]) > conditional_mean_residual.predict(test0[subset_complement + ['treatment']])).astype(int)
- Estimate of the nonparametric EIF of $P \mapsto V(f_0, P)$ at $P_0$
indicator = np.array((test['treatment'] == treatment_rule).astype(float)) prob = (propensity_score.predict_proba(test[covariates]) * np.eye(2)[treatment_rule]).sum(axis=1) test_ = pd.DataFrame.copy(test) test_['treatment'] = treatment_rule EIF = (indicator / prob) * (test_['outcome'] - conditional_mean.predict(test_[covariates + ['treatment']])) EIF += conditional_mean.predict(test_[covariates + ['treatment']]) predictiveness = conditional_mean.predict(test_[covariates + ['treatment']]).mean() EIF -= predictiveness
여기서, $V(f_n, \hat{P}_n)$는 predictiveness
와 동일
- Estimate of the nonparametric EIF of $P \mapsto V(f_{0, s}, P)$ at $P_0$
indicator = np.array((test['treatment'] == treatment_rule_residual).astype(float)) prob = (propensity_score.predict_proba(test[covariates]) * np.eye(2)[treatment_rule_residual]).sum(axis=1) test_ = pd.DataFrame.copy(test) test_['treatment'] = treatment_rule_residual EIF_residual = (indicator / prob) * (test_['outcome'] - conditional_mean.predict(test_[covariates + ['treatment']])) EIF_residual += conditional_mean.predict(test_[covariates + ['treatment']]) predictiveness_residual = conditional_mean.predict(test_[covariates + ['treatment']]).mean() EIF_residual -= predictiveness_residual
여기서, $V(f_{n, s}, \hat{P}_n)$는 predictiveness_residual
과 동일
- Estimate of VIM: $v_n - v_{n,s}$
oracle = (predictiveness + EIF.mean()) - (predictiveness_residual + EIF_residual.mean())
- 점근 분산의 추정
asymptotic_variance = ((EIF - EIF_residual) ** 2).mean()
(전체 완결된 python 코드는 https://github.com/CausalityXAI/xai/blob/main/ate/pom_vim_inference.py 참조)
3. VIM의 95% 신뢰구간
result = [(n,
p - 1.96 * v / np.sqrt(data.shape[0]),
p + 1.96 * v / np.sqrt(data.shape[0]),
p) for p, v, n in zip(VIM, var, covariates)]
# 시각화
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 4))
plt.errorbar([x[0] for x in result],
[x[3] for x in result],
[x[2] - x[3] for x in result],
linestyle='None', marker='o')
plt.ylabel('VIM')
plt.savefig('assets/VIM_toy.png')
해석
- $X_5$가 가장 높은 population-level 중요도를 가짐
- population-level 중요도란? 전체 설명변수에서 $X_5$ 변수를 제외하였을 때 감소하는 oracle predictiveness의 양
- $s = {5}$일 때, VIM $\psi_{n, s} = 0.64$이고 점근적으로 유효한 $\psi_{0, s}$의 95% 신뢰구간은 $(0.20, 1.08)$
- 만약 우리가 전체 설명변수에서 $X_5$를 제외한다면, 우리가 얻는 oracle potential outcome mean (predictiveness measure)은 0.64가 감소
Comments