Inference on Variable Importance Measure 2편

5 minute read

참조논문

  1. Demystifying statistical learning based on efficient influence functions
  2. Semiparametric doubly robust targeted double machine learning: a review
  3. A General Framework for Inference on Algorithm-Agnostic Variable Importance
  4. Variable importance measures for heterogeneous causal effects

변수의 중요도에 대한 통계적 추론 (Inference on Variable Importance Measure (VIM))

import os
os.chdir(os.path.dirname(os.path.abspath(__file__)))
import numpy as np
import pandas as pd
import tqdm

Settings

Simulation 데이터 세팅은 다음과 같다:

  • 데이터 개수 $n = 1000$
  • 변수 개수 $p = 5$
  • treatment ($\sigma(\cdot)$은 sigmoid function):
\[A \sim Ber(\sigma(0.5 \cdot X_1 -0.4 \cdot X_2 + 0.3 \cdot X_3 + 0.2 \cdot X_4 -0.1 \cdot X_5))\]
  • Conditional ATE:
\[\tau(X) = 1 \cdot X_1 -2 \cdot X_2 -3 \cdot X_3 -4 \cdot X_4 +5 \cdot X_5\]
  • Outcome:
\[Y \sim N(-5 \cdot X_1 -4 \cdot X_2 + 3 \cdot X_3 -2 \cdot X_4 +1 \cdot X_5 + A \cdot \tau(X), 1)\]

위의 확률 분포를 토대로, 아래와 같이 python으로 simulation data를 생성할 수 있다.

np.random.seed(1)
n =  1000
p =  5
X =  np.random.uniform(low=-1,  high=1,  size=(n, p))
beta =  np.array([[0.5,  -0.4,  0.3,  0.2,  -0.1]])
logit = X @ beta.T
prob =  1  / (1  +  np.exp(-logit))
treatment =  np.random.binomial(n=1,  p=prob)

beta =  np.array([[1,  -2,  -3,  -4,  5]])
cate = X @ beta.T
beta =  np.array([[-5,  -4,  3,  -2,  1]])
outcome = X @ beta.T + treatment * cate +  np.random.normal(size=(n, 1))
  
data =  np.concatenate([X, treatment, outcome], axis=1)
covariates =  ['X{}'.format(i+1)  for i in  range(p)]
data =  pd.DataFrame(data, columns=covariates +  ['treatment',  'outcome'])

Implementation

1. 기계학습 방법론을 언제, 어디서 사용해야 할까?

아래와 같은 총 3개의 기계학습 모형이 필요:

  1. 조건부 기댓값(conditional expectation) : $Q_n(a, x) = \hat{\mathbb{E}}[Y \vert A=a, X=x]$ where $\hat{P}_n$ be any distribution of $(X, A, Y)$ such that
    • $X$의 주변 분포가 empirical distribution으로 주어지고,
    • $Y$ given $A = a$ for $a = 0, 1$ and $X = X_i$ for $i = 1, \cdots, n$의 조건부 기댓값이 랜덤 포레스트 회귀 알고리즘을 이용해 추정한 $\hat{\mathbb{E}} [Y \vert A=a, X=X_i]$와 동일
  2. propensity score : 랜덤 포레스트 분류 알고리즘을 이용해 $\pi_n(a, x) = \hat{Pr}(A=a \vert X=x)$ 추정
  3. subset conditional expectation : 랜덤 포레스트 회귀 알고리즘을 이용해 $Q_{n,s}(a, x) = \hat{\mathbb{E}} [Q_n(a, x) \vert X_{-s} = x_{-s} ]$ 추정
    • $Q_n(A_i, X_i)$를 $X_{i, -s}$에 대해서 regress

위의 3가지 모형을 이용해, 아래와 같은 추가적인 함수를 정의 및 계산:

  • treatment rule: $f_n(x) = I(Q_n(1, x) > Q_n(0, x))$
  • subset treatment rule: $f_{n, s}(x) = I(Q_{n, s}(1, x) > Q_{n, s}(0, x))$
  • Estimate of the nonparametric EIF of $P \mapsto V(f_0, P)$ at $P_0$:
\[\phi_n: z \mapsto \frac{I(a = f_n(x))}{\pi_n(f_n(x), x)} (y - Q_n(f_n(x), x)) + Q_n(f_n(x), x) - V(f_n, P_n)\]
  • Estimate of the nonparametric EIF of $P \mapsto V(f_{0, s}, P)$ at $P_0$:
\[\phi_{n,s}: z \mapsto \frac{I(a = f_{n, s}(x))}{\pi_n(f_{n, s}(x), x)} (y - Q_n(f_{n, s}(x), x)) + Q_n(f_{n, s}(x), x) - V(f_{n,s}, P_n)\]
  • Estimate of the oracle predictiveness:
\[V(f_n, \hat{P}_n) = \frac{1}{n}\sum_{i=1}^n \hat{\mathbb{E}}[Y \vert A=f_n(X_i), X=X_i]\]
  • Estimate of the subset oracle predictiveness:
\[V(f_{n, s}, \hat{P}_n) = \frac{1}{n}\sum_{i=1}^n \hat{\mathbb{E}}[Y \vert A=f_{n, s}(X_i), X=X_i]\]
  • one-step estimator of $V(f_0, P_0)$:
\[v_n = V^*(\hat{P}_n) + \frac{1}{n} \sum_{i=1}^n \phi_n(Z_i)\]
  • one-step estimator of $V(f_{0, s}, P_0)$:
\[v_{n,s} = V(f_{n,s}, \hat{P}_n) + \frac{1}{n} \sum_{i=1}^n \phi_{n, s}(Z_i)\]
  • 점근 분산 (asymptotic variance):
\[\tau_{n,s}^2 = \frac{1}{n} \sum_{i=1}^n (\phi_n(Z_i) - \phi_{n, s}(Z_i))^2\]
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

np.random.seed(0)
K =  2  # cross-fitted inference
m = data.shape[0]  // K
index_list =  [m * i for i in  range(K)]  +  [data.shape[0]]
  • K는 cross-fitted inference를 위한 데이터 fold의 개수이고, index_list는 데이터를 나누기 위한 관측치 번호이다.
    • $\mathcal{D} = \mathcal{D}_1 \cup \mathcal{D}_2$
    • $\mathcal{D}_1$: fitting algorithm fitting
    • $\mathcal{D}_2$: nonparametric EIF and predictive measure calculation
  • index_list를 이용해 아래와 같이 train($\mathcal{D}_1$), test($\mathcal{D}_2$) 데이터를 나눌 수 있다.
    idx = index_list[i]
    train = data.iloc[index_list[i] : index_list[i+1]]
    test =  pd.concat([data.iloc[: index_list[i]], data.iloc[index_list[i+1] : ]], axis=0)
    

2. 기계학습 모형 및 필요한 함수 정의

  • conditional expectation: $Q_n(a, x) = \hat{\mathbb{E}}[Y \vert A=a, X=x]$
    conditional_mean =  RandomForestRegressor(random_state=0)
    conditional_mean.fit(train[covariates +  ['treatment']], train['outcome'])
    
  • propensity score: $\pi_n(a, x) = \hat{Pr}(A=a \vert X=x)$
    propensity_score =  RandomForestClassifier(random_state=0)
    propensity_score.fit(train[covariates], train['treatment'])
    
  • treatment rule: $f_n(x) = I(Q_n(1, x) > Q_n(0, x))$
    test1 =  pd.DataFrame.copy(test)
    test1['treatment']  =  1
    test0 =  pd.DataFrame.copy(test)
    test0['treatment']  =  0
    treatment_rule = (conditional_mean.predict(test1[covariates +  ['treatment']])  >
                    conditional_mean.predict(test0[covariates +  ['treatment']])).astype(int)
    
  • subset conditional expectation $Q_{n,s}(a, x) = \hat{\mathbb{E}} [Q_n(a, x) \vert X_{-s} = x_{-s} ]$
    pred =  conditional_mean.predict(train[covariates +  ['treatment']])
    conditional_mean_residual =  RandomForestRegressor(random_state=0)
    conditional_mean_residual.fit(train[subset_complement +  ['treatment']], pred)
    
  • subset treatment rule: $f_{n, s}(x) = I(Q_{n, s}(1, x) > Q_{n, s}(0, x))$
    test1 =  pd.DataFrame.copy(test)
    test1['treatment']  =  1
    test0 =  pd.DataFrame.copy(test)
    test0['treatment']  =  0
    treatment_rule_residual = (conditional_mean_residual.predict(test1[subset_complement +  ['treatment']])  >
                             conditional_mean_residual.predict(test0[subset_complement +  ['treatment']])).astype(int)
    
  • Estimate of the nonparametric EIF of $P \mapsto V(f_0, P)$ at $P_0$
    indicator =  np.array((test['treatment']  == treatment_rule).astype(float))
    prob = (propensity_score.predict_proba(test[covariates])  *  np.eye(2)[treatment_rule]).sum(axis=1)
    test_ =  pd.DataFrame.copy(test)
    test_['treatment']  = treatment_rule
    EIF = (indicator / prob) * (test_['outcome']  -  conditional_mean.predict(test_[covariates +  ['treatment']]))
    EIF +=  conditional_mean.predict(test_[covariates +  ['treatment']])
    predictiveness =  conditional_mean.predict(test_[covariates +  ['treatment']]).mean()
    EIF -= predictiveness
    

여기서, $V(f_n, \hat{P}_n)$는 predictiveness와 동일

  • Estimate of the nonparametric EIF of $P \mapsto V(f_{0, s}, P)$ at $P_0$
    indicator =  np.array((test['treatment']  == treatment_rule_residual).astype(float))
    prob = (propensity_score.predict_proba(test[covariates])  *  np.eye(2)[treatment_rule_residual]).sum(axis=1)
    test_ =  pd.DataFrame.copy(test)
    test_['treatment']  = treatment_rule_residual
    EIF_residual = (indicator / prob) * (test_['outcome']  -  conditional_mean.predict(test_[covariates +  ['treatment']]))
    EIF_residual +=  conditional_mean.predict(test_[covariates +  ['treatment']])
    predictiveness_residual =  conditional_mean.predict(test_[covariates +  ['treatment']]).mean()
    EIF_residual -= predictiveness_residual
    

여기서, $V(f_{n, s}, \hat{P}_n)$는 predictiveness_residual과 동일

  • Estimate of VIM: $v_n - v_{n,s}$
    oracle = (predictiveness +  EIF.mean()) - (predictiveness_residual +  EIF_residual.mean())
    
  • 점근 분산의 추정
    asymptotic_variance = ((EIF - EIF_residual) **  2).mean()
    

(전체 완결된 python 코드는 https://github.com/CausalityXAI/xai/blob/main/ate/pom_vim_inference.py 참조)

3. VIM의 95% 신뢰구간

result =  [(n,
			p -  1.96  * v /  np.sqrt(data.shape[0]),
			p +  1.96  * v /  np.sqrt(data.shape[0]),
			p) for p, v, n in  zip(VIM, var, covariates)]

# 시각화
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 4))
plt.errorbar([x[0]  for x in result],
			[x[3]  for x in result],
			[x[2]  - x[3]  for x in result],
			linestyle='None',  marker='o')
plt.ylabel('VIM')
plt.savefig('assets/VIM_toy.png')

해석

  • $X_5$가 가장 높은 population-level 중요도를 가짐
    • population-level 중요도란? 전체 설명변수에서 $X_5$ 변수를 제외하였을 때 감소하는 oracle predictiveness의 양
  • $s = {5}$일 때, VIM $\psi_{n, s} = 0.64$이고 점근적으로 유효한 $\psi_{0, s}$의 95% 신뢰구간은 $(0.20, 1.08)$
  • 만약 우리가 전체 설명변수에서 $X_5$를 제외한다면, 우리가 얻는 oracle potential outcome mean (predictiveness measure)은 0.64가 감소

Comments