Distance-Based Analysis of Ordinal Data and Ordinal Time Series (논문 읽기)
2 minute read
Distance-Based Analysis of Ordinal Data and Ordinal Time Series
notations
- ordered categorical range $S = {s_0, \cdots, s_m }$ where $s_0 \leq s_1 \leq \cdots \leq s_m$
- ordinal random variable $X \in {s_0, \cdots, s_m }$
- rank count variable $I \in { 0, \cdots, m }$
- $X = s_I$
distance function
- distance function $d(s_k, s_l): S \times S \mapsto [0, \infty)$
- example: block distance
- $d_{o, 1}(s_k, s_l) = \vert k - l \vert$
- distance value do not depend on the actual labeling $s_0, \cdots, s_m$
- possible properties of ordinal distances
- Maximization $d(s_0, s_m) = \max_{x, y \in S} d(x, y)$
- $d$ is said to be compatible with the ordering if
\[x < y < z \text{ implies that } d(x, z) > d(x, y), d(y, z)\]
- $d$ is said to be additive if for given $d_1, \cdots, d_m > 0$, it holds that
\[d(s_i, s_{i+k}) = d_{i+1} + \cdots + d_{i+k} \text{ for all } i = 0, \cdots, m-1, k = 1, \cdots, m-i\]
- Centrosymmetry
\[d(s_i, s_j) = d(s_{m-i}, s_{m-j}) \text{ for all } 0 \leq i < j \leq m\]
location
\[loc = \arg\min_{x \in S} E_X[d(X, x)]\]
dispersion
-
\[disp = E_X[d(X, loc)]\]
- \(disp = E[d(X_1, X_2)] = \sum_{i,j=0}^m d(s_i, s_j) p_i p_j \leq d(s_0, s_m)\) by maximization property
asymmetry
- reflected copy $X^r = s_{m-I}$ where $P(X = s_i) = p_{i}$ and $P(X^r = s_i) = p_{m-i}$
- If $p_i = p_{m-i}$, then $X =_d X^r$: symmetry distribution
\[\begin{align}
asym &= E[d(X, X^r)] - disp \\
&= \sum_{i,j=0}^m d(s_i, s_j) p_i (p_{m-j} - p_j) \\
&= 0 (\text{ if distribution of $X$ is symmetry})
\end{align}\]
skewness
\[\begin{align}
skew &= E[d(X, s_m)] - E[d(X, s_0)] \\
&= \sum_{i,j=0}^m d(s_i, s_m) p_i - \sum_{i,j=0}^m d(s_i, s_0) p_i \\
&= \sum_{i,j=0}^m d(s_i, s_m) p_i - \sum_{i,j=0}^m d(s_{m-i}, s_m) p_i (\text{ Centrosymmetry }) \\
&= \sum_{i,j=0}^m d(s_i, s_m) p_i - \sum_{i,j=0}^m d(s_{i}, s_m) p_{m-i} \\
&= \sum_{i,j=0}^m d(s_i, s_m) p_i - \sum_{i,j=0}^m d(s_{i}, s_m) p_{i} (\text{ symmetric distribution }) \\
&= 0
\end{align}\]
- 거리의 일반적인 정의 방식
- 추상적인 공간에 대해 조건을 만족하는 적당한 거리함수(distance function)을 생각한다면, 1에서 정의한 거리를 적용해 모델링을 할 수 있음
- ordered categorical → Gumbel-Softmax로 샘플링 및 모형화 할 수 있지 않을까?
- categorical data를 이용한 regression에서 이분산성(Heteroskedasticity)을 dispersion의 역수를 가중치로 곱해줌으로써 해결할 수 있을까?
- location 계산에서, category $S$ 중에 1개의 원소로 확실하게 구할 수 있음 (일반적인 방식인 추정한 뒤 가장가까운 category를 고르는 방식과 차이)
Reference
- Weiß, C. H. (2019). Distance-based analysis of ordinal data and ordinal time series. Journal of the American Statistical Association.
Comments