[Fitting] Ordinary Least Squares : OLS, 최소자승법

728x90

Ordinary Least Squares : OLS, 최소자승법

Solution을 구할 수 없는 Over-determined system에서 solution의 근사치(approximation)을 구할 수 있게 해주는 방법임.
Machine Learning에서 Supervised Learning의 대표적인 task인 Regression을 해결하는 가장 간단한 알고리즘임.
input과 output의 linear relation을 파악한다 (비선형도 features를 추가하는 방식으로 확장이 가능하긴함)

Over-determined System

Over-deterrmined system은 linear system (연립방정식)에서 지나치게 equation(식)이 많아서 모든 식을 만족하는 solution이 없는 경우를 말한다.

쉽게 예를 든다면, 2차원 평면에서 다른 2개의 점만 있을 때에는 이 두 점을 정확히 지나는 직선을 항상 구할 수 있지만,
점이 3개가 될 경우에 이들 세 개의 점을 모두 지나는 직선은 존재하지 않을 수 있다.
만약 점이 더 많아지면 모든 점을 지나는 직선이 존재하기는 더욱 어려워진다.

여기서 직선을 정의 (=model을 정의)하는 2개의 변수(slope, intercept)가 바로 우리가 구하고자 하는 solution (linear model을 결정하는 parameters)이므로,

unknow이 2개이니

서로 다른 equation(식)이 2개(=2개의 점)면 이들을 모두 지나는 직선에 해당하는 solution을 구할 수 있지만,
solution이 만족해야하는 equation이 지나치게 많아지면 (=training dataset의 data sample의 수가 많아짐을 의미)
대부분의 경우 solution이 존재할 수 없게 된다.

구해야하는 Unknown의 수를 $n$이라고 하고, 주어진 equation의 수를 $m$이라고 할 경우 다음과 같다.

$m < n$ : under-determined system.
$m = n$ : determined system.
$m > n$ : overdeterined system.

대부분의 linear regression은 over-determined system에 대한 solution을 구해야한다.

2022.12.02 - [.../Math] - [Math] ill-posed, well-posed, ill-conditioned, well-conditioned matrix (or problem)

[Math] ill-posed, well-posed, ill-conditioned, well-conditioned matrix (or problem)

well-posed matrix and well-conditioned matrix $A\textbf{x}=\textbf{b}$와 같은 linear system에서 system matrix $A$가 invertible하다면 해당 linear system(달리 말하면 연립방정식)이 well-posed라고 할 수 있다. 하지만, 해당 matri

dsaint31.tistory.com

OLS의 Loss function

다음의 linear system 경우, equation이 4개가 존재하고 unknown이 3개인 over-determined system이다.

$$
\begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 2 & 1 & 0 \end{bmatrix}\begin{bmatrix}x_1 \\ x_2 \\ x_3 \end{bmatrix}= \begin{bmatrix}1 \\ 2 \\ 1 \\ 0 \end{bmatrix}$$

OLS는 over-determined system인 $A\bf{x} \simeq \bf{b}$ (,where $A \in \mathbb{R}^{m\times n}, \bf{b} \in \mathbb{R}^n, \text{ and } m \gg n$)에서 근사해 (the approximation of the solution , $\hat{\textbf{x}}$)을 구해준다.
이 때 approxiation $\hat{\textbf{x}}$는 다음과 같이 정의된다.

$$\hat{\textbf{x}}= \underset{\textbf{x}} {\text{arg } \text{min }} \Vert \textbf{b}-A\textbf{x} \Vert$$

$A\textbf{x}$는 matrix $A$의 column space ($\text{Col }A$)에 속하는 vector이고 해당 vector가 $\textbf{b}$와 거리, 즉 difference vector ($\textbf{b}-A\textbf{x}$)의 magnitude가 가장 작은 경우를 만족하는 solution이 바로 $\hat{\textbf{x}}$이다.

$\Vert \textbf{b}-A\textbf{x} \Vert$ 은 Residual Sum of Squares (잔차제곱합)의 제곱근을 의미.
Residual Sum of Squares는 linear model의 추정치($A\textbf{x}$)와 정답($\textbf{b}$)간의 error의 정도를 나타냄.
- Machine Learning의 관점에서는 바로 loss function임 (최소화시켜야하는 대상).
$\Vert ~ \Vert$는 L2-norm이기 때문에 제곱근을 생략하면 결국 Sum of Squares임.
때문에 제곱 (squares)을 최소화(least)한다라는 이름(Least Square)이 붙은 것임.

기하학적으로 본 OLS

기하학적으로 쉽게 살펴보기 위해 $\text{Col }A$를 평면이라고 하면 다음 그림으로 위의 관계가 설명된다.

$\text{Col }A$는 matrix $A$의 모든 column들을 각각 vector라고 생각하고 이들의 linear combination으로 만들어질 수 있는 모든 vector이 구성하는 vector들을 포함하는 일종의 집합(이를 span이라고 칭함)이다.
$$ A=\begin{bmatrix}\textbf{a}_1 & \textbf{a}_2 & \cdots & \textbf{a}_n\end{bmatrix} \\ A\textbf{x} = A\begin{bmatrix}x_1 \\ x_2 \\ \vdots \\x_n\end{bmatrix} = x_1\textbf{a}_1 + x_2\textbf{a}_2 + \cdots + x_n\textbf{a}_n$$
위 그림에서 $A\textbf{x}_0$, $A\textbf{x}_1$, $A\hat{\textbf{x}}$ 모두 $\text{Col }A$에 포함되며, 평면으로 $\text{Col }A$를 나타낸 경우엔 해당 평면에 포함된 점들에 대한 위치 벡터들이라고 볼 수 있다.

위의 그림에서 보이듯이 $\Vert\textbf{b}-A\textbf{x}\Vert$가 최소가 되는 difference vector $(\textbf{b}-A\hat{\textbf{x}})$는 반드시 $\text{Col }A$에 포함되는 모든 vector들과 orthogonal(직교)해야하며 이를 식을 표현하면 다음과 같다.

$$(x_1 \textbf{a}_1 + x_2 \textbf{a}_2 + \cdots +x_n \textbf{a}_n ) \perp (\textbf{b}-A\hat{\textbf{x}}) \text{ for any vector }\textbf{x}$$

위 식을 풀어쓰면 다음과 같다.
$$ \begin{matrix}
\begin{matrix} \textbf{a}_1 \perp (\textbf{b}-A\hat{\textbf{x}}) \\ \textbf{a}_2 \perp (\textbf{b}-A\hat{\textbf{x}}) \\ \vdots\\ \textbf{a}_n \perp (\textbf{b}-A\hat{\textbf{x}}) \\ \end{matrix} \Rightarrow \begin{matrix} {\textbf{a}_1}^T (\textbf{b}-A\hat{\textbf{x}})=0 \\ {\textbf{a}_2}^T (\textbf{b}-A\hat{\textbf{x}})=0 \\ \vdots \\ {\textbf{a}_n}^T (\textbf{b}-A\hat{\textbf{x}})=0 \\ \end{matrix} \Rightarrow A^T(\textbf{b}-A\hat{\textbf{x}})=\textbf{0} \Rightarrow A^TA\hat{\textbf{x}}=A^T\textbf{b} \end{matrix}$$

Normal Equation

위의 식을 통해 주어진 least square problem $𝐴\textbf{𝐱} \simeq \textbf{b}$에 대한 근사해가 만족해야하는 다음의 equation이 얻어진다.

$$A^TA\hat{\textbf{x}}=A^T\textbf{b}$$

위 식에서 $A^TA$는 symmetric matrix인데, 해당 matrix가 만약 invertible하다면 다음과같이 $\hat{\textbf{x}}$에 대해 정리된 식을 얻을 수 있다.

$$\hat{\textbf{x}} = (A^TA)^{-1}A^T\textbf{b}$$

위 식이 바로 normal equation이고, linear regression 에서 closed-form solution이 된다.

2022.04.29 - [.../Math] - Closed-form solution and Closed-form expression

Closed-form solution and Closed-form expression

Closed-form solution 💡 Solution(해)이 closed-form expression으로 주어진 것을 가르킴. 다음의 문장들 은 위와 같은 뜻. 방정식(equation)을 analytic method로 solution을 구할 수 있다. equation의 solution이 closed-form solu

dsaint31.tistory.com

$A^TA$는 symmetric matrix이므로, 이론적으로 $A^TA$가 positive definte matrix인 경우 inverse를 구할 수 있고 normal equation으로 solution이 구해진다.

symmetric matrix의 경우, eigen value들이 항상 real number이고 eigen vector들이 orthogonal임.
$A^TA$의 경우 항상 postive semi-definite이다.
symmetric matrix는 항상 diagnolizable(대각화)이 가능하기 때문에, eigen value들이 0만 아니라며 항상 invertible임.

즉, 이론적으로는 singular(=non-invertible)일 수도 있으나, 실제 학습데이터로 각 feature의 값을 column으로 놓고 각 instance(or case)를 행으로 배치하여 matrix $A$를 만드는 대부분의 경우 $A^TA$는 invsertible하며 때문에 normal equation으로 closed-form solution를 구할 수 있다.

Over-determined problem을 푸는 다른 방법.

같은 Least square regression이지만, independent variables에서도 error가 있다고 가정하고 처리하는 Total Linear Regression도 있음(보다 많이 이용됨)

2024.06.22 - [.../Math] - [CV] Fitting: Total Least Square Regression

[CV] Fitting: Total Least Square Regression

Total Least Squares (TLS) RegressionTotal Least Squares (TLS) 회귀는 데이터의 모든 방향에서의 오차를 최소화하는 회귀 방법임.이는 특히 독립 변수와 종속 변수 모두에 오차가 포함되어 있는 경우에 유용함.

dsaint31.tistory.com

Closed-form equation으로 접근하는 Analystic approach를 사용하는 경우에는,

어느 경우에나 solution을 구할 수 있는 Singular Value Decomposition에 기반한 pseudo inverse를 이용하는 경우가 일반적이고,

Iterative method로 푸는 경우엔 Gradient descent method가 가장 많이 사용된다.

2022.12.02 - [.../Math] - [LA] Pseudo Inverse Matrix

[LA] Pseudo Inverse Matrix

Pseudo Inverse Matrix square matrix 가 아닌 matrix $A_{m\times n}$에서 inverse에 해당하는 matrix를 구하는데 사용됨. $$ A^+_{n\times m}=\left\{\begin{array}{ll} (A^TA)^{-1}A^T &,A^+A=I_{n\times n}&\text{if }m \ge n&(1)\\A^T(AA^T)^{-1}&,AA^

dsaint31.tistory.com

'Programming > ML' 카테고리의 다른 글

[Fitting] Total Least Square Regression (1)	2024.06.22
[ML] RANdom SAmple Consensus (RANSAC) (0)	2024.06.13
[ML] Feature Scaling (0)	2024.05.04
[Statistics] coefficient of determination (결정계수 ~ R squared) (0)	2024.05.01
[ML] Underfit (0)	2023.09.21

Ordinary Least Squares : OLS, 최소자승법

Over-determined System

OLS의 Loss function

기하학적으로 본 OLS

Normal Equation

Over-determined problem을 푸는 다른 방법.

'Programming > ML' 카테고리의 다른 글

티스토리툴바