[OpenCV] K-Means Tree Hyper-parameters: FLANN based Matcher

이 글은 OpenCV에서
K-Means Tree 를 중심으로
FLANN based Matcher의
하이퍼파라미터를 정리한 글임.

https://gist.github.com/dsaint31x/e06fee90d97dd177afe4e38c5b9db68c

dip_kmeans_hyperparams.ipynb

dip_kmeans_hyperparams.ipynb. GitHub Gist: instantly share code, notes, and snippets.

gist.github.com

1. K-Means Tree란

K-Means Tree는

OpenCV의 FLANN (Fast Library for Approximate Nearest Neighbors)에서
사용 가능한 고속 근사 검색 알고리즘 중 하나.

K-Means Tree는

대규모 및 고차원 데이터에서 효율적으로 검색할 수 있도록 설계되었으며,
하이퍼파라미터 조정을 통해 성능을 최적화할 수 있음

2. OpenCV 구현물

OpenCV에서는 다음의

Index Parameters 와
Search Parameters로

K-Means Tree의 Hyper-parameters 를 설정.

2-1.Index Parameters

index_params는 K-Means Tree 생성 방식을 정의하는 주요설정.

다음의 각 파라미터는 Tree의 구조와 근사 검색 성능에 영향을 줍니다.

2-1-1. algorithm

설명: 사용하려는 알고리즘을 지정.
값:
- 2 → K-Means Tree 알고리즘 (FLANN_KMEANS).

예시:

index_params = dict(algorithm=2)

0 인 경우 linear search
1 인 경우 k-d Tree
3 인 경우 Composite Tree로 k-d Tree와 K-Means Tree 가 혼합된 형태.

2-1-1. branching

설명: 각 노드에서 데이터를 몇 개의 클러스터로 나눌지 정의합니다. 이는 K-Means 알고리즘의 K값과 동일.
값: 양의 정수 (일반적으로 2~64)
- 영향:
  - 작은 값: Tree 깊이가 증가하며, 각 노드에 포함된 데이터가 많아짐
  - 큰 값: Tree 깊이가 얕아지며, 클러스터 분할이 더 세밀해짐.

예시:

 index_params = dict(
       algorithm=2, 
       branching=32, # 각 노드에서 32개의 클러스터 생성
)

2-1-3. iterations

설명: K-Means 클러스터링 과정에서 centroid 를 찾기 위한 업데이트를 반복하는 최대 횟수.
값: 양의 정수 (일반적으로 10~100).
- 영향:
  - 작은 값: Tree 생성 속도는 빠르지만, 클러스터 Centroid(중심)이 최적화되지 않을 수 있음.
  - 큰 값: 클러스터 중심이 더 정확히 수렴하지만, 생성 시간이 증가.

예시:

 index_params = dict(
       algorithm=2, 
       branching=32, 
       iterations=20,
)

2-1-4. centers_init

설명: K-Means 클러스터링에서 초기 클러스터 중심을 선택하는 방법.
Python OpenCV (4.10 현재)에선 지정이 제대로 안됨 (default로만 사용가능)
값:
- 0 → 랜덤 초기화. (default)
- 1 → Gonzales 알고리즘.
- 2 → K-Means++ 초기화.
영향:
- K-Means++ 초기화가 일반적으로 더 빠르게 수렴하고 안정적인 클러스터를 생성.

예시:

 index_params = dict(
        algorithm=1, 
        branching=32, 
        centers_init="kmeanspp",
)

2-1-5. trees

설명:
- 생성할 k-d Tree(1)의 개수를 정의: K-Means(2)에서 사용되지 않음.
- 여러 Tree를 병렬적으로 생성.
- 여러 Tree를 생성하면 검색 정확도가 향상됨.
값: 양의 정수 (일반적으로 1~10)
영향:
- 작은 값: 검색 속도가 빠르지만 정확도가 낮아질 수 있음.
- 큰 값: 검색 정확도 증가, 메모리 사용량 및 Tree 생성 시간이 증가.

예시:

 index_params = dict(
    algorithm=1,  # Use k-d Tree
    trees=4,      # Number of parallel trees
)

2-2. Search Parameters

search_params는

검색 과정에서의
근사성과
속도를 조정하는 hyperparmaeters들로 구성됨.

2-2-1. checks

설명: 검색 시 방문할 최대 노드 수를 정의.
값:
- 0 → 전체 탐색 (정확도 100%이지만 느림).
- 양의 정수 → 방문할 노드 수 제한 (근사 검색).
영향:
- 작은 값: 검색 속도가 빠르지만 정확도가 낮아질 수 있음.
- 큰 값: 검색 정확도가 증가하지만 검색 시간이 길어질 수 있음

예시:

search_params = dict(checks=50)

2-2-2. eps

설명: 근사 허용 오차. 검색 중 결과의 정확도와 탐색 범위를 조정합니다.
값: 양의 실수 (0.0 이상).
- eps = 0.0 → 완벽한 탐색.
- eps > 0.0 → 근사 검색 허용.
영향:
- 작은 값: 더 정확한 검색.
- 큰 값: 검색 속도 향상, 정확도 감소.

예시:

search_params = dict(eps=0.01)  # 근사 허용 오차 1%

2-3. Full Example

OpenCV에서 K-Means Tree를 설정하고 쿼리하는 전체 코드는 다음과 같음:

 import cv2
import numpy as np
# Generate random 4D data
data = np.random.rand(1000, 4).astype(np.float32)
 
# Define index parameters for K-Means Tree
index_params = dict(
    algorithm=2,                  # Use K-Means Tree
    branching=32,                 # Number of clusters per node    
    iterations=10,                # K-Means iterations    
    # centers_init="kmeanspp",      # Center initialization    
    # trees=4                       # Number of parallel trees for k-d Tree
)
# Define search parameter
search_params = dict(
    checks=50,    # Number of nodes to visit    
    eps=0.01      # Approximation accuracy
)
 
# Create the FLANN index
flann_index = cv2.flann_Index(
    data, 
    index_params,
)
 
# Query with a random point
query = np.random.rand(1, 4).astype(np.float32)
k = 5  # Number of nearest neighbors
indices, dists = flann_index.knnSearch(
     query, 
     k, 
     params=search_params,
)
 
# Print results
print(f"Query Point:{query}")
print(f"Nearest Neighbor Indices:{indices}")
print(f"Distances:{dists}")

3. Key Trade-offs

3-1. Accuracy vs Speed:

checks와 eps 파라미터를 통해 검색 속도와 정확도의 균형을 조절:

checks 파라미터
- 높은 값 (예: 100 이상): 더 많은 노드를 방문하여 정확도가 향상되지만 검색 속도가 느려짐
- 낮은 값 (예: 32 이하): 빠른 검색이 가능하나 정확도가 떨어질 수 있음
eps 파라미터
- 0에 가까운 값: 거의 정확한 최근접 이웃을 찾지만 계산 비용이 증가
- 큰 값 (예: 0.1 이상): 근사치를 허용하여 검색 속도가 향상되나 정확도는 감소

3-2. Memory Usage:

trees 값을 높이면 다음과 같은 Trade-off가 발생:

장점:
- 여러 Tree를 병렬 검색하여 검색 정확도 향상
- 다양한 경로로 근접 이웃을 찾아 결과의 신뢰성 증가
단점:
- k-d Tree에서만 의미가 있고, K-Means Tree에서는 의미가 없음.
- 각 Tree 마다 추가 메모리가 필요하여 전체 메모리 사용량 증가
- 여러 Tree를 동시에 구축해야 하므로 초기 인덱스 생성 시간 증가
- 검색 시 여러 Tree를 탐색하므로 검색 시간도 일부 증가

3-3. Initialization:

cpp 버전에서는 제대로 동작하지만,
Python Binding에선 정상적인 설정이 되지 않고 있음.

초기화 방법(centers_init)이
다음과 같이 K-Means Clustering의 수렴 속도와 클러스터 품질에 영향을 미침:

FLANN_CENTERS_RANDOM
- 무작위로 중심점을 선택하여 빠르게 시작
- 하지만 부적절한 초기 중심점으로 인해 수렴이 늦어질 수 있음
FLANN_CENTERS_GONZALES
- 첫 중심점을 무작위로 선택한 후, 기존 중심점들과 가장 먼 점들을 순차적으로 선택
- 중심점들이 고르게 분포하여 안정적인 클러스터링 가능
FLANN_CENTERS_KMEANSPP
- 첫 중심점을 무작위로 선택한 후,
- 기존 중심점들로부터의 거리에 비례하는 확률로 새로운 중심점 선택
- 일반적으로 가장 빠른 수렴과 높은 품질의 클러스터링 결과 제공

[DIP] K-Means Tree: Nearest Neighbor Search

https://gist.github.com/dsaint31x/f7f59e6b99f6ef3dc46e4e7da8826cd9 dip_kmeans_tree.ipynbdip_kmeans_tree.ipynb. GitHub Gist: instantly share code, notes, and snippets.gist.github.com 0. 소개K-Means Tree는대규모 데이터에서효율적으로 Nearest

dsaint31.tistory.com

https://docs.opencv.org/4.x/dc/dc3/tutorial_py_matcher.html

OpenCV: Feature Matching

Goal In this chapter We will see how to match features in one image with others. We will use the Brute-Force matcher and FLANN Matcher in OpenCV Basics of Brute-Force Matcher Brute-Force matcher is simple. It takes the descriptor of one feature in first se

docs.opencv.org

https://docs.opencv.org/4.x/db/d18/classcv_1_1flann_1_1GenericIndex.html?utm_source=chatgpt.com

OpenCV: cv::flann::GenericIndex< Distance > Class Template Reference

The FLANN nearest neighbor index class. This class is templated with the type of elements for which the index is built. More... template<typename Distance> class cv::flann::GenericIndex< Distance > The FLANN nearest neighbor index class. This class is temp

docs.opencv.org

https://dsaint31.me/mkdocs_site/ML/ch07/clustering/#k-means-and-k-medoids

BME228

Clustering (군집) feature space에서 가까운(=유사한) sample들을 모아 하나의 cluster로 묶는 task. input : label이 되어있지 않은 training data output : 유사한 sample들이 묶여있는 cluster hyper-parameter : cluster 를 몇개

dsaint31.me

'Programming > DIP' 카테고리의 다른 글

[DIP] Least Square Method for Fitting (0)	2024.11.25
[Math] Maximum Likelihood Estimator: M-Estimator (0)	2024.11.25
[DIP] K-Means Tree: Nearest Neighbor Search (0)	2024.11.25
[CV] Optical Flow: Lucas Kanade Method (LK method, 1981) (2)	2024.11.17
[CV] Optical Flow: Horn-Schunck Method (1981) (0)	2024.11.17

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`