liuyubobobo《机器学习》学习笔记(七)

tech2025-01-05  5

K近邻算法 —— KNN


1.示意图

欲判断绿点的种类:选取了与它距离最近的三个点,其中红色占两个,蓝色占一个,所以预测它为红色种类

2.KNN代码实现

X_train:数据

y_train:标签

x:待预测数据

import numpy as np from math import sqrt distances = [sqrt(np.sum((x_train - x) ** 2)) for x_train in X_train]

欧拉公式(计算距离)

二维: ( x a − x b ) 2 + ( y a − y b ) 2 \sqrt{(x^a - x^b)^2 + (y^a - y^b)^2} (xaxb)2+(yayb)2

三维: ( x a − x b ) 2 + ( y a − y b ) 2 + ( z a − z b ) 2 \sqrt{(x^a - x^b)^2 + (y^a - y^b)^2+ (z^a - z^b)^2} (xaxb)2+(yayb)2+(zazb)2

n维: ∑ i = 1 n ( x i a − x i b ) 2 \sqrt{\sum_{i=1}^n(x^a_i - x^b_i)^2} i=1n(xiaxib)2

k = 6 nearest = np.argsort(distances) topK_y = {y_train[i] for i in nearest[:k]} # 距离x最近的k个点的标签值 # {1, 1, 1, 1, 1, 0} from collections import Counter votes = Counter(topK_y) # 计数器 # Counter({0: 1, 1: 5}) votes.most_common(1) # 出现频率最高的键对 # [(1, 5)] predict_y = votes.most_common(1)[0][0] # 预测的标签 # 1

3.KNN的模型

对于KNN来说,训练集就是模型

4.使用scikit-learn中的KNN

from sklearn.neighbors import KNeighborsClassifier KNN_classifier = KNeighborsClassifier(n_neighbors=6) # 构造实例 KNN_classifier.fit(X_train, y_train) # fit拟合 KNN_classifier.predict(x) # 传入向量的形式不推荐 KNN_classifier.predict(x.reshape(1, -1))

5.判断机器学习算法的性能

训练 - 测试分离(train_test_split)

import numpy as np def train_test_split(X, y, test_ratio = 0.2, seed = None): if seed: np.random.seed(seed) shuffle_indexes = np.random.permutation(len(X)) # len(X)个数随机排列 test_size = int(len(X) * test_ratio) test_indexes = shuffle_indexes[: test_size] train_indexes = shuffle_indexes[test_size:] X_train = X[train_indexes] X_test = X[test_indexes] y_train = y[train_indexes] y_test = y[test_indexes] return X_train, X_test, y_train, y_test

检测KNN算法的准确率(accuracy_score)

X_train, X_test, y_train, y_test = train_test_split(X, y) KNN_classifier = KNeighborsClassifier(n_neighbors = 3) KNN_classifier.fit(X_train, y_train) y_predict = KNN_classifier.predict(X_test) accuracy_score = sum(y_predict == y_test) / len(y_test)

sklearn中的train_test_split和accuracy_score

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 666) from sklearn.metrics import accuracy_score accuracy_score(y_test, y_predict)

6.超参数

超参数:在算法运行前需要决定的参数

模型参数:算法过程中学习的参数

KNN算法没有模型参数,k是典型的超参数

寻找好的超参数的方法:

领域知识经验数值实验搜索 from sklearn.neighbors import KNeighborsClassifier best_score = 0.0 best_k = -1 for k in range(1, 11): knn_clf = KNeighborsClassifier(n_neighbors=k) knn_clf.fit(X_train, y_train) score = knn_clf.score(X_test, y_test) if score > best_score: best_k = k best_score = score print("best_k = ", best_k) print("best_score = ", best_score)

KNN中另一个超参数——weights

weights的取值:

“uniform”:权重相同

“distance”:权重是距离的倒数

7.网格搜索(Grid Search)

穷举所有的超参数组合,找到表现最好的超参数

设置超参数遍历的范围

param_grid = [ { 'weights': ['uniform'], 'n_neighbors': [i for i in range(1, 11)] }, { 'weights': ['distance'], 'n_neighbors': [i for i in range(1, 11)] } ]

导入GridSearchCV,并进行拟合

from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import GridSearchCV knn_clf = KNeighborsClassifier() grid_search = GridSearchCV(knn_clf, param_grid, n_jobs=-1, verbose=2) # n_jobs: 核的数量,default=1 # verbose: 打印信息 grid_search.fit(X_train, y_train)

查看结果

grid_search.best_estimator_ # 最佳模型 grid_search.best_score_ # 最佳精度 grid_search.best_params_ # 最佳超参数

8.数据归一化

当特征的尺度相差较大,某些特征的影响会占主导,如:

肿瘤大小(厘米)发现时间(天)样本11200样本25100

此时样本距离已经被发现时间所主导

解决方法——数据归一化

最值归一化(normalization):把所有数据映射到0-1之间,适用于分布有明显边界的情况

X s c a l e = X − X m i n X m a x − X m i n X_{scale} = \frac{X - X_{min}}{X_{max} - X_{min}} Xscale=XmaxXminXXmin

均值方差归一化(standardization):把所有数据归一到均值为0方差为1的分布中

适用:数据分布没有明显的边界,或者存在极端数据值

X s c a l e = X − X m e a n X s t d X_{scale} = \frac{X - X_{mean}}{X_{std}} Xscale=XstdXXmean

用scikit-learn对测试数据集进行归一化

以均值方差归一化为例,fit过程中保留的关键信息就是均值和方差

import numpy as np from sklearn import datasets iris = datasets.load_iris() X = iris.data y = iris.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 666) from sklearn.preprocessing import StandardScaler standardScaler = StandardScaler() standardScaler.fit(X_train) standardScaler.mean_ # 查看均值 # array([5.83416667, 3.08666667, 3.70833333, 1.17 ]) standardScaler.scale_ # 查看标准差 # array([0.81019502, 0.44327067, 1.76401924, 0.75317107]) # 对训练集和验证集都进行归一化处理 X_train = standardScaler.transform(X_train) X_test_standard = standardScaler.transform(X_test) # 对归一化效果进行检验 from sklearn.neighbors import KNeighborsClassifier knn_clf = KNeighborsClassifier(n_neighbors = 3) knn_clf.fit(X_train, y_train) knn_clf.score(X_test_standard, y_test) # 传入的一定是X_test_standard !! # 1.0

可以看到,准确率是1.0,效果很好

对于最值归一化,则需要导入sklearn.preprocessing.MinMaxScaler

9.一些延伸

K近邻算法解决回归问题

导入库sklearn.neighbors.KNeighborsRegressor,文档

K近邻算法的缺点

效率低。如果训练集有m个样本,n个特征,则预测每一个新的数据,需要O(m*n)

高度数据相关。错误数据影响很大

预测结果不具有可解释性。

维数灾难。随着维度的增加,“看似相近”的两个点之间的距离越来越大

最新回复(0)