欲判断绿点的种类:选取了与它距离最近的三个点,其中红色占两个,蓝色占一个,所以预测它为红色种类
X_train:数据
y_train:标签
x:待预测数据
import numpy as np from math import sqrt distances = [sqrt(np.sum((x_train - x) ** 2)) for x_train in X_train]欧拉公式(计算距离)
二维: ( x a − x b ) 2 + ( y a − y b ) 2 \sqrt{(x^a - x^b)^2 + (y^a - y^b)^2} (xa−xb)2+(ya−yb)2
三维: ( x a − x b ) 2 + ( y a − y b ) 2 + ( z a − z b ) 2 \sqrt{(x^a - x^b)^2 + (y^a - y^b)^2+ (z^a - z^b)^2} (xa−xb)2+(ya−yb)2+(za−zb)2
n维: ∑ i = 1 n ( x i a − x i b ) 2 \sqrt{\sum_{i=1}^n(x^a_i - x^b_i)^2} ∑i=1n(xia−xib)2
k = 6 nearest = np.argsort(distances) topK_y = {y_train[i] for i in nearest[:k]} # 距离x最近的k个点的标签值 # {1, 1, 1, 1, 1, 0} from collections import Counter votes = Counter(topK_y) # 计数器 # Counter({0: 1, 1: 5}) votes.most_common(1) # 出现频率最高的键对 # [(1, 5)] predict_y = votes.most_common(1)[0][0] # 预测的标签 # 1对于KNN来说,训练集就是模型
训练 - 测试分离(train_test_split)
import numpy as np def train_test_split(X, y, test_ratio = 0.2, seed = None): if seed: np.random.seed(seed) shuffle_indexes = np.random.permutation(len(X)) # len(X)个数随机排列 test_size = int(len(X) * test_ratio) test_indexes = shuffle_indexes[: test_size] train_indexes = shuffle_indexes[test_size:] X_train = X[train_indexes] X_test = X[test_indexes] y_train = y[train_indexes] y_test = y[test_indexes] return X_train, X_test, y_train, y_test检测KNN算法的准确率(accuracy_score)
X_train, X_test, y_train, y_test = train_test_split(X, y) KNN_classifier = KNeighborsClassifier(n_neighbors = 3) KNN_classifier.fit(X_train, y_train) y_predict = KNN_classifier.predict(X_test) accuracy_score = sum(y_predict == y_test) / len(y_test)sklearn中的train_test_split和accuracy_score
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 666) from sklearn.metrics import accuracy_score accuracy_score(y_test, y_predict)超参数:在算法运行前需要决定的参数
模型参数:算法过程中学习的参数
KNN算法没有模型参数,k是典型的超参数
寻找好的超参数的方法:
领域知识经验数值实验搜索 from sklearn.neighbors import KNeighborsClassifier best_score = 0.0 best_k = -1 for k in range(1, 11): knn_clf = KNeighborsClassifier(n_neighbors=k) knn_clf.fit(X_train, y_train) score = knn_clf.score(X_test, y_test) if score > best_score: best_k = k best_score = score print("best_k = ", best_k) print("best_score = ", best_score)KNN中另一个超参数——weights
weights的取值:
“uniform”:权重相同
“distance”:权重是距离的倒数
穷举所有的超参数组合,找到表现最好的超参数
设置超参数遍历的范围
param_grid = [ { 'weights': ['uniform'], 'n_neighbors': [i for i in range(1, 11)] }, { 'weights': ['distance'], 'n_neighbors': [i for i in range(1, 11)] } ]导入GridSearchCV,并进行拟合
from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import GridSearchCV knn_clf = KNeighborsClassifier() grid_search = GridSearchCV(knn_clf, param_grid, n_jobs=-1, verbose=2) # n_jobs: 核的数量,default=1 # verbose: 打印信息 grid_search.fit(X_train, y_train)查看结果
grid_search.best_estimator_ # 最佳模型 grid_search.best_score_ # 最佳精度 grid_search.best_params_ # 最佳超参数当特征的尺度相差较大,某些特征的影响会占主导,如:
肿瘤大小(厘米)发现时间(天)样本11200样本25100此时样本距离已经被发现时间所主导
解决方法——数据归一化
最值归一化(normalization):把所有数据映射到0-1之间,适用于分布有明显边界的情况
X s c a l e = X − X m i n X m a x − X m i n X_{scale} = \frac{X - X_{min}}{X_{max} - X_{min}} Xscale=Xmax−XminX−Xmin
均值方差归一化(standardization):把所有数据归一到均值为0方差为1的分布中
适用:数据分布没有明显的边界,或者存在极端数据值
X s c a l e = X − X m e a n X s t d X_{scale} = \frac{X - X_{mean}}{X_{std}} Xscale=XstdX−Xmean
用scikit-learn对测试数据集进行归一化
以均值方差归一化为例,fit过程中保留的关键信息就是均值和方差
import numpy as np from sklearn import datasets iris = datasets.load_iris() X = iris.data y = iris.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 666) from sklearn.preprocessing import StandardScaler standardScaler = StandardScaler() standardScaler.fit(X_train) standardScaler.mean_ # 查看均值 # array([5.83416667, 3.08666667, 3.70833333, 1.17 ]) standardScaler.scale_ # 查看标准差 # array([0.81019502, 0.44327067, 1.76401924, 0.75317107]) # 对训练集和验证集都进行归一化处理 X_train = standardScaler.transform(X_train) X_test_standard = standardScaler.transform(X_test) # 对归一化效果进行检验 from sklearn.neighbors import KNeighborsClassifier knn_clf = KNeighborsClassifier(n_neighbors = 3) knn_clf.fit(X_train, y_train) knn_clf.score(X_test_standard, y_test) # 传入的一定是X_test_standard !! # 1.0可以看到,准确率是1.0,效果很好
对于最值归一化,则需要导入sklearn.preprocessing.MinMaxScaler
K近邻算法解决回归问题
导入库sklearn.neighbors.KNeighborsRegressor,文档
K近邻算法的缺点
效率低。如果训练集有m个样本,n个特征,则预测每一个新的数据,需要O(m*n)
高度数据相关。错误数据影响很大
预测结果不具有可解释性。
维数灾难。随着维度的增加,“看似相近”的两个点之间的距离越来越大