kNN-识别手写数字
阅读原文时间:2023年07月09日阅读:4

最后,我们要进行手写数字分类任务,但是现在我们是用kNN算法,可能会比较慢

首先,完整地看完2.3.1和2.3.2的内容,然后找到trainingDigits和testDigits文件夹,大致浏览下

那么思路应该是:

  • 从文件夹中获取文件名,,并且文件名中包含了标记,再分别打开每个文件

  • 对打开的每个文件,对其向量化

  • 然后从上述文件获得的每个向量,数据集,标记集和选定的k,用分类器进行输出

    import numpy as np

    def txt2vec(filename):
    # 3232的规模,用11024的向量接收
    vecContent = np.zeros((1, 1024))
    with open(filename, 'r') as fobj:
    for i in range(32):
    line = fobj.readline()
    for j in range(32):
    vecContent[0, 32 * i + j] = int(line[j])
    return vecContent

    打印输出看一下结果

    filename = './trainingDigits/0_0.txt'
    a = txt2vec(filename)
    print(a[0, 0:64])

    [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0.

    1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1.
    2. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

没有问题,这样我们的txt转换成vector函数就做好了

接下来,有一个难点,要把trainingDigits和testDigits文件夹的文件名分别获得,并得到标记

需要使用listdir函数,需要从os导包

import numpy as np
from os import listdir
trainingFilePath = './trainingDigits'
testFilePath = './testDigits'
# 获得trainingDigits的各文件
trainingFileList = listdir(trainingFilePath)
# 获得标记集
labelSet = []
dataSetNum = len(trainingFileList)
print(dataSetNum)

1934

之后便把之前写的代码综合起来

import numpy as np
import kNN

def txt2vec(filename):
    # 32*32的规模,用1*1024的向量接收
    vecContent = np.zeros((1, 1024))
    with open(filename, 'r') as fobj:
        for i in range(32):
            line = fobj.readline()
            for j in range(32):
                vecContent[0, 32 * i + j] = int(line[j])
        return vecContent

# 打印输出看一下结果
# filename = './trainingDigits/0_0.txt'
# a = txt2vec(filename)
# print(a[0, 0:64])

trainingFilePath = './trainingDigits'
testFilePath = './testDigits'

from os import listdir

def hwPredict():
    # 获得trainingDigits的各文件
    trainingFileList = listdir(trainingFilePath)
    # 获得标记集
    labelSet = []
    dataSetNum = len(trainingFileList)
    # 获得数据集
    dataSet = np.zeros((dataSetNum, 1024))
    # print(dataSetNum)
    for i in range(dataSetNum):
        # 获得每一个txt文件
        eachTrainingFile = trainingFileList[i]
        # 因为文件时0_0.txt类型,所以先按.分割,再按_分割
        eachTrainingFile = eachTrainingFile.split('.')[0]
        eachTrainingFileLabel = int(eachTrainingFile.split('_')[0])
        labelSet.append(eachTrainingFileLabel)
        # 通过txt2vec获得数据集
        trainingFilename = 'trainingDigits/' + eachTrainingFile + '.txt'
        dataSet[i, :] = txt2vec(trainingFilename)
        # print(len(dataSet))
        # print(dataSet.shape)
        # print(type(dataSet))
        # print(labelSet)
    # 现在我们的数据集和label都做好了
    # 开始用测试集的数据来进行判断

    testFileList = listdir(testFilePath)
    # print(testFileList)
    errorCount = 0.0
    testSetNum = len(testFileList)
    # print(testSetNum)
    for i in range(testSetNum):
        # 老样子,先进行每个向量的划分
        eachTestFile = testFileList[i]
        # print(eachTestFile)
        eachTestFile = eachTestFile.split('.')[0]
        # print(eachTestFile)
        eachTestFileLabel = int(eachTestFile.split('_')[0])
        # 转换成向量
        testFilename = 'trainingDigits/' + eachTestFile + '.txt'
        testVector = txt2vec(testFilename)
        # print(testVector)
        testClassifierResult = kNN.classifier(testVector,dataSet,labelSet,3)
        print("the classifier came back with:%d,the real answer is:%d"%(testClassifierResult,eachTestFileLabel))
        if testClassifierResult != eachTestFileLabel:
            errorCount += 1.0
    print("\nthe total number of errors is:",errorCount)
    print("\nthe total error rate is:",errorCount/testSetNum)

hwPredict()

结果如下:

the classifier came back with:0,the real answer is:0
the classifier came back with:0,the real answer is:0
the classifier came back with:0,the real answer is:0
...
the classifier came back with:9,the real answer is:9
the classifier came back with:9,the real answer is:9

the total number of errors is: 13.0

the total error rate is: 0.013742071881606765

kNN算法至此告一段落,代码均上传至https://github.com/lpzju/-

kNN算法在分类算法中最简单最有效,但是复杂度也比较大,且使用大量存储空间。另一个缺点是无法给出任何数据的基础结构信息