[computer|[computer vision] Bag of Visual Word (BOW) [computervision]BagofVisualWord(BOW)

Bag of Visual Word (BoW, BoF, 词袋) 简介 BoW 是传统的计算机视觉方法，用一些特征（一些向量）来表示一个图像。BoW的核心思想是利用一组较为通用的特征，将图像用这些特征来表示，不同图像对于同一个特征的响应也是不同的，最终一个图像可以转化成关于这一组特征的一个频率直方图（向量）。这里有个挺清晰的介绍。BoW 常常用在 content-based image retrieval (CBIR) 任务上。
例如下面这张图（来源 Brown Computer Vision 2021 ）形象的介绍了BoW的，首先有一堆图片，然后提取这些图片中的特征，然后提取具有代表性的通用特征，然后计算不同图像对于这些特征的响应，从而将图像转换成关于这组特征的一个特征向量。

文章图片
实践 【[computer|[computer vision] Bag of Visual Word (BOW)】本文不过多的介绍理论部分，主要使用opencv来进行一些实践操作。
数据集
本文使用的是一个比较老的数据集是 ZuBuD 数据集，是苏黎世联邦理工构建的数据集，开放下载。数据集是苏黎世城市内的一些建筑，训练集有1005张图像，包含201个建筑，测试集有115张图像，用来测试 image retrieval，有ground truth信息，即指定来哪些图像是对应的，如下随便找了两张图片。

文章图片

文章图片
以下是 ground truth 的部分信息，例如第一行代表测试集中编号为 1 的图像对应到训练集中，应该是编号 100。

TEST TRAIN 001 100 002 102 003 104 004 105 005 107 006 109 ... ...

总体思路

对每个图像提取sift特征
将训练集的所有特征放在一起进行聚类
对训练集中的图像计算直方图
对测试集中的图像计算直方图
从训练集中找和测试图像直方图最接近的图像作为结果
计算正确率

代码部分
有了上述思路后，代码的逻辑也比较清晰了，下面给出所有的代码，详细的解释在注释里。

#1.对每个图像提取sift特征 #2.将训练集合的所有特征放在一起进行聚类 #3.对每个图像计算直方图 #4.对测试图像计算直方图 #5.从训练集中寻找和测试图像直方图最近接近的图像作为结果 #6.计算正确率import cv2 import os import matplotlib.pyplot as plt import numpy as np import time from sklearn.cluster import MiniBatchKMeansDataPath = "../Dataset/ZuBuD" #数据集的根目录 TrainPath = os.path.join(DataPath, "png-ZuBuD") #训练集的根目录 TestPath = os.path.join(DataPath,"1000city","qimage") #测试集的根目录 trainList = os.listdir(TrainPath) #训练集图像的所有名字TrainSIFTPath = "../Dataset/ZuBuD/Train_SIFT" #训练集图像SIFT保存的路径（保存在文件中时有用） TestSIFTPath = "../Dataset/ZuBuD/Test_SIFT" #测试集图像SIFT保存的路径（保存在文件中时有用）TrainSIFT = []#训练集的SIFT特征，为了后面numpy方便拼接 TestSIFT = []#测试集的SIFT特征Train_SIFT_dict = {}#同上，只不过用名字来索引特征 Test_SIFT_dict = {}#批量生成SIFT特征 def genSIFT(dataDir,outdir, outlist,outdict): begin = time.time() sift = cv2.SIFT_create() imgList = os.listdir(dataDir) if not os.path.exists(outdir): os.mkdir(outdir) count = 0 for name in imgList: ext = os.path.splitext(name)[-1] if ext!=".png" and ext!=".JPG" and ext!=".jpg" : continue #读取图片、转成灰度、提取描述子 path = os.path.join(dataDir,name) imgdata = https://www.it610.com/article/cv2.imread(path) gray = cv2.cvtColor(imgdata,cv2.COLOR_BGR2GRAY) _, des = sift.detectAndCompute(gray, None) outlist.append(des) outdict[name] = des #np.save(os.path.join(outdir,name),des) print(len(imgList),count) count = count + 1 end = time.time()#聚类，也是生成通用特征、词袋，这里用的是MiniBatchKMeans，这个比KMeans快，精度没有差很多 def cluster(featureList, n): #将所有训练图片的SIFT特征放在一起进行聚类 begin = time.time() X = np.concatenate(featureList) kmeans = MiniBatchKMeans(n_clusters=n, random_state=0,verbose=1).fit(X) end = time.time() return kmeans#计算余弦距离，为了计算相似度 def get_cos_similar(v1, v2): num = float(np.dot(v1, v2)) denom = np.linalg.norm(v1) * np.linalg.norm(v2) return 0.5 + 0.5 * (num / denom) if denom != 0 else 0#读取groundtruth文件，生成数据对 def getGroundTruth(dataPath): gtpair = {} with open(os.path.join(dataPath,"zubud_groundtruth.txt")) as f: gt = f.readlines() for i, line in enumerate(gt): if i == 0: continue test, train = line[:-1].split("\t") gtpair[test] = train return gtpair#根据聚类的结果，也就是词袋生成频率向量，这里就将图像转成了一个向量表示 def getFeatureHistogram(dataDict,kmeans): outDict = {} for k in dataDict.keys(): feat = dataDict[k] his = np.bincount(kmeans.predict(feat)) if his.shape[0] < kmeans.n_clusters: diff = kmeans.n_clusters - his.shape[0] for i in range(diff): his = np.append(his,0) outDict[k] = his return outDict#这里时进行测试，这里使用了一种比较朴素的方法，也就是测试图像 #和训练集里的图像挨个比较，取余弦距离最大的那个作为结果。 def predict(testHisDict, trainHisDict, gtpair): predict = {}for testk in testHisDict.keys(): testhis = testHisDict[testk] score = 0.0 index = "" for traink in trainHisDict.keys(): trainhis = trainHisDict[traink] s = get_cos_similar(testhis,trainhis) if s > score: score = s index = traink predict[testk] = indexsuc = 0 for k in predict.keys(): tk = k[5:8] pk = predict[k][7:10] if gtpair[tk] == pk: suc = suc+1 return suc/len(predict)#将以上步骤串起来，调整聚类的类别，来观察精度 def pipeline(n_list): result = []#1.对训练集、测试集提取sift特征 t0 = time.time() genSIFT(TrainPath,TrainSIFTPath,TrainSIFT,Train_SIFT_dict) genSIFT(TestPath,TestSIFTPath,TestSIFT,Test_SIFT_dict) t1 = time.time() #2.读取ground truth gtpair = getGroundTruth(DataPath)#3.对训练集提取的sift进行聚类，生成 visual word for n in n_list: t3 = time.time() clu = cluster(TrainSIFT, n) t4 = time.time() #4.计算每个图像关于 visual word 的直方图 train_his = getFeatureHistogram(Train_SIFT_dict, clu) test_his = getFeatureHistogram(Test_SIFT_dict, clu) t5 = time.time() #5.利用余弦距离计算相似度 acc = predict(test_his,train_his, gtpair) t6 = time.time() info = {"sift":t1-t0,"clu":t4-t3,"calvw":t5-t4,"predict":t6-t5,"acc":acc} result.append(info) print(info) return resultresult = pipeline([50,100,300,600,1000,2000]) print(result)

测试结果
本文一共测试了6组聚类的类别，随着类别增多，准确的逐渐上升，但是太对类别准确度反而会下降，这是因为在实验中发现每张图像平均也就能提取1000～1500个特征点，2000个类别太多啦。下面是绘制的准确度折线图，因为1000 - 2000之间没有测试，因此可能准确率还会有所提升。600个类别的准确率为 75.65%， 1000个准确率为 78.26%。