[OpenCV实战]5 基于深度学习的文本检测 _深度学习

须知少年凌云志，曾许人间第一流。这篇文章主要讲述[OpenCV实战]5 基于深度学习的文本检测相关的知识，希望能为你提供帮助。
在这篇文章中，我们将逐字逐句地尝试找到图片中的单词！基于最近的一篇论文进行文字检测。
EAST: An Efficient and Accurate Scene Text Detector.
应该注意，文本检测不同于文本识别。在文本检测中，我们只检测文本周围的边界框。但是，在文本识别中，我们实际上找到了框中所写的内容。例如，在下面给出的图像中，文本检测将为您提供单词周围的边界框，文本识别将告诉您该框包含单词STOP。本文只进行文本检测。
本文基于tensorflow模型，基于OpenCV调用tensorflow模型。我们将逐步讨论算法是如何工作的。您将需要OpenCV3.4.3以上版本来运行代码。其他opencv DNN模型读取也类似这样步骤。
涉及的步骤如下： 1. 下载EAST模型1. 将模型加载到内存中1. 准备输入图像1. 正向传递blob通过网络1. 处理输出 # 1 网络加载
我们将使用cv :: dnn :: readnet(C++版本)或cv2.dnn.ReadNet(python版本)函数将网络加载到内存中。它会根据指定的文件名自动检测配置和框架。在我们的例子中，它是一个pb文件，因此，它将假定要加载Tensorflow网络。和加载图像不大一样，没有模型结构描述文件。
C++

Net net = readNet(model);

Python

net = cv.dnn.readNet(model)

2 读取图像我们需要创建一个4-D输入blob，用于将图像输送到网络。这是使用blobFromImage函数完成的。
C++

blobFromImage(frame, blob, 1.0, Size(inpWidth, inpHeight), Scalar(123.68, 116.78, 103.94), true, false);

Python

blob = cv.dnn.blobFromImage(frame, 1.0, (inpWidth, inpHeight), (123.68, 116.78, 103.94), True, False)

我们需要为此函数指定一些参数。它们如下： 1. 第一个参数是图像本身。1. 第二个参数指定每个像素值的缩放。在这种情况下，它不是必需的。因此我们将其保持为1。1. 第三个参数是设定网络的默认输入为320×320。因此，我们需要在创建blob时指定它。最好和网络输入一致。1. 第四个参数是训练时候设定的模型均值。需要减去模型均值。1. 第五个参数是我们是否要交换R和B通道。这是必需的，因为OpenCV使用BGR格式，Tensorflow使用RGB格式，caffe模型使用BGR格式。1. 最后一个参数是我们是否要裁剪图像并采取中心裁剪。在这种情况下我们指定False。 # 3 前向传播
现在我们已准备好输入，我们将通过网络传递它。网络有两个输出。一个指定文本框的位置，另一个指定检测到的框的置信度分数。两个输出层如下：
feature_fusion/concat_3
feature_fusion/Conv_7/Sigmoid
这两个输出可以直接用netron这个软件打开pb模型，看到最后输出结果。Netron是一个模型结构可视化神器，支持tf, caffe, keras,mxnet等多种框架。Netron下载地址：
c++读取输出代码如下：

std::vector& lt; String& gt; outputLayers(2); outputLayers[0] = "feature_fusion/Conv_7/Sigmoid"; outputLayers[1] = "feature_fusion/concat_3";

python读取输出代码如下：

outputLayers = []outputLayers.append("feature_fusion/Conv_7/Sigmoid")outputLayers.append("feature_fusion/concat_3")

接下来，我们通过将输入图像传递到网络来获得输出。如前所述，输出由两部分组成：置信度和位置。
C++

std::vector& lt; Mat& gt; output; net.setInput(blob); net.forward(output, outputLayers); Mat scores = output[0]; Mat geometry = output[1];

python:

net.setInput(blob)output = net.forward(outputLayers)scores = output[0]geometry = output[1]

4 处理输出如前所述，我们将使用两个层的输出并解码文本框的位置及其方向。我们可能会得到许多文本框。因此，我们需要从该批次中筛选出看起来最好的文本框。这是使用非极大值抑制算法完成的。
非极大值抑制算法在目标检测中应用很广泛，具体可以参考
【[OpenCV实战]5 基于深度学习的文本检测】1 解码
C++:

std::vector& lt; RotatedRect& gt; boxes; std::vector& lt; float& gt; confidences; decode(scores, geometry, confThreshold, boxes, confidences);

python:

[boxes, confidences] = decode(scores, geometry, confThreshold)

2 非极大值抑制
我们使用OpenCV函数NMSBoxes（C ++）或NMSBoxesRotated（Python）来过滤掉误报并获得最终预测。
C++:

std::vector& lt; int& gt; indices; NMSBoxes(boxes, confidences, confThreshold, nmsThreshold, indices);

Python:

indices = cv.dnn.NMSBoxesRotated(boxes, confidences, confThreshold, nmsThreshold)

3结果和代码 3.1结果在VS2017下运行了C++代码，其中OpenCV版本至少要3.4.5以上。不然模型读取会有问题。模型文件太大，见下载链接：
如果没有积分（系统自动设定资源分数）看看参考链接。我搬运过来的，大修改没有。
或者梯子直接下载模型：
结果如下，效果还不错，速度也还好。

文章图片

文章图片

3.2 代码C++代码有所更改，python没有。对文本检测不熟悉，注释不多，但是实际代码不需要太大变化。
C++代码：

// text_detection.cpp : 此文件包含 "main" 函数。程序执行将在此处开始并结束。 //#include "pch.h" #include & lt; iostream& gt; #include & lt; opencv2/opencv.hpp& gt; using namespace std; using namespace cv; using namespace cv::dnn; //解码 void decode(const Mat & amp; scores, const Mat & amp; geometry, float scoreThresh, std::vector& lt; RotatedRect& gt; & amp; detections, std::vector& lt; float& gt; & amp; confidences); /** * @brief * * @param srcImg 检测图像 * @param inpWidth 深度学习图像输入宽 * @param inpHeight 深度学习图像输入高 * @param confThreshold 置信度 * @param nmsThreshold 非极大值抑制算法阈值 * @param net * @return Mat */ Mat text_detect(Mat srcImg, int inpWidth, int inpHeight, float confThreshold, float nmsThreshold, Net net)//输出 std::vector& lt; Mat& gt; output; std::vector& lt; String& gt; outputLayers(2); outputLayers[0] = "feature_fusion/Conv_7/Sigmoid"; outputLayers[1] = "feature_fusion/concat_3"; //检测图像 Mat frame, blob; frame = srcImg.clone(); //获取深度学习模型的输入 blobFromImage(frame, blob, 1.0, Size(inpWidth, inpHeight), Scalar(123.68, 116.78, 103.94), true, false); net.setInput(blob); //输出结果 net.forward(output, outputLayers); //置信度 Mat scores = output[0]; //位置参数 Mat geometry = output[1]; // Decode predicted bounding boxes，对检测框进行解码，获取文本框位置方向 //文本框位置参数 std::vector& lt; RotatedRect& gt; boxes; //文本框置信度 std::vector& lt; float& gt; confidences; decode(scores, geometry, confThreshold, boxes, confidences); // Apply non-maximum suppression procedure，应用非极大性抑制算法 //符合要求的文本框 std::vector& lt; int& gt; indices; NMSBoxes(boxes, confidences, confThreshold, nmsThreshold, indices); // Render detections. 输出预测 //缩放比例 Point2f ratio((float)frame.cols / inpWidth, (float)frame.rows / inpHeight); for (size_t i = 0; i & lt; indices.size(); ++i)RotatedRect & amp; box = boxes[indices[i]]; Point2f vertices[4]; box.points(vertices); //还原坐标点 for (int j = 0; j & lt; 4; ++j)vertices[j].x *= ratio.x; vertices[j].y *= ratio.y; //画框 for (int j = 0; j & lt; 4; ++j)line(frame, vertices[j], vertices[(j + 1) % 4], Scalar(0, 255, 0), 2, LINE_AA); // Put efficiency information. 时间 std::vector& lt; double& gt; layersTimes; double freq = getTickFrequency() / 1000; double t = net.getPerfProfile(layersTimes) / freq; std::string label = format("Inference time: %.2f ms", t); putText(frame, label, Point(0, 15), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 255, 0)); return frame; //模型地址 auto model = "./model/frozen_east_text_detection.pb"; //检测图像 auto detect_image = "./image/patient.jpg"; //输入框尺寸 auto inpWidth = 320; auto inpHeight = 320; //置信度阈值 auto confThreshold = 0.5; //非极大值抑制算法阈值 auto nmsThreshold = 0.4; int main()//读取模型 Net net = readNet(model); //读取检测图像 Mat srcImg = imread(detect_image); if (!srcImg.empty())cout & lt; & lt; "read image success!" & lt; & lt; endl; Mat resultImg = text_detect(srcImg, inpWidth, inpHeight, confThreshold, nmsThreshold, net); imshow("result", resultImg); waitKey(); return 0; /** * @brief 输出检测到的文本框相关信息 * * @param scores 置信度 * @param geometry 位置信息 * @param scoreThresh 置信度阈值 * @param detections 位置 * @param confidences 分类概率 */ void decode(const Mat & amp; scores, const Mat & amp; geometry, float scoreThresh, std::vector& lt; RotatedRect& gt; & amp; detections, std::vector& lt; float& gt; & amp; confidences)detections.clear(); //判断是不是符合提取要求 CV_Assert(scores.dims == 4); CV_Assert(geometry.dims == 4); CV_Assert(scores.size[0] == 1); CV_Assert(geometry.size[0] == 1); CV_Assert(scores.size[1] == 1); CV_Assert(geometry.size[1] == 5); CV_Assert(scores.size[2] == geometry.size[2]); CV_Assert(scores.size[3] == geometry.size[3]); const int height = scores.size[2]; const int width = scores.size[3]; for (int y = 0; y & lt; height; ++y)//识别概率 const float *scoresData = https://www.songbingjia.com/android/scores.ptr& lt; float& gt; (0, 0, y); //文本框坐标 const float *x0_data = geometry.ptr& lt; float& gt; (0, 0, y); const float *x1_data = geometry.ptr& lt; float& gt; (0, 1, y); const float *x2_data = geometry.ptr& lt; float& gt; (0, 2, y); const float *x3_data = geometry.ptr& lt; float& gt; (0, 3, y); //文本框角度 const float *anglesData = geometry.ptr& lt; float& gt; (0, 4, y); //遍历所有检测到的检测框 for (int x = 0; x & lt; width; ++x)float score = scoresData[x]; //低于阈值忽略该检测框 if (score & lt; scoreThresh)continue; // Decode a prediction. // Multiple by 4 because feature maps are 4 time less than input image. float offsetX = x * 4.0f, offsetY = y * 4.0f; //角度及相关正余弦计算 float angle = anglesData[x]; float cosA = std::cos(angle); float sinA = std::sin(angle); float h = x0_data[x] + x2_data[x]; float w = x1_data[x] + x3_data[x]; Point2f offset(offsetX + cosA * x1_data[x] + sinA * x2_data[x], offsetY - sinA * x1_data[x] + cosA * x2_data[x]); Point2f p1 = Point2f(-sinA * h, -cosA * h) + offset; Point2f p3 = Point2f(-cosA * w, sinA * w) + offset; //旋转矩形，分别输入中心点坐标，图像宽高，角度 RotatedRect r(0.5f * (p1 + p3), Size2f(w, h), -angle * 180.0f / (float)CV_PI); //保存检测框 detections.push_back(r); //保存检测框的置信度 confidences.push_back(score);

Python代码：

# Import required modules import cv2 as cv import math import argparseparser = argparse.ArgumentParser(description=Use this script to run text detection deep learning networks using OpenCV.) # Input argument parser.add_argument(--input, help=Path to input image or video file. Skip this argument to capture frames from a camera.) # Model argument parser.add_argument(--model, default="./model/frozen_east_text_detection.pb", help=Path to a binary .pb file of model contains trained weights. ) # Width argument parser.add_argument(--width, type=int, default=320, help=Preprocess input image by resizing to a specific width. It should be multiple by 32. ) # Height argument parser.add_argument(--height,type=int, default=320, help=Preprocess input image by resizing to a specific height. It should be multiple by 32. ) # Confidence threshold parser.add_argument(--thr,type=float, default=0.5, help=Confidence threshold. ) # Non-maximum suppression threshold parser.add_argument(--nms,type=float, default=0.4, help=Non-maximum suppression threshold. )args = parser.parse_args()############ Utility functions ############ def decode(scores, geometry, scoreThresh): detections = [] confidences = []############ CHECK DIMENSIONS AND SHAPES OF geometry AND scores ############ assert len(scores.shape) == 4, "Incorrect dimensions of scores" assert len(geometry.shape) == 4, "Incorrect dimensions of geometry" assert scores.shape[0] == 1, "Invalid dimensions of scores" assert geometry.shape[0] == 1, "Invalid dimensions of geometry" assert scores.shape[1] == 1, "Invalid dimensions of scores" assert geometry.shape[1] == 5, "Invalid dimensions of geometry" assert scores.shape[2] == geometry.shape[2], "Invalid dimensions of scores and geometry" assert scores.shape[3] == geometry.shape[3], "Invalid dimensions of scores and geometry" height = scores.shape[2] width = scores.shape[3] for y in range(0, height):# Extract data from scores scoresData = https://www.songbingjia.com/android/scores[0][0][y] x0_data = geometry[0][0][y] x1_data = geometry[0][1][y] x2_data = geometry[0][2][y] x3_data = geometry[0][3][y] anglesData = geometry[0][4][y] for x in range(0, width): score = scoresData[x]# If score is lower than threshold score, move to next x if(score & lt; scoreThresh): continue# Calculate offset offsetX = x * 4.0 offsetY = y * 4.0 angle = anglesData[x]# Calculate cos and sin of angle cosA = math.cos(angle) sinA = math.sin(angle) h = x0_data[x] + x2_data[x] w = x1_data[x] + x3_data[x]# Calculate offset offset = ([offsetX + cosA * x1_data[x] + sinA * x2_data[x], offsetY - sinA * x1_data[x] + cosA * x2_data[x]])# Find points for rectangle p1 = (-sinA * h + offset[0], -cosA * h + offset[1]) p3 = (-cosA * w + offset[0],sinA * w + offset[1]) center = (0.5*(p1[0]+p3[0]), 0.5*(p1[1]+p3[1])) detections.append((center, (w,h), -1*angle * 180.0 / math.pi)) confidences.append(float(score))# Return detections and confidences return [detections, confidences]if __name__ =="__main__": # Read and store arguments confThreshold = args.thr nmsThreshold = args.nms inpWidth = args.width inpHeight = args.height model = args.model# Load network net = cv.dnn.readNet(model)# Create a new named window kWinName = "EAST: An Efficient and Accurate Scene Text Detector" outputLayers = [] outputLayers.append("feature_fusion/Conv_7/Sigmoid") outputLayers.append("feature_fusion/concat_3")# Read frame frame = cv.imread("./image/stop1.jpg")# Get frame height and width height_ = frame.shape[0] width_ = frame.shape[1] rW = width_ / float(inpWidth) rH = height_ / float(inpHeight)# Create a 4D blob from frame. blob = cv.dnn.blobFromImage(frame, 1.0, (inpWidth, inpHeight), (123.68, 116.78, 103.94), True, False)# Run the model net.setInput(blob) output = net.forward(outputLayers) t, _ = net.getPerfProfile() label = Inference time: %.2f ms % (t * 1000.0 / cv.getTickFrequency())# Get scores and geometry scores = output[0] geometry = output[1] [boxes, confidences] = decode(scores, geometry, confThreshold) # Apply NMS indices = cv.dnn.NMSBoxesRotated(boxes, confidences, confThreshold,nmsThreshold) for i in indices: # get 4 corners of the rotated rect vertices = cv.boxPoints(boxes[i[0]]) # scale the bounding box coordinates based on the respective ratios for j in range(4): vertices[j][0] *= rW vertices[j][1] *= rH for j in range(4): p1 = (vertices[j][0], vertices[j][1]) p2 = (vertices[(j + 1) % 4][0], vertices[(j + 1) % 4][1]) cv.line(frame, p1, p2, (0, 255, 0), 2, cv.LINE_AA); # cv.putText(frame, ":.3f".format(confidences[i[0]]), (vertices[0][0], vertices[0][1]), cv.FONT_HERSHEY_SIMPLEX, 0.5, (255, 0, 0), 1, cv.LINE_AA)# Put efficiency information cv.putText(frame, label, (0, 15), cv.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255))# Display the frame cv.imshow("result",frame) cv.waitKey(0)