Quantify how much a slide has been filled with handwriting

问题

I have a video of a slideshow, where the presenter handwrites notes onto the slide:

I would like to create a program that detects if a slide is being filled (by handwritten notes for example) or if it is a new slide.

One method I was thinking of is OCR of the text, but this is not suitable since here the only text that changes are either handwritten or math.

What I have done so far: I go through the video and compare always the previous frame and the current frame. I extract the bounding box coordinates from all elements that have been added with respect to the previous frame, and I store the highest y-coordinate. The highest y-coordinate belongs to the element the furthest down the image (as seen from the top of the image). Thus this should -in theory- give me an indication if I am filling up the slide...

In practice, I cannot really make use of this data:

The video in question can be downloaded here: http://www.filedropper.com/00_6

Here is my code:

from skimage.measure import compare_ssim
import cv2
import numpy as np

# Packages for live plot visualisation 
import pyqtgraph as pg
from pyqtgraph.Qt import QtGui, QtCore
from tqdm import tqdm

def get_y_corrd_of_lowest_added_element(prev_frame, frame):
    """
    Given Two Images it detects the bounding boxes of all elemnts that 
    are different betweent the two images and outputs the y coordinate of
    the lowest added element (when seen from the top of the image)

    Parameters
    ----------
    prev_frame : numpy array 
        original image.
    frame : numpy array
        new image, based on original image.

    Returns
    -------
    TYPE
        lowest y coordinate of elments that were added.

    """
    # Compute SSIM between two images
    (score, diff) = compare_ssim(prev_frame, frame, full=True)

    # The diff image contains the actual image differences between the two images
    # and is represented as a floating point data type in the range [0,1] 
    # so we must convert the array to 8-bit unsigned integers in the range
    # [0,255] before we can use it with OpenCV
    diff = (diff * 255).astype("uint8")

    # Threshold the difference image, followed by finding contours to
    # obtain the regions of the two input images that differ
    thresh = cv2.threshold(diff, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)[1]
    contours = cv2.findContours(thresh.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    contours = contours[0] if len(contours) == 2 else contours[1]

    # Initialize a list that will hold all y coordinates of all bounding boxes
    # of all elements that were added to the frame when compared to the 
    # previous frame
    y_list = [0]
    
    for c in contours:
        
        area = cv2.contourArea(c)
        if area > 40:
        
            x,y,w,h = cv2.boundingRect(c)
            # Append to y coordinate list
            y_list.append(y)
             
    y_list.sort()
    
    return y_list[-1]


def transform(frame):
    # convert to greyscale
    frame =  cv2.cvtColor(frame,cv2.COLOR_BGR2GRAY)
    # make smaller
    small = cv2.resize(frame, (0,0), fx=0.5, fy=0.5) 
    return small

vidcap = cv2.VideoCapture(ADD PATH TO VIDEO HERE)
success,prev_frame = vidcap.read()
prev_frame = transform(prev_frame)

# For Real Time Ploting
#Source: http://www.pyqtgraph.org/downloads/0.10.0/pyqtgraph-0.10.0-deb/pyqtgraph-0.10.0/examples/PlotSpeedTest.py
app = QtGui.QApplication([])
win = pg.GraphicsWindow()
win.resize(800, 800)
p = win.addPlot()
p.setTitle('Lowest Y')
plot = p.plot([])

# Store lowest y coordinates of added elements
y_lowest_list = []
while success:
  success,frame = vidcap.read()
  
  # convert
  frame = transform(frame)
  
  # show frame
  cv2.imshow("frame", frame)
  cv2.waitKey(1)
  
  #extract lowest y corrd
  y = get_y_corrd_of_lowest_added_element(prev_frame, frame)
  y_lowest_list.append(y)
  # Real-time plot
  plot.setData(y_lowest_list)
  
# close real-time plot
win.close()

Does anyone have an idea?

回答1:

You can try this code, see comments:

import cv2
import numpy as np

def get_bg_and_ink_level(frame):

    frame =  cv2.cvtColor(frame,cv2.COLOR_BGR2HSV)
    background=cv2.threshold(frame[:,:,2], 245, 255, cv2.THRESH_BINARY)[1]
    background_level=cv2.mean(background) # for future use if you need to select frames without hands. 
    ink_color_low = (117,60,150) 
    ink_color_high = (130,207,225) 
    only_ink = cv2.inRange(frame, ink_color_low, ink_color_high)
    ink_level=cv2.mean(only_ink)
    return background_level[0], ink_level[0]

vidcap = cv2.VideoCapture('0_0.mp4')
success,frame = vidcap.read()
bg = []
ink=[]
i=0
while success:
   lv= get_bg_and_ink_level(frame)
   bg.append(lv[0])
   ink.append(lv[1])
   success,frame = vidcap.read()
   
# search for frames where the blue ink is removed from the picture. 
d_ink=np.diff(ink)
d_ink[-1]=-2.0 #add last frame
idx=np.where(d_ink<-1.0)

#save frames
for i in idx[0]:
    vidcap.set(cv2.CAP_PROP_POS_FRAMES, i)
    flag, frame = vidcap.read()
    out_name='frame'+str(i)+'.jpg'
    cv2.imwrite(out_name, frame)

Result 15708 frame:

回答2:

As a first pass at the problem, I'd probably want to just count the number of pixels that are different between the two images. It has several desirable properties:

It's an actual distance metric.
It's dirt-cheap computationally.
Slides with more handwriting are farther from the original than slides with little writing (e.g. if you progressively added more writing and wanted to order those).
If there's even a moderate amount of content on the slides, you'll plausibly (not necessarily) have any two unrelated slides be farther from each other than two slides which are the same but just differ in the handwriting (especially with thin writing like that).

It's not a perfect solution of course -- e.g., if you acquire the slides by taking photos then almost every slide will differ at every pixel. Take a moment to think about it with respect to your use case and data collection methods.

It's pretty common for images in python to be represented as numpy arrays. Supposing that's the case for you as well, the following example would compute the metric in question (or could be readily modified to give you similarity rather than distance):

def dist(a, b):
    # Supposes some sort of pixel representation like bgr or hsl with
    # shape (w, h, other) or (h, w, other)
    return np.sum(np.sum(a!=b, axis=-1)!=0)

来源：https://stackoverflow.com/questions/63624333/quantify-how-much-a-slide-has-been-filled-with-handwriting

标签

python

python-3.x

image-processing