by starkdg at Jun 01, '19

ClipSeekr is a real-time video clip recognition system designed to detect video sequences that occur in a video

How It Works

Clipseekr works by indexing fingerprints of video clips.  A 64-bit fingerprint is created for each frame of the clip
from the spatial frequency information extracted from its discrete cosine transform. These 64-bit integers are  then stored in a reverse index. This reverse index is simply a redis database of key-value pairs, where the key is a frame’s fingerprint pointing to a value consisting of an ID and some sequence information. Unknown streams can then be monitored to recognized the appearance of these indexed clips. The basic principle is simple. When the number of consecutive frames recognized for a particular ID reaches a specified threshold, the clip can then be identified together with its timestamp in the stream. This threshold is adjustable, but a good value for a 29.97 fps stream seems to be between 5 and 10 consecutive frames.


The code can be found in the github repository here:


Test Results

To evaluate this method, we streamed four hours of television
and copied the commercial spots into new files for indexing.
Altogether, there were 142 of these ad spots, 135 of which
being unique video sequences. In brief, only one spot failed
to be detected outright - i.e. a “false negative” - while five
were detected falsely - i.e. “false positives”. The rest were
successfuly detected within seconds of the occurence in the stream.
This would roughly make for a false posive rate of 3.3%, and a
false negative rate of 0.01%. The following table logs the
results more precisely. The first two columns mark the clips
and the timestamps for where they actually occur in the stream.
The next two columns indicate the clips that get recognized
along with their timestamps.

A black font represents correct detections; a red font
represents false positives; and blue is for false negatives.


The only one that failed to be detected was a McDonald’s
commercial, called “Uber Eats”. The only thing noteworthy
is that the frames seemed exceptionally dark in contrast.
Perhaps not enough definition in the fingerprints. Another
noteworthy issue is the second detection of the spot called
“Jack Daniels”. While the first one was a correct match,
the second detection, even though it was a different clip,
it shared enough of the first clip in common that the second was
recognized as the first. This is an inherent weakness in the
fingerprinting system, since there is not enough temporal
information preserved to differentiate the two in real-time.

A few notes for further study:

  • While the fingerprinting method is fairly robust to many
    distortions, it is not robust to changes in the screen format.
    In other words, many broadcast streams manipulate the screen
    format to include varying amounts of black space in the margins.
    Also, the presence of various logos and other textual occlusions
    further obfuscate the spatial information of the frames.
    Alternative fingerprinting methods can be explored for this:
    scale-invariant feature points, or feature points combined with
    region-based descriptors.
  • The limited temporal information restricts the ability of the
    system to differentiate between clips that share a significant
    portion of frames in common. In other words, two commercial
    spots are often composed from common sequences only edited
    differently. Unfortunately, the real-time nature of the problem
    prohibits a second pass of the data. Recognition decisions are
    constrained to only looking at past frames.
  • Given the success of convolutional neural nets for image
    recognition tasks, it would be interesting to add in a
    recurrence property to better model a sequence of frames.
    Previous work in extracting image fingerprints from convolutional
    network models shows promise in differentiating images: