Intro

I was in between two big projects at work when I was assigned to work on a small project for a couple weeks. We had an in-house GUI available to us that was able to ingest video and output detections from a model to a database. I found some surveillance-like datasets online and came up with the idea of using Image-Language embeddings to identify key frames or important events in long-form video sequences.

Method

First, I used YOLOv8 (link below for my YOLOv8 code) for the object detection and ByteTrack for tracking across video frames. I used Salesforce’s BLIP since it was recently released and was open-source. Every couple image frames, I would take the BLIP embedding and compare the cosine similarity to the previous embedding. If it was over a certain threshold, I would identify it as a key frame.

Link to YOLO page:

Video Captioning with BLIP

Example 2

Surveillance - I don’t have any keyframe examples, but here are some example detections from surveillance input.

Example 1

Dog - in this short video, I was able to identify 1 key frame (apart from the first and last frames), and that is when the person first enters into frame.

The captions are:

  • Frame 0: A dog laying on the sidewalk in a city

  • Frame 864: A person walking down a sidewalk with a dog

  • Frame 1032: A dog laying on the sidewalk in front of a store