Intro
I was in between two big projects at work when I was assigned to work on a small project for a couple weeks. We had an in-house GUI available to us that was able to ingest video and output detections from a model to a database. I found some surveillance-like datasets online and came up with the idea of using Image-Language embeddings to identify key frames or important events in long-form video sequences.
Method
First, I used YOLOv8 (link below for my YOLOv8 code) for the object detection and ByteTrack for tracking across video frames. I used Salesforce’s BLIP since it was recently released and was open-source. Every couple image frames, I would take the BLIP embedding and compare the cosine similarity to the previous embedding. If it was over a certain threshold, I would identify it as a key frame.
Link to YOLO page:
Video Captioning with BLIP
Example 2
Surveillance - I don’t have any keyframe examples, but here are some example detections from surveillance input.
Example 1
Dog - in this short video, I was able to identify 1 key frame (apart from the first and last frames), and that is when the person first enters into frame.
The captions are:
Frame 0: A dog laying on the sidewalk in a city
Frame 864: A person walking down a sidewalk with a dog
Frame 1032: A dog laying on the sidewalk in front of a store