Clustering Features from ViT-B/16

Background

I was looking at the DINOv2 demo here and was very interested in the instance retrieval section. I graduated with a Minor in Art History, so any time I see a parallel between AI and art I get excited.

Vision Transformer

To replicate their demo, I would need to download a model file. At the time of this experiment, Meta had not open sourced their DINOv2 model weights, so I went on hugging face and found their ViT-B/16 model weights for DINOv2’s predecessor here.

The Experiment

I found an image dataset on Kaggle consisting of different classes of fruits. I ran each image through the Vision Transformer and extracted the image features in the form of an array. Using these features, I was able to visualize distinct clusters where each fruit inhabited a place in the feature space. In order to mimic the Meta demo, I input a new image and extract its features. I next create a cosine similarity matrix between the dataset and the query image and pull the images with the highest similarity score. Although I couldn’t use the newest model weights, I was still able to output a similar result