VibeBuilders.ai Logo
VibeBuilders.ai
[P] I built an open SotA image tagging model to do what CLIP won't

[P] I built an open SotA image tagging model to do what CLIP won't

fpgaminer
April 15, 2025
reddit

I'm a hobbyist ML researcher and finally, after a year of work, built a state of the art machine vision model from scratch. It's ViT-B/16 based, 448x448x3 input, 91M parameters, trained for 660M samples, with multi-label classification as the target task, on over 5000 unique tags.

All the big foundation vision models today were trained on heavily filtered datasets, greatly limiting the concepts they can represent, in line with arbitrary sets of rules for what is deemed "wholesome" by leading tech companies. Everything from innocuous to spicy is on the chopping block of those filters. And because CLIP pervades the industry, from StableDiffusion to LLaVA, so does OpenAI's sensibilities.

My goal was to build a vision model for tagging images, mainly for labelling images for SD finetunes, but which wasn't as heavily filtered and handicapped as CLIP/BLIP/LLaVA. Something more inclusive, diverse, and sex positive.

Starting from the wonderful work of SmilingWolf (https://github.com/SmilingWolf/SW-CV-ModelZoo) and the Danbooru2021 dataset, I iterated for a year on the model, training, and manually labeling a thousand images to help the model generalize beyond the danbooru domain.

I'm releasing the first version of this model, dubbed JoyTag, today: https://github.com/fpgaminer/joytag

It achieves a mean F1 score of 0.578 across all of its over 5000 tags and across both the anime/manga styled images of the original danbooru dataset, but also photographs and other mediums thanks to the auxiliary training data I provided to it.

It was quite the struggle getting to this point, and I probably spent more time and money than any sane person should have. I learned a lot about dealing with datasets as large as danbooru2021, training models at scale, and how to keep yourself awake all night so your 8xA100 rental doesn't crash and blow all your money.

In my manual testing outside of even the validation set, the model has generalized well to unseen images, so I'm quite happy with the results thus far. There's plenty more work to do expanding its dataset to improve that F1 score further, and roundout its weak points. With inclusivity and diversity being a major goal of this project, I'm disappointed by some of its remaining limitations (as documented in the GitHub README). But I'm already busy manually tagging more images using my model-augmented workflow.

I'm happy to answer questions about the project, the training procedure, anything. All the training parameters are documented on GitHub, but there are so many little details that were hard won over the year. Like that damned loss multiplier. Ugh.

Github: https://github.com/fpgaminer/joytag Model download: https://huggingface.co/fancyfeast/joytag/tree/main Demo: https://huggingface.co/spaces/fancyfeast/joytag

Vibe Score

LLM Vibe Score

0

Sentiment

Human Vibe Score

1

Rate this Resource

Join the VibeBuilders.ai Newsletter

The newsletter helps digital entrepreneurs how to harness AI to build your own assets for your funnel & ecosystem without bloating your subscription costs.

Start the free 5-day AI Captain's Command Line Bootcamp when you sign up:

By subscribing, you agree to our Privacy Policy and Terms of Service.