A learning project for building a custom vision-language model from personal photos.
This is a small pipeline I put together to understand how fine-tuning VLMs works end-to-end. It takes a folder of photos, uses Llava:13b to generate captions, and prepares the data in a format ready for fine-tuning.