Suggestion: fewshot (oneshot) prompting

#1
by MoralHazard - opened

Thanks for this fantastic model! I really appreciate the variety of prompts it was trained on. I'd like to suggest another: oneshot prompting using a caption from a similar image.

Given the tiny size of the model, it often hallucinate and misses details, even when providing tags and character descriptions in the prompt. I've found dramatic improvements in the cosine distance between the generated short captions and the 'pruned' ground truth from your anime multicaptions dataset when adding the ground truth caption from an image with a similar tag list (excluding the image itself, ofc) to the prompt. Your 1.5m image dataset works great for quickly finding similar images using feature hashed tag embeddings and ann.

However, I'm shoehorning the example caption into the "info" component of your prompting schema, which was designed to hold info about the image to be captioned, not a similar-but-distinct image. Sometimes the details of the example bleed into the new caption, something I imagine would happen less if the model were trained on receiving examples that are similar to but don't directly describe the image to be captioned.

Thank you!

Sure, training scheme (especially for 2b) needs a rework and more augmentation. The idea of ​​using short descriptions from similar images (or something like it) while highlighting the differences in the answer sounds quite promising can improve the result and give more attention to captioned picture.

But I'm not sure if I understand your proposal correctly, could you please explain it with some more detail or example?

I confess I might be the confused one now. Are you suggesting some sort of contrastive learning, using the related captions as a similar but incorrect examples? I'm not saying you shouldn't do that, but what I was suggesting was simply to add another option to the grounding_prompt component of your schema and fine-tune the model a bit on generating captions using the additional context of a similar-but-not-identical caption.

Here is an example from my pipeline. It follows your schema except that I've added an "example" component--the "pruned" caption in your dataset from an image with a similar set of tags. I also changed 'grounding_info' to ' Here is a caption for a similar image for reference: '

You need to write a medium-short and convenient caption for the picture. Here is a caption for a similar image for reference: A group of nude girls relax in an outdoor onsen. Arcueid Brunestud has long blonde hair in a braid, sitting with a serious expression. Artoria Pendragon sits beside her with light gray hair in a bun, appearing attentive. Illyasviel von Einzbern, with dark brown twintails, is shyly covering her breasts. Irisviel von Einzbern has long purple hair and is smiling while holding a towel on her lap. Medea stands behind, observing with long black hair. Tohsaka Rin, with short dark green hair, appears aloof at the edge. Matou Sakura is partially submerged, looking away shyly with long sky-blue hair. The background features dark grey rocks, wooden structures, and autumn foliage, creating a peaceful and intimate atmosphere with steam rising from the water.. Here are grounding tags for better understanding: 00s 6+girls alcohol aqua_eyes artoria_pendragon_(all) artoria_pendragon_(fate) bazett_fraga_mcremitz blonde_hair blue_eyes blush breasts caren_hortensia censored cleavage closed_eyes convenient_censoring euryale_(fate)
everyone fate/hollow_ataraxia fate/stay_night fate_(series) flat_chest fujimura_taiga hair_down hotincool illyasviel_von_einzbern impossible_towel leysritt_(fate) long_hair long_legs looking_at_viewer luviagelita_edelfelt matou_sakura medea_(fate) medium_breasts medusa_(fate) medusa_(rider)(fate) multiple_girls naked_towel onsen open_mouth purple_eyes purple_hair red_eyes saber(fate) sake sella_(fate) siblings silver_hair sisters sitting square_pupils standing stheno_(fate) tohsaka_rin towel twins water yellow_eyes. Here is a list of characters that are present in the picture: artoria_pendragon_(all), artoria_pendragon_(fate), saber_(fate), tohsaka_rin, medusa_(fate), medusa_(rider)(fate), matou_sakura, medea(fate), euryale_(fate) artoria_pendragon_(all), artoria_pendragon_(fate), saber_(fate), tohsaka_rin, medusa_(fate), medusa_(rider)(fate), matou_sakura, medea(fate), euryale_(fate). Here are popular tags or traits for each character on the picture: artoria_pendragon_(all): [girl, blonde_hair, short_hair, green_eyes]
artoria_pendragon_(fate): [girl, blonde_hair, short_hair, green_eyes]
saber_(fate): [girl, blonde_hair, short_hair, green_eyes]
tohsaka_rin: [girl, black_hair, long_hair, blue_eyes]
medusa_(fate): [girl, purple_hair, long_hair, purple_eyes]
medusa_(rider)(fate): [girl, purple_hair, long_hair, purple_eyes]
matou_sakura: [girl, purple_hair, long_hair, purple_eyes]
medea
(fate): [girl, blue_hair, long_hair, blue_eyes]
euryale_(fate): [girl, purple_hair, twintails, purple_eyes].


Generated caption for https://img3.gelbooru.com/images/0d/d5/0dd548ffbd641215e127bc9f678a771f.jpg: A group of girls are enjoying a relaxing time in an outdoor onsen. Artoria Pendragon, with her blonde hair and green eyes, is standing with a towel around her waist, looking slightly embarrassed. Tohsaka Rin, with her black hair and blue eyes, is partially submerged, looking away shyly. Medea, with long purple hair and purple eyes, is sitting with a towel around her, looking calm. Euryale, with long purple hair and purple eyes, is also in the onsen, looking serene. Medusa, with long purple hair and purple eyes, is sitting with a towel around her, looking slightly annoyed. The scene is peaceful, with steam rising from the water, surrounded by rocks and autumn foliage.

The caption generated with the same pipeline but without the example caption (just tags and character info) was totally hallucinated (and perhaps a bit too explicit so I won't post it).

This already works very well. On a test dataset the mean cosine distance between embeddings of pruned and embeddings of the captions that used the examples was about 40% lower than the cosine distance with pruned for captions generated without the one-shot prompting. However, I don't think the model has been explicitly trained on this sort of prompting, and models this small are often inflexible, so I imagine the results could be even better if it was specifically trained to utilize this added context.

Given the huge quality gain and the very low overhead (feature hashing a image's taglist and querying against the dataset takes almost no time) this seems like a very attractive method for getting high quality captions from a very lightweight model.

Sure, the contrast training is indeed a good option here. I've been thinking of something kind of same, but in form advanced augmentation for info and other grounding to make them work more stable and having less shifting of attention from the picture.
As for using of very close captions for inference - that's pretty interesting approach. Thank you for explaining, I will investigate it more and try to implement in next versions.
I appreciate your contribution.

Sign up or log in to comment