Florence-2-base-Castollux-v0.5

A microsoft/Florence-2-base fine-tune intended to improve the quality and formatting of image captioning.

Uses <CAPTION> task prompt. Training on <DETAILED_CAPTION> or <MORE_DETAILED_CAPTION> did not seem to make a difference in quality compared to <CAPTION>.

In v0.1 & v0.2 I was only filtering out captions with more than 1000 Florence-2 tokens. v0.3 and up will now be filtering out any captions above 512 T5 tokens.

Model Evaluation

Selected images are not cherry picked. They are from the first batch of images in the randomly chosen eval split.

Source Image Ground Truth Original Model Prediction Fine-tuned Model Prediction
eval_img_1 A woman with dark hair and hoop earrings is positioned in the lower right quadrant of the image, looking upwards and to the right side of the frame. She is wearing a brown fur coat that covers her shoulders and chest, and her bare shoulders are visible. Her makeup includes defined eyebrows, eyeliner, and lipstick. The background is divided into dark and light areas, with a dark structure resembling a bridge or overpass dominating the upper portion of the image, casting a shadow over the woman. Above the dark structure, a lighter area suggests the sky, which is partially obscured by the dark structure. The background behind the woman is blurred and out of focus, showing hints of greenery and possibly buildings in the distance. The lighting in the image is warm and golden, suggesting it might be sunset or golden hour, with the light source appearing to be from the right side, illuminating the woman's face and the fur coat. The image has a grainy texture. A woman in a fur coat standing on a bridge. A woman with dark hair is positioned in the center of the frame, looking upwards and to the right, wearing a large, brown fur coat that is draped over her shoulders and arms. She has dark eyeliner and eyeshadow, and is wearing a gold hoop earring in her left ear. Her lips are slightly parted, and she has a ring on her right hand. The background is dark and out of focus, suggesting an overpass or bridge structure, with a cloudy sky visible in the upper portion of the image. The lighting in the image appears to be warm and golden, casting shadows on the woman and the overpass.
eval_img_2 A very close-up view of a cow's pink nose with black spots and white whiskers, being held from below by a fair-skinned hand. The cow's mouth is slightly open showing teeth and saliva. The cow's face is white with black patches and a yellow ear tag on its right ear with the number "9128" clearly visible on it. In the background, green grass is visible, along with parts of other black and white cows, trees, and a bright sky with some white clouds. A close up of a person's hand holding a cow's nose. A close-up shot shows a white and black spotted cow with a pink nose and mouth, positioned in the center of the frame, facing towards the right side of the image. The cow has black markings on its face and ears, and a yellow tag with the number "9126" printed on it is visible on the right ear of the cow. In the background, there are other black and white cows, some of which are partially visible behind the cow on the left, and green grass is visible in the foreground. The sky is overcast and grey. A hand with visible skin is in the lower foreground, holding the cow's mouth.
eval_img_3 A fox with reddish-orange fur on its head, back, and legs, and white fur on its chest and belly stands facing forward in the center of the image, looking directly at the viewer. The fox has pointed ears with dark tips and dark eyes. Its nose is black, and its fur appears soft and dense. The background is filled with blurred green foliage, including tall grass and various leaves, suggesting a natural outdoor setting. A thin, dark, and slightly angled branch is positioned vertically on the left side of the image, partially obscuring the fox's left leg. The fox appears to be standing on a patch of ground covered with fallen leaves and possibly some white material. At the bottom left corner of the image, the text "@Pataluth" is visible in white letters. A red fox standing on top of a log in the grass. A fox with reddish-brown fur, a white chest, and black legs stands facing forward and looking directly at the viewer with brown eyes. The fox has pointed ears and a bushy tail. The background is filled with tall green grass and foliage, creating a natural outdoor setting. In the bottom left corner of the image, there is text that reads "@Patauth".
eval_img_4 A squirrel with grey and brown fur and a large bushy tail is sitting upright on a ground covered with green grass and brown leaves, holding a piece of a green fruit in its front paws and eating it, with its head tilted slightly to the left. The fruit is partially eaten and shows a white interior with small green seeds. Another whole green fruit is to the left of the squirrel, and another is partially visible in the bottom right corner. The background is blurred green grass and some out-of-focus trees or bushes. A squirrel eating an acorn in the grass. A squirrel with grey and brown fur and a bushy tail is standing on green grass and eating a nut. The squirrel is facing towards the left side of the image and is holding the nut with its front paws. Behind the squirrel and to the left of the squirrel, there is a large, round, light green mushroom with a textured surface and a yellowish-green interior. The ground is covered with green grass, brown leaves, and fallen leaves. The background is blurred and green, suggesting a natural outdoor setting.
eval_img_5 A view looking up shows a wall with the legible text, "Education is the most powerful weapon which you can use to change the world." followed by "Nelson Mandela" in a smaller font, positioned on the right side of the wall. Below the text is a small potted tree with dark green leaves. To the left of the wall are metal railings for stairs leading upwards. The stairs have wooden steps. In the foreground to the right is a dark structure with rectangular cutouts at the top. To the left of the wall are several round wooden tables with metal legs and brown chairs with metal frames. Further to the left are silver metal carts with wheels. The background shows windows and parts of other floors. There are circular lights on the ceiling in the upper left. A silver cylindrical object is mounted on the wall to the left of the text. The lighting in the image is somewhat dim. A large room with a quote on the wall. A brightly lit indoor space features a staircase with wooden steps and a black metal railing on the right side, leading up to a wall with a white text that reads "EDUCATION is the most powerful weapon which you can use to change the world." and below it is a green leafy plant. To the left of the staircase is a metal trolley with wheels, and further to the left is a cylindrical grey pillar with a green "F" sign on it, and to the right of the pillar is a window with multiple panes. Above the window is a ceiling with a circular light fixture. The floor is dark, and the walls are a light beige color.
eval_img_6 A woman with short black hair and sunglasses perched on her head is standing on grass, wearing a white sheer long-sleeved top with ruffled cuffs and a purple and black plaid spaghetti strap dress, holding a small purple handbag with floral details and a grey fur pompom, and wearing multiple rings and necklaces, with her eyes closed and her right hand touching her forehead, in a grassy area with trees casting shadows and fallen leaves on the grass, a red car parked on a road behind her, and a house visible in the background through the trees, under bright sunlight. A woman in a purple plaid dress standing in the grass. A woman with short black hair and sunglasses on her head is standing barefoot in a grassy area with fallen leaves. She is wearing a white long-sleeved top with ruffled cuffs and a purple and white plaid pinafore dress. She has multiple necklaces and rings on her fingers. A tattoo is visible on her left thigh. A purple handbag with a fluffy pom-pom is held in her right hand. Behind her are large trees with green leaves and branches, casting shadows on the grass. In the background, a red car is parked on the side of the road. A building is visible in the distance behind the trees. The sky is blue and clear.
eval_img_7 A view of a beach with a cloudy sky overhead shows various layers of clouds, some appearing darker and more dense, while others are lighter with hints of pink and white, particularly towards the horizon line. The sea stretches across the middle of the image, its surface reflecting the light from the sky, creating a shimmering effect with subtle ripples and waves. Foamy white waves are rolling onto the shore, creating a distinct line of white against the darker sand. The sand in the foreground is wet and appears dark grey or brown, with a smooth texture where the water has receded, and a slightly more textured appearance further back. The horizon line is visible in the distance, separating the sea from the sky. There are a few small, dark, indistinct objects visible on the horizon line. A view of the ocean from a beach at sunset. A beach scene is captured under a cloudy sky that transitions from light blue at the top to dark grey and white towards the horizon. The sky is filled with large, layered clouds of varying shades of grey and grey, some of which are illuminated by soft, warm light. The clouds are scattered across the sky, creating a gradient effect. Below the clouds, the ocean stretches out to the horizon, with a few small islands visible on the water's surface. The water is a dark blue-grey color, reflecting the sky and clouds. In the foreground, the wet sand of the beach is visible, with white foamy waves crashing onto the shore. The sand appears to be a light beige color with darker speckles. The overall lighting is soft and diffused, suggesting a sunset or sunrise setting.
eval_img_8 A dirt path scattered with rocks leads upwards into the distance in the center of the image, bordered by snow-covered ground and rocks. Coniferous trees are present on both sides of the path, with some having green needles and others displaying golden yellow needles. Snow covers the ground, rocks, and trees. In the background, snow-covered mountains with rocky peaks are partially obscured by clouds. The sky is blue with white clouds. A snow covered trail in the middle of a snowy mountain. A dirt path winds through a snowy landscape, leading towards a mountain range in the background under a blue sky with white clouds. The path is made of brown dirt and is surrounded by snow-covered rocks and patches of green and yellow vegetation. The ground is also covered in snow, with patches of snow visible on the rocks and grass. On the left side of the path, there is a rocky slope covered in white snow, and on the right side, there are tall pine trees with thin trunks and some bare trees with yellow and green foliage. Behind the trees, the mountain range is visible, with snow covering the peaks and slopes. The sky is bright blue with scattered white clouds, and the overall scene is bright and sunny.

val_loss

Training Settings

Trained with Florence-2ner using this config and ~20K images:

{
    "model_name": "microsoft/Florence-2-base",
    "task_prompt": "<CAPTION>",
    "dataset_path": "./0000_Datasets/Gemini-512lim",
    "wandb_project_name": "Florence-2-base",
    "run_name": "Florence-2-base-Castollux-v0.5-run2",
    "epochs": 2,
    "optimizer": "CAME",
    "learning_rate": 5e-6,
    "lr_scheduler": "REX",
    "gradient_checkpointing": true,
    "freeze_vision": false,
    "freeze_language": false,
    "freeze_other": false,
    "train_batch_size": 8,
    "eval_batch_size": 8,
    "gradient_accumulation_steps": 8,
    "clip_grad_norm": 1,
    "weight_decay": 1e-2,
    "save_total_limit": 3,
    "save_steps": 10,
    "eval_steps": 10,
    "warmup_steps": 50,
    "eval_split_ratio": 0.01,
    "seed": 42,
    "filtering_processes": 128,
    "attn_implementation": "sdpa"
}
Downloads last month
98
Safetensors
Model size
271M params
Tensor type
BF16
ยท
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API does not yet support model repos that contain custom code.

Model tree for PJMixers-Images/Florence-2-base-Castollux-v0.5

Finetuned
(12)
this model

Dataset used to train PJMixers-Images/Florence-2-base-Castollux-v0.5

Space using PJMixers-Images/Florence-2-base-Castollux-v0.5 1

Collection including PJMixers-Images/Florence-2-base-Castollux-v0.5