Florence 2 VQA - Engineering Drawings

Model Overview

The Florence 2 VQA model is fine-tuned for visual question answering (VQA) tasks, specifically for engineering drawings. It takes both an image (e.g., a technical drawing) and a textual question as input, and generates a text-based answer related to the content of the image.


Model Details

  • Base Model: microsoft/Florence-2-base-ft
  • Task: Visual Question Answering (VQA)
  • Architecture: Causal Language Model (CLM)
  • Framework: Hugging Face Transformers

How to Use the Model

Install Dependencies

Make sure you have the required libraries installed:

pip install transformers torch datasets pillow gradio

Load the Model

To load the model and processor for inference, use the following code:

from transformers import AutoConfig, AutoModelForCausalLM
import torch

# Determine if a GPU is available and set the device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# Load configuration from the base model
config = AutoConfig.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)

# Load the model using the base model's configuration
model = AutoModelForCausalLM.from_pretrained(
    "fauzail/Florence-2-VQA",
    config=config,
    trust_remote_code=True
).to(device)

Load the Processor

from transformers import AutoProcessor

# Load the processor for the model
processor = AutoProcessor.from_pretrained("fauzail/Florence-2-VQA", trust_remote_code=True)

Define the Prediction Function

Once the model and processor are loaded, define a prediction function that takes an image and question as input:

def predict(image_path, question):
    from PIL import Image

    # Load and preprocess the image
    image = Image.open(image_path).convert("RGB")

    # Prepare inputs using the processor
    inputs = processor(text=[question], images=[image], return_tensors="pt", padding=True).to(device)

    # Generate the output from the model
    outputs = model.generate(**inputs)

    # Decode the output tokens into a human-readable format
    answer = processor.tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer

Test It for Example

Now, test the model using an image and a question:

image_path = "test.png"  # Replace with your image path
question = "Tell me in detail about the image?"

# Call the prediction function
answer = predict(image_path, question)
print("Answer:", answer)

Alternative: Use Gradio for Interactive Web Interface

If you prefer an interactive interface, you can use Gradio to deploy the model:

import gradio as gr
from PIL import Image

# Define the prediction function for Gradio
def predict(image, question):
    inputs = processor(text=[question], images=[image], return_tensors="pt", padding=True).to(device)
    outputs = model.generate(**inputs)
    return processor.tokenizer.decode(outputs[0], skip_special_tokens=True)

# Create the Gradio interface
interface = gr.Interface(
    fn=predict,
    inputs=["image", "text"],
    outputs="text",
    title="Florence 2 VQA - Engineering Drawings",
    description="Upload an engineering drawing and ask a related question."
)

# Launch the Gradio interface
interface.launch()

Training Details

  • Preprocessing:
    • Images were resized and normalized.
    • Text data (questions and answers) was tokenized using the Florence tokenizer.
  • Hyperparameters:
    • Learning Rate: 1e-6
    • Batch Size: 2
    • Gradient Accumulation Steps: 4
    • Epochs: 10

Training was performed using mixed precision for efficiency.


Downloads last month
17
Safetensors
Model size
271M params
Tensor type
F32
ยท
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Model tree for fauzail/Florence-2-VQA

Finetuned
(17)
this model

Space using fauzail/Florence-2-VQA 1