Florence 2 VQA - Engineering Drawings

Model Overview

The Florence 2 VQA model is fine-tuned for visual question answering (VQA) tasks, specifically for engineering drawings. It takes both an image (e.g., a technical drawing) and a textual question as input, and generates a text-based answer related to the content of the image.

Model Details

Base Model: microsoft/Florence-2-base-ft
Task: Visual Question Answering (VQA)
Architecture: Causal Language Model (CLM)
Framework: Hugging Face Transformers

How to Use the Model

Install Dependencies

Make sure you have the required libraries installed:

pip install transformers torch datasets pillow gradio

Load the Model

To load the model and processor for inference, use the following code:

from transformers import AutoConfig, AutoModelForCausalLM
import torch

# Determine if a GPU is available and set the device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# Load configuration from the base model
config = AutoConfig.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)

# Load the model using the base model's configuration
model = AutoModelForCausalLM.from_pretrained(
    "fauzail/Florence-2-VQA",
    config=config,
    trust_remote_code=True
).to(device)

Load the Processor

from transformers import AutoProcessor

# Load the processor for the model
processor = AutoProcessor.from_pretrained("fauzail/Florence-2-VQA", trust_remote_code=True)

Define the Prediction Function

Once the model and processor are loaded, define a prediction function that takes an image and question as input:

def predict(image_path, question):
    from PIL import Image

    # Load and preprocess the image
    image = Image.open(image_path).convert("RGB")

    # Prepare inputs using the processor
    inputs = processor(text=[question], images=[image], return_tensors="pt", padding=True).to(device)

    # Generate the output from the model
    outputs = model.generate(**inputs)

    # Decode the output tokens into a human-readable format
    answer = processor.tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer

Test It for Example

Now, test the model using an image and a question:

image_path = "test.png"  # Replace with your image path
question = "Tell me in detail about the image?"

# Call the prediction function
answer = predict(image_path, question)
print("Answer:", answer)

Alternative: Use Gradio for Interactive Web Interface

If you prefer an interactive interface, you can use Gradio to deploy the model:

import gradio as gr
from PIL import Image

# Define the prediction function for Gradio
def predict(image, question):
    inputs = processor(text=[question], images=[image], return_tensors="pt", padding=True).to(device)
    outputs = model.generate(**inputs)
    return processor.tokenizer.decode(outputs[0], skip_special_tokens=True)

# Create the Gradio interface
interface = gr.Interface(
    fn=predict,
    inputs=["image", "text"],
    outputs="text",
    title="Florence 2 VQA - Engineering Drawings",
    description="Upload an engineering drawing and ask a related question."
)

# Launch the Gradio interface
interface.launch()

Training Details

Preprocessing:
- Images were resized and normalized.
- Text data (questions and answers) was tokenized using the Florence tokenizer.
Hyperparameters:
- Learning Rate: 1e-6
- Batch Size: 2
- Gradient Accumulation Steps: 4
- Epochs: 10

Training was performed using mixed precision for efficiency.

fauzail
/

Florence-2-VQA