OCR Quality Assessment using Unigram Language Model

This HuggingFace model repository contains a unigram language model built for OCR quality assessment.

Model & Bloom Filter Integration

The build process creates bloom filter dictionaries with the following metadata:

Version: A specific version identifier (e.g. v1.0.0)
Language: The target language (e.g. en)
Model Name: A short identifier (e.g. wp for Wikipedia)
False Positive Probability: The target FP probability (e.g. 0.001)

The bloom filter dictionaries are first generated in a designated build directory (BUILD_DIR). They are then copied into this repository following a flat hierarchy structure. This means all built bloom filter files reside in a single directory (e.g. /bloom) without further nested subfolders, ensuring a streamlined layout.

Deployment Workflow

The Makefile targets:

copy-bloom: Copies the built bloom filter file to bloom/.
commit-bloom: Automatically stages and commits the update with a descriptive commit message.
push-bloom: Pushes the commit to the remote repository.
deploy-bloom: Aggregates the above steps into one deployment command.

This integration maintains a modular workflow where build artifacts created in BUILD_DIR are rapidly incorporated into the HuggingFace model repository.

...existing model usage and evaluation instructions...