Fine-tuning YOLOv11 to detect stamps and signatures on banking documents - a practical walkthrough

DEV Community

Muhammad umair akram

Apr 30, 2026, 10:16 AM

Every day, banking ops teams manually review thousands of documents - There are a few reasonable starting points for document object detection: Layout-aware models like LayoutLMv3 or Donut - strong for structured forms, but heavier, harder to fine-tune for a narrow task, and slower at inference. Overkill if you only need to detect a small set of objects (stamps, signatures, initials). - Classical OpenCV approaches - template matching, contour detection, Hough transforms. Fast and lightweight but brittle on real-world scans. - YOLO family (v8, v11) - the sweet spot for object detection on documents. Fast, well-documented, easy to fine-tune, and the precision/recall tradeoff is tunable to ops-team requirements. I went with YOLOv11. The ultralytics Python package handles most of the busywork, inference runs well under 100ms per page on a modest GPU, and the architecture handles small objects - which stamps often are at low scan resolutions - better than older versions. ## The 80%: data preparation and annotation Anyone who's shipped CV in production will tell you the same thing: the model is the easy part. Data is where the time goes. Annotation tooling. I used Roboflow - clean web UI for bounding-box labeling, automatic train/val/test splits, easy export to YOLO format. CVAT is the open-source alternative if you can't use a SaaS for compliance reasons. Class taxonomy. Resist the urge to define ten classes on day one. Start with the smallest set that solves the business problem: signature - stamp - (Optionally handwritten_initials if your forms include them) More classes means more labeled examples per class, more failure modes, and a harder model to debug. You can always split a class later. You can rarely merge messy ones cleanly. Train/val/test split discipline. Separate documents into the three splits by source, not just randomly. If the same form template appears in both train and val, your validation metric is lying to you - the model is learning the form layout, not the object. In a regulated environment where wrong predictions cost real money, you cannot afford a lying validation set. Augmentation strategy - and why the defaults are wrong for documents. The off-the-shelf YOLO augmentation defaults are designed for natural images. They include rotation up to 30°, mosaic, MixUp. For documents, that's actively wrong: Rotation should be tightly limited (±5°). Documents are upright. Heavy rotation creates training examples that don't reflect production input. - Mosaic augmentation should be off. Pasting four documents into a 2×2 grid produces inputs that don't exist at inference time. - What helps instead: brightness/contrast variation (different scan qualities), JPEG compression noise (low-quality scans), partial occlusion (parts of the document obscured), Gaussian blur (out-of-focus phone shots). "The single biggest accuracy gain in my project came from augmenting for phone-photographed scans. Production data was messier than my training set assumed - closing that gap mattered more than any architecture change." ## Training configuration that actually matters Most YOLO hyperparameters are fine at defaults. The ones that move the needle on documents: from ultralytics import YOLO model = YOLO('yolo11m.pt') results = model.train( data='dataset.yaml', epochs=100, imgsz=1024, # higher imgsz matters for small stamps batch=8, lr0=0.001, patience=20, # early stopping if mAP stalls augment=True, mosaic=0.0, # off for documents degrees=5, # limit rotation fliplr=0.0, # don't horizontally flip docs ) ``` {% endraw %} Two things worth flagging: **{% raw %}`imgsz=1024`{% endraw %} not 640.** Stamps at low resolution can become a few pixels - too small for the model to detect reliably. Higher input size costs more compute per image, but the precision gain on small objects is substantial. **Disable horizontal flipping.** A flipped form is a wrong form. Augmentations that produce never-seen-in-production inputs hurt generalization on the inputs you actually care about. ## The metric you should actually optimize for Most tutorials default to {% raw %}`[email protected]`{% endraw %}. For document AI in a regulated environment, that's the wrong primary metric. Ops teams care about **precision**. When the model says "there's a signature here," they need it to be right. A false positive sends a document downstream that shouldn't be there, costing reviewer time. A false negative is recoverable - the document falls back to manual review, which is the existing baseline. Track both, but if you have to optimize one, optimize precision. Your ops manager will thank you. ## Inference and deployment A model that runs on a GPU is fun. A model that runs on a CPU is shippable. For most document-AI workloads - where you're processing on the order of dozens to hundreds of pages per minute, not millions - CPU inference with an ONNX-exported model is faster to deploy, cheaper to run, and far more compatible with locked-down production environments where GPU drivers are a fight you don't want. The flow is: 1. Train with {% raw %}`ultralytics` (PyTorch backend, GPU during training) 2. Export the trained weights to ONNX 3. Serve via `ultralytics`'s ONNX-runtime path on CPU at inference time Step 2 is one line: ```python from ultralytics import YOLO model = YOLO('best.pt') model.export(format='onnx') # writes best.onnx alongside best.pt ``` Step 3 - the inference service: ```python from fastapi import FastAPI, UploadFile from ultralytics import YOLO from PIL import Image import io app = FastAPI() model = YOLO('best.onnx') # ONNX runtime, CPU-only @app.post('/detect') async def detect(file: UploadFile): image = Image.open(io.BytesIO(await file.read())) results = model(image) detections = [] for r in results: for box in r.boxes: detections.append({ 'class': model.names[int(box.cls)], 'confidence': float(box.conf), 'bbox': box.xyxy.tolist()[0], }) return {'detections': detections} ``` The most important line in that snippet is `model = YOLO('best.onnx')` at module level - load the model **once at startup**, never per request. Reloading the model on every request is the most common production mistake I've seen on YOLO endpoints. It's the difference between 50ms response time and 5,000ms. For the container: a slim Python base image (`python:3.11-slim`) is enough. No CUDA, no GPU drivers, no NVIDIA dependencies. The image ends up under 500MB, starts in seconds, and runs anywhere - including locked-down corporate VMs and on-prem environments where shipping a GPU-dependent service is months of approvals you don't have. That's the real tradeoff: you give up a small amount of per-request latency in exchange for a service that deploys today, not next quarter. ## What the tutorials don't tell you Three lessons the standard YOLO blog posts skip: **1. The long tail of weird scans is where production breaks.** Faxed pages with horizontal banding, partially photocopied documents, phone shots with one corner cut off, watermarks bleeding through from the back side. Your training set won't include enough of these. Get a sample of real production input as fast as possible - even just 50 images - and use them for evaluation, not training. They tell you what the world actually looks like. **2. Log every prediction with the input image hash.** When the model fails in production, you want to be able to find the exact input that broke it, retroactively. Hash the input, log the prediction, store both. That's how you build round-2 training data without hunting. **3. Don't chase [email protected].** Diminishing returns. If your business needs 95% precision at 70% recall, optimize for that operating point - not for a metric that summarizes the whole curve. Talk to your ops team. Get the actual numbers they care about. Train against those. ## Closing The model is not the bottleneck for document AI. The bottleneck is annotation discipline, augmentation tuned to real production input, and deployment that doesn't blow up under load. If you're building computer vision for regulated industries - banking, insurance, legal, healthcare - the playbook above is what's worked for me. The frameworks change. The data discipline doesn't.