AI News Hub Logo

AI News Hub

Fine-tuning YOLOv11 to detect stamps and signatures on banking documents - a practical walkthrough

DEV Community
Muhammad umair akram

Every day, banking ops teams manually review thousands of documents -  There are a few reasonable starting points for document object detection: Layout-aware models like LayoutLMv3 or Donut - strong for structured forms, but heavier, harder to fine-tune for a narrow task, and slower at inference. Overkill if you only need to detect a small set of objects  (stamps, signatures, initials).  - Classical OpenCV approaches - template matching, contour detection, Hough transforms. Fast and lightweight but brittle on real-world scans.  - YOLO family (v8, v11) - the sweet spot for object detection on  documents. Fast, well-documented, easy to fine-tune, and the  precision/recall tradeoff is tunable to ops-team requirements. I went with YOLOv11. The ultralytics Python package handles most of the  busywork, inference runs well under 100ms per page on a modest GPU, and the architecture handles small objects - which stamps often are at low  scan resolutions - better than older versions. ## The 80%: data preparation and annotation Anyone who's shipped CV in production will tell you the same thing: the  model is the easy part. Data is where the time goes. Annotation tooling. I used Roboflow - clean web UI for bounding-box  labeling, automatic train/val/test splits, easy export to YOLO format.  CVAT is the open-source alternative if you can't use a SaaS for  compliance reasons. Class taxonomy. Resist the urge to define ten classes on day one.  Start with the smallest set that solves the business problem: signature  - stamp  - (Optionally handwritten_initials if your forms include them) More classes means more labeled examples per class, more failure modes,  and a harder model to debug. You can always split a class later. You  can rarely merge messy ones cleanly. Train/val/test split discipline. Separate documents into the three  splits by source, not just randomly. If the same form template appears  in both train and val, your validation metric is lying to you - the  model is learning the form layout, not the object. In a regulated  environment where wrong predictions cost real money, you cannot afford  a lying validation set. Augmentation strategy - and why the defaults are wrong for documents.  The off-the-shelf YOLO augmentation defaults are designed for natural  images. They include rotation up to 30°, mosaic, MixUp. For documents,  that's actively wrong: Rotation should be tightly limited (±5°). Documents are upright.  Heavy rotation creates training examples that don't reflect production  input.  - Mosaic augmentation should be off. Pasting four documents into a  2×2 grid produces inputs that don't exist at inference time.  - What helps instead: brightness/contrast variation (different scan  qualities), JPEG compression noise (low-quality scans), partial  occlusion (parts of the document obscured), Gaussian blur (out-of-focus  phone shots). "The single biggest accuracy gain in my project came from augmenting for phone-photographed scans. Production data was messier than my training set assumed - closing that gap mattered more than any architecture change." ## Training configuration that actually matters Most YOLO hyperparameters are fine at defaults. The ones that move the  needle on documents:  from ultralytics import YOLO model = YOLO('yolo11m.pt') results = model.train(  data='dataset.yaml',  epochs=100,  imgsz=1024, # higher imgsz matters for small stamps  batch=8,  lr0=0.001,  patience=20, # early stopping if mAP stalls  augment=True,  mosaic=0.0, # off for documents  degrees=5, # limit rotation  fliplr=0.0, # don't horizontally flip docs  )  ``` {% endraw %} Two things worth flagging: **{% raw %}`imgsz=1024`{% endraw %} not 640.** Stamps at low resolution can become a few  pixels - too small for the model to detect reliably. Higher input size  costs more compute per image, but the precision gain on small objects  is substantial. **Disable horizontal flipping.** A flipped form is a wrong form.  Augmentations that produce never-seen-in-production inputs hurt  generalization on the inputs you actually care about. ## The metric you should actually optimize for Most tutorials default to {% raw %}`[email protected]`{% endraw %}. For document AI in a regulated  environment, that's the wrong primary metric. Ops teams care about **precision**. When the model says "there's a  signature here," they need it to be right. A false positive sends a  document downstream that shouldn't be there, costing reviewer time. A  false negative is recoverable - the document falls back to manual  review, which is the existing baseline. Track both, but if you have to optimize one, optimize precision. Your  ops manager will thank you. ## Inference and deployment A model that runs on a GPU is fun. A model that runs on a CPU is  shippable. For most document-AI workloads - where you're processing on the order of dozens to hundreds of pages per minute, not millions -   CPU inference with an ONNX-exported model is faster to deploy, cheaper   to run, and far more compatible with locked-down production environments   where GPU drivers are a fight you don't want. The flow is: 1. Train with {% raw %}`ultralytics` (PyTorch backend, GPU during training)  2. Export the trained weights to ONNX  3. Serve via `ultralytics`'s ONNX-runtime path on CPU at inference time Step 2 is one line: ```python  from ultralytics import YOLO model = YOLO('best.pt')  model.export(format='onnx') # writes best.onnx alongside best.pt  ``` Step 3 - the inference service: ```python  from fastapi import FastAPI, UploadFile  from ultralytics import YOLO  from PIL import Image  import io app = FastAPI()  model = YOLO('best.onnx') # ONNX runtime, CPU-only @app.post('/detect')  async def detect(file: UploadFile):  image = Image.open(io.BytesIO(await file.read()))  results = model(image) detections = []  for r in results:  for box in r.boxes:  detections.append({  'class': model.names[int(box.cls)],  'confidence': float(box.conf),  'bbox': box.xyxy.tolist()[0],  }) return {'detections': detections}  ``` The most important line in that snippet is `model = YOLO('best.onnx')`  at module level - load the model **once at startup**, never per request.  Reloading the model on every request is the most common production  mistake I've seen on YOLO endpoints. It's the difference between 50ms  response time and 5,000ms. For the container: a slim Python base image (`python:3.11-slim`) is  enough. No CUDA, no GPU drivers, no NVIDIA dependencies. The image  ends up under 500MB, starts in seconds, and runs anywhere - including  locked-down corporate VMs and on-prem environments where shipping a  GPU-dependent service is months of approvals you don't have. That's the real tradeoff: you give up a small amount of per-request  latency in exchange for a service that deploys today, not next quarter. ## What the tutorials don't tell you Three lessons the standard YOLO blog posts skip: **1. The long tail of weird scans is where production breaks.** Faxed  pages with horizontal banding, partially photocopied documents, phone  shots with one corner cut off, watermarks bleeding through from the  back side. Your training set won't include enough of these. Get a  sample of real production input as fast as possible - even just 50  images - and use them for evaluation, not training. They tell you what  the world actually looks like. **2. Log every prediction with the input image hash.** When the model  fails in production, you want to be able to find the exact input that  broke it, retroactively. Hash the input, log the prediction, store both.  That's how you build round-2 training data without hunting. **3. Don't chase [email protected].** Diminishing returns. If your business  needs 95% precision at 70% recall, optimize for that operating point -   not for a metric that summarizes the whole curve. Talk to your ops  team. Get the actual numbers they care about. Train against those. ## Closing The model is not the bottleneck for document AI. The bottleneck is  annotation discipline, augmentation tuned to real production input,  and deployment that doesn't blow up under load. If you're building  computer vision for regulated industries - banking, insurance, legal,  healthcare - the playbook above is what's worked for me. The frameworks  change. The data discipline doesn't.