How I built an on-device plate OCR scanner for iPhone with Apple Vision

Nikos Katsikanis - April 18, 2026

I wanted live number-plate text recognition in an iPhone app without sending frames to a server or leaning on a third-party SDK. Apple’s text recognition was already decent. The hard part was making live reads stable enough to trust.

iPhone camera view reading a car number plate live on device

What I actually wanted

I wanted the whole pipeline to run locally, in real time, inside an iOS Acronym iPhone Operating System. Apple’s mobile operating system for the iPhone, with native APIs for camera capture, graphics, and on-device machine vision. app using Apple’s own frameworks. That meant camera frames from AVFoundation Acronym Audio Video Foundation. Apple’s framework for camera capture, media pipelines, and working with live video on iPhone and iPad. , text recognition from Vision, image work from Core Image, and the rest in Swift.

Getting OCR Acronym Optical character recognition. The process of detecting text in an image and turning it into machine-readable characters. out of one frame was easy enough. Making the scanner behave well in a live camera feed was the real job.

Live plate scanning has real-world noise:

motion blur
changing light
reflections
poor focus
background text
partial reads
different guesses on different frames

So the interesting problem was not how to run OCR. It was how to stop a good OCR engine from wobbling in all the normal ways a live camera feed wobbles.

The stack stayed simple

AVFoundation for the camera feed
Vision for text recognition
Core Image for orientation and cropping
Swift for matching, voting, smoothing, and state

No server calls. No external SDK. The core pieces were VNRecognizeTextRequest on the Vision side and captureDevicePointConverted(fromLayerPoint:) on the camera side.

final class ScannerStore: NSObject, ObservableObject {
    let session = AVCaptureSession()

    private let sessionQueue = DispatchQueue(label: "scanner.session")
    private let ocrQueue = DispatchQueue(label: "scanner.ocr")

    private var isProcessingFrame = false
    private var lastProcessedAt: TimeInterval = 0
}

I kept the scanner inside one long-lived store object so camera setup, OCR requests, and published detection state all stayed in one place. Two queues mattered: one for camera work and one for OCR. If OCR blocks camera work, the scanner feels bad immediately.

Technique #1: I did not OCR every frame

A common mistake is trying to process every frame. That burns CPU, heats the phone, drains battery, and leaves the UI reacting to stale work.

guard shouldProcessFrame(now: now) else { return }

I gated frame processing in two ways:

skip frames while an OCR request is still running
enforce a minimum interval between scans

It was much better to process fewer recent frames than a backlog of old ones.

Technique #2: ROI Acronym Region of interest. The part of an image or frame that a vision system should prioritise instead of scanning everything. was the biggest single gain

The biggest improvement did not come from OCR settings. It came from telling Vision where to look.

private let defaultVisionRegion =
CGRect(x: 0.05, y: 0.35, width: 0.90, height: 0.30)

request.regionOfInterest = roi

Instead of scanning the whole frame, I constrained recognition to a centre band where users naturally place the plate. That gave me less irrelevant text, faster OCR, fewer false positives, and much more stable output.

Without ROI, Vision will happily read signs, bumper stickers, dealership branding, reflections, or whatever else happens to be in frame. With ROI, it stops wandering.

Apple documents regionOfInterest in Vision’s normalised image space, which is why it helps to be explicit about coordinate conversion instead of hand-waving it.

UIKit overlays and Vision do not share the same coordinates

If the guide box on screen uses top-left coordinates and Vision uses bottom-left coordinates, the Y axis has to be flipped. If OCR looks offset from the visible guide box, this is one of the first places I check.

func updateOverlayRegion(uiRect: CGRect, in viewSize: CGSize) {
    let nx = uiRect.origin.x / viewSize.width
    let nyTop = uiRect.origin.y / viewSize.height
    let nw = uiRect.size.width / viewSize.width
    let nh = uiRect.size.height / viewSize.height

    let visionY = 1.0 - nyTop - nh

    let roi = CGRect(x: nx, y: visionY, width: nw, height: nh)
}

That part follows Vision’s documented coordinate system for image requests and bounding boxes, so I treat it as plumbing that has to be right before I tune anything else.

Technique #3: focus and exposure should follow the same target

If the app already knows where the plate should be, the camera should prioritise that area too.

The one caveat is coordinate space. The camera device does not use Vision’s ROI coordinates directly, so the visible guide box or ROI centre needs to be converted before being applied.

let pointOfInterest = previewLayer.captureDevicePointConverted(
    fromLayerPoint: overlayCenter
)

device.focusPointOfInterest = pointOfInterest
device.exposurePointOfInterest = pointOfInterest
device.focusMode = .continuousAutoFocus
device.exposureMode = .continuousAutoExposure

This made frames noticeably sharper. A lot of OCR problems are really camera problems first. Apple’s preview-layer conversion method is the bit that keeps the target point honest.

Technique #4: I tuned Vision for plate text, not words

let request = VNRecognizeTextRequest()

request.recognitionLevel = .accurate
request.usesLanguageCorrection = false
request.recognitionLanguages = ["en-GB"]
request.regionOfInterest = roi
request.minimumTextHeight = 0.09

These settings mattered for practical reasons:

.accurate because a wrong read costs more than a slightly slower pass
usesLanguageCorrection = false because number plates are identifiers, not normal words
recognitionLanguages because a likely locale can make the recogniser behave more predictably, even though plate reading is not natural language
minimumTextHeight because tiny background text should be ignored

Apple documents minimumTextHeight relative to image height. In practice I tuned it empirically based on how large the plate usually appeared in my chosen framing, instead of pretending one threshold fit every case.

if roi.height >= 0.25 {
    request.minimumTextHeight = 0.10
} else if roi.height >= 0.18 {
    request.minimumTextHeight = 0.09
} else {
    request.minimumTextHeight = 0.07
}

Technique #5: I kept only the largest plausible text block

Even inside the ROI, Vision often returns multiple text observations. In my guided scanner flow, the plate was usually the largest meaningful text block, so I kept the biggest plausible bounding box and ignored the rest.

let largest = observations.max(by: {
    ($0.boundingBox.width * $0.boundingBox.height) <
    ($1.boundingBox.width * $1.boundingBox.height)
})

That stripped out a lot of accidental matches from badges, small labels, nearby signs, and partial text. It is a guided-scanner heuristic, not a universal truth.

Technique #6: I asked for multiple candidates and normalised common mistakes

A naive implementation trusts only the top candidate.

observation.topCandidates(3)

That mattered because OCR often returns close alternatives:

AB12CDE
ABI2CDE
AB12ODE

One frame might rank the wrong one first, but the right answer is often still in the candidate list. Before scoring them, I normalised obvious OCR confusions like O and 0, I and 1, or S and 5. That is where domain knowledge beats generic OCR.

Technique #7: confidence voting was better than trusting one frame

Instead of believing the first candidate from one frame, I summed confidence by normalised plate value.

totals[plate, default: 0] += confidence

If the same normalised plate kept showing up across candidates and frames, it pulled ahead naturally. That was far more stable than trusting one frame’s top guess.

Technique #8: temporal smoothing made it feel trustworthy

Even strong OCR lies for one frame. Focus wobbles, glare hits, or a car moves just enough to produce nonsense. So I stopped asking whether one frame was correct and started asking whether the same answer had stayed strongest for long enough.

private var recentPlates: [(plate: String,
                            confidence: Double,
                            time: Date)] = []

I kept a rolling window and only accepted a plate once it had enough accumulated weight over a short time span. This was one of the biggest gains in the whole scanner.

Technique #9: duplicate cooldowns stopped spammy detections

Once a plate was accepted, I usually did not want the same result firing over and over.

if recentlySeen(plate) { return }

That mattered for repeated pass-through scenarios like gates, parking workflows, or evidence capture.

The full matching pipeline

frame gating
ROI constraint
focus and exposure bias
Vision OCR
minimum text size filter
largest text block only
multi-candidate extraction
normalisation
confidence voting
temporal smoothing
duplicate cooldown

None of those changes is dramatic on its own. Together they turned the scanner from a demo into something I would actually trust in a live feed.

What moved the needle most

ROI constraints
temporal smoothing
multi-candidate voting plus normalisation
largest text filtering
focus and exposure targeting

I cropped saved evidence to the same region the scanner used

If I needed to save or upload an evidence image, I cropped it to the same ROI used during recognition.

private func cropRectForVisionRegion(_ region: CGRect,
                                     in extent: CGRect) -> CGRect {
    let rect = CGRect(
        x: extent.minX + (region.origin.x * extent.width),
        y: extent.minY + (region.origin.y * extent.height),
        width: region.width * extent.width,
        height: region.height * extent.height
    )

    return rect.insetBy(dx: -24, dy: -16).intersection(extent)
}

That kept saved images aligned with what the scanner had actually analysed instead of saving a misleading full frame.

If I built v2

a fallback .fast OCR pass when accurate mode finds nothing
country-specific plate formats
adaptive thresholds based on lighting
local detection history
bounding-box tracking between frames
motion-assisted frame selection

Closing

Apple gives you the OCR engine. The product quality comes from the heuristics around it.

For me, that meant constraining the search space, improving the camera input, comparing multiple candidates, normalising obvious mistakes, and requiring consistency over time. That is what turned a working prototype into a plate scanner that felt stable enough to use live on an iPhone.