AiFormParser

Browser diagnostics

Confirms that the OCR engine and the local LLM run correctly in this browser. Everything on this page stays client-side; nothing is uploaded.

Client-side capability check

Runtime acceleration backend

Surfaces what wllama actually picks at runtime, so you can tell a single-threaded WASM fallback or a missing WebGPU adapter apart from a healthy multi-thread + GPU setup without opening devtools. The pre-load row probes the browser; the post-load row reports what wllama settled on after the model is loaded.

Backend (post-load): load a model to populate.

Pre-load snapshot (JSON)

(pending)

Console log capture

Mirrors the devtools console (wllama wrapper, llama.cpp native log from suppressNativeLog: false, runtime diagnostics) so you can copy it without opening devtools. Capture starts when the page loads, so reload before running a check if you want a clean trace.

LLM diagnostic

Loads the selected model, generates a few tokens, validates structured output (JSON-schema tool call), then OCRs a synthetic image to confirm multimodal extraction works end to end. Run each check individually or use Run all. Each run appends a row to the results table.

Model

Enable vision encoder Compute offload Diagnostic only. Both controls trigger a model reload on the next run. If the Model options YAML below explicitly sets the same key, that wins and the matching picker option is disabled.

Model options (YAML; merged into the user-pipeline defaults at load time). Editing these unloads the current model so the next run reloads. Available parameters. Set any value to model_default to drop that key from the request, falling back to the model's own default. The diagnostic-only key wllama_compat_fallback (default false here) controls CPU-only loads: when false a n_gpu_layers: 0 load attempts the WebGPU main bundle directly (it currently traps with "unreachable", so this is for testing custom wllama builds); set it true to route CPU-only through the compat bundle as production does. This key is consumed locally and is not sent to wllama.

Completion options (YAML; passed per-completion, no reload needed). Includes temperature plus any SamplingParams. Set any value to model_default (the default for temperature) to drop that key, so the model's own baked-in default wins. Thinking is disabled by default via chat_template_kwargs: {enable_thinking: false, reasoning: false} so short generations do not get stuck inside the <think> preamble; re-enable by editing this textarea.

Step	Result	Time	Detail

Live token stream (drag the bottom-right corner to resize; previous step runs stay visible until Clear results):

LLM benchmarking

Sweeps thread count (1, 3, and the value the user pipeline picks on this device) against compute offload (all GPU, all CPU, GPU with the vision encoder forced to CPU). For each combination, the model is reloaded, a short text generation runs and a synthetic multimodal OCR runs, each stopped after 10 generated tokens (thinking included). TTFT and tok/s are recorded per task. The final table is sorted by multimodal tok/s, descending, since OCR is the production workload. Uses the Model picker and the Model / Completion options textareas above. Cancel interrupts after the current combination's load.

Combination	Text TTFT	Text tok/s	OCR TTFT	OCR tok/s