Skip to main content

Document processing

This page explains how Better Comply indexes controlled documents for AI retrieval (RAG) and how to choose the right processing mode for your deployment.

Who this is for

Operators deploying Better Comply who need to decide how document indexing should run in their environment.

Why this matters

When an admin uploads or creates a controlled document, the system splits its content into fragments and embeds them as vectors in Postgres (pgvector). Those fragments are what the AI content generator and training assistant retrieve when producing relevant content.

This embedding step is slow and CPU-intensive. It must:

  • Survive a user refreshing the page mid-import.
  • Run correctly on a Cloud Run instance that may be scaled to zero between requests (no continuous CPU billing).
  • Work the same on a self-hosted Docker deployment with no cloud services.

The DOCUMENT_PROCESSING_MODE environment variable selects how this work is triggered.

Converting uploads to Markdown

When an admin chooses to convert an uploaded file into an editable Markdown controlled document, the source binary is handled before indexing:

  • A legacy .doc is normalised to .docx via LibreOffice before extraction. If the LibreOffice binary is not installed, the conversion falls back to AI extraction so the import still succeeds.
  • The original uploaded file is retained as provenance on the resulting version and stays downloadable, even though the indexed body is the converted Markdown.

How the queue works

The document's status column in the database is the queue:

pending → processing → ready
↘ error
  1. When POST /v1/process-document is called, the document status moves to pending (or processing, depending on mode).
  2. The drain picks up pending documents atomically (preventing two workers from processing the same document).
  3. On success, status moves to ready. On failure, it moves to error.
  4. A document stuck in processing for longer than DOC_STALE_PROCESSING_MS (default 5 minutes) is eligible to be reclaimed by the next drain cycle. This handles the case where a Cloud Run instance was reclaimed mid-job.

Processing modes

inline (default)

DOCUMENT_PROCESSING_MODE=inline

After returning the 202 Accepted response, the backend triggers processing via setImmediate in the same Node.js process.

  • Good for: local development, single always-on container.
  • Risk on Cloud Run: Cloud Run throttles CPU after a response completes. The indexing job may be cut short if the instance is reclaimed before the work finishes. The next admin-triggered call or drain will reclaim and retry it.

worker

DOCUMENT_PROCESSING_MODE=worker

POST /v1/process-document only sets the document to pending. A separate HTTP call to POST /v1/process-document-queue drains the queue batch by batch. This endpoint is secret-gated (same RECERTIFICATION_SCAN_SECRET as the other crons).

Drive the drain with whichever scheduler your host provides:

Cloud Scheduler (GCP):

gcloud scheduler jobs create http better-comply-doc-queue \
--schedule="*/1 * * * *" \
--uri="https://<cloud-run-url>/v1/process-document-queue" \
--http-method=POST \
--headers="x-internal-secret=<secret>,content-type=application/json" \
--time-zone=UTC \
--message-body='{}'

In-process timer (self-hosted Docker):

DOCUMENT_PROCESSING_MODE=worker
ENABLE_INPROCESS_DOC_WORKER=true
DOC_WORKER_POLL_MS=30000
DOC_WORKER_BATCH_SIZE=3

With ENABLE_INPROCESS_DOC_WORKER=true, the backend starts an internal timer that calls the drain on the configured interval. This eliminates the need for an external scheduler and works well on a single always-on Docker container.

warning

Do not enable ENABLE_INPROCESS_DOC_WORKER=true on a multi-instance Cloud Run deployment. Multiple instances would each run their own timer and contend on the queue. Use Cloud Tasks or Cloud Scheduler instead.

cloud-tasks

DOCUMENT_PROCESSING_MODE=cloud-tasks
GCP_PROJECT_ID=my-project-123
CLOUD_TASKS_LOCATION=us-central1
CLOUD_TASKS_QUEUE=doc-index
CLOUD_TASKS_TARGET_URL=https://<cloud-run-url>

POST /v1/process-document enqueues a Cloud Task that calls POST /v1/process-document-task per document. Cloud Tasks retries the delivery if the instance is unavailable, making this the most durable option for GCP deployments.

The Cloud Tasks API is called using the Cloud Run instance's metadata-server token - no SDK or extra credentials needed.

One-time setup:

gcloud tasks queues create doc-index --location=us-central1

Mode comparison

inlineworker + schedulercloud-tasks
Survives instance reclaimNo (stale reclaim on next drain)Yes (next scheduler tick)Yes (Cloud Tasks retry)
External dependenciesNoneCloud Scheduler or system cronCloud Tasks queue (GCP)
Multi-instance safeYes (best-effort per instance)Yes (atomic DB claim)Yes
Self-hosted / non-cloudYesYes (system cron or in-process timer)No

Stuck jobs

A document stuck in processing past DOC_STALE_PROCESSING_MS (default 5 minutes) is visible in the admin UI as an "Indexing stalled" badge with a one-click "Reprocess" button. Clicking Reprocess calls POST /v1/process-document again, which resets the document to pending and triggers a fresh indexing cycle.

Operators can tune the stale threshold via DOC_STALE_PROCESSING_MS (milliseconds, min 60 000, max 3 600 000).

Tuning batch size and poll interval

When using DOCUMENT_PROCESSING_MODE=worker:

  • DOC_WORKER_BATCH_SIZE (default 3) - how many documents to process per drain call. Increase for faster indexing at the cost of higher peak CPU.
  • DOC_WORKER_POLL_MS (default 30000) - only relevant when ENABLE_INPROCESS_DOC_WORKER=true. Lower values mean faster pickup; values below 5 seconds are not recommended.