Document processing
This page explains how Better Comply indexes controlled documents for AI retrieval (RAG) and how to choose the right processing mode for your deployment.
Operators deploying Better Comply who need to decide how document indexing should run in their environment.
Why this matters
When an admin uploads or creates a controlled document, the system splits its content into fragments and embeds them as vectors in Postgres (pgvector). Those fragments are what the AI content generator and training assistant retrieve when producing relevant content.
This embedding step is slow and CPU-intensive. It must:
- Survive a user refreshing the page mid-import.
- Run correctly on a Cloud Run instance that may be scaled to zero between requests (no continuous CPU billing).
- Work the same on a self-hosted Docker deployment with no cloud services.
The DOCUMENT_PROCESSING_MODE environment variable selects how this work is triggered.
Converting uploads to Markdown
When an admin chooses to convert an uploaded file into an editable Markdown controlled document, the source binary is handled before indexing:
- A legacy
.docis normalised to.docxvia LibreOffice before extraction. If the LibreOffice binary is not installed, the conversion falls back to AI extraction so the import still succeeds. - The original uploaded file is retained as provenance on the resulting version and stays downloadable, even though the indexed body is the converted Markdown.
How the queue works
The document's status column in the database is the queue:
pending → processing → ready
↘ error
- When
POST /v1/process-documentis called, the document status moves topending(orprocessing, depending on mode). - The drain picks up
pendingdocuments atomically (preventing two workers from processing the same document). - On success, status moves to
ready. On failure, it moves toerror. - A document stuck in
processingfor longer thanDOC_STALE_PROCESSING_MS(default 5 minutes) is eligible to be reclaimed by the next drain cycle. This handles the case where a Cloud Run instance was reclaimed mid-job.
Processing modes
inline (default)
DOCUMENT_PROCESSING_MODE=inline
After returning the 202 Accepted response, the backend triggers processing via setImmediate in the same Node.js process.
- Good for: local development, single always-on container.
- Risk on Cloud Run: Cloud Run throttles CPU after a response completes. The indexing job may be cut short if the instance is reclaimed before the work finishes. The next admin-triggered call or drain will reclaim and retry it.
worker
DOCUMENT_PROCESSING_MODE=worker
POST /v1/process-document only sets the document to pending. A separate HTTP call to POST /v1/process-document-queue drains the queue batch by batch. This endpoint is secret-gated (same RECERTIFICATION_SCAN_SECRET as the other crons).
Drive the drain with whichever scheduler your host provides:
Cloud Scheduler (GCP):
gcloud scheduler jobs create http better-comply-doc-queue \
--schedule="*/1 * * * *" \
--uri="https://<cloud-run-url>/v1/process-document-queue" \
--http-method=POST \
--headers="x-internal-secret=<secret>,content-type=application/json" \
--time-zone=UTC \
--message-body='{}'
In-process timer (self-hosted Docker):
DOCUMENT_PROCESSING_MODE=worker
ENABLE_INPROCESS_DOC_WORKER=true
DOC_WORKER_POLL_MS=30000
DOC_WORKER_BATCH_SIZE=3
With ENABLE_INPROCESS_DOC_WORKER=true, the backend starts an internal timer that calls the drain on the configured interval. This eliminates the need for an external scheduler and works well on a single always-on Docker container.
Do not enable ENABLE_INPROCESS_DOC_WORKER=true on a multi-instance Cloud Run deployment. Multiple instances would each run their own timer and contend on the queue. Use Cloud Tasks or Cloud Scheduler instead.
cloud-tasks
DOCUMENT_PROCESSING_MODE=cloud-tasks
GCP_PROJECT_ID=my-project-123
CLOUD_TASKS_LOCATION=us-central1
CLOUD_TASKS_QUEUE=doc-index
CLOUD_TASKS_TARGET_URL=https://<cloud-run-url>
POST /v1/process-document enqueues a Cloud Task that calls POST /v1/process-document-task per document. Cloud Tasks retries the delivery if the instance is unavailable, making this the most durable option for GCP deployments.
The Cloud Tasks API is called using the Cloud Run instance's metadata-server token - no SDK or extra credentials needed.
One-time setup:
gcloud tasks queues create doc-index --location=us-central1
Mode comparison
inline | worker + scheduler | cloud-tasks | |
|---|---|---|---|
| Survives instance reclaim | No (stale reclaim on next drain) | Yes (next scheduler tick) | Yes (Cloud Tasks retry) |
| External dependencies | None | Cloud Scheduler or system cron | Cloud Tasks queue (GCP) |
| Multi-instance safe | Yes (best-effort per instance) | Yes (atomic DB claim) | Yes |
| Self-hosted / non-cloud | Yes | Yes (system cron or in-process timer) | No |
Stuck jobs
A document stuck in processing past DOC_STALE_PROCESSING_MS (default 5 minutes) is visible in the admin UI as an "Indexing stalled" badge with a one-click "Reprocess" button. Clicking Reprocess calls POST /v1/process-document again, which resets the document to pending and triggers a fresh indexing cycle.
Operators can tune the stale threshold via DOC_STALE_PROCESSING_MS (milliseconds, min 60 000, max 3 600 000).
Tuning batch size and poll interval
When using DOCUMENT_PROCESSING_MODE=worker:
DOC_WORKER_BATCH_SIZE(default3) - how many documents to process per drain call. Increase for faster indexing at the cost of higher peak CPU.DOC_WORKER_POLL_MS(default30000) - only relevant whenENABLE_INPROCESS_DOC_WORKER=true. Lower values mean faster pickup; values below 5 seconds are not recommended.
Related
- Environment variables - full reference for
DOCUMENT_PROCESSING_MODEand related variables - Scheduled jobs - how to set up the
process-document-queuecron