Bulk OCR thousands of PDFs with curl and the XRPpdf API

A step-by-step guide to batch-processing thousands of scanned PDFs into searchable documents using curl, bash, and the XRPpdf API — with webhooks, error handling, and rate-limit awareness.

You have a folder full of scanned PDFs. Maybe hundreds. Maybe thousands. You need every one of them searchable. Here's how to do it with curl, a few lines of bash, and the XRPpdf API.

Prerequisites

  1. An XRP wallet linked at xrppdf.com
  2. Page credits funded (Scale tier: 30 XRP → 10,000 pages at $0.0043/page)
  3. An API key from your dashboard
  4. curl and jq installed

Step 1: Submit all files

#!/usr/bin/env bash
set -euo pipefail

API_KEY="xrpocr_live_YOUR_KEY_HERE"
INPUT_DIR="./scanned-pdfs"
LOG_FILE="./ocr-jobs.log"

# Clear previous log
> "$LOG_FILE"

for pdf in "$INPUT_DIR"/*.pdf; do
  filename=$(basename "$pdf")

  response=$(curl -s -X POST https://xrppdf.com/api/v1/ocr \
    -H "Authorization: Bearer $API_KEY" \
    -F "file=@$pdf")

  job_id=$(echo "$response" | jq -r '.job_id // empty')

  if [[ -n "$job_id" ]]; then
    echo "$filename $job_id" >> "$LOG_FILE"
    echo "✓ Submitted: $filename → $job_id"
  else
    error=$(echo "$response" | jq -r '.error // "unknown error"')
    echo "✗ Failed: $filename — $error" >&2
  fi
done

echo "Done. $(wc -l < "$LOG_FILE") jobs submitted."

This loops through every PDF in ./scanned-pdfs/, submits each one, and logs the filename-to-job-ID mapping.

Step 2: Poll for results and download

#!/usr/bin/env bash
set -euo pipefail

API_KEY="xrpocr_live_YOUR_KEY_HERE"
LOG_FILE="./ocr-jobs.log"
OUTPUT_DIR="./searchable-pdfs"
mkdir -p "$OUTPUT_DIR"

while IFS=' ' read -r filename job_id; do
  echo -n "Checking $filename ($job_id)... "

  # Poll until done (max 5 minutes per job)
  for attempt in $(seq 1 60); do
    status_json=$(curl -s \
      -H "Authorization: Bearer $API_KEY" \
      "https://xrppdf.com/api/v1/jobs/$job_id")

    status=$(echo "$status_json" | jq -r '.status')

    if [[ "$status" == "complete" ]]; then
      curl -s -o "$OUTPUT_DIR/$filename" \
        -H "Authorization: Bearer $API_KEY" \
        "https://xrppdf.com/download/$job_id"
      echo "✓ Downloaded"
      break
    elif [[ "$status" == "error" ]]; then
      echo "✗ Error: $(echo "$status_json" | jq -r '.error')" >&2
      break
    else
      sleep 5
    fi
  done
done < "$LOG_FILE"

echo "Done. Results in $OUTPUT_DIR/"

Better approach: use webhooks

Polling works, but webhooks are cleaner — especially for large batches. Instead of looping and sleeping, let XRPpdf call you when each job finishes.

1. Register a webhook

curl -X POST https://xrppdf.com/api/webhooks \
  -H "Authorization: Bearer $SESSION_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://your-server.com/ocr-callback"}'

You'll get back a secret (shown once). Save it.

2. Receive callbacks

Every completed job sends a POST to your URL:

{
  "event": "job.complete",
  "job_id": "abc123",
  "status": "complete",
  "pages": 12,
  "processing_seconds": 8.4
}

Headers include an HMAC signature for verification:

X-XRPOCR-Signature: sha256=<hex>
X-XRPOCR-Timestamp: 1713456789
X-XRPOCR-Job-Id: abc123

3. Verify the signature

import hmac, hashlib

def verify_webhook(secret: str, timestamp: str, body: bytes,
                   signature: str) -> bool:
    expected = hmac.new(
        secret.encode(),
        f"{timestamp}.".encode() + body,
        hashlib.sha256
    ).hexdigest()
    return hmac.compare_digest(f"sha256={expected}", signature)

4. Download on callback

import requests

def handle_callback(job_id: str, api_key: str, output_dir: str):
    r = requests.get(
        f"https://xrppdf.com/download/{job_id}",
        headers={"Authorization": f"Bearer {api_key}"}
    )
    with open(f"{output_dir}/{job_id}.pdf", "wb") as f:
        f.write(r.content)

Rate limits and concurrency

Tier Concurrent jobs Notes
Default 2 Good for casual use
Pro (8 XRP) Higher limit Contact for adjustment
Scale (30 XRP) Up to 50 Built for batch workflows

If you hit the concurrency limit, the API returns HTTP 429 with {"error": "...", "in_flight": 2, "limit": 2}. Back off and retry.

A simple throttle for the submit script:

MAX_CONCURRENT=10
active=0

for pdf in "$INPUT_DIR"/*.pdf; do
  submit_job "$pdf" &
  ((active++))

  if ((active >= MAX_CONCURRENT)); then
    wait -n
    ((active--))
  fi
done
wait

Idempotency keys

If your network is unreliable, add an idempotency key to prevent double-processing:

curl -X POST https://xrppdf.com/api/v1/ocr \
  -H "Authorization: Bearer $API_KEY" \
  -H "Idempotency-Key: batch-2026-04-18-invoice-0042" \
  -F "[email protected]"

Same key within 24 hours = same response replayed. No duplicate charges.

Cost at scale

Live XRP/RLUSD feed: $1.42 per XRP.

Pages Tier XRP cost Approx USD
100 100-bundle 2 XRP $2.83
1,000 Pro 8 XRP $11.33
10,000 Scale 30 XRP $42.5
50,000 5× Scale 150 XRP $212.5

Credits never expire. Buy once, use over weeks or months.

Full API docs

Everything above — endpoints, auth, webhooks, idempotency, error codes — is documented at xrppdf.com/docs.


Ready to batch-process? Get an API key → — link a wallet, fund credits, start calling.