Minimum hardening baseline
- Terminate TLS in front of the API.
- Run the service as a non-root user.
- Restrict inbound access to required ports only.
- Isolate renderer sidecars from unnecessary network paths.
That baseline is the starting point, not the finish line. A self-hosted scraper talks to untrusted public pages and can sit close to valuable internal systems, so it deserves the same discipline as any other internet-facing API.
Network and Access Control
- Put a reverse proxy or gateway in front of the service.
- Restrict who can reach the API by network, identity, or both.
- Avoid exposing internal health or admin surfaces to the public internet.
- If browser rendering is enabled, isolate the renderer from internal systems it does not need to reach.
Runtime Isolation
Treat page fetching and browser rendering as higher-risk components than your application logic.
- run them with the least privilege possible,
- keep filesystem access narrow,
- and isolate sidecars so a renderer problem does not automatically become a broader platform problem.
Secrets and Keys
- Keep API keys, proxy credentials, and LLM keys out of image builds.
- Inject secrets at runtime through your platform's secret store.
- Rotate keys during environment changes or incident response, not only on a fixed calendar.
Operational guidance
- Rotate API keys during deployment cutovers.
- Keep browser-rendering dependencies on the smallest possible surface area.
- Expose
/health only where your load balancer or monitoring needs it.
- Review warning-heavy targets separately; they often indicate anti-bot defenses rather than renderer bugs.
Monitoring and Auditability
At minimum, watch:
- API error rate,
- warning frequency,
- crawl job duration,
- renderer availability,
- and resource spikes on the browser sidecar.
Keep enough logs to answer three questions after an incident:
- what URL or workload triggered the issue,
- whether it was an engine problem or a target-site problem,
- and what data, if any, was still returned.