Purity

Discord content moderation platform with real-time ML-powered NSFW detection. Built and led a team across 4 microservices, GPU inference, and Terraform infrastructure.

TypeScriptNestJSFastAPIDiscord.jsONNXTerraformDocker
January 15, 20254 min read

Discord servers get flooded with bad content. Images, videos, links -- people push boundaries constantly and manual moderation falls apart at any real scale. Most moderation bots are either too slow (API-based classification with round-trip latency) or too blunt (one model, one threshold, no nuance). I built a system that runs five ML models on GPU, processes content in real time, and lets server admins configure exactly how aggressive they want it to be. Led the project end-to-end, managed a team of 5+ developers, and owned everything from Terraform configs to CUDA inference.

Architecture split

The system is four microservices in a monorepo, split into two deployment groups. The control plane (React dashboard + NestJS API) handles auth, settings, and metrics. The hot path (Discord bot + FastAPI ML service + Redis) handles the actual scanning. These deploy independently. You can ship a dashboard change without touching the bot, and the hot path runs on a dedicated VPS optimized for GPU workloads.

Loading diagram...

GPU inference pipeline

Five ONNX models: NudeNet, a ViT-based hentai classifier, two YOLO models (weapons, drugs), and Falcon NSFW. They all run through ONNX Runtime with CUDA, graph optimizations enabled, and CUDA graphs for reduced kernel launch overhead.

The interesting part is how batching and early stopping work. The consumer checks Redis queue depth: if there are enough items waiting, it pulls a batch and runs all models concurrently via asyncio.gather(). But within a single image's classification, models run through asyncio.as_completed() -- if any model flags content as unsafe, the remaining tasks get cancelled. No point running the drugs detector if NudeNet already flagged it.

Preprocessing is shared: one PIL-to-NumPy conversion generates both 224x224 and 640x640 arrays, so we're not duplicating work across models. OpenCV runs single-threaded (cv2.setNumThreads(1)) to avoid GIL contention in the async context. There's also a Discord-specific optimization: we rewrite cdn.discordapp.com URLs to media.discordapp.net with resize params, so we're downloading 256px JPEGs instead of full-resolution PNGs. Cuts download time significantly.

At startup, the ML service runs warmup passes -- dummy inference plus real images -- to stabilize cuDNN autotuning and AMP kernels before handling live traffic.

Redis as the nervous system

The bot pushes jobs to moderation:queue via LPUSH. The ML service consumes with BRPOP (5s timeout). Results go back through moderation:results, consumed by the bot with infinite-timeout BRPOP. Separate Redis clients for pushing and consuming, because blocking reads would block everything else.

Jobs that fail get their attempt counter incremented. After max attempts, they move to moderation:dead. Each job carries a settings snapshot -- a frozen copy of the server's configuration at enqueue time. This matters because admins can change settings while jobs are in flight. Without the snapshot, you'd apply new thresholds to content that was queued under old ones.

Connection resilience is exponential backoff from 1s to 30s max, with health checks at the start of every consumer loop iteration.

Settings sync without polling

When an admin changes moderation settings in the dashboard, the bot needs to know immediately. The flow: API writes to PostgreSQL, triggers pg_notify('guild_settings_updated', payload). The bot's PgListenerService maintains a persistent connection with LISTEN, catches the notification, and seeds the Redis cache. The in-memory cache gets invalidated simultaneously.

This gives us a four-layer cache: Redis (5min TTL) -> in-memory Map -> HTTP fallback to the API -> hardcoded defaults. The hot path never touches the database during normal operation. If Redis goes down, we fall back to memory. If that's cold, we hit the API. If the API is unreachable, defaults. Each layer degrades gracefully.

The PG listener runs a keepalive SELECT 1 every 60 seconds to prevent idle timeout and reconnects automatically with backoff if the connection drops.

Infrastructure

Terraform manages Hetzner Cloud VPS (application services) and DigitalOcean (managed PostgreSQL). Modular firewall config with SSH whitelisting, per-port inbound rules, and iptables egress filtering. CI/CD runs on GitHub Actions with path-based change detection -- it diffs against main and only builds services whose files changed. Docker layer caching via GHA cache backend keeps build times reasonable. Separate deploy steps for control plane and hot path stacks.

ONNX models are stored in DigitalOcean Spaces and downloaded during the Docker build. Database migrations only run if the API service changed. There's a force-deploy escape hatch for when you replace a VPS and need to redeploy everything.

Metrics

Services push metrics to an API aggregator endpoint, signed with HMAC. Histogram percentiles (p50, p95, p99) are computed client-side before pushing. Counters send deltas, not absolutes -- they reset after each capture. The dashboard consumes these via SSE, so admins see live moderation stats without refreshing.

Tech stack

TypeScript, Python, NestJS, FastAPI, React, Discord.js, PyTorch, ONNX Runtime (CUDA), PostgreSQL, Redis, Prisma, Docker, Terraform, GitHub Actions, Nginx.