CallSnag

AI voice assistant platform that answers phone calls. Built the backend for an iOS app -- Twilio VoIP, ElevenLabs conversational AI, contract-first API with OpenAPI 3.1.0.

Node.jsExpressTwilioElevenLabsGoogle GeminiMySQL
December 20, 20243 min read

An iOS app where you train an AI with your voice and preferences, and it answers your phone calls. The user picks a routing policy -- maybe unknown callers go to AI, contacts ring through normally -- and the system handles the rest. Twilio for telephony, ElevenLabs for the voice agent, Gemini for prompt generation and call summarization. I built the entire backend. The iOS app was a separate team; I owned everything server-side.

Call routing

Incoming calls hit a Twilio webhook. The backend validates the signature, looks up the phone number, checks if the caller is a known contact or on the whitelist, then applies the user's routing policy. There are six policies: all calls to the AI agent, only unknown callers to the agent, contacts to voicemail while unknowns go to AI, and a few other combinations. If the user's AI agent is disabled, the system overrides to voicemail regardless of policy. Emergency numbers (911, 112, 999, etc.) are blocked from being routed to AI.

Calls follow a state machine: Ringing -> Answered -> SentToAgent or SentToVoicemail -> Complete. Status callbacks from Twilio update the state asynchronously. The routing logic generates TwiML on the fly based on the policy and caller context.

Loading diagram...

Voice agents and training

ElevenLabs handles the conversational AI. Agent creation goes through their /convai/agents/create endpoint with voice configuration, model selection, prompt, and first message. Phone numbers get imported into ElevenLabs with the Twilio SID and auth token, then linked to the agent.

Training is the interesting part. Users can train their agent through text conversations or voice conversations. Voice conversations are transcribed and the transcript is extracted. All training data feeds into Gemini, which generates the agent's prompt dynamically from the accumulated transcripts. When the user adds or changes training data, the knowledge base updates and the agent's prompt regenerates. The agent learns what to say, how to respond, and what information matters to the user -- all derived from actual conversations rather than manual prompt writing.

Agents also have webhook tools for dynamic data access during calls. The agent can pull user session info, contacts, and preferences in real time through authenticated webhook endpoints.

Hallucination prevention

The AI generates structured data from calls -- caller name, callback number, a summary. We don't trust any of it blindly. Generated contact information gets validated against existing user data. If the AI says "call from John" but there's no John in the contacts, it gets flagged. Call summaries are generated by Gemini from the raw transcript, not from the AI agent's interpretation.

Contract-first API

The backend was designed API-first with an OpenAPI 3.1.0 spec and Swagger UI. The iOS team worked against the spec while I built the implementation. JSDoc annotations on controllers generate the spec automatically. Every endpoint has documented request/response schemas, error codes, and auth requirements. No undocumented endpoints, no surprises.

The database is 15+ MySQL tables with connection pooling. External services (Twilio, ElevenLabs, Gemini, OneSignal) are abstracted behind a manager/singleton pattern so they can be swapped or mocked without touching business logic.

Tech stack

JavaScript, Node.js, Express, MySQL, Twilio, ElevenLabs API, Google Gemini, DigitalOcean Spaces, OneSignal, Docker, WebSockets, JWT.