The Gap Between "Works in Demo" and "Works in Production"
Every messaging bridge demo looks the same: send a message on Platform A, watch it appear on Platform B in under a second. The engineering work that makes a demo look like this is roughly two hours. The engineering work that makes a bridge reliable in production — at scale, with platform outages, with rate limiting, with concurrent edits, with network failures — is roughly two years.
This guide covers the architectural decisions that determine whether a messaging bridge works reliably in production.
The Fundamental Reliability Requirements
A production messaging bridge must satisfy five reliability properties:
- No message loss: A message sent on the source platform must eventually appear on the destination platform, even if there are transient failures
- No duplicate delivery: A message must appear exactly once on the destination platform, even if the webhook delivers the event multiple times
- Order preservation: Messages must appear in the correct order on the destination platform
- Thread coherence: Replies must be nested under the correct parent thread
- Edit/delete synchronization: Edits and deletions on the source must propagate to the destination
Failing any of these properties produces a bridge that is unreliable in practice — and unreliable bridges get turned off. The compliance team pulls the plug when they realize messages are appearing out of order in audit logs. The operations team pulls the plug when they realize some messages are being delivered twice.
The Event Processing Pipeline
A production-grade bridge event processing pipeline has four stages:
Stage 1: Receive and acknowledge
The webhook endpoint has one job: receive the event, validate the signature, write the event to a durable queue, and respond 200 within the platform's timeout window (3 seconds for Slack, 3 seconds for Teams, 3 seconds for Webex).
No processing happens in Stage 1. No database queries. No API calls to downstream systems. If any of those operations are slow or fail, the webhook acknowledgment misses the timeout window, the platform marks the delivery as failed, and the reliability chain breaks.
[Platform Webhook] → [Endpoint: validate sig, enqueue, respond 200] → [Queue: SQS / Redis Streams / Kafka]
Stage 2: Deduplication
Event delivery is "at least once" for all five platforms — the same event may be delivered multiple times. Before processing any event, check whether it has already been processed:
event_id = hash(platform + event_type + message_id + timestamp)
if redis.exists(f"processed:{event_id}"):
return # skip, already processed
redis.setex(f"processed:{event_id}", 300, "1") # 5-minute deduplication window
# proceed with processing
The deduplication key should combine platform, event type, message ID, and timestamp. Using message ID alone is insufficient — the same message ID can appear in different event types (created, updated, deleted).
Stage 3: Transform and route
The transform stage performs:
- Parse: Extract the relevant fields from the platform-specific event payload
- Identity resolve: Map source user ID to destination user ID (email-based resolution)
- Format translate: Convert source markdown/Block Kit/Adaptive Card to destination format
- Thread resolve: Look up the parent thread mapping (source thread ID → destination thread ID)
- Route: Determine the destination channel based on the bridge configuration
Each of these operations may fail. The transform stage must handle failures gracefully:
- Identity resolution failure → fall back to display name without @mention
- Format translation error → fall back to plain text
- Thread resolution miss → post as top-level message (not as a reply)
Stage 4: Deliver with retry
The delivery stage calls the destination platform API. The critical properties:
Idempotency key: Include a unique key in each delivery request that allows the destination platform to deduplicate if the request is retried. Teams and Webex support client-request-id headers for this purpose. Slack does not have native delivery idempotency — the deduplication must be at the bridge level.
Exponential backoff with jitter: On 429 (rate limited) or 5xx (server error) responses, wait and retry with exponential backoff. Add random jitter to prevent thundering herd when multiple workers retry simultaneously.
def retry_with_backoff(fn, max_attempts=5):
for attempt in range(max_attempts):
try:
return fn()
except RateLimitError as e:
wait = e.retry_after or (2 ** attempt + random.uniform(0, 1))
time.sleep(wait)
except TransientError:
time.sleep(2 ** attempt + random.uniform(0, 1))
raise MaxRetriesExceeded()
Dead letter queue: After max retries, move the failed event to a dead letter queue. Do not silently discard it. The DLQ should trigger an alert and be reviewed for patterns (a consistent failure usually indicates a misconfiguration or a platform API change).
State Management: The Thread Mapping Store
Thread coherence — ensuring that replies on Platform A appear as replies to the correct message on Platform B — requires a persistent, low-latency state store. The data model:
thread_map: {
"{platform_a}:{message_id}" → "{platform_b}:{message_id}",
"{platform_b}:{message_id}" → "{platform_a}:{message_id}"
}
This bidirectional map is written at message delivery time and read at reply-routing time. The read must complete before the delivery API call — it is on the hot path.
Storage choice: Redis is the standard choice for this store. It provides sub-millisecond reads, atomic writes, and TTL support (you can expire old thread mappings after 30 days, when reply threads are typically inactive).
Consistency model: Use atomic Redis operations (SETNX, MULTI/EXEC) to prevent race conditions when two concurrent replies arrive for the same parent message.
Handling Platform Outages
No messaging platform has 100% uptime. When a destination platform is unavailable, a production bridge must:
- Detect the outage (sustained 5xx responses or connection timeouts)
- Buffer outbound messages in the queue rather than attempting delivery
- Implement a circuit breaker that stops sending requests to the unavailable platform
- Resume delivery when the platform recovers (circuit breaker closes)
- Drain the buffer in order, respecting rate limits
The circuit breaker pattern prevents the bridge from hammering an unavailable platform with retries and ensures that when the platform recovers, the message backlog is drained cleanly.
Monitoring and Observability
The metrics that matter for a messaging bridge in production:
| Metric | Alert Threshold |
|---|---|
| Event receipt to delivery latency (p99) | > 5 seconds |
| Dead letter queue depth | > 0 |
| Deduplication cache hit rate | < 0.1% (should be very rare) |
| Webhook acknowledgment failure rate | > 0.1% |
| Rate limit event frequency | Trending up = approaching limit |
| Thread mapping cache miss rate | > 5% = state store issue |
A bridge without structured observability is not production-ready — you have no visibility into whether the reliability properties are being satisfied.
Read the Slack Events API deep dive → | Read the Teams Graph API deep dive → | See SyncRivo's reliability architecture →