Inside Merge: how we’re building the leading sync engine
.png)
Our sync jobs move millions of records every day for frontier LLM providers, leading banks, and thousands of other B2B SaaS companies.
To power this scale, our engineering team is constantly rethinking how we can deliver faster, more reliable, and more resilient integrations.
To that end, here are some of the measures we’ve recently taken to raise the bar for sync performance.
Evolving concurrency from batching to dynamic scheduling
Our initial implementation for concurrency involved fixed-size batches with a "Sync Issuer" coordinating work. This involved:
- Processing API requests sequentially
- Grouping substeps into fixed batches (e.g., batch size of 2)
- Waiting for an entire batch completion before proceeding
This approach came with a few drawbacks. Notably, performance was constrained by the slowest batch member, and sync issuers were left waiting instead of making more API requests—leading to wasted time.
This led us to adopt a fundamentally different approach: “Dynamic Node Scheduling.”
Here’s a snapshot of how it works:
1. The Sync Issuer makes all API requests as quickly as possible.
2. Each result becomes a <code class="blog_inline-code">`QUEUED`</code> sync node.
3. Up to batch-size nodes run simultaneously as <code class="blog_inline-code">`RUNNING`</code>.
4. Completed nodes automatically trigger queued nodes.
This led us to eliminate the bottleneck caused by slow batch members, and it’s prevented idle sync issuer time. Taken together, we’ve sped up syncs by up to 15x.
Adopting intelligent rate limit management
Careful rate limit management makes syncs faster, as it eliminates the delays and retries associated with hitting actual rate limits.
With this in mind, we use a shared Redis cache to track API request activity across all concurrent processes.
This allows us to:
- Monitor usage across different rate limit types (we’ve catalogued these for each integration)
- Coordinate between multiple processing jobs
- Trigger exceptions when approaching the 80% rate limit threshold
- Schedule optimal retry timing based on encoded cooloff periods
This approach consistently pushes the boundaries of rate limit management at scale.
For example, we recently synced 1.3 million objects for a frontier LLM provider and operated within 3% of their theoretical maximum throughput by dynamically managing rate limits and backing off at the right times.
Engineering fault-tolerant infrastructure at scale
We’ve introduced fault-tolerant state persistence to ensure sync jobs survive interruptions.
When AWS issues a termination notice—or our memory monitoring detects trouble—the system immediately serializes the job’s entire state into a JSON snapshot. This snapshot, capturing hundreds of variables, is written to Elastic File System (EFS) within the two-minute window available.
When a replacement server comes online, it retrieves the state file, reconstructs the sync environment with complete fidelity, and resumes execution without losing progress. No manual intervention required.
This breakthrough lets us run jobs of any duration with confidence in their completion.
We’ve also realized significant business benefits: By leaning further into spot instances, we’ve achieved 40% daily compute cost savings; and our engineers are free from repetitive interventions in large account syncs.
Final thoughts
We’re proud of the progress so far, but our mission isn’t to be better than competitors. It’s to deliver the best sync performance possible for our customers.
With customer feedback and ongoing experimentation—whether it’s in scheduling, retry logic, or infrastructure resilience—we’ll continue to push the limits of what’s possible in data synchronization.