Designing a Resumable Serverless Sync Pipeline
By KINETK Team
Designing a Resumable Serverless Sync Pipeline
The most impressive AI demo collapses if the data underneath cannot keep up.
A search API that returns stale results loses trust on the second query. A trend dashboard that shows yesterday's narratives is a worse product than no dashboard at all. An agent reasoning about a creator who deleted their account three weeks ago will get the wrong answer with full confidence. None of these failures look like AI failures. They look like the product is broken, because in any meaningful sense, it is.
This post is about the layer that prevents those failures. It walks through the design of a continuous, resumable sync pipeline that moves new and updated scan records from a raw event store into a canonical metadata store every hour. The pipeline is idempotent by design, transactional at the page level, and observable through health checks and freshness signals. It is unglamorous. It is also the thing that makes the rest of the system trustworthy.
For Everyone
Why continuous sync
Before continuous sync, the alternative is a one-off migration. Run a script, copy the data, declare victory. This is fine for a launch, and disastrous as a product.
Real social data is not a static dataset. New scans land continuously. Old records get updated when a creator changes their handle, a post's engagement counts climb, a video gets reshot, a community gets renamed. Tags drift. Captions get edited. If the canonical store stops moving, the API on top of it slowly disconnects from reality.
A continuous sync pipeline closes that gap. Every hour, it asks: what changed in the upstream store since the last time I checked? It pulls those changes, runs them through the same parsing and normalization rules, and updates the knowledge graph. The system never has a "stale period" longer than the interval between runs.
This sounds simple. The work is in the corner cases.
The five problems any continuous data pipeline has to solve
Every team that builds one of these eventually meets the same five problems.
The first is scheduling. How does the pipeline know when to run? The naive answer is "every hour, run it." The real answer has to handle: the previous run is still going (do not start a second one in parallel); the schedule was paused while something else was being investigated (the next run should pick up where the last one left off, not from now); the time-of-day matters for some signals but not others (we are talking about hundreds of scans a second we need to sync).
The second is drift. Upstream sources change their shape over time. New fields appear, old fields get renamed, some records suddenly have a JSON field where they used to have a string. The pipeline cannot break when this happens. It also cannot silently drop the new shape. The right behavior is to absorb the change, log it, and keep moving.
The third is malformed inputs. Real data is dirty. Some records have control characters in captions, some have invalid UTF-8 sequences, some have JSON that was double-encoded by a bug six months ago and never fixed. A parser that throws on any of these takes down the whole batch. A parser that silently swallows them produces wrong results. The right behavior is to recognize each known failure mode, handle it explicitly, and keep moving.
The fourth is retries. Anything that talks to a remote system will fail occasionally. A network blip, a database connection timeout, a transient permission glitch. The pipeline has to retry the work without writing the same record twice. The way to do that is to make every operation idempotent: if you write the same row twice, you get the same result you would have gotten writing it once.
The fifth is observability. If the pipeline silently stops, every downstream feature slowly rots. Health checks need to expose not just "is the Lambda up" but "did the last run actually finish, and how long ago was it." Alarms need to fire on errors and on absence-of-runs. Without those signals, the failure mode is "everything looks fine until a customer notices the data is two weeks old."
A walking example
The clock ticks over. The scheduler invokes the job. Here is what happens.
The job reads the timestamp of the last successful run from a small parameter store. It rounds back one day to handle midnight boundary cases and short-lived consistency lags in the upstream index. It then walks the upstream store, day partition by day partition, asking for any record whose update timestamp is newer than the last successful run.
For each page of records that comes back, it does five things. It parses each record into a uniform shape, handling all the messy input cases mentioned above. It groups the records by creator, by tag, and by community, and writes those entity rows first. It then writes the content rows, with foreign keys pointing back to the entities. It writes the tag-edge junction rows last. All of this happens inside a single database transaction so that a failure in any step rolls everything back. Then it moves to the next page.
After every page has succeeded, and only after every page has succeeded, the Lambda writes a new "last successful run" timestamp into the parameter store. If the Lambda times out before that point, the next invocation reads the same checkpoint and re-processes the same window. Because the writes are idempotent, the re-processing produces the same database state. No record is ever skipped, and no record is ever written twice in a way that produces drift.
Health checks run independently. They read the most recent checkpoint and report how many minutes have passed since the last successful run. Anything older than three hours triggers an alert. Anything that returns an error during the run also triggers an alert. The two alarms together cover both "it failed" and "it never ran."
That is the entire happy path. The work is in making sure every step's failure mode is handled explicitly.
Why this is harder than it sounds
The corner cases compound.
Upstream eventual consistency means a record written exactly at the boundary of a query window can be invisible to the query for a few minutes. The one-day overlap on the partition scan handles this. Without it, you would silently miss any record that landed in the few-minute window where the upstream index had not caught up.
Daylight saving time and timezone handling are not theoretical concerns. A Lambda running in UTC, scanning a partition keyed by date, with a checkpoint stored as ISO 8601, can produce off-by-one errors twice a year if you do not think carefully about which timezone owns the partition key.
JSON encoded twice (a string containing a JSON-encoded string) is a real failure mode that shows up in the wild more often than people expect. It happens when an upstream system does JSON.stringify(...) on something that is already a string. Your parser has to recognize the case and unwrap it.
Control characters and lone UTF-16 surrogates inside captions break a surprising number of things downstream. Some JSON libraries reject lone surrogate halves. The parser has to clean these out before they get anywhere near the database.
These are not exotic bugs. They are the predictable consequences of running a pipeline against real social data. The work of designing a sync pipeline is mostly the work of cataloguing and handling these failure modes deliberately rather than getting paged by them at 3am.
For Builders
For technical readers, here is the runtime in detail.
Architecture
sequenceDiagram
participant EB as Scheduler
participant L as Sync Lambda
participant CP as Checkpoint Store
participant DDB as Raw Scan Store
participant P as Parser
participant PG as Canonical Database
EB->>L: Trigger sync
L->>CP: Read last synced timestamp
L->>DDB: Query updated records by time window
DDB-->>L: Paged raw items
L->>P: Parse, sanitize, normalize
P-->>L: Enriched content rows
L->>PG: Transactional upserts (per page)
PG-->>L: Processed UUIDs
L->>CP: Write checkpoint after all pages succeed
Six logical steps. Each has a clear input shape, a clear output shape, and a single failure mode it has to handle.
Step 1: trigger and checkpoint read
The Lambda is triggered on a fixed schedule by the cloud scheduler. If the previous run is still going when the next tick fires, the new invocation throttles. The schedule is hourly; if the previous run is still running an hour later, something is wrong, and the alarm on no-successful-runs will fire before the data drifts noticeably.
The checkpoint is an ISO timestamp stored in a parameter store. On the first run, when the parameter does not exist, we default to twenty-four hours in the past. This bounds the cold-start work to a single day rather than to the entire history of the upstream store.
Step 2: query with an overlap window
The upstream raw store is a key-value store with a global secondary index keyed on a date partition (YYYY-MM-DD) and an update timestamp sort key. We walk that index day by day, starting one full day before the last checkpoint and ending today.
The one-day overlap is deliberate. It handles three real failure modes. First, midnight boundary records that landed slightly before the checkpoint timestamp but slightly after midnight in the partition's timezone. Second, the upstream index's eventual consistency: a record written milliseconds before the previous run finished may not have been visible to that run's query. Third, the rare clock-skew event between the writer and the index.
Within each day partition, we filter the sort key to updatedAt > lastSyncedAt so we do not re-process records the previous run already saw. Only the partition scan window widens; the sort-key filter keeps the actual record set tight.
The query yields pages, not a flat list. We process page by page, which keeps memory bounded regardless of how busy the upstream store has been.
Step 3: parse with wrapNumbers
For each page, the typed key-value records get unmarshalled into plain objects. We pass wrapNumbers: true to the unmarshaller. This matters because some of the integer fields in the records (follower counts on large accounts, view counts on viral posts) routinely exceed 2^53, which is the largest exact integer JavaScript can represent. Without wrapNumbers, the unmarshaller silently truncates. With it, the values arrive as a wrapper type the rest of the pipeline can convert deliberately.
The parser also normalizes inconsistent field names from the upstream side: contentUrl versus url, content_id versus contentId versus postId, and a few others. The result is a uniform SourceRow shape that the next step consumes.
Step 4: defensive metadata extraction
Each record’s metadata arrives as a JSON blob, but the parser is built to handle the messy cases that show up in production. If the blob is null, it returns null. If it is already an object, it passes through unchanged. If it arrives as a string, the parser cleans it first by stripping any UTF-8 BOM and removing NUL bytes. It also detects double-encoded JSON and unwraps it before deserializing, so a JSON string containing another JSON string still resolves correctly.
Once deserialized, the parser extracts platform-specific fields such as engagement stats, author information, timestamps, tags, and community references. Any strings it processes are cleaned by removing control characters, applying NFC normalization, and converting empty strings to null. Before the metadata is written to the database’s JSON column, it is validated one final time with JSON.parse; if anything remains invalid, the system stores null instead of a malformed value that could break later queries.`
Step 5: write the checkpoint last
The checkpoint write happens outside the page loop, only after every page has been processed successfully. This is the single most important reliability property of the pipeline.
If the Lambda times out mid-run, the checkpoint is unchanged. The next invocation reads the same checkpoint and reprocesses the same window. Re-processing is safe because every database write is idempotent. No record is skipped, no record is double-counted in a way that produces drift.
The alternative is to write the checkpoint after each page. That would let the pipeline make progress under timeouts, at the cost of partial-state risk: if a page failed midway through but its predecessor succeeded, the checkpoint would advance past records that did not actually land. Choosing checkpoint-after-success trades occasional re-processing for guaranteed completeness.
The checkpoint write itself uses exponential backoff with up to three attempts, because the parameter store occasionally throttles under enough concurrent activity. A failed checkpoint write does not roll back the actual database state; it just means the next run sees the previous checkpoint and re-processes one window.
Observability
Two alarms on the sync Lambda cover the main failure modes.
The first fires when the function throws an error in any 5-minute window. Errors include database unavailability, parsing exceptions that escape the per-record handlers, and any unhandled promise rejection. The alarm publishes to a topic that operations subscribes to.
The second fires when the function has not been invoked in three hours. This catches the case where the schedule got disabled, the function got throttled, or the cloud scheduler's permissions got revoked. Without this alarm, "the data slowly stops moving" is a silent failure mode.
What this enables
A continuous sync pipeline is unglamorous infrastructure. It does not show up in product demos. It does not have a customer-facing API. The only time anyone notices it is when it is broken.
That is also why it matters. Every other piece of the intelligence stack on top of this layer assumes that the canonical store is fresh, idempotent, and complete. The semantic search pipeline assumes the join key into the vector index is accurate and up to date. The narrative cluster job assumes engagement counts reflect reality within the last hour. The campaign brief endpoint assumes creator stats are not weeks old. The agents reading the API assume that the world they are reasoning about is the world that exists.
If the sync pipeline lies, all of those assumptions break, and the failures are diffuse. A search returns results that look fine but are stale. A trend dashboard shows a narrative that is already over. An agent confidently recommends a creator who deleted their account two weeks ago. None of these failures look like the sync pipeline broke. They look like the product is wrong.
That is why we treat reliability as part of the product. Every reliability decision in this pipeline is a downstream feature decision in disguise. The one-day overlap on the partition scan is the reason a search query never returns yesterday's data missing today's updates. The transactional upsert order is the reason creator stats and content rows never get out of sync. The checkpoint-after-success rule is the reason a Lambda timeout never silently advances state.
The most impressive AI demo collapses if the data layer underneath cannot keep up. The unsexy half of intelligence infrastructure is what makes the impressive half possible. Both halves get built. Only one gets shown off.