KINETK
The IP Intelligence Backbone
All Research

From Raw Scans to KG-Ready Metadata

By KINETK Team

From Raw Scans to KG-Ready Metadata

A single TikTok scan record arrives with a like count in a field called diggCount. The same content, reshot on Instagram, will have a like count in likeCount. The same content, screenshotted on Reddit, has a top-level score. The same content, repinned on Pinterest, has stats.likes.

None of these fields mean exactly the same thing. None of them are interchangeable. And yet for any downstream system that wants to ask "which creators are getting traction with this kind of content," the differences need to disappear.

This post is about the layer that makes that disappearance possible. It is the boring, brutal work of ingestion: parsing inconsistent JSON, extracting the parts that matter, turning fields buried in opaque metadata into first-class entities, and writing it all into a schema that can be queried, ranked, and joined to a vector index without per-platform special cases.

The intelligence graph is only as good as the contract this layer enforces. Get this wrong and every downstream feature inherits the mess.


The hidden difficulty

People underestimate ingestion. From the outside it sounds like a parsing problem. In practice it is a category problem.

Each platform has its own metadata dialect. Field names disagree. Field meanings disagree. A "like" on Reddit is not the same as a "like" on TikTok, even if you ignore the field-name difference and just look at the number. Reddit's score nets upvotes against downvotes; TikTok's diggCount does not. Engagement counts can be missing, can be strings, can be numbers, can be wrapped JSON, can be null, can be the literal string "null". Captions can have control characters embedded in them. Tags arrive as both structured arrays and as hashtags hidden inside descriptions. Authors arrive as plain strings on some platforms and as nested objects with follower stats on others.

If you let any of this leak into the rest of the system, you pay for it forever. Every downstream feature will need to know that TikTok stores follower counts in one shape and Instagram stores them in another. Every analyst query will need a footnote. Every machine learning model trained on the data will inherit the per-platform quirks.

The job of the ingestion layer is to make the dialects go away. What comes out the other side is a small set of tables with stable column names, normalized values, and explicit relationships. After that, nothing downstream has to know that TikTok and Instagram exist, except where it actively wants to.

A worked example

Walk a single scan record through the pipeline.

A scanner picks up a fitness clip on TikTok. The raw record arrives in our first database with a metadata blob containing the creator's handle, follower count, the video's view count, like count under diggCount, comment count, share count, the description, a tags array, and a publish timestamp.

The ingestion layer does the following.

It pulls the creator's handle and follower stats and writes a row to creators, keyed on (TIKTOK, @creator_handle). If the creator already exists, the follower count gets refreshed and the existing row is updated. The row gets a creator_id.

It extracts hashtags from both the structured tags array and from the description text (looking for #word patterns). It normalizes each one (lowercase, hash stripped, deduplicated within the batch) and writes any new ones to tags. Each tag gets a tag_id.

It writes the content row itself to content. The row carries the creator's creator_id as a foreign key, plus a denormalized author_handle for fast lookup without a join. Engagement counts get normalized into the canonical column names. The publish timestamp gets converted from whatever shape the platform used (Unix seconds vs Unix milliseconds) into normalized timestamp in database. The metadata blob gets validated as JSON and stored as a single column for archival; we do not query it directly any more.

The same record reshot on Instagram a day later goes through the same pipeline. It produces a different content row, possibly a different creator row (different handle), and shares some tag rows but not others. There are now two content rows that contain the same actual video, but the system does not know that yet. Cross-platform deduplication is the job of a later similarity pass.

What this enables

After ingestion, a query like "which creators are gaining traction in fitness content this week" does not need any knowledge of the underlying platforms. It joins content to creators on creator_id, joins content to content_tags to tags on the relevant tag, filters by published_at and engagement_rate, and returns. The query is the same whether the content came from TikTok, Instagram, Reddit, or anywhere else.

This is what "KG-ready" actually means. The schema is shaped so that the questions you want to ask can be expressed without per-platform special cases.

Why this is harder than it sounds

Three things tend to go wrong.

The first is the loss of provenance. It is tempting to flatten platform-specific fields into common ones and then forget that the originals existed. We keep the raw metadata blob alongside the normalized columns precisely so that when a downstream consumer needs the original shape (for an audit, for a re-extraction, for handling a platform field we did not know about at ingestion time), it is still there.

The second is duplicate handling. The same TikTok video, reposted on Instagram and screenshotted on Reddit, is one piece of content from a cultural standpoint and three rows in the database. If we collapse them at ingestion time we lose information about how the content moved. If we never collapse them we cannot answer questions about the underlying piece of culture. The schema handles both cases through a canonical_uuid pointer.

The third is sync state. Once content is ingested, it becomes part of the knowledge graph. If the row is updated later (re-scanned, or enriched by a downstream job), the graph needs to know it has changed. We track this with a graph_sync_status field that moves through pending (newly ingested), exported (synced into downstream graph stores), dirty (changed since last export), and excluded (intentionally kept out of the graph). The state is what lets graph construction stay incremental rather than rebuilding from scratch every time.

None of these are visible from the API surface. They are the seams that make the API surface possible.


The normalization diagram

flowchart LR
    A[Raw platform metadata] --> B[Deserialize and sanitize]
    B --> C[Platform-specific extraction]
    C --> D[Normalized content row]
    C --> E[Creator entity]
    C --> F[Tag entities]
    C --> G[Community entity]
    D --> H[content table]
    E --> I[creators table]
    F --> J[tags table]
    F --> K[content_tags edges]
    G --> L[communities table]

Five outputs from one raw record. Each output gets its own table or junction row. The relationships between them are explicit, indexable, and queryable.

Graph sync lifecycle

Both content and creators carry a graph_sync_status field with four states.

  • pending is the default for new rows. The graph export job picks these up and writes them to whatever downstream graph store needs them.
  • exported means the export has happened. The row has a graph_node_id and a graph_exported_at timestamp.
  • dirty means the row was updated after export. The next export pass picks up dirty rows and re-syncs them.
  • excluded is for rows that should never enter the graph (spam, broken records, low-quality data). Excluded rows still live in content; they are just filtered out of graph-facing queries.

The state is indexed with a partial index WHERE graph_sync_status IN ('pending', 'dirty') so the export job can find work to do without scanning the whole table. Most rows are exported most of the time, so the partial index stays small.


What this enables

After this layer runs, the rest of the system gets to operate as if all platforms were the same.

The semantic search pipeline joins vector candidates back to content on weaviate_id and gets uniform engagement, recency, and creator signals without per-platform branches. The narrative cluster job groups content by tag co-occurrence and treats #fitness consistently across the data. The campaign brief endpoint pulls creator stats from creators, content evidence from content, and tag weight from tags and content_tags, all without knowing which platform the content originally came from.

This is the layer agents will rely on without ever seeing it. It is the contract that makes the rest of the surface possible.

The phrase that stays with us inside the team is that AI systems are only as intelligent as their ingestion contracts. Every piece of intelligence the platform offers downstream is a downstream consumer of these tables. Get the contract right and the surface above it gets to be expressive. Get it wrong and every feature inherits the mess.

This is the work that does not show up in a demo. It is also the work that determines whether the demo can scale into a product.