May 4, 2026

Building a Multimodal Knowledge Graph for the Autonomous Internet

By KINETK Team

Building a Multimodal Knowledge Graph for the Autonomous Internet

The internet is increasingly visual, cross-platform, and narrative driven. A trend on TikTok shows up reshot on Instagram, screenshotted on Reddit, and stitched into a YouTube short within hours. The text that travels with it is sparse, noisy, and inconsistent across platforms. The next generation of software, the AI agents people are building right now, will need to make sense of all of this in real time.

At Kinetk we are building the contextual intelligence layer those agents will rely on. This post explains how we are putting it together: why we combine semantic search over media with a normalized graph of creators, communities, metadata, and narratives, and why neither piece alone is enough.

For Everyone

The problem

Most social intelligence systems today fall into two camps. One indexes captions and hashtags and runs text search over them. The other dashboards engagement metrics platform by platform. Both miss the shape of culture as it actually moves online.

Culture today is visual. Captions lie or are missing. Hashtags drift in meaning. The same fitness clip can spread across five platforms with five different titles, three different creators, and zero overlap in hashtags. If your tool only reads captions, you miss most of it.

Culture is also relational. A creator who reaches a million viewers but has no audience overlap with peers in their niche is shaping a narrative differently from a creator who sits at the center of a tightly connected community. Engagement counts alone cannot tell you which one matters more for a brief or a campaign.

And culture moves fast. By the time a topic shows up in a trending dashboard, it has already peaked. The interesting moment is the rise, when something is forming but has not yet caught the attention of the mainstream tools.

A different approach

Instead of either pure text search or pure metrics, Kinetk combines three layers.

The first is a multimodal vector layer. Think of it as a way to search across pixels and frames the same way you search across words. Two clips are close in this space if they look and feel alike, regardless of what their captions say. This is how we handle the case where the captions disagree but the content is the same.

The second is a clean, structured database. We take messy platform-specific scan results and turn them into stable rows about content, creators, communities, and tags, with relationships between them. This is the part that lets you ask questions like "who else is posting in this space" or "what tags travel together" without writing a custom parser for every platform.

The third is a layer of derived intelligence. Some answers, like "what narratives are emerging this week", are too expensive to compute fresh on every query. So we precompute them and store them as graph-shaped read models that the rest of the system can pull on demand.

Together, these three layers turn raw social signal into something an AI agent or an analyst can actually reason over.

A running example

Imagine a fitness trend forming across short-form video platforms. A specific exercise sequence picks up momentum first among five mid-sized creators on one platform, then gets reshot by larger creators on a second platform, then begins to appear in screenshots on a third.

With text search alone, you would find scattered mentions of the exercise name, but only if creators happened to caption it. You would miss everyone who just posted the video without typing the term.

With dashboards alone, you might see a small bump in a generic fitness category. The actual trend is a tight cluster, and the cluster is invisible at the platform-level rollup.

Kinetk handles this differently. The video embeddings group all the visually similar clips into one cluster, even when their captions disagree. The structured database tells you which creators posted those clips, what communities they belong to, and which tags travel with them. The narrative layer flags the cluster as emerging because it is growing fast, has diverse creators, and is appearing on more than one platform.

What an agent gets back is not just a search result, but a story. This is the trend, here is who is shaping it, here is where it is moving, here is the evidence.

What this system can answer

In plain language, the system can answer:

Show me content that looks like this clip, regardless of platform.
Which creators are spreading similar content right now.
Which communities are picking up a given topic, and how fast.
What narratives are forming this week that nobody is writing about yet.
Give me evidence-backed context about a topic so I can write a brief or train an agent.

None of these are new questions. What is new is being able to answer them across platforms, on top of visual content, with structured context, in a single query.

Why this is harder than it sounds

The shape of the system looks tidy in a diagram. The work to make it real is not.

Three problems tend to be underestimated. The first is entity resolution. The same creator usually exists on three or four platforms with different handles and different metadata schemas. The same video gets reshot, screenshotted, and remixed until the only thing tying versions together is the visual content itself.

The second is freshness. Culture moves in hours, not months. A trend dashboard that refreshes weekly is already too late. Continuous ingestion across the public social web, with parsing, embedding, deduplication, and graph updates running constantly, is its own engineering project.

The third is multimodal coverage. Most search systems index text. The interesting signal in modern social media is in the pixels and frames. Generating, storing, and querying high-dimensional embeddings at scale, then keeping them in sync with a relational store, requires infrastructure choices that compound over time.

None of these problems are unsolvable on their own. The reason this work has not been done already is that solving all three at once, and keeping the result coherent as platforms change, takes a kind of compounding pipeline work that does not show up in a demo. That is the actual moat.

For Builders

For technical readers, here is how the layers come together.

The architecture

flowchart LR
    A[Raw scan events] --> B[Metadata normalization]
    C[Media embeddings] --> D[Vector index]
    B --> E[KG-ready database]
    D --> F[Semantic retrieval]
    E --> F
    E --> G[Graph read models]
    F --> H[Re-ranked evidence]
    G --> H
    H --> I[Agents, APIs, dashboard]

Three pipelines feed the intelligence layer. Raw scan events get parsed and normalized into a relational store. Media files get embedded into a multimodal vector space. The two stores are joined by stable identifiers and then used together at query time.

Multimodal embeddings as recall

We use a multimodal embedding model that maps images, videos, and text into a shared high-dimensional space. A search for a text phrase returns similar visual content, and a search for an image returns similar videos. This is the recall stage. It casts a wide net of plausibly relevant items even when the text metadata is unhelpful.

There is one caveat to design around. Cross-modal similarity is systematically lower than within-modal similarity. A text query rarely scores high against a video, while a video can score very high against another video. We normalize within result sets rather than across them, and we never use a global threshold for relevance.

The hard problem in this layer is not running the model. It is keeping a coherent embedding space across formats and platforms, handling video with no usable caption, and deciding when two visually similar clips are the same content versus a coincidental match. Doing this on a continuous feed at production scale is what most off-the-shelf vector setups are not built for.

Canonical metadata

The vector store gives us recall. The structured store gives us context. The canonical database holds:

content: one row per scanned media item, with platform, creator, engagement, tags, and a join key into the vector store
creators: one row per author per platform, with follower stats and verification
communities: platform-level groups such as subreddits
tags: canonical, per-platform, normalized
content_tags: the many-to-many edge table that lets us treat tags as graph nodes
content_similarity_edges: vector-derived neighbor edges, written by a scheduled job
narrative_clusters and topic_metrics: precomputed read models, refreshed on a schedule

Each of these tables is intentionally simple. The complexity lives in the relationships between them, not inside any single row. Engagement rate is a generated column. Tags are stored as both an array on content and as proper rows in tags so that we can query them either way.

Re-ranking with real-world signals

Vector similarity gets us a candidate set. The candidates then get scored against a small set of signals that matter for usefulness: engagement (likes, comments, shares, normalized for view count), recency (with decay), creator weight (followers and verification), and tag-cluster fit. The result is a ranked list where the top items are not just visually similar, but also contextually meaningful.

This step is the difference between a search tool and an intelligence tool. Embeddings alone retrieve everything that resembles the query. Re-ranking with real-world signals returns the things actually worth surfacing to a downstream agent or user.

The naive version of re-ranking is to trust raw engagement counts. In practice, a like on one platform does not mean what a like means on another, follower counts often overstate reach, and recency without decay overweights the last hour. Each signal has to be normalized in a platform-aware way, and the weights have to be tuned to the kind of question the system is being asked.

Precomputed read models for narratives

Some queries are too expensive to compute live. "Show me emerging narratives this week" is one of them. It would require clustering tens of millions of items, scoring co-occurrence across creators and platforms, and ranking by velocity, every time someone hit the endpoint.

Instead, we run a scheduled job that builds tag co-occurrence communities, expands them through high-confidence vector neighbors, and stores the result in a narrative_clusters table along with momentum metrics. The API then reads the precomputed clusters and joins them with live evidence at query time. Reads stay fast, even when the underlying graph is large.

The same pattern applies to topic metrics, creator metrics, and similarity edges. Anything expensive moves to a scheduled refresh. Anything time-sensitive happens at request time.

Building the clusters well is its own problem. Tag co-occurrence alone is too noisy. Vector neighbors alone are too dense. The two have to be combined and pruned, with creator diversity and platform spread used as filters, so that a real emerging narrative does not get drowned out by a single creator posting many variations of the same clip.

What this enables, concretely

Consider an agent helping a brand decide where to spend on a creator campaign. Without a layer like this, the agent has to scrape platforms separately, build its own normalization, compute its own audience overlap, and trust its own engagement weights. The work takes months and the result is brittle the moment a platform changes a field name.

With a multimodal knowledge graph in place, the agent makes a single query. It gets back a ranked list of creators who fit the campaign theme, the narratives they are shaping, the communities they reach, engagement signals already weighted across platforms, and the underlying content as evidence. Every claim is traceable to a row.

For richer queries the system can do more. A discovery call returns the same evidence plus the narratives the content sits inside, with context-aware labels and a small set of quantified arbitrage observations specific to the query. The agent does not have to write its own clustering or come up with its own narrative names. The layer hands them back ready to use.

That is the difference between an agent that retrieves and an agent that reasons. Retrieval is solved. Reasoning at the scale of the social internet is what is missing, and reasoning needs structure.

The internet is becoming a place where software, not people, drives most decisions. Those autonomous systems will need contextual, cross-platform intelligence about the world they operate in. A multimodal knowledge graph is one of the necessary parts, and the work of building it well takes years of compounding pipeline development. That is the work we are doing.

If any of this resonates with you, whether you are a researcher, a builder, a founder, or a creator thinking about how AI is going to reshape your work, we would love to hear from you.