KINETK
The IP Intelligence Backbone
All Research

Beyond Vector Search: Reranking Social Content with Real-World Signals

By KINETK Team

Beyond Vector Search: Reranking Social Content with Real-World Signals

A user types "luxury running watch for marathon training" into a search.

The naive answer is to embed the query, find the nearest vectors in your video index, and return them in distance order. You will get visually similar watches. You will also get five-year-old stock photos, dead accounts, and clips that happened to have a watch in frame but are about something else entirely.

The right answer is harder. You want content that is visually relevant, currently meaningful, posted by creators who actually reach the right audience, and surfaced with enough context that an agent or a person can use the result for something. That is the gap between vector search and intelligence.

This post explains how Kinetk closes that gap. It walks through query expansion, multimodal embedding, reciprocal rank fusion, and signal-weighted reranking. It is the pipeline that powers our discovery API today, not a future plan.


For Everyone

Why nearest vector wins is not enough

Vector search is a strong recall engine. Given a query, it can find content that is semantically or visually similar even when the words do not line up. That is a step change from text search.

But recall is not ranking. The top of the result list, the items the user actually sees, needs to be more than "closest in latent space." On social content in particular, three things go wrong if you stop at vector distance.

The first is single-query bias. A user types one phrase. That phrase is one of many ways to describe what they want. A search for "luxury running watch for marathon training" will miss content that was matched by "premium GPS watch for runners" or "marathon-grade smart watch." The right query depends on how creators actually caption their content, which the user does not know.

The second is the multimodal gap. When you compare a text query to a video using a shared embedding model, the similarity scores are systematically lower than when you compare two videos to each other. A perfect text-to-video match might score 0.5 in cosine similarity. A duplicate video match might score 0.9. If you use a flat threshold for relevance, text queries return nothing useful. If you do not normalize, results from different modalities are not comparable.

The third is signal blindness. Vector distance does not know which creators have real audiences, whether a clip was posted yesterday or three years ago, or whether a piece of content is mid-engagement spike or already dead. All three matter to whether a result is worth showing.

The four-step pattern

Kinetk addresses these three problems with a four-step pipeline.

We expand the user's query into a few semantically related variants, so the search is not betting everything on one phrasing.

We embed each variant in our multimodal space and run vector search against the right target vectors for the content type, so a single query reaches both image and video content with the appropriate index.

We fuse the result lists using reciprocal rank fusion, a stable merge that gives more weight to content that appeared in multiple lists. RRF is a well-studied technique from information retrieval and we use the standard form.

We then join the candidates back to our normalized metadata in our database and rerank them with real-world signals: engagement that is normalized across platforms, recency with decay, creator reach, and content fit with the underlying tag or narrative cluster.

The first three steps cast a wider, more robust net. The fourth step decides what is actually worth showing.

A worked example

Walk it through with the running watch query.

Step one. The user query goes to a small language model with a tight prompt. The model returns three variants such as "premium GPS watch for runners," "high-end marathon training watch," and "luxury sports smart watch for endurance athletes." The original query is kept too. We now have four queries.

Step two. Each query gets embedded in the same multimodal space we use for video and image content. We run each one against the relevant target vectors. Text queries against image content go through the image vector. Text queries against video content go through the video vector. For unknown content types we fan out and try both.

Step three. We merge the resulting lists with reciprocal rank fusion. A clip that was returned by three of the four query variants gets scored higher than a clip that only matched one. A clip that matched on both the image vector and the video vector gets credit for both. The fusion is stable across query rephrasings, which is the property we care about.

Step four. We take the top candidates and look them up in our database. We now know the creator, their follower count, the platform, the post date, the tags, the engagement counts, and the narrative cluster the content belongs to if any. We rerank with these signals.

The final list is not the closest vectors. It is the most useful, current content for what the user actually meant.

What this returns

A search query through this pipeline returns a ranked list of content with the relevance score and a breakdown of which signals contributed. It returns the creator of each piece with cross-platform context. It returns the community or narrative cluster the content sits in. It returns the tag and engagement profile.

This is not "ten search results." It is evidence-backed context that an agent or analyst can build on without having to re-derive any of it.

Why this is harder than it sounds

The pieces above sound simple. Each one has a non-obvious failure mode that takes real engineering to handle.

Query expansion has to fail gracefully. If the LLM is slow, returns an unparseable response, or the API key is not configured, the search still has to run. Our expansion treats the original query as the floor, not the ceiling, so any failure path falls back to single-query search rather than breaking.

Multimodal embedding has to be coherent across formats. The same model has to produce comparable vectors for text, images, and video segments. The result-set normalization has to account for the multimodal gap so cross-modal queries do not silently return nothing.

Rank fusion has to merge across two axes at once: multiple query variants and multiple target vectors. Naive concatenation overcounts the most popular vector. RRF's per-rank decay handles it cleanly, which is why it is the standard.

The database join has to be cheap and exact. Vector candidates come back with a join key into the structured store. If the join is slow or the data is stale, the rerank stage starves. We index the join key, batch the lookup, and run it in parallel with the next vector page when possible.

Signal-weighted reranking has to be platform-aware. The fields that mean "likes" on TikTok, Instagram, Reddit, and Pinterest are not the same field, do not carry the same meaning, and do not compose into a single score without normalization. The hard work was not picking the signals. It was making them comparable.


For Builders

For technical readers, here is how the pipeline is structured.

The architecture

flowchart TD
    Q[User query] --> E[LLM query expansion]
    E --> V[Text-to-multimodal embedding]
    V --> W[Vector search per target vector]
    W --> RRF[Reciprocal rank fusion]
    RRF --> J[Join content metadata in database]
    J --> S[Signal-weighted reranking]
    S --> A[Evidence-backed results]

Five stages. Each is independently testable, each has its own failure mode, and each can degrade without taking the whole pipeline down.

Query expansion

The user query is sent to a small LLM with a tight prompt: produce N semantically related queries for searching social media, return a JSON array, no explanation.

The expansion is optional in two senses. First, callers can disable it on a per-request basis. Second, the implementation is defensive at every level. Missing API key returns the original query. Non-2xx response returns the original query. Unparseable response returns the original query. Timeout returns the original query. The result is always a list of at least one query, never an error.

This matters because any user-facing latency budget will sometimes be exceeded by an outbound LLM call. When it is, we want the search to complete with the original query rather than fail. Expansion is a multiplier on quality, not a precondition for results.

Multimodal embedding

We use a multimodal embedding model that produces a high-dimensional vector for text, image, or video input. The same vector space is used at both query time and ingestion time, so a text query lands in the same coordinate system as the indexed media. Cross-modal retrieval works because of that shared space, not because of any tricks at search time.

Two practical choices. We embed the expansion variants in a single batch when the model supports it, which keeps the per-query latency closer to one round trip than four. And we treat the embedding service as a circuit-breakable dependency: if it is slow, we cap how many variants we will embed before issuing the search.

Reciprocal rank fusion

The classic RRF formula is score = 1 / (rank + k), where k is a constant and rank is the position of the document in the ranked list, starting at 1. We use k = 60, which is the constant from Cormack, Clarke, and Buettcher's original 2009 paper and remains the standard in nearly every practical RRF deployment.

For each (variant, target_vector) result list, we walk the rows and add 1 / (rank + 60) to an accumulator keyed on the content's identifier. Content that appears in multiple lists accumulates score. The accumulator also tracks which target vectors hit, so the rerank stage can credit cross-modal robustness as a separate signal: a clip that surfaced in both image and video search is more confidently on-topic than one that only surfaced in one.

The output is a single fused candidate set, ranked by accumulated RRF score. This is what we hand to the database join stage.

A few practical notes on k. The constant is what stabilizes RRF against rank ties at the top of the list. If you use a smaller constant, the top result of any list dominates the merge. If you use a larger constant, the contribution of position flattens and the merge starts behaving like uniform voting. Sixty is the empirical sweet spot from the IR literature, which is why everyone uses it. We did not tune it.

Database join and metadata enrichment

The fused candidate set is identifiers, not full records. We look them up in the canonical database store using indexed batch queries. The join returns content rows enriched with creator, community, tags, engagement counters, and pointers into the narrative cluster table when applicable.

This is also where filtering happens. If the request asked for a particular platform, time window, content type, or creator, we apply those filters at the database level. We do not do metadata filtering inside the vector store, because vector filters interact badly with the recall guarantees of HNSW-style indexes once the filter is selective enough to push the search out of the graph's dense region.

Signal-weighted reranking

With enriched candidates in hand, we compute a final score per candidate using a weighted sum of normalized signals.

final_score =
    w_sim   * similarity
  + w_eng   * engagement
  + w_rec   * recency
  + w_reach * author_reach
  + w_depth * engagement_depth

The weights are tunable per intent. A campaign brief search uses different weights than an emerging-narrative discovery search, because the question being asked is different.

Each signal has its own normalization step. Similarity is the candidate's RRF score normalized within the result set, not across queries, because of the multimodal gap. Engagement is a log-scale combination of view, like, comment, and share counts, normalized by the maximum value in the candidate set. The combination intentionally weights deeper interactions higher than shallower ones (TikTok's diggCount does not mean the same as Instagram's likeCount, and neither equals Reddit's score, so platform-specific corrections happen upstream of this stage.) Recency is the candidate's published timestamp normalized between the oldest and newest in the set. Engagement depth is a quality signal we maintain alongside the raw counts, intended to surface content that is more deeply engaged with rather than only widely seen.

Cluster-level signals (tag arbitrage, narrative membership, platform opportunity, attribute lift) are not part of per-content reranking. They are computed by a separate analytics layer that runs on the discover endpoint, on top of these scored rows. We keep the per-content score simple and stable; the richer cluster view layers cleanly on top.

Sparse metadata fallback

Real social data is incomplete. A scan record might have a creator handle but no follower count. It might have a published timestamp but no tag list. It might have an embedding but no platform string. The pipeline cannot stop on missing fields.

The scoring stage handles this with an explicit fallback path. When fewer than 30% of the candidate rows have view counts, we drop the engagement and reach signals entirely and rerank using only similarity and recency. This avoids meaningless engagement scores when the data is sparse, and gives the system stable behavior on the long tail of low-coverage queries. It is a real behavior of the deployed pipeline, not a theoretical safety net.

The only signal we hard-require is the join into our database. If a vector candidate cannot be matched to a content row, it is dropped. This shows up at the edges of the dataset and surfaces in our diagnostics so we can backfill the metadata where needed.


What this enables, concretely

Consider an analyst building a campaign brief for a smart-watch launch. They want creators who are currently active in marathon-training content, with high audience overlap among health-conscious millennials, on the platforms the brand cares about.

Without this pipeline, the analyst writes a search, looks at results, refines the search, looks again, opens spreadsheets, normalizes platforms by hand, and after a day or two has a list of guesses. The brief gets built on whatever the searcher found and whatever they remembered.

With this pipeline, the analyst writes one search. The result is a ranked list of currently-trending creators making relevant content, with engagement normalized across platforms, recency weighted, follower reach factored in, and the underlying clips returned as evidence. The brief gets built on data the analyst can audit row by row.

The same pipeline serves agents that build briefs autonomously. The output shape is structured, the fields are explicit, and every score is traceable to a signal. An LLM agent that reads the result has everything it needs to compose a brief without being asked to redo the retrieval work.

The pipeline also serves two endpoint shapes. A lighter one returns ranked content with diagnostics, intended for callers that want raw evidence. A heavier one runs the same retrieval and then layers narrative clustering, tag and creator analytics, platform-arbitrage signals, and context-aware narrative labels on top. The agent picks the one that matches the question.

Vector search is a starting point. The actual product is what happens after the candidates come back. That is the work this pipeline does, and it is the work that turns a similarity engine into something a campaign team or an autonomous agent can rely on.