Show HN: Ranked Search for Semi-Structured Data

We’ve been working on a search problem that requires querying both text and numbers simultaneously. For example, in a dataset of clothing items with descriptions and prices, a search for “slim pants for $20” should prioritize skinny jeans for $25 over slim pants for $50 because they are semantically similar and the price is closer. I’ve found that standard embedding models struggle with numerical ordering, while text-to-SQL methods rely on exact matches and often filter out too many results.

To solve this, we built a system designed specifically for structured datasets like CSVs or tables. Here’s a demo link where you can upload a small CSV to try out (no login required): https://demo.tryvoker.com.

Unlike most RAG approaches, we process each column independently, handling text with embeddings and numbers with custom scoring. When a user submits a query, we parse it into relevant fields—for instance, extracting “slim pants” as the description and “20” as the price. We then compute cosine similarity between the description embeddings and “slim pants” while also calculating the percent error between the user’s price input and the numerical field. These individual similarity scores are then combined across all columns to generate a final ranking.

Right now, our system works best with well-structured data, so some preprocessing is often needed. We’re working on improving this by detecting and restructuring messy data automatically, such as pivoting columns or extracting attributes from large text fields. We’re also adding feedback mechanisms, like a thumbs up/down system, to refine future search results based on user input. I’d love to hear about your experiences with similar search challenges and would appreciate any feedback!

Comments URL: https://news.ycombinator.com/item?id=43196710

Points: 10

# Comments: 3

https://demo.tryvoker.com

Vytvořeno 3h | 27. 2. 2025 20:30:16

Chcete-li přidat komentář, přihlaste se