Boneyard Tools

AI Near-Duplicate Finder

Spot entries that MEAN the same thing even when the words differ. Paste a list (one item per line) and a MiniLM embedding model compares every pair by meaning, so paraphrases like 'the meeting is at 3pm' and 'our 3 oclock catch-up' get flagged even though exact-match and fuzzy string tools miss them. Each near-duplicate pair is shown with its similarity score, highest first. Everything runs in your browser, so nothing is uploaded; the model downloads once on first use, then is cached.

How to find near-duplicate entries

  1. Paste your list into the box, one item per line.
  2. Set the similarity threshold (higher means stricter, only very close matches count).
  3. Click Find near-duplicates; the first run loads the model, then near-duplicate pairs appear ranked by similarity.

Examples

Catch a paraphrase exact-match would miss

Lines: 'The meeting is at 3pm tomorrow.' / 'Tomorrow's meeting starts at 3 in the afternoon.' / 'The invoice total is $420.'
Flags the two meeting lines as a near-duplicate pair; the invoice line is left alone.

Frequently asked questions

How is this different from a normal duplicate finder?

A normal duplicate finder needs the text to match character for character (or close to it). This compares meaning using sentence embeddings, so it catches paraphrases like 'the meeting is at 3pm' and 'our 3 oclock catch-up' even though they share almost no words. It uses cosine similarity, not string matching.

Is my text uploaded anywhere?

No. The MiniLM embedding model runs entirely in your browser via WebAssembly. Your list is processed on your device and never sent to a server. Only the model is downloaded, once, then cached.

Which AI model does this use?

all-MiniLM-L6-v2, a compact sentence-transformer (about 23 MB) that maps each line to a 384-dimensional vector. It is fast, widely used for semantic similarity, and runs locally through transformers.js and ONNX.

What does the similarity threshold control?

It is the minimum cosine similarity for two lines to be flagged as near-duplicates. A higher threshold (closer to 98%) only catches very close paraphrases; a lower one (toward 50%) catches looser matches but may flag merely related lines. The default of 82% is a good starting point.

How big a list can it handle?

It comfortably handles hundreds to a few thousand lines. Because every pair is compared, the work grows with the square of the line count, so very large lists take longer; the model downloads once on first use, then is cached, and everything runs in your browser so nothing is uploaded.

Related tools