Boneyard Tools

AI Semantic Dedupe

Clean a list down to its unique ideas, not just its unique strings. Paste a list (one item per line) and a MiniLM embedding model removes lines that MEAN the same as an earlier one, keeping the first of each near-duplicate group. It catches paraphrases that exact-match dedupe leaves behind, then hands back a tidy, copy-ready list and shows which removed line mapped to which kept one. Everything runs in your browser, so nothing is uploaded; the model downloads once on first use, then is cached.

How to remove near-duplicate lines

  1. Paste your list into the box, one item per line.
  2. Set the similarity threshold (higher keeps more lines, lower removes looser matches).
  3. Click Remove duplicates; the first run loads the model, then copy the cleaned list from the output.

Examples

Drop a paraphrase a plain dedupe keeps

Lines: 'Ship the order by Friday.' / 'Make sure the order goes out before Friday.' / 'Refund the customer.'
Keeps 'Ship the order by Friday.' and 'Refund the customer.'; removes the Friday paraphrase.

Frequently asked questions

How is this different from removing duplicate lines normally?

A plain duplicate remover only drops lines that match character for character. This compares meaning using sentence embeddings, so it also removes paraphrases like 'ship the order by Friday' and 'make sure the order goes out before Friday' that share few words. It uses cosine similarity, not string matching.

Which line is kept when two are near-duplicates?

The first one in your list. The tool walks lines top to bottom and keeps a line unless it is within the threshold of an already-kept line, so earlier entries win. The output preserves the original order of the lines that are kept.

Is my text uploaded anywhere?

No. The MiniLM embedding model runs entirely in your browser via WebAssembly. Your list is processed on your device and never sent to a server. Only the model is downloaded, once, then cached.

Which AI model does this use?

all-MiniLM-L6-v2, a compact sentence-transformer (about 23 MB) that maps each line to a 384-dimensional vector. It is fast, widely used for semantic similarity, and runs locally through transformers.js and ONNX.

What does the similarity threshold control?

It is the minimum cosine similarity for a line to count as a near-duplicate of an earlier one and be removed. A higher threshold (closer to 98%) removes only very close paraphrases; a lower one removes looser matches but risks dropping merely related lines. The default of 85% is a balanced starting point.

Related tools