AI Near-Duplicate Finder
Spot entries that MEAN the same thing even when the words differ. Paste a list (one item per line) and a MiniLM embedding model compares every pair by meaning, so paraphrases like 'the meeting is at 3pm' and 'our 3 oclock catch-up' get flagged even though exact-match and fuzzy string tools miss them. Each near-duplicate pair is shown with its similarity score, highest first. Everything runs in your browser, so nothing is uploaded; the model downloads once on first use, then is cached.
How to find near-duplicate entries
- Paste your list into the box, one item per line.
- Set the similarity threshold (higher means stricter, only very close matches count).
- Click Find near-duplicates; the first run loads the model, then near-duplicate pairs appear ranked by similarity.
Examples
Catch a paraphrase exact-match would miss
Lines: 'The meeting is at 3pm tomorrow.' / 'Tomorrow's meeting starts at 3 in the afternoon.' / 'The invoice total is $420.'
Flags the two meeting lines as a near-duplicate pair; the invoice line is left alone.
Frequently asked questions
How is this different from a normal duplicate finder?
A normal duplicate finder needs the text to match character for character (or close to it). This compares meaning using sentence embeddings, so it catches paraphrases like 'the meeting is at 3pm' and 'our 3 oclock catch-up' even though they share almost no words. It uses cosine similarity, not string matching.
Is my text uploaded anywhere?
No. The MiniLM embedding model runs entirely in your browser via WebAssembly. Your list is processed on your device and never sent to a server. Only the model is downloaded, once, then cached.
Which AI model does this use?
all-MiniLM-L6-v2, a compact sentence-transformer (about 23 MB) that maps each line to a 384-dimensional vector. It is fast, widely used for semantic similarity, and runs locally through transformers.js and ONNX.
What does the similarity threshold control?
It is the minimum cosine similarity for two lines to be flagged as near-duplicates. A higher threshold (closer to 98%) only catches very close paraphrases; a lower one (toward 50%) catches looser matches but may flag merely related lines. The default of 82% is a good starting point.
How big a list can it handle?
It comfortably handles hundreds to a few thousand lines. Because every pair is compared, the work grows with the square of the line count, so very large lists take longer; the model downloads once on first use, then is cached, and everything runs in your browser so nothing is uploaded.
Related tools
Semantic Dedupe
Remove near-duplicate lines from a list, including paraphrases that share no words, and get a clean output. An AI runs in your browser. Nothing is uploaded.
Find Similar Lines
Give one reference line and a list, and an AI ranks every line by how close in meaning it is. Catches paraphrases. Runs in your browser, nothing uploaded.
Semantic Search
Search any text by meaning, not keywords. Paste a list or document, type a query, and an AI ranks the closest matches in your browser. Nothing is uploaded.
Remove Duplicate Lines
Remove duplicate lines from text online and keep the first of each. Options for case sensitivity and trimming whitespace. Free, instant, and private.
Acronym Generator
Turn any phrase into an acronym from the first letter of each main word. Skip small words like the and of, keep them, or add dot separators.
Add Line Numbers
Add line numbers to any text online. Set the start value, step, and separator, pad numbers to align, and copy or download the result. Free and private.