Question 1

How is this different from a normal duplicate finder?

Accepted Answer

A normal duplicate finder needs the text to match character for character (or close to it). This compares meaning using sentence embeddings, so it catches paraphrases like 'the meeting is at 3pm' and 'our 3 oclock catch-up' even though they share almost no words. It uses cosine similarity, not string matching.

Question 2

Is my text uploaded anywhere?

Accepted Answer

No. The MiniLM embedding model runs entirely in your browser via WebAssembly. Your list is processed on your device and never sent to a server. Only the model is downloaded, once, then cached.

Question 3

Which AI model does this use?

Accepted Answer

all-MiniLM-L6-v2, a compact sentence-transformer (about 23 MB) that maps each line to a 384-dimensional vector. It is fast, widely used for semantic similarity, and runs locally through transformers.js and ONNX.

Question 4

What does the similarity threshold control?

Accepted Answer

It is the minimum cosine similarity for two lines to be flagged as near-duplicates. A higher threshold (closer to 98%) only catches very close paraphrases; a lower one (toward 50%) catches looser matches but may flag merely related lines. The default of 82% is a good starting point.

Question 5

How big a list can it handle?

Accepted Answer

It comfortably handles hundreds to a few thousand lines. Because every pair is compared, the work grows with the square of the line count, so very large lists take longer; the model downloads once on first use, then is cached, and everything runs in your browser so nothing is uploaded.

AI Near-Duplicate Finder

How to find near-duplicate entries

Examples

Frequently asked questions

Related tools

Semantic Dedupe

Find Similar Lines

Semantic Search

Remove Duplicate Lines

Acronym Generator

Add Line Numbers