Question 1

How is this different from removing duplicate lines normally?

Accepted Answer

A plain duplicate remover only drops lines that match character for character. This compares meaning using sentence embeddings, so it also removes paraphrases like 'ship the order by Friday' and 'make sure the order goes out before Friday' that share few words. It uses cosine similarity, not string matching.

Question 2

Which line is kept when two are near-duplicates?

Accepted Answer

The first one in your list. The tool walks lines top to bottom and keeps a line unless it is within the threshold of an already-kept line, so earlier entries win. The output preserves the original order of the lines that are kept.

Question 3

Is my text uploaded anywhere?

Accepted Answer

No. The MiniLM embedding model runs entirely in your browser via WebAssembly. Your list is processed on your device and never sent to a server. Only the model is downloaded, once, then cached.

Question 4

Which AI model does this use?

Accepted Answer

all-MiniLM-L6-v2, a compact sentence-transformer (about 23 MB) that maps each line to a 384-dimensional vector. It is fast, widely used for semantic similarity, and runs locally through transformers.js and ONNX.

Question 5

What does the similarity threshold control?

Accepted Answer

It is the minimum cosine similarity for a line to count as a near-duplicate of an earlier one and be removed. A higher threshold (closer to 98%) removes only very close paraphrases; a lower one removes looser matches but risks dropping merely related lines. The default of 85% is a balanced starting point.

AI Semantic Dedupe

How to remove near-duplicate lines

Examples

Frequently asked questions

Related tools

Near-Duplicate Finder

Find Similar Lines

Semantic Search

Remove Duplicate Lines

Acronym Generator

Add Line Numbers