How a URL extractor finds and cleans links
How this tool recognizes web addresses in messy text, trims sentence punctuation, balances brackets, and de-duplicates the list.
What counts as a URL
The extractor scans your text for runs that begin with one of three markers: http://, https:// or a bare www. prefix. From that starting point it grabs every following character until it hits whitespace, an angle bracket, a straight double quote or an apostrophe. That permissive approach is deliberate, because a real link can contain slashes, question marks, ampersands, hashes and encoded characters that a stricter pattern would cut short. The trade off is that the raw match sometimes swallows a closing bracket or a full stop that belongs to the surrounding sentence, which the next step cleans up.
Trimming the punctuation sentences add
Prose wraps links in punctuation all the time, as in the phrase see (https://example.com). Left alone that would yield a broken address ending in a stray parenthesis. The tool walks backward from the end of each match and removes trailing periods, commas, semicolons, colons, exclamation marks, question marks, quotes and brackets. Brackets get special treatment: a closing parenthesis, square bracket or brace is only stripped when the link contains no matching opening one. That balance check is why a path such as /wiki/Nirvana_(band) keeps its closing parenthesis while a link merely wrapped in parentheses loses it.
De-duplicating and sorting the list
Long pages and email threads repeat the same link many times. With remove duplicates enabled, each address is compared in lowercase, so differences in capitalization do not create phantom extras, and only the first occurrence is kept along with the exact casing you pasted. Sorting is separate and off by default: switch on 'Sort A to Z' to reorder the survivors alphabetically with a case-insensitive comparison. Leaving sorting off preserves the original reading order, which is handy when the sequence of links carries meaning.
What the extractor will not catch
Because detection keys off http, https and www., a bare domain like example.com with no prefix is not recognized, and neither are email addresses, ftp links or custom app schemes. Any address that contains a literal space is cut at that space, since whitespace marks the end of a match. Links that have been deliberately obfuscated, for example written as hxxp or with spaces inserted, will not match their normal form. For those cases you may need to normalize the text first or reach for a purpose built parser.