Boneyard Tools

How a URL extractor finds and cleans links

How this tool recognizes web addresses in messy text, trims sentence punctuation, balances brackets, and de-duplicates the list.

What counts as a URL

The extractor scans your text for runs that begin with one of three markers: http://, https:// or a bare www. prefix. From that starting point it grabs every following character until it hits whitespace, an angle bracket, a straight double quote or an apostrophe. That permissive approach is deliberate, because a real link can contain slashes, question marks, ampersands, hashes and encoded characters that a stricter pattern would cut short. The trade off is that the raw match sometimes swallows a closing bracket or a full stop that belongs to the surrounding sentence, which the next step cleans up.

Trimming the punctuation sentences add

Prose wraps links in punctuation all the time, as in the phrase see (https://example.com). Left alone that would yield a broken address ending in a stray parenthesis. The tool walks backward from the end of each match and removes trailing periods, commas, semicolons, colons, exclamation marks, question marks, quotes and brackets. Brackets get special treatment: a closing parenthesis, square bracket or brace is only stripped when the link contains no matching opening one. That balance check is why a path such as /wiki/Nirvana_(band) keeps its closing parenthesis while a link merely wrapped in parentheses loses it.

De-duplicating and sorting the list

Long pages and email threads repeat the same link many times. With remove duplicates enabled, each address is compared in lowercase, so differences in capitalization do not create phantom extras, and only the first occurrence is kept along with the exact casing you pasted. Sorting is separate and off by default: switch on 'Sort A to Z' to reorder the survivors alphabetically with a case-insensitive comparison. Leaving sorting off preserves the original reading order, which is handy when the sequence of links carries meaning.

What the extractor will not catch

Because detection keys off http, https and www., a bare domain like example.com with no prefix is not recognized, and neither are email addresses, ftp links or custom app schemes. Any address that contains a literal space is cut at that space, since whitespace marks the end of a match. Links that have been deliberately obfuscated, for example written as hxxp or with spaces inserted, will not match their normal form. For those cases you may need to normalize the text first or reach for a purpose built parser.

Frequently asked questions

Can I extract only the domain names instead of full URLs?

Not directly. The tool returns complete addresses including the path and query string. To reduce them to domains, paste the results into a find and replace tool and strip everything after the host.

Why did a link inside angle brackets come out without the closing bracket?

A match stops at the first angle bracket, so a link written as <https://example.com> is captured up to the closing bracket, which is then trimmed as trailing punctuation. The address itself is preserved intact.