Boneyard Tools

HTML tags, entities and plain text

How HTML tags, script blocks and character entities work, and the safe order for turning marked up text into clean plain text.

Tags, elements and why they clutter copied text

HTML wraps content in tags, angle bracket markers like <p> to open a paragraph and </p> to close it. Inline tags such as <b> or <a> style a run of text without breaking the line, while block tags such as <div>, <p> and <li> define separate blocks that display on their own lines. When you copy from a page source, a CMS field or an email, those tags travel with the words and turn readable prose into a wall of markup. Stripping them leaves only the content a reader actually sees.

Character entities and how they decode

Some characters cannot appear literally in HTML because they have special meaning, so they are written as entities. An ampersand becomes &amp;, a less-than sign becomes &lt;, and a non breaking space becomes &nbsp;. There are named entities like &pound; for the pound sign, and numeric references like &#39; for an apostrophe or hex codes for any Unicode character. Decoding turns these codes back into the real symbols, which is why the result reads naturally instead of showing raw ampersand sequences.

Why script and style come out first

A page often includes <script> and <style> elements whose contents are code, not readable text. If you only deleted the tags you would be left with loose JavaScript and CSS sitting in the middle of your prose. To avoid that, these elements are removed wholesale first, opening tag, closing tag and everything in between, before any other cleanup runs. That single early step is what keeps tracking snippets and stylesheet rules out of the plain text.

The order of operations that keeps text clean

Order matters when stripping HTML. Script and style blocks are removed first, then block level tags are mapped to newlines or spaces so neighboring words never run together, then any remaining inline tags and comments are deleted. Only after the markup is gone are entities decoded, and whitespace is tidied last so that a decoded non breaking space collapses cleanly instead of leaving a double gap. Following this sequence is what turns messy source into predictable, readable output every time.

Frequently asked questions

Why not just delete everything between angle brackets?

That naive approach would leave script and style contents behind and could merge adjacent words where a block tag used to be. Handling scripts, block tags and entities in the right order avoids both problems.

Does stripping tags make text safe to embed?

It removes markup for reading, but it is not a security sanitizer. If you are inserting untrusted content into a page, use a dedicated HTML sanitizing library rather than a plain text stripper.

What happens to a non breaking space?

With decoding on, &nbsp; becomes a normal space, and the final whitespace pass collapses any run of spaces down to one so the output does not carry hidden gaps.