HTML tags, entities and plain text
How HTML tags, script blocks and character entities work, and the safe order for turning marked up text into clean plain text.
Tags, elements and why they clutter copied text
HTML wraps content in tags, angle bracket markers like <p> to open a paragraph and </p> to close it. Inline tags such as <b> or <a> style a run of text without breaking the line, while block tags such as <div>, <p> and <li> define separate blocks that display on their own lines. When you copy from a page source, a CMS field or an email, those tags travel with the words and turn readable prose into a wall of markup. Stripping them leaves only the content a reader actually sees.
Character entities and how they decode
Some characters cannot appear literally in HTML because they have special meaning, so they are written as entities. An ampersand becomes &, a less-than sign becomes <, and a non breaking space becomes . There are named entities like £ for the pound sign, and numeric references like ' for an apostrophe or hex codes for any Unicode character. Decoding turns these codes back into the real symbols, which is why the result reads naturally instead of showing raw ampersand sequences.
Why script and style come out first
A page often includes <script> and <style> elements whose contents are code, not readable text. If you only deleted the tags you would be left with loose JavaScript and CSS sitting in the middle of your prose. To avoid that, these elements are removed wholesale first, opening tag, closing tag and everything in between, before any other cleanup runs. That single early step is what keeps tracking snippets and stylesheet rules out of the plain text.
The order of operations that keeps text clean
Order matters when stripping HTML. Script and style blocks are removed first, then block level tags are mapped to newlines or spaces so neighboring words never run together, then any remaining inline tags and comments are deleted. Only after the markup is gone are entities decoded, and whitespace is tidied last so that a decoded non breaking space collapses cleanly instead of leaving a double gap. Following this sequence is what turns messy source into predictable, readable output every time.