PHP Class: HTML to Plain Text Conversion

This class converts HTML to plain, formatted ASCII text. By default, the text is wrapped to 70 characters, and some basic formatting is applied to preserve some of the HTML formatting. Some examples:

  • Paragraphs are indented
  • Heading tags <h1> - <h3>are all caps
  • Horizontal lines, <hr>, are converted to hyphens
  • Links are preserved as a footnoted list at the end

I’ve found this class extremely useful for things like sending out HTML-formatted rich email, by converting the HTML to text for the plain-text alternate format some email readers require.

Thanks to Alexander Krug for finding an error in the search array of regular expressions, and pointing out the solution. I’ve updated the class accordingly.

Thanks to Joss Sanglier for adding several more HTML entity codes to the search and replace arrays. I’ve updated the class accordingly.

Thanks to Darius Kasperavicius for suggesting the addition of $allowed_tags and its supporting function (which I slightly modified).

Thanks to Justin Dearing for pointing out that a replacement for the <TH> tag was missing, and suggesting an appropriate fix.

Thanks to Mathieu Collas for finding a display/formatting bug in the _build_link_list() function: email readers would show the left bracket and number (“[1″) as part of the rendered email address.

Thanks to Wojciech Bajon for submitting code to handle relative links, which I hadn’t considered. I modified his code a bit to handle normal HTTP links and MAILTO links. Also for suggesting three additional HTML entity codes to search for.

Thanks to Jacob Chandler for pointing out another link condition for the _build_link_list() function: “https”.

Thanks to Marc Bertrand for suggesting a revision to the word wrapping functionality; if you specify a $width of 0 or less, word wrapping will be ignored.

August 8, 2008: Big housecleaning update:

Thanks to Colin Brown for suggesting the fix to handle </li> and blank lines (whitespace). Christian Basedau also suggested the blank lines fix.

Special thanks to Marcus Bointon, Christian Basedau, Norbert Laposa, Bas van de Weijer, and Marijn van Butselaar for pointing out my glaring error in the <th> handling. Marcus also supplied a host of fixes.

Thanks to Jeffrey Silverman for pointing out that extra spaces should be compressed–a problem addressed with Marcus Bointon’s fixes but that I had not yet incorporated.

Thanks to Daniel Schledermann for suggesting a valuable fix with <a> tag handling.

Thanks to Wojciech Bajon (again!) for suggesting fixes and additions, including the <a> tag handling that Daniel Schledermann pointed out but that I had not yet incorporated. I haven’t (yet) incorporated all of Wojciech’s changes, though I may at some future time.

End of the housecleaning updates. Basically, I updated enough of this class thanks to the help of everyone who has contacted me that I’ve advanced it to version 1.0.0.

You can view the direct source of the class here, or download it in one of the following formats:

Here are some examples on how to use this class:

First, the basic, easiest method. Use this if you have the HTML you want to convert stored in a string variable:

<?php

// Include the class definition file.
require_once(‘class.html2text.inc’);

// The “source” HTML you want to convert.
$html ‘Sample string with HTML code in it’;

// Instantiate a new instance of the class. Passing the string
// variable automatically loads the HTML for you.
$h2t =& new html2text($html);

// Simply call the get_text() method for the class to convert
// the HTML to the plain text. Store it into the variable.
$text $h2t->get_text();

// Or, alternatively, you can print it out directly:
$h2t->print_text();

?>

This next example shows the two areas of variation in the class: where it pulls the HTML from (in this case an external file, and handling relative links by assigning a base URL.

<?php

// Include the class definition file.
require_once(‘class.html2text.inc’);

// The “source” HTML you want to convert, stored in a file.
$filename ‘/path/to/file.html’;

// Instantiate a new instance of the class. Passing the filename
// followed by the “true” flag tells the class to find the HTML
// in the specified file. Should work on remote files, too.
$h2t =& new html2text($filenametrue);

// The HTML is likely full of relative links, so let’s specify
// an absolute source.
$h2t->set_base_url(‘http://www.example.com’);

// Simply call the get_text() method for the class to convert
// the HTML to the plain text. Store it into the variable.
$text $h2t->get_text();

// Or, alternatively, you can print it out directly:
$h2t->print_text();

?>