Web Automation, from The PHP Cookbook -WebReference- | 7
PHP Cookbook: Web Automation
Converting ASCII to HTML
Problem
You want to turn plaintext into reasonably formatted HTML.
Solution
First, encode entities with htmlentities(
); then, transform the text into various HTML structures. The pc_ascii2html(
) function shown in Example 11-3 has basic transformations for links
and paragraph breaks.
Example 11-3: pc_ascii2html( )
function pc_ascii2html($s) {$s = htmlentities($s);$grafs = split("\n\n",$s);for ($i = 0, $j = count($grafs); $i < $j; $i++) {// Link to what seem to be http or ftp URLs$grafs[$i] = preg_replace('/((ht|f)tp:\/\/[^\s&]+)/','<a href="$1">$1</a>',$grafs[$i]);// Link to email addresses$grafs[$i] = preg_replace('/[^@\s]+@([-a-z0-9]+\.)+[a-z]{2,}/i','<a href="mailto:$1">$1</a>',$grafs[$i]);// Begin with a new paragraph$grafs[$i] = '<p>'.$grafs[$i].'</p>';}return join("\n\n",$grafs);}
Discussion
The more you know about what the ASCII text looks like, the better your HTML conversion can be. For example, if emphasis is indicated with *asterisks* or /slashes/ around words, you can add rules that take care of that, as follows:
$grafs[$i] = preg_replace('/(\A|\s)\*([^*]+)\*(\s|\z)/','$1<b>$2</b>$3',$grafs[$i]);$grafs[$i] = preg_replace('{(\A|\s)/([^/]+)/(\s|\z)}','$1<i>$2</i>$3',$grafs[$i]);
See Also
Documentation on preg_replace( )
at http://www.php.net/preg-replace.
Converting HTML to ASCII
Problem
You need to convert HTML to readable, formatted ASCII text.
Solution
If you have access to an external program that formats HTML as ASCII, such as lynx, call it like so:
$file = escapeshellarg($file);$ascii = `lynx -dump $file`;
Discussion
If you can't use an external formatter, the pc_html2ascii(
) function shown in Example 11-4 handles a reasonable subset of HTML
(no tables or frames, though).
Example 11-4: pc_html2ascii( )
function pc_html2ascii($s) {// convert links$s = preg_replace('/<a\s+.*?href="?([^\" >]*)"?[^>]*>(.*?)<\/a>/i','$2 ($1)', $s);// convert <br>, <hr>, <p>, <div> to line breaks$s = preg_replace('@<(b|h)r[^>]*>@i',"\n",$s);$s = preg_replace('@<p[^>]*>@i',"\n\n",$s);$s = preg_replace('@<div[^>]*>(.*)</div>@i',"\n".'$1'."\n",$s);// convert bold and italic$s = preg_replace('@<b[^>]*>(.*?)</b>@i','*$1*',$s);$s = preg_replace('@<i[^>]*>(.*?)</i>@i','/$1/',$s);// decode named entities$s = strtr($s,array_flip(get_html_translation_table(HTML_ENTITIES)));// decode numbered entities$s = preg_replace('//e','chr(\\1)',$s);// remove any remaining tags$s = strip_tags($s);return $s;}
See Also
Recipe 9.8 for more on get_html_translation_table();
documentation on preg_replace( ) at http://www.php.net/preg-replace,
get_html_translation_table( ) at http://www.php.net/get-html-translation-table,
and strip_tags( ) at http://www.php.net/strip-tags.
Created: March 11, 2003
Revised: March 11, 2003
URL: http://webreference.com/programming/php/chap11/1

Find a programming school near you