Web Automation, from The PHP Cookbook -WebReference- | 7

PHP Cookbook: Web Automation

Converting ASCII to HTML

Problem

You want to turn plaintext into reasonably formatted HTML.

Solution

First, encode entities with htmlentities( ); then, transform the text into various HTML structures. The pc_ascii2html( ) function shown in Example 11-3 has basic transformations for links and paragraph breaks.

Example 11-3: pc_ascii2html( )

function pc_ascii2html($s) {
  $s = htmlentities($s);
  $grafs = split("\n\n",$s);
  for ($i = 0, $j = count($grafs); $i < $j; $i++) {
    // Link to what seem to be http or ftp URLs
    $grafs[$i] = preg_replace('/((ht|f)tp:\/\/[^\s&]+)/',
                              '<a href="$1">$1</a>',$grafs[$i]);
 
    // Link to email addresses
    $grafs[$i] = preg_replace('/[^@\s]+@([-a-z0-9]+\.)+[a-z]{2,}/i',
        '<a href="mailto:$1">$1</a>',$grafs[$i]);
 
    // Begin with a new paragraph 
    $grafs[$i] = '<p>'.$grafs[$i].'</p>';
  }
  return join("\n\n",$grafs);
}

Discussion

The more you know about what the ASCII text looks like, the better your HTML conversion can be. For example, if emphasis is indicated with *asterisks* or /slashes/ around words, you can add rules that take care of that, as follows:

$grafs[$i] = preg_replace('/(\A|\s)\*([^*]+)\*(\s|\z)/',
                          '$1<b>$2</b>$3',$grafs[$i]);
$grafs[$i] = preg_replace('{(\A|\s)/([^/]+)/(\s|\z)}',
                          '$1<i>$2</i>$3',$grafs[$i]);

See Also

Documentation on preg_replace( ) at http://www.php.net/preg-replace.

Converting HTML to ASCII

Problem

You need to convert HTML to readable, formatted ASCII text.

Solution

If you have access to an external program that formats HTML as ASCII, such as lynx, call it like so:

$file = escapeshellarg($file);
$ascii = `lynx -dump $file`;

Discussion

If you can't use an external formatter, the pc_html2ascii( ) function shown in Example 11-4 handles a reasonable subset of HTML (no tables or frames, though).

Example 11-4: pc_html2ascii( )

function pc_html2ascii($s) {
  // convert links
  $s = preg_replace('/<a\s+.*?href="?([^\" >]*)"?[^>]*>(.*?)<\/a>/i',
                    '$2 ($1)', $s);
 
  // convert <br>, <hr>, <p>, <div> to line breaks
  $s = preg_replace('@<(b|h)r[^>]*>@i',"\n",$s);
  $s = preg_replace('@<p[^>]*>@i',"\n\n",$s);
  $s = preg_replace('@<div[^>]*>(.*)</div>@i',"\n".'$1'."\n",$s);
  
  // convert bold and italic
  $s = preg_replace('@<b[^>]*>(.*?)</b>@i','*$1*',$s);
  $s = preg_replace('@<i[^>]*>(.*?)</i>@i','/$1/',$s);
 
  // decode named entities
  $s = strtr($s,array_flip(get_html_translation_table(HTML_ENTITIES)));
 
  // decode numbered entities
  $s = preg_replace('//e','chr(\\1)',$s);
  
  // remove any remaining tags
  $s = strip_tags($s);
  
  return $s;
}

See Also

Recipe 9.8 for more on get_html_translation_table(); documentation on preg_replace( ) at http://www.php.net/preg-replace, get_html_translation_table( ) at http://www.php.net/get-html-translation-table, and strip_tags( ) at http://www.php.net/strip-tags.


Created: March 11, 2003
Revised: March 11, 2003

URL: http://webreference.com/programming/php/chap11/1