How To Transform HTML To Textile Markup - The CakePHP TextileHelper Revisited
Posted by Tim Koschützki, on Aug 23, 2007 - in PHP & CakePHP » Views & Helpers
Hi folks. For a current project of mine I had to find a way to decode html into textile markup. Why? Because we are using tinyMCE to process our textareas as wyciwyg editors, which generate HTML. However, we want all output controlled via textile to allow only the textile tags. Yes, we could do it with strip_tags(), but textile is much more elegant. Plus, it was a requirement by the client. Come on and find out how to detextile html.
How Does It Work?
Well the code is not entirely trivial, but it looks like what you would expect: a bunch of regex processing. Here is the basic detextile() method, which takes some (html) text:
-
function detextile($text) {
-
-
-
'table','tr','td','u','del','sup','sub','blockquote');
-
-
foreach($oktags as $tag){
-
}
-
-
$text = $this->detextile_process_glyphs($text);
-
$text = $this->detextile_process_lists($text);
-
-
-
return $this->decode_high($text);
-
}
Okay, so we processTag() all html tags that we want to cover, process glyphs (we will get to that in a minute) and lists, eliminate all tabs and paragraphs and return the text decoded, with UTF8 as the standard charset making use of the mb_decode_numericentity() function. So what does processTag do?
-
function processTag($matches) {
-
$a = $this->splat($atts);
-
-
'em'=>'_',
-
'i'=>'__',
-
'b'=>'**',
-
'strong'=>'*',
-
'cite'=>'??',
-
'del'=>'-',
-
'ins'=>'+',
-
'sup'=>'^',
-
'sub'=>'~',
-
'span'=>'%');
-
-
-
return $phr[$tag].$this->sci($a).$content.$phr[$tag];
-
} elseif($tag=='blockquote') {
-
return 'bq.'.$this->sci($a).' '.$content;
-
return $tag.$this->sci($a).'. '.$content;
-
} elseif ($tag=='a') {
-
$out = '"'.$content;
-
$out.= '":'.$t['href'];
-
return $out;
-
} else {
-
return $all;
-
}
-
}
-
-
function sci($a)
-
{
-
$out = '';
-
foreach($a as $t){
-
$out.= ($t['name']=='class') ? '(='.$t['att'].')' : '';
-
$out.= ($t['name']=='id') ? '[='.$t['att'].']' : '';
-
$out.= ($t['name']=='style') ? '{='.$t['att'].'}' : '';
-
$out.= ($t['name']=='cite') ? ':'.$t['att'] : '';
-
}
-
return $out;
-
}
Here is where much of the converting takes place. We have a map of conversion strings from html tags to textile entities and convert them here. We preserve any classes, ids and attributes using the splat method and the sci method and return the text. The splat method is quite sophisticated and long to explain, but it should become clear when you look at it below.
Now on to the glyphs and the list methods:
-
function detextile_process_glyphs($text) {
-
'’'=>'\'', # single closing
-
'‘'=>'\'', # single opening
-
'”'=>'"', # double closing
-
'“'=>'"', # double opening
-
'—'=>'--', # em dash
-
'–'=>' - ', # en dash
-
'×' =>'x', # dimension sign
-
'™'=>'(TM)', # trademark
-
'®' =>'(R)', # registered
-
'©' =>'(C)', # copyright
-
'…'=>'...' # ellipsis
-
);
-
-
foreach($glyphs as $f=>$r){
-
}
-
return $text;
-
}
Easy. It simply converts some html entities for glyphs into their textile equivalents.
The list method:
-
function detextile_process_lists($text) {
-
$list = false;
-
-
foreach($text as $line){
-
-
$line = "";
-
$list = "o";
-
$line = "";
-
$list = false;
-
$line = "";
-
$list = "u";
-
$line = "";
-
$list = false;
-
} else if ($list == 'o'){
-
} else if ($list == 'u'){
-
}
-
$glyph_out[] = $line;
-
}
-
-
}
This method is a bit more tricky. It wipes out any list starting tags (ul, ol) and converts all li-tags into their textile equivalents - either "# " (for ordered lists) or " *" (for unordered lists).
How Do You Use The Code?
Using the code is darn easy. You just invoke the detextile method upon your html code:
An Example
Here is some example html code we want to convert:
Detextile output:
-
*This is some bold text*p. _This is italic text_p. p. <u>Underline text man</u>p. * ul list item1 * ul list item2* ul list item3# ol list item1# ol list item2# ol list
-
item3
Cool! And it was easy as well!
Get The Code
Here are all methods for your cakephp textile helper. You can plug them in into any Textile Helper for other frameworks of course:
-
// -------------------------------------------------------------
-
// The following functions are used to detextile html, a process
-
// still in development.
-
// By Tim Koschützki
-
-
// Based on code from http://www.aquarionics.com
-
-
// -------------------------------------------------------------
-
function detextile($text) {
-
-
-
'table','tr','td','u','del','sup','sub','blockquote');
-
-
foreach($oktags as $tag){
-
}
-
-
$text = $this->detextile_process_glyphs($text);
-
$text = $this->detextile_process_lists($text);
-
-
-
return $this->decode_high($text);
-
}
-
-
function detextile_process_glyphs($text) {
-
'’'=>'\'', # single closing
-
'‘'=>'\'', # single opening
-
'”'=>'"', # double closing
-
'“'=>'"', # double opening
-
'—'=>'--', # em dash
-
'–'=>' - ', # en dash
-
'×' =>'x', # dimension sign
-
'™'=>'(TM)', # trademark
-
'®' =>'(R)', # registered
-
'©' =>'(C)', # copyright
-
'…'=>'...' # ellipsis
-
);
-
-
foreach($glyphs as $f=>$r){
-
}
-
return $text;
-
}
-
-
function detextile_process_lists($text) {
-
$list = false;
-
-
foreach($text as $line){
-
-
$line = "";
-
$list = "o";
-
$line = "";
-
$list = false;
-
$line = "";
-
$list = "u";
-
$line = "";
-
$list = false;
-
} else if ($list == 'o'){
-
} else if ($list == 'u'){
-
}
-
$glyph_out[] = $line;
-
}
-
-
}
-
-
function processTag($matches) {
-
$a = $this->splat($atts);
-
-
'em'=>'_',
-
'i'=>'__',
-
'b'=>'**',
-
'strong'=>'*',
-
'cite'=>'??',
-
'del'=>'-',
-
'ins'=>'+',
-
'sup'=>'^',
-
'sub'=>'~',
-
'span'=>'%');
-
-
-
return $phr[$tag].$this->sci($a).$content.$phr[$tag];
-
} elseif($tag=='blockquote') {
-
return 'bq.'.$this->sci($a).' '.$content;
-
return $tag.$this->sci($a).'. '.$content;
-
} elseif ($tag=='a') {
-
$out = '"'.$content;
-
$out.= '":'.$t['href'];
-
return $out;
-
} else {
-
return $all;
-
}
-
}
-
-
// -------------------------------------------------------------
-
function filterAtts($atts,$ok)
-
{
-
foreach($atts as $a) {
-
if($a['att']!='') {
-
$out[$a['name']] = $a['att'];
-
}
-
}
-
}
-
# dump($out);
-
return $out;
-
}
-
-
// -------------------------------------------------------------
-
function sci($a)
-
{
-
$out = '';
-
foreach($a as $t){
-
$out.= ($t['name']=='class') ? '(='.$t['att'].')' : '';
-
$out.= ($t['name']=='id') ? '[='.$t['att'].']' : '';
-
$out.= ($t['name']=='style') ? '{='.$t['att'].'}' : '';
-
$out.= ($t['name']=='cite') ? ':'.$t['att'] : '';
-
}
-
return $out;
-
}
-
-
// -------------------------------------------------------------
-
function splat($attr) // returns attributes as an array
-
{
-
$atnm = '';
-
$mode = 0;
-
-
$ok = 0;
-
switch ($mode) {
-
case 0: // name
-
$atnm = $match[1]; $ok = $mode = 1;
-
}
-
break;
-
-
case 1: // =
-
$ok = 1; $mode = 2;
-
break;
-
}
-
$ok = 1; $mode = 0;
-
}
-
break;
-
-
case 2: // value
-
'att'=>str_replace('"','',$match[1]));
-
$ok = 1; $mode = 0;
-
break;
-
}
-
'att'=>str_replace("'",'',$match[1]));
-
$ok = 1; $mode = 0;
-
break;
-
}
-
$arr[]=
-
'att'=>$match[1]);
-
$ok = 1; $mode = 0;
-
}
-
break;
-
}
-
if ($ok == 0){
-
$mode = 0;
-
}
-
}
-
if ($mode == 1) $arr[] =
-
-
return $arr;
-
}
The code is based on an unfinished start from http://www.aquarionics.com. Thanks to the guys over there!
Have fun!