Parsing XML With The DOM Library
Posted by Tim Koschützki, on Jun 05, 2007 - in PHP & CakePHP » Other
The PHP 4 DOMXML extension has undergone some serious transformation since PHP5 and is a lot easier to use. Unlike SimpleXML, DOM can, at times, be cumbersome and unwiedly. However, it is often a better choice than SimpleXML. Please join me and find out why.
Since SimpleXML and DOM objects are interoperable you can use the former for simplicity and the latter for power. How you can exchange data between the two extensions is explained at the bottom of the article.
The DOM extension is especially useful when you want to modify XML documents , as SimpleXML for example does not allow to remove nodes from an XML document. For this article's code examples we will use the same foundation that we used in the Parsing XML with SimpleXML post.
We will use this very site's google sitemap file, which can be downloaded here. The sitemap.xml file features an xml list of pages of php-coding-practices.com for easy indexing in google.
Loading and Saving XMLDocuments
The DOM extension, just like SimpleXML, provides two ways to load xml documents - either by string or by filename:
-
$source = 'sitemap.xml';
-
-
$dom = new DomDocument();
-
$dom->load($source);
-
-
// load as string
-
$dom2 = new DomDocument();
In addition to that, the DomDocument object provides two functions to load html files. The advantage is that html files do not have to be well-formed to load. Here is an example:
-
$doc = new DOMDocument();
-
$doc->loadHTML("<html><body>Test
-
</body></html>");
The cool news is that mal-formed HTML will automatically be transferred into well-formed one. Look at this script:
-
$doc = new DOMDocument();
-
$doc->loadHTML("<html><body><p>Test
-
</p></body></html>");
The DomDocument::loadHTML() method will automatically add a DTD (Document Type Definition) and add the missing end-tag for the opened p-tag. Cool, isn't it?
-
< !DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
-
</p></body></html>
Saving XML data with the DOM library is as easy. Just use DomDocument::saveHTML() and DomDocument::saveXML() with no parameters. They will automatically create XML or HTML documents from your xml contents and return them. DomDocument::saveHTMLFile() and DomDocument::save() save to html and xml files. They request a filepath paramter as a string.
XPath Queries
One of the most powerful features of the DOM extension is the way in which it integrates with XPath queries. In fact, DomXpath is much more powerful than its SimpleXML equivalent:
-
$source = 'sitemap.xml';
-
$dom = new DomDocument();
-
$dom->load($source);
-
-
$xpath = new DomXPath($dom);
-
$xpath->registerNamespace('c', 'http://www.google.com/schemas/sitemap/0.84');
-
$result = $xpath->query("//c:loc/text()");
-
';
-
//echo $result->item(3)->data;
-
foreach($result as $b) {
-
';
-
}
Notice that the sitemap xml file contains a namespace already, which we register using DomXPath::registerNamespace():
-
< ?xml version="1.0" encoding="UTF-8"?>
We really have to register that namespace with the DomXPath object or else it will not know where to search. ;) You can also register multiple namespaces, but more on that later. Notice that we use text() within the xpath query to get the actual text contents of the nodes.
If you want to learn the ins and outs of the xpath language, I recommend reading the W3C XPath Reference.
Modifying XML Documents
Adding New Nodes
To add new data to a loaded dom documented, we need to create new DomElement objects by using the DomDocument::createElement(), DomDocument::createElementNS() and DomDocument::createTextNode() methods.
In the following we will add a new url to our urlset:
-
$source = 'sitemap.xml';
-
$dom = new DomDocument();
-
$dom->load($source);
-
-
// url element
-
$url = $dom->createElement('url');
-
-
-
// location
-
$loc = $dom->createElement('loc');
-
$text = $dom->createTextNode('http://php-coding-practices.com/article/');
-
$loc->appendChild($text);
-
-
// last modification
-
$lastmod= $dom->createElement('lastmod');
-
$text = $dom->createTextNode('2007-04-20T10:24:32+00:00');
-
$lastmod->appendChild($text);
-
-
// change frequency
-
$changefreq= $dom->createElement('changefreq');
-
$text = $dom->createTextNode('weekly');
-
$changefreq->appendChild($text);
-
-
// priority
-
$priority= $dom->createElement('priority');
-
$text = $dom->createTextNode('0.3');
-
$priority->appendChild($text);
-
-
// add the elements to the url
-
$url->appendChild($loc);
-
$url->appendChild($lastmod);
-
$url->appendChild($changefreq);
-
$url->appendChild($priority);
-
-
// add the new url to the root element (urlset)
-
$dom->documentElement->appendChild($url);
-
The code is pretty self-explanatory. First we create a new url element as well as some sub-elements. Then we append those sub-elements to the url element, which we in turn append to the document's root element. Note that the root element can be accessed via the $dom->documentElement property. The output:
-
....
-
<loc>http://php-coding-practices.com/2007/04/</loc>
-
-
<lastmod>2007-04-30T16:54:58+00:00</lastmod>
-
<changefreq>yearly</changefreq>
-
<priority>0.5</priority>
-
-
-
<url>
-
<loc>http://php-coding-practices.com/2007/03/</loc>
-
<lastmod>2007-03-29T20:04:51+00:00</lastmod>
-
-
<changefreq>yearly</changefreq>
-
<priority>0.5</priority>
-
-
</url>
Now it was certainly not as easy as it would have been had we used SimpleXML. The DOM extension provides many more methods for more power. For example you can associate a namespace with an element
while creation using DomDocument::createElementNS(). I will provide some example code on that later in the article.
Adding Attributes To Nodes
Via DomDocument::setAttribute() we can easily add an attribute to a node object. Example:
-
$url = $dom->createElement('url');
-
...
-
$url->setAttribute('meta:level','3');
Here we set a fictive meta:level attribute with the value 3 to our url NodeElement from above.
Moving Data
Moving data is not as obvious as you might expect, as the DOM extension does not provide a real method that takes care of that, explicitly. Instead we will have to use
a combination of DomDocument::insertBefore(). As an example, suppose we want to move our new url from above just before the very first url:
-
$xpath = new DomXPath($dom);
-
$xpath->registerNamespace("c","http://www.google.com/schemas/sitemap/0.84");
-
$result = $xpath->query("//c:url");
-
$result->item(1)->parentNode->insertBefore($result->item(1),$result->item(0));
-
DomDocument::insertBefore() takes two parameters, the new node and the reference node. It inserts the new node before the reference node. In our example, we insert the second url ($result->item(1)) before the first one ($result->item(0)).
I hear you asking why we use DomDocument::insertBefore() on the $result->item(1)->parentNode node.. Couldn't we just as easily use simply $result->item(0)? No of course not, as we need to execute DomDocument::insertBefore() on the root element, urlset, and not a specific url (look at our xpath query).
We could use the following code which is perfectly valid and gets us the same results, though:
-
$result->item(0)->parentNode->insertBefore($result->item(1),$result->item(0));
If we wanted to append the first url at the bottom of the sitemap, the following code is the way to go:
-
$result->item(0)->parentNode->appendChild($result->item(0));
-
-
// or $dom->documentElement->appendChild($result->item(0)); respectively
Easy is it not? :) DomDocument::insertBefore() and DomNode::appendChild() automatically move (and not copy and then move) the corresponding nodes. If you wish to clone a node first before moving it, use DomNode::cloneNode() first:
-
$source = 'sitemap.xml';
-
$dom = new DomDocument();
-
$dom->load($source);
-
-
$xpath = new DomXPath($dom);
-
$xpath->registerNamespace("c","http://www.google.com/schemas/sitemap/0.84");
-
$result = $xpath->query("//c:url");
-
$clone = $result->item(0)->cloneNode(true);
-
$result->item(4)->parentNode->appendChild($clone);
The important thing here is that you have to supply omNode::cloneNode() with a true parameter (default is false), so that it copies all descendant nodes as well. If we had left that to false, we would have gotten an empty <url></url> node, which is not desirable. ;)
Modifying Node Data
When modifying node data, you want to modify the CDATA within a node. You can use xpath again to find the node you want to edit and then simply supply a new value to its data property:
-
$source = 'sitemap.xml';
-
$dom = new DomDocument();
-
$dom->load($source);
-
-
$xpath = new DomXPath($dom);
-
$xpath->registerNamespace("c","http://www.google.com/schemas/sitemap/0.84");
-
$result = $xpath->query("//c:loc/text()");
-
-
$node = $result->item(1);
-
This code transforms the location data of the second url to uppercase letters.
Removing Data From XML Documents
There are three types of data that you would possbily want to remove from xml documents: elements, attributes and CDATA. The DOM extension provides a method for each of them:
DomElement::removeAttribute(), DomNode::removeChild() and DomCharacterData::deleteData(). We will use a custom xml document and not our sitemap to demonstrate their behavior. This makes it easier for you
to come back to this article and see at first glance how these methods work. Thank Nikos if you want to. ;)
-
$xml = <<<XML
-
<xml>
-
<text type="input">This is some really cool text!</text>
-
<text type="input">This is some other really cool text!</text>
-
<text type="misc">This is some cool text!</text>
-
<text type="output">This is text!</text>
-
-
XML;
-
-
$dom = new DomDocument();
-
$dom->loadXML($xml);
-
$xpath = new DomXPath($dom);
-
-
$result = $xpath->query("//text");
-
-
// remove first node
-
$result->item(0)->parentNode->removeChild($result->item(0));
-
-
// remove attribute from second node
-
$result->item(1)->removeAttribute('type');
-
-
//delete data from third element
-
$result = $xpath->query('text()',$result->item(2));
-
$result->item(0)->deleteData(0, $result->item(0)->length);
-
The output of this is:
-
< ?xml version="1.0"?>
-
<xml>
-
-
<text>This is some other really cool text!</text>
-
<text type="misc"></text>
-
<text type="output">This is text!</text>
In this example we start by retrieving all text nodes from a document. Then we remove some data from that document. Simple.
In fact we remove the first node alltogether as well as the attribute of the second node. Finally we truncate the character data of the third node, using xpath to query the corresponding text() node.
Note that DomCharacterData::deleteData() requires a starting offset and a length parameter. Since we want to truncate the data in our example we supply 0 and the length of the CDATA node.
DOM And Working With Namespaces
DOM is very capable of handling namespaces on its own. Most of the time you can ignore them and pass attribute and element names with the appropriate prefix directly to most DOM functions.
-
$dom = new DomDocument();
-
-
$node = $dom->createElement('ns1:somenode');
-
$node->setAttribute('ns2:someattribute','somevalue');
-
-
$node2 = $dom->createElement('ns3:anothernode');
-
$node->appendChild($node2);
-
-
// Set xmlns attributes
-
-
$node->setAttribute('xmlns:ns1', 'http://php-coding-practices.com/');
-
$node->setAttribute('xmlns:ns2', 'http://php-coding-practices.com/articles/');
-
$node->setAttribute('xmlns:ns3', 'http://php-coding-practices.com/sitemap/');
-
$node->setAttribute('xmlns:ns4', 'http://php-coding-practices.com/about-the-author/');
-
-
$dom->appendChild($node);
-
The output of this script is:
-
< ?xml version="1.0"?>
-
<ns1 :somenode
-
ns2:someattribute="somevalue"
-
xmlns:ns1="http://php-coding-practices.com/"
-
xmlns:ns2="http://php-coding-practices.com/articles/"
-
xmlns:ns3="http://php-coding-practices.com/sitemap/"
-
xmlns:ns4="http://php-coding-practices.com/about-the-author/">
-
<ns3 :anothernode/>
We can simplify the use of namespaces somewhat by using DomDocument::createElementNS() and DomDocument::setAttributeNS(), which were specifically designed for this purpose:
-
$dom = new DomDocument();
-
-
$node = $dom->createElementNS('http://php-coding-practices.com/', 'ns1:somenode');
-
$node->setAttributeNS('http://somewebsite.com/ns2', 'ns2:someattribute', 'somevalue');
-
-
$node2 = $dom->createElementNS('http://php-coding-practices.com/articles/', 'ns3:anothernode');
-
$node3 = $dom->createElementNS('http://php-coding-practices.com/sitemap/', 'ns1:someothernode');
-
-
$node->appendChild($node2);
-
$node->appendChild($node3);
-
-
$dom->appendChild($node);
This results in the following output:
-
< ?xml version="1.0"?>
-
<ns1 :somenode
-
xmlns:ns1="http://php-coding-practices.com/"
-
xmlns:ns2="http://somewebsite.com/ns2"
-
xmlns:ns3="http://php-coding-practices.com/articles/"
-
xmlns:ns11="http://php-coding-practices.com/sitemap/"
-
ns2:someattribute="somevalue">
-
<ns3 :anothernode xmlns:ns3="http://php-coding-practices.com/articles/"/>
-
<ns11 :someothernode xmlns:ns1="http://php-coding-practices.com/sitemap/"/>
Interfacing With SimpleXML
As I have mentioned at the start of our little DOM journey it is very easy to exchange loaded documents between SimpleXML and DOM. Therefore, you can take advantage of both
systems' strengths - SimpleXML's simplicity and DOM's power.
You can import SimpleXML object into DOM by using PHP's dom_import_simplexml() function:
-
$sxml = simplexml_load_file('sitemap.xml');
-
$node = dom_import_simplexml($sxml);
-
$dom = new DomDocument();
-
$dom->importNode($node,true);
-
$dom->appendChild($node);
DomDocument::importNode() creates a copy of the node and associates it with the current document. Its second parameter - a boolean value - determines if the method will recursively import the subtree or not.
You can also import a dom object into SimpleXML using simple_xml_import_dom():
-
$dom = new DomDocument();
-
$dom->load('sitemap.xml');
-
$sxe = simplexml_import_dom($dom);
Conclusion
DOM is certainly a very powerful way of dealing with XML documents. While it provides a good interface for basically every task one could dream of it often takes quite a lot of code lines to accomplish a task. SimpleXML's interface is of course a little easier, but less powerful.
Especially the fact that SimpleXML is rather incapable of removing data makes DOM the way to go for more complicated XML document processing. DOM's power in dealing with namespaces make it a valuable tool when dealing with large portions of data where naming conflicts are likely.
In fact we covered only a small portion of DOM's power. There are many other associating objects which have several useful methods. For example, we have not covered how to append character data. Check the DOM function reference for more information.
Thanks for staying with me on the DOM-boot till the end of our joirney! I hope you enjoyed it - please beware of the gap between the boot and the footbridge when leaving.
18 Comments
[...] complément de la doc PHP officielle, voici un petit tutorial qui va bien et qui nous montre les différentes manipulations qu’il est possible [...]
[...] Parsing XML With The DOM Library | PHP Coding Practices - Become an expert PHP Programmer - Great article that goes deep into the the DOMXML extensions of PHP, showing you how to do serious manipulation of XML documents. Includes loading, parsing and writing XML docs, using XPath queries, adding nodes, removing data and more. [...]
good stuff on namespaces; didn't see anyting about validating a document (dtd, schema) or error handling
Parsing XML With The DOM Library | PHP Coding Practices - Become an expert PHP Programmer - Great article that goes deep into the the DOMXML extensions of PHP, showing you how to do serious manipulation of XML documents. Includes loading, parsing and writing XML docs, using XPath queries, adding nodes, removing data and more
Koschuetzki has created and posted a new tutorial today that talks about working with XML documents with PHP’s DOM Library (the PHP5 version
worst site
I have been trying to run the namespaces example among others and I keep getting -
Parse error: syntax error, unexpected T_VARIABLE in /Library/WebServer/Documents/tester/dom/read.php on line 4
Any ideas would be much appreciated,
Cheers
Chris
Can you show me your code please?
Good,,,
Just a note to say thanks, a really useful article. Was going around the bend not understaning why addAttribute was working; setAttribute fixed the problem. Thanks, Rob.
Very useful article, thx Tim!!
Hi Tim,
I'm wondering how do you format the xml output? Like in the 8th panel above (titled "XML") the output is nicely formatted with each node on a separate line. My output is always in one very very long line. Any tips are appreciated.
Thank you!
Mikhail: Unfortunately I had the same problem like you and just formatted it so it could actually be analyzed by people here reading the post. : /
I can't think of anything other than parsing the XML output again to make it work. However, the DOM library might provide some tool for this? I can't seem to find it though. : (
Thanks Tim! I guess in the big scheme of things it doesn't matter how it's stored in a file. As long as I can parse it ok :).
Absolutely :]
Ok, I found a nice solution to formatting the XML output - it is using the newline \n (and tab \t) characters. The idea is to append these before and after the text-node value, as shown in simple example below:
$dom = domxml_open_file($xmlpath);
$root = $dom->document_element(); //get root node
$dom->preserveWhiteSpace = true;
$dom->formatOutput = true;
//create entry container
$entry_container = $dom->create_element("entry");
$entry_container = $root->append_child($entry_container);
//create word within entry
$word = $dom->create_element("word");
$word = $entry_container->append_child($word);
//write to word container
$word_value = $dom->create_text_node("\n".$_POST['myword']."\n");
$word_value = $word->append_child($word_value);
$dom->dump_file($xmlpath, false, true); //save xml file
this outputs a word on a separate line, between the opening and closing tags. Much neater I suppose.
Perfect, thanks for sharing Mikhail! :)



[...] Koschuetzki has created and posted a new tutorial today that talks about working with XML documents with PHP’s DOM Library (the PHP5 version) [...]