Parsing XML With The DOM Library
Posted on 5/6/07 by Tim Koschützki
The PHP 4 DOMXML extension has undergone some serious transformation since PHP5 and is a lot easier to use. Unlike SimpleXML, DOM can, at times, be cumbersome and unwiedly. However, it is often a better choice than SimpleXML. Please join me and find out why.
Since SimpleXML and DOM objects are interoperable you can use the former for simplicity and the latter for power. How you can exchange data between the two extensions is explained at the bottom of the article.
The DOM extension is especially useful when you want to modify XML documents , as SimpleXML for example does not allow to remove nodes from an XML document. For this article's code examples we will use the same foundation that we used in the Parsing XML with SimpleXML post.
We will use this very site's google sitemap file, which can be downloaded here. The sitemap.xml file features an xml list of pages of php-coding-practices.com for easy indexing in google.
Loading and Saving XMLDocuments
The DOM extension, just like SimpleXML, provides two ways to load xml documents - either by string or by filename:
$dom = new DomDocument();
$dom->load($source);
// load as string
$dom2 = new DomDocument();
$dom2->loadXML(file_get_contents($source));
In addition to that, the DomDocument object provides two functions to load html files. The advantage is that html files do not have to be well-formed to load. Here is an example:
The cool news is that mal-formed HTML will automatically be transferred into well-formed one. Look at this script:
$doc->loadHTML("<html><body><p>Test
</p></body></html>");
echo $doc->saveHTML();
The DomDocument::loadHTML() method will automatically add a DTD (Document Type Definition) and add the missing end-tag for the opened p-tag. Cool, isn't it?
<html><body><p>Test
</p></body></html>
Saving XML data with the DOM library is as easy. Just use DomDocument::saveHTML() and DomDocument::saveXML() with no parameters. They will automatically create XML or HTML documents from your xml contents and return them. DomDocument::saveHTMLFile() and DomDocument::save() save to html and xml files. They request a filepath paramter as a string.
XPath Queries
One of the most powerful features of the DOM extension is the way in which it integrates with XPath queries. In fact, DomXpath is much more powerful than its SimpleXML equivalent:
$dom = new DomDocument();
$dom->load($source);
$xpath = new DomXPath($dom);
$xpath->registerNamespace('c', 'http://www.google.com/schemas/sitemap/0.84');
$result = $xpath->query("//c:loc/text()");
echo $result->length.'
';
//echo $result->item(3)->data;
foreach($result as $b) {
echo $b->data.'
';
}
Notice that the sitemap xml file contains a namespace already, which we register using DomXPath::registerNamespace():
We really have to register that namespace with the DomXPath object or else it will not know where to search. ;) You can also register multiple namespaces, but more on that later. Notice that we use text() within the xpath query to get the actual text contents of the nodes.
If you want to learn the ins and outs of the xpath language, I recommend reading the W3C XPath Reference.
Modifying XML Documents
Adding New Nodes
To add new data to a loaded dom documented, we need to create new DomElement objects by using the DomDocument::createElement(), DomDocument::createElementNS() and DomDocument::createTextNode() methods.
In the following we will add a new url to our urlset:
$dom = new DomDocument();
$dom->load($source);
// url element
$url = $dom->createElement('url');
// location
$loc = $dom->createElement('loc');
$text = $dom->createTextNode('http://php-coding-practices.com/article/');
$loc->appendChild($text);
// last modification
$lastmod= $dom->createElement('lastmod');
$text = $dom->createTextNode('2007-04-20T10:24:32+00:00');
$lastmod->appendChild($text);
// change frequency
$changefreq= $dom->createElement('changefreq');
$text = $dom->createTextNode('weekly');
$changefreq->appendChild($text);
// priority
$priority= $dom->createElement('priority');
$text = $dom->createTextNode('0.3');
$priority->appendChild($text);
// add the elements to the url
$url->appendChild($loc);
$url->appendChild($lastmod);
$url->appendChild($changefreq);
$url->appendChild($priority);
// add the new url to the root element (urlset)
$dom->documentElement->appendChild($url);
echo $dom->saveHtml();
The code is pretty self-explanatory. First we create a new url element as well as some sub-elements. Then we append those sub-elements to the url element, which we in turn append to the document's root element. Note that the root element can be accessed via the $dom->documentElement property. The output:
<loc>http://php-coding-practices.com/2007/04/</loc>
<lastmod>2007-04-30T16:54:58+00:00</lastmod>
<changefreq>yearly</changefreq>
<priority>0.5</priority>
<url>
<loc>http://php-coding-practices.com/2007/03/</loc>
<lastmod>2007-03-29T20:04:51+00:00</lastmod>
<changefreq>yearly</changefreq>
<priority>0.5</priority>
</url>
Now it was certainly not as easy as it would have been had we used SimpleXML. The DOM extension provides many more methods for more power. For example you can associate a namespace with an element
while creation using DomDocument::createElementNS(). I will provide some example code on that later in the article.
Adding Attributes To Nodes
Via DomDocument::setAttribute() we can easily add an attribute to a node object. Example:
...
$url->setAttribute('meta:level','3');
Here we set a fictive meta:level attribute with the value 3 to our url NodeElement from above.
Moving Data
Moving data is not as obvious as you might expect, as the DOM extension does not provide a real method that takes care of that, explicitly. Instead we will have to use
a combination of DomDocument::insertBefore(). As an example, suppose we want to move our new url from above just before the very first url:
$xpath->registerNamespace("c","http://www.google.com/schemas/sitemap/0.84");
$result = $xpath->query("//c:url");
$result->item(1)->parentNode->insertBefore($result->item(1),$result->item(0));
echo $dom->saveXML();
DomDocument::insertBefore() takes two parameters, the new node and the reference node. It inserts the new node before the reference node. In our example, we insert the second url ($result->item(1)) before the first one ($result->item(0)).
I hear you asking why we use DomDocument::insertBefore() on the $result->item(1)->parentNode node.. Couldn't we just as easily use simply $result->item(0)? No of course not, as we need to execute DomDocument::insertBefore() on the root element, urlset, and not a specific url (look at our xpath query).
We could use the following code which is perfectly valid and gets us the same results, though:
If we wanted to append the first url at the bottom of the sitemap, the following code is the way to go:
// or $dom->documentElement->appendChild($result->item(0)); respectively
Easy is it not? :) DomDocument::insertBefore() and DomNode::appendChild() automatically move (and not copy and then move) the corresponding nodes. If you wish to clone a node first before moving it, use DomNode::cloneNode() first:
$dom = new DomDocument();
$dom->load($source);
$xpath = new DomXPath($dom);
$xpath->registerNamespace("c","http://www.google.com/schemas/sitemap/0.84");
$result = $xpath->query("//c:url");
$clone = $result->item(0)->cloneNode(true);
$result->item(4)->parentNode->appendChild($clone);
echo $dom->saveXML();
The important thing here is that you have to supply omNode::cloneNode() with a true parameter (default is false), so that it copies all descendant nodes as well. If we had left that to false, we would have gotten an empty <url></url> node, which is not desirable. ;)
Modifying Node Data
When modifying node data, you want to modify the CDATA within a node. You can use xpath again to find the node you want to edit and then simply supply a new value to its data property:
$dom = new DomDocument();
$dom->load($source);
$xpath = new DomXPath($dom);
$xpath->registerNamespace("c","http://www.google.com/schemas/sitemap/0.84");
$result = $xpath->query("//c:loc/text()");
$node = $result->item(1);
$node->data = strtoupper($node->data);
echo $dom->saveXML();
This code transforms the location data of the second url to uppercase letters.
Removing Data From XML Documents
There are three types of data that you would possbily want to remove from xml documents: elements, attributes and CDATA. The DOM extension provides a method for each of them:
DomElement::removeAttribute(), DomNode::removeChild() and DomCharacterData::deleteData(). We will use a custom xml document and not our sitemap to demonstrate their behavior. This makes it easier for you
to come back to this article and see at first glance how these methods work. Thank Nikos if you want to. ;)
<xml>
<text type="input">This is some really cool text!</text>
<text type="input">This is some other really cool text!</text>
<text type="misc">This is some cool text!</text>
<text type="output">This is text!</text>
XML;
$dom = new DomDocument();
$dom->loadXML($xml);
$xpath = new DomXPath($dom);
$result = $xpath->query("//text");
// remove first node
$result->item(0)->parentNode->removeChild($result->item(0));
// remove attribute from second node
$result->item(1)->removeAttribute('type');
//delete data from third element
$result = $xpath->query('text()',$result->item(2));
$result->item(0)->deleteData(0, $result->item(0)->length);
echo $dom->saveXML();
The output of this is:
<xml>
<text>This is some other really cool text!</text>
<text type="misc"></text>
<text type="output">This is text!</text>
In this example we start by retrieving all text nodes from a document. Then we remove some data from that document. Simple.
In fact we remove the first node alltogether as well as the attribute of the second node. Finally we truncate the character data of the third node, using xpath to query the corresponding text() node.
Note that DomCharacterData::deleteData() requires a starting offset and a length parameter. Since we want to truncate the data in our example we supply 0 and the length of the CDATA node.
DOM And Working With Namespaces
DOM is very capable of handling namespaces on its own. Most of the time you can ignore them and pass attribute and element names with the appropriate prefix directly to most DOM functions.
$node = $dom->createElement('ns1:somenode');
$node->setAttribute('ns2:someattribute','somevalue');
$node2 = $dom->createElement('ns3:anothernode');
$node->appendChild($node2);
// Set xmlns attributes
$node->setAttribute('xmlns:ns1', 'http://php-coding-practices.com/');
$node->setAttribute('xmlns:ns2', 'http://php-coding-practices.com/articles/');
$node->setAttribute('xmlns:ns3', 'http://php-coding-practices.com/sitemap/');
$node->setAttribute('xmlns:ns4', 'http://php-coding-practices.com/about-the-author/');
$dom->appendChild($node);
echo $dom->saveXML();
The output of this script is:
<ns1 :somenode
ns2:someattribute="somevalue"
xmlns:ns1="http://php-coding-practices.com/"
xmlns:ns2="http://php-coding-practices.com/articles/"
xmlns:ns3="http://php-coding-practices.com/sitemap/"
xmlns:ns4="http://php-coding-practices.com/about-the-author/">
<ns3 :anothernode/>
We can simplify the use of namespaces somewhat by using DomDocument::createElementNS() and DomDocument::setAttributeNS(), which were specifically designed for this purpose:
$node = $dom->createElementNS('http://php-coding-practices.com/', 'ns1:somenode');
$node->setAttributeNS('http://somewebsite.com/ns2', 'ns2:someattribute', 'somevalue');
$node2 = $dom->createElementNS('http://php-coding-practices.com/articles/', 'ns3:anothernode');
$node3 = $dom->createElementNS('http://php-coding-practices.com/sitemap/', 'ns1:someothernode');
$node->appendChild($node2);
$node->appendChild($node3);
$dom->appendChild($node);
echo $dom->saveXML();
This results in the following output:
<ns1 :somenode
xmlns:ns1="http://php-coding-practices.com/"
xmlns:ns2="http://somewebsite.com/ns2"
xmlns:ns3="http://php-coding-practices.com/articles/"
xmlns:ns11="http://php-coding-practices.com/sitemap/"
ns2:someattribute="somevalue">
<ns3 :anothernode xmlns:ns3="http://php-coding-practices.com/articles/"/>
<ns11 :someothernode xmlns:ns1="http://php-coding-practices.com/sitemap/"/>
Interfacing With SimpleXML
As I have mentioned at the start of our little DOM journey it is very easy to exchange loaded documents between SimpleXML and DOM. Therefore, you can take advantage of both
systems' strengths - SimpleXML's simplicity and DOM's power.
You can import SimpleXML object into DOM by using PHP's dom_import_simplexml() function:
$node = dom_import_simplexml($sxml);
$dom = new DomDocument();
$dom->importNode($node,true);
$dom->appendChild($node);
DomDocument::importNode() creates a copy of the node and associates it with the current document. Its second parameter - a boolean value - determines if the method will recursively import the subtree or not.
You can also import a dom object into SimpleXML using simple_xml_import_dom():
$dom->load('sitemap.xml');
$sxe = simplexml_import_dom($dom);
echo $sxe->url[0]->loc;
Conclusion
DOM is certainly a very powerful way of dealing with XML documents. While it provides a good interface for basically every task one could dream of it often takes quite a lot of code lines to accomplish a task. SimpleXML's interface is of course a little easier, but less powerful.
Especially the fact that SimpleXML is rather incapable of removing data makes DOM the way to go for more complicated XML document processing. DOM's power in dealing with namespaces make it a valuable tool when dealing with large portions of data where naming conflicts are likely.
In fact we covered only a small portion of DOM's power. There are many other associating objects which have several useful methods. For example, we have not covered how to append character data. Check the DOM function reference for more information.
Thanks for staying with me on the DOM-boot till the end of our joirney! I hope you enjoyed it - please beware of the gap between the boot and the footbridge when leaving.
You can skip to the end and add a comment.
[...] complément de la doc PHP officielle, voici un petit tutorial qui va bien et qui nous montre les différentes manipulations qu’il est possible [...]
[...] Parsing XML With The DOM Library | PHP Coding Practices - Become an expert PHP Programmer - Great article that goes deep into the the DOMXML extensions of PHP, showing you how to do serious manipulation of XML documents. Includes loading, parsing and writing XML docs, using XPath queries, adding nodes, removing data and more. [...]
good stuff on namespaces; didn't see anyting about validating a document (dtd, schema) or error handling
Parsing XML With The DOM Library | PHP Coding Practices - Become an expert PHP Programmer - Great article that goes deep into the the DOMXML extensions of PHP, showing you how to do serious manipulation of XML documents. Includes loading, parsing and writing XML docs, using XPath queries, adding nodes, removing data and more
Koschuetzki has created and posted a new tutorial today that talks about working with XML documents with PHP’s DOM Library (the PHP5 version
worst site
I have been trying to run the namespaces example among others and I keep getting -
Parse error: syntax error, unexpected T_VARIABLE in /Library/WebServer/Documents/tester/dom/read.php on line 4
Any ideas would be much appreciated,
Cheers
Chris
Can you show me your code please?
Good,,,
Just a note to say thanks, a really useful article. Was going around the bend not understaning why addAttribute was working; setAttribute fixed the problem. Thanks, Rob.
Very useful article, thx Tim!!
Hi Tim,
I'm wondering how do you format the xml output? Like in the 8th panel above (titled "XML") the output is nicely formatted with each node on a separate line. My output is always in one very very long line. Any tips are appreciated.
Thank you!
Mikhail: Unfortunately I had the same problem like you and just formatted it so it could actually be analyzed by people here reading the post. : /
I can't think of anything other than parsing the XML output again to make it work. However, the DOM library might provide some tool for this? I can't seem to find it though. : (
Thanks Tim! I guess in the big scheme of things it doesn't matter how it's stored in a file. As long as I can parse it ok :).
Absolutely :]
Ok, I found a nice solution to formatting the XML output - it is using the newline \n (and tab \t) characters. The idea is to append these before and after the text-node value, as shown in simple example below:
$dom = domxml_open_file($xmlpath);
$root = $dom->document_element(); //get root node
$dom->preserveWhiteSpace = true;
$dom->formatOutput = true;
//create entry container
$entry_container = $dom->create_element("entry");
$entry_container = $root->append_child($entry_container);
//create word within entry
$word = $dom->create_element("word");
$word = $entry_container->append_child($word);
//write to word container
$word_value = $dom->create_text_node("\n".$_POST['myword']."\n");
$word_value = $word->append_child($word_value);
$dom->dump_file($xmlpath, false, true); //save xml file
this outputs a word on a separate line, between the opening and closing tags. Much neater I suppose.
Perfect, thanks for sharing Mikhail! :)
"The DomDocument::loadHTML() method will automatically add a DTD (Document Type Definition) and add the missing end-tag for the opened p-tag. Cool, isn't it?"
NO, IT IS NOT! It sucks big time, like anything that should be controlled but isn't! I'm still struggling to avoid JUST THIS automatic insertions, because I need to work on bits of HTML, without the DTD, HEAD, BODY tags.
What would be really cool is to find a way of using loadHTML, do your thing then output with saveHTML WITHOUT having to put up with these unwelcomed and uncontrollable additions by the PHP DOM. Jesus!
I agree with YoDaddy! I too am trying to work with HTMl fragments, and I am finding the "wonderful" auto DTD insertion a blinding headache! I've been at this for over 17 hours now and I still can't find a solution. If anyone has any suggestions then I would greatly welcome them!
Still a pretty cool article though, I did use it to start learning manipulation of XML with PHPs Dom a few months ago :)
YoDaddy, Ben : No need to shout. -.-
Hmm, I don't see a solution either, other than manually editing the file later. The API obviously does not allow it. So one would need to write a wrapper around saveHtml() and that removes the unwelcomed additions.
Ben: You say you have been at this for 17 hours, did you not try to manually edit the file after?
Thanks for the post. Helps me a lot to remove a node from an XML as I was using simpleXML and delete possibilities are limited on XML feed. I writted a post on my little experience:
Here is the link: Here is the link: http://yougatech.com/2009/02/remove-an-xml-node-using-dom-document-and-attribute-name/
Hi Tim,
In your examples, the namespace declaration is repeated (in both the root node and in the element where it is used). Having the namespace declaration repeated in every element is a lot of overhead when there are many elements. Any idea how to make it only appear in the root node?
Actually for me, it only shows up in every node where it's used, but not in the root. Using PHP 5.1.3
thanks for the post, PHP4 and XML handling was a real pain!
PHP5 is a dream to manipulate XML.
This post is too old. We do not allow comments here anymore in order to fight spam. If you have real feedback or questions for the post, please contact us.
[...] Koschuetzki has created and posted a new tutorial today that talks about working with XML documents with PHP’s DOM Library (the PHP5 version) [...]