PHP DOM:使用XML

tech2023-10-31  102

SimpleXML allows you to quickly and easily work with XML documents, and in the majority of cases SimpleXML is sufficient. But if you’re working with XML in any serious capacity, you’ll eventually need a feature that isn’t supported by SimpleXML, and that’s where the PHP DOM (Document Object Model) comes in.

SimpleXML使您可以快速轻松地使用XML文档,并且在大多数情况下,SimpleXML就足够了。 但是,如果您要以任何严肃的能力使用XML,那么最终将需要SimpleXML不支持的功能,而PHP DOM(文档对象模型)就应运而生。

PHP DOM is an implementation of the W3C DOM standard and it adheres more to the object model than does SimpleXML. It may seem a little overwhelming at first, but if you’re willing to learn then you’ll find that this library for accessing and manipulating XML documents provides a great deal of control over working XML documents in PHP. This is because DOM differentiates between the various constituents of an XML document, such as different node types.

PHP DOM是W3C DOM标准的实现,与SimpleXML相比,它对对象模型的粘附更多。 乍一看似乎有点让人不知所措,但是如果您愿意学习,那么您会发现该用于访问和操作XML文档的库提供了对PHP中工作XML文档的大量控制。 这是因为DOM区分XML文档的各种组成部分,例如不同的节点类型。

To explore some of the basic functionality associated with PHP DOM, let’s create a class which is able to add and remove books in library and query the catalog. It should offer the following functionality:

为了探索与PHP DOM相关的一些基本功能,让我们创建一个类,该类能够在库中添加和删除书籍以及查询目录。 它应提供以下功能:

Query for a book found by its ISBN

查询其ISBN找到的书 Add a book to the library

将书添加到图书馆 Remove a book from the library

从图书馆取出一本书 Find all books of a specific genre

查找特定类型的所有书籍

DTD和XML (The DTD and XML)

In this article, I’ll use the following DTD and XML that describe a library and its books. This should provide enough material to demonstrate how the extension can be used:

在本文中,我将使用以下DTD和XML描述库及其书籍。 这应该提供足够的材料来演示如何使用扩展名:

<!ELEMENT library (book*)> <!ELEMENT book (title, author, genre, chapter*)> <!ATTLIST book isbn ID #REQUIRED> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT genre (#PCDATA)> <!ELEMENT chapter (chaptitle,text)> <!ATTLIST chapter position NMTOKEN #REQUIRED> <!ELEMENT chaptitle (#PCDATA)> <!ELEMENT text (#PCDATA)> <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE library SYSTEM "library.dtd"> <library> <book isbn="isbn1234"> <title>A Book</title> <author>An Author</author> <genre>Horror</genre> <chapter position="first"> <chaptitle>chapter one</chaptitle> <text><![CDATA[Lorem Ipsum...]]></text> </chapter> </book> <book isbn="isbn1235"> <title>Another Book</title> <author>Another Author</author> <genre>Science Fiction</genre> <chapter position="first"> <chaptitle>chapter one</chaptitle> <text><![CDATA[<i>Sit Dolor Amet...</i>]]></text> </chapter> </book> </library>

One of the most important things in understanding DOM is the concept of a node. A node is essentially any conceptual item in the XML document. If it’s an element (such as chapter) then it’s a node. If it’s an attribute, (such as isbn), then it’s viewed as a node by DOM. Nodes provide the atomic structure of an XML document.

理解DOM最重要的事情之一就是节点的概念。 节点本质上是XML文档中的任何概念项目。 如果它是一个元素(例如chapter ),那么它是一个节点。 如果它是一个属性(例如isbn ),那么它将被DOM视为一个节点。 节点提供XML文档的原子结构。

PHP DOM subclasses DOMNode to provide child classes which represent different aspects of the document. So, DOMDocument actually inherits from DOMNode. DOMElement and DOMAttr also inherit from DOMNode. Having a common parent class enables you to have common methods and properties available to all nodes, such as those used to determine a node’s type, value, or even adding to it.

PHP DOM子类DOMNode来提供代表文档不同方面的子类。 因此, DOMDocument实际上是从DOMNode继承的。 DOMElement和DOMAttr也从DOMNode继承。 拥有公共的父类使您可以对所有节点使用公共的方法和属性,例如用于确定节点的类型,值甚至添加到节点的类型。

图书馆课 (The Library Class)

A class called Library offers methods for the required functionality that was outlined in the introduction. It also has a constructor and destructor, and internal properties to store the DOM Document and path to the XML file. The various operations are performed on the DOM Document reference, and the path is used when saving the tree back as XML back to the file system.

称为Library的类提供了介绍中概述的所需功能的方法。 它还具有构造函数和析构函数,以及用于存储DOM文档和XML文件路径的内部属性。 各种操作均在DOM Document引用上执行,并且在将树另存为XML回文件系统时使用该路径。

<?php class Library { private $xmlPath; private $domDocument; public function __construct($xmlPath) { // TODO: instantiate the private variable representing // the DOMDocument } public function __destruct() { // TODO: free memory associated with the DOMDocument } public function getBookByISBN($isbn) { // TODO: return an array with properties of a book } public function addBook($isbn, $title, $author, $genre, $chapters) { // TODO: add a book to the library } public function deleteBook($isbn) { // TODO: Delete a book from the library } public function findBooksByGenre($genre) { // TODO: Return an array of books } }

I’ll deliberately keep things simple as the example only serves to demonstrate what DOM can do. In a real-world application, perhaps you’d instantiate book objects to encapsulate the problem more fully, and you’d probably want to handle errors more gracefully as well. You don’t need to do this at this stage, though. We can just assume that values passed and returned are strings or arrays, and errors can be handled by throwing a generic exception.

我将故意简化事情,因为该示例仅用于演示DOM可以做什么。 在实际的应用程序中,也许您要实例化书本对象以更全面地封装问题,并且您可能还希望更优雅地处理错误。 不过,您无需在此阶段执行此操作。 我们可以假设传递和返回的值是字符串或数组,并且可以通过引发通用异常来处理错误。

处理对象的构造和破坏 (Handling Object Construction and Destruction)

The constructor is designed to take the path to the XML document that you want to use as an argument. There are a few of tests it does to ensure that the document is valid.

构造函数旨在采用您要用作参数的XML文档的路径。 为了确保文档有效,它进行了一些测试。

The first test is to determine the document being loaded uses the “library” doctype. Each DOMDocument has the public property doctype which returns the doctype used by the XML document. So for this example, you should see that the doctype property is set to “library” when you’ve loaded up the document.

第一个测试是确定要使用“库” doctype加载文档。 每个DOMDocument都有一个公共属性doctype ,该属性返回XML文档使用的doctype。 因此,对于此示例,您应该在加载文档时将doctype属性设置为“ library”。

The second test is to ensure that the definition used is defined in the correct manner using the public systemId or publicId properties. The XML used here is defined by a DTD specified on the system as library.dtd, so it tests for that by comparing it against the systemId property.

第二项测试是确保使用public systemId或publicId属性以正确的方式定义使用的定义。 这里使用的XML定义一个DTD的系统上指定library.dtd ,所以通过比较其对测试为systemId财产。

The third test is to ensure that the document itself is valid according to the DTD. The validation of the document also checks whether the document is well-formed (i.e. tag mismatches, etc.) and that it adheres to the DTD on which it is based.

第三项测试是确保文件本身根据DTD是有效的。 文件的验证还检查文件的格式是否正确(例如标签不匹配等),以及文件是否遵守其所基于的DTD。

Once all of these conditions are met, it stores a reference to the loaded document and path to the XML file as internal properties to be used later by other methods. But if at any point one of the tests fail, an exception is thrown.

一旦满足所有这些条件,它将存储对已加载文档的引用和XML文件的路径作为内部属性,以供其他方法稍后使用。 但是,如果其中任何一个测试失败,则会引发异常。

<?php public function __construct($xmlPath) { //loads the document $doc = new DOMDocument(); $doc->load($xmlPath); //is this a library xml file? If ($doc->doctype->name != "library" || $doc->doctype->systemId != "library.dtd") { throw new Exception("Incorrect document type"); } //is the document valid and well-formed? if($doc->validate()) { $this->domDocument = $doc; $this->xmlPath = $xmlPath; } else { throw new Exception("Document did not validate"); } }

The destructor method releases any memory used by the $domDocument. This is really just a simple call to unset the property.

析构函数方法释放$domDocument使用的所有内存。 这实际上只是一个取消设置属性的简单调用。

<?php public function __destruct() { unset($this->domDocument); }

通过其ISBN归还图书 (Return a Book by its ISBN)

Now on to the main methods for reading an manipulating the underlying XML document.

现在介绍读取操作底层XML文档的主要方法。

The first method obtains details of a book from a provided ISBN. You can provide the ISBN as a string and the method returns an array detailing the properties of the book.

第一种方法从提供的ISBN中获取书籍的详细信息。 您可以将ISBN作为字符串提供,该方法将返回一个详细说明该书属性的数组。

PHP DOM provides a very simple function to return a specific element based on it’s ID – getElementById() which returns a DOMElement object. For this to work, you will have to have nominated an ID with your DTD, as I did:

PHP DOM提供了一个非常简单的函数来根据其ID返回特定元素– getElementById() ,该元素返回DOMElement对象。 为此,您必须像我一样在DTD中指定一个ID:

<!ATTLIST book isbn ID #REQUIRED>

It’s important to know that getElementById() only works if the document has been validated against a DTD. If not, then the function will simply not pick up the fact that the element has an ID.

重要的是要知道getElementById()仅在针对DTD验证了文档的情况下才有效。 如果不是,则该函数将根本不了解元素具有ID的事实。

Another way of obtaining elements from a document is to use getElementsByTagName(). This method returns a collection of nodes which have been found with the specified tag name. The collection returned is a DOMNodeList, which is traversable.

从文档中获取元素的另一种方法是使用getElementsByTagName() 。 此方法返回已找到具有指定标签名称的节点的集合。 返回的集合是DOMNodeList ,它是可遍历的。

Items in the DOMNodeList can also be picked out by their position in the list with item(). Because the DTD defines a book can only have one author, we know that the DOMNodeList will contain one node which can be accessed with item(0). The DTD enforces this fact, and if it were different in the document then you would have received a validation error when the Library object was created.

DOMNodeList项目也可以使用item()在列表中的位置进行挑选。 因为DTD定义了一本书只能有一个作者,所以我们知道DOMNodeList将包含一个可以通过item(0)访问的节点。 DTD强制执行这一事实,如果文档中的内容不同,则在创建Library对象时您将收到验证错误。

Once you have found the particular node you want, you can find it’s value using the public property nodeValue.

找到所需的特定节点后,可以使用公共属性nodeValue来找到其值。

To access attributes, you can make use of DOMNode‘s pubic property attributes which returns a DOMNamedNodeMap. This is similar to the DOMNodeList in that it is traversable, but you can also pick out a specific attribute using the getNamedItem() method and just pass the name of the attribute as a string. The return value is a DOMNode.

要访问属性,可以使用DOMNode的公共属性,该属性返回DOMNamedNodeMap 。 这与DOMNodeList相似,因为它是可遍历的,但是您也可以使用getNamedItem()方法选择特定的属性,并将属性的名称作为字符串传递。 返回值为DOMNode 。

The implementation of the method to retrieve a book and its information thus looks like this:

因此,检索书籍及其信息的方法的实现如下所示:

<?php public function getBookByISBN($isbn) { // get a book element from the isbn ID $book = $this->domDocument->getElementById($isbn); // if a book was not returned... if (!$book) { throw new Exception("No book found with ISBN ". $isbn); } $arrBook = array(); $arrBook["isbn"] = $isbn; // get the data from the elements based on their tag names // // we know these DOMNodeLists will only return one // item since the DTD states this $arrBook["author"] = $book->getElementsByTagName("author") ->item(0)->nodeValue; $arrBook["title"] = $book->getElementsByTagName("title") ->item(0)->nodeValue; $arrBook["genre"] = $book->getElementsByTagName("genre") ->item(0)->nodeValue; $chapters = $book->getElementsByTagName("chapter"); $arrChapters = array(); // iterate over the chapter elements foreach($chapters as $chapter) { $chapterId = $chapter->attributes ->getNamedItem("position")->nodeValue; $chapterTitle = $chapter ->getElementsByTagName("chaptitle")->item(0) ->nodeValue; $chapterText = $chapter ->getElementsByTagName("text")->item(0) ->nodeValue; $arrChapter["title"] = $chapterTitle; $arrChapter["text"] = $chapterText; $arrChapters[$chapterId] = $arrChapter; } $arrBook["chapters"] = $arrChapters; return $arrBook; }

Identifying and pulling data from an XML document is relatively simple. The main hurdle to overcome is understanding the node concept; once you understand that, you’ll find that obtaining the data you want is a straightforward process.

从XML文档中识别和提取数据相对简单。 要克服的主要障碍是理解节点概念。 一旦了解了这一点,就会发现获取所需数据是一个简单的过程。

将书添加到图书馆 (Adding a Book to the Library)

The next method to define adds a book to the XML database. The method takes the properties and an array of chapters of the book to add.

下一个定义方法将一本书添加到XML数据库。 该方法采用属性和本书章节的数组来添加。

One way of performing such a task is to use the createElement() method and add this new node to the document, and set a reference to it so you can operate on the object from that point forward. When you create an element you must also add it to the document. Using createElement() does not automatically add it to the document for you. It associates the element with document, but that’s as far as it goes. It’s good practice to add elements you intend to be part of the document as soon as they are instantiated so that they are not forgotten!

执行此任务的一种方法是使用createElement()方法并将此新节点添加到文档中,并设置对它的引用,以便您可以从那时开始对该对象进行操作。 创建元素时,还必须将其添加到文档中。 使用createElement()不会自动将其添加到您的文档中。 它将元素与文档相关联,但这已尽其所能。 优良作法是在实例化元素后立即添加要成为文档一部分的元素,以免被遗忘!

You can use the documentElement property to identify the root element of the XML document. If we weren’t to do this and just add directly to the document, we would in fact be adding a child to the very end of the document (i.e. outside of the library element). This would result in a validation error. If you think about it, this behaviour of DOM is totally reasonable; treating the document as the root element and adding a child to it would place it after the library element as that is the first child of the document.

您可以使用documentElement属性来标识XML文档的根元素。 如果我们不这样做,而只是直接添加到文档中,那么实际上我们将在文档的末尾添加一个子级(即,在library元素之外)。 这将导致验证错误。 如果您考虑一下,DOM的这种行为是完全合理的。 将文档视为根元素并添加一个子元素会将其放置在library元素之后,因为那是文档的第一个子元素。

Of course, the book element must contain an ISBN, so an attribute must be added to the newly created element. There are two ways of doing this. The simplest is to use setAttribute() which takes the name of the attribute and the value of the attribute as arguments. The second way is to create a DOMAttr object and then append that to the element. DOMAttr is a subclass of DOMNode, so it benefits from all the inherited methods and properties its parent offers.

当然, book元素必须包含ISBN,因此必须将属性添加到新创建的元素中。 有两种方法可以做到这一点。 最简单的方法是使用setAttribute() ,该方法将属性名称和属性值作为参数。 第二种方法是创建DOMAttr对象,然后将其附加到元素。 DOMAttr是DOMNode的子类,因此它受益于其父级提供的所有继承的方法和属性。

setAttribute() and setAttributeNode() are responsible for adding and updating attributes associated with an element. If the attribute does not exist, it will be created. If it does exist, it will be updated.

setAttribute()和setAttributeNode()负责添加和更新与元素关联的属性。 如果该属性不存在,则将创建它。 如果确实存在,它将被更新。

To supply the value for a text element, it is advisable to use DOMCdataSection(). The chapters of the books are given as PCDATA and not CDATA in the DTD. This is because an element cannot be described as containing CDATA directly; we have to declare it as PCDATA and then wrap the content in <![CDATA[...]]>. It sounds counter-intuitive as we need to be able to put unparsed character data in the text element for use later, but this is why we have to create a specific DOMCdataSection; this will safely wrap our text in <![CDATA[...]]>. If you were to add HTML directly to a node, you’ll find that invalid characters such as < or & would be converted to their relevant entities (i.e. &lt; and &amp;). This is because these characters have special meaning is XML. The ampersand for entities, and the greater-than symbol starts a tag. DOM substitutes these so as not to cause any parsing issues when the document is loaded or validated.

要提供文本元素的值,建议使用DOMCdataSection() 。 这些书的各章在DTD中以PCDATA而不是CDATA的形式给出。 这是因为不能将元素描述为直接包含CDATA。 我们必须将其声明为PCDATA,然后将内容包装在<![CDATA[...]]> 。 这听起来有点违反直觉,因为我们需要能够将未解析的字符数据放入文本元素中以备后用,但这就是为什么我们必须创建特定的DOMCdataSection ; 这样可以安全地将文本包装在<![CDATA[...]]> 。 如果将HTML直接添加到节点,则会发现诸如<或&之类的无效字符将转换为它们的相关实体(即&lt;和&amp;)。 这是因为这些字符具有XML的特殊含义。 实体的“&”号和大于号开始一个标签。 DOM替代了这些,以便在加载或验证文档时不会引起任何解析问题。

The last step in adding a book is to save the new document back into the file, which is done with the document’s save() method.

添加书籍的最后一步是将新文档保存回文件中,这是通过文档的save()方法完成的。

The method altogether looks like this:

该方法总计如下所示:

<?php public function addBook($isbn, $title, $author, $genre, $chapters) { // create a new element represeting the new book $newbook = $this->domDocument->createElement("book"); // append the newly created element $this->domDocument->documentElement ->appendChild($newbook); // setting the attribute can be done in one of two ways // Method One: // $newbook->setAttribute("isbn", $isbn); // Method Two: $idAttribute = new DOMAttr("isbn", $isbn); $newbook->setAttributeNode($idAttribute); $title = $this->domDocument ->createElement("title", $title); $newbook->appendChild($title); $author = $this->domDocument ->createElement("author", $author); $newbook->appendChild($author); $genre = $this->domDocument ->createElement("genre", $genre); $newbook->appendChild($genre); foreach($chapters as $position => $chapter) { $newchapter = $this->domDocument ->createElement("chapter"); $newbook->appendChild($newchapter); $newchapter->setAttribute("position", $position); $newchaptitle = $this->domDocument ->createElement("chaptitle", $chapter["title"]); $newchapter->appendChild($newchaptitle); $newtext = $this->domDocument->createElement("text"); $newchapter->appendChild($newtext); // Rather than creating a new element, create a // DOMCdataSection which ensures our text is // wrapped in <![CDATA[ and ]]> $cdata = new DOMCdataSection($chapter["text"]); $newtext->appendChild($cdata); } // save the document $this->domDocument->save($this->xmlPath); }

从图书馆删除一本书 (Deleting a Book from the Library)

The next method to tackle is deleting a book. This is just a case of identifying which element in the XML document you want to delete and then use the removeChild() method to remove it. There are two important things to understand, however.

解决的下一个方法是删除一本书。 这只是确定要删除XML文档中的哪个元素,然后使用removeChild()方法删除它的一种情况。 但是,有两点要理解。

First, you are unable to remove a child from an instance of DOMDocument directly. You have to access the documentElement and remove the child from there. This is for the same reasons why you had to refer to documentElement when adding a book to the library.

首先,您无法直接从DOMDocument实例中删除子级。 您必须访问documentElement并从那里删除子项。 出于同样的原因,在将书籍添加到库时必须引用documentElement 。

Second, removing the element from the document just removes it from memory. If you want to persist the data, you should save it back to a file.

其次,从文档中删除元素只是将其从内存中删除。 如果要保留数据,则应将其保存回文件。

Here’s what the deleteBook() method looks like:

这是deleteBook()方法的外观:

<?php public function deleteBook($isbn) { // get the book element based on its ID $book = $this->domDocument->getElementById($isbn); // simply remove the child from the documents // documentElement $this->domDocument->documentElement->removeChild($book); // save back to disk $this->domDocument->save($this->xmlPath); }

按类型查找书籍 (Find Books by Genre)

The method to find specific books based on a genre employs XPath to obtain the results we need. getElementById(), as you saw before, is a convenient way of picking items out of the DOM when we have declared an ID within a DTD. But what can we do if we need to query against some other data in the XML? We can use an DOMXPath object. XPath itself is beyond the scope of this article, but I do advise you look at some resources explaining the syntax. The XPath query to find any book item in the XML which has a genre of a specific type is:

根据类型查找特定书籍的方法使用XPath来获得我们需要的结果。 正如您之前看到的,当我们在DTD中声明ID时, getElementById()是从DOM中挑选项目的便捷方法。 但是,如果我们需要查询XML中的其他数据怎么办? 我们可以使用DOMXPath对象。 XPath本身不在本文讨论范围之内,但是我建议您查看一些解释语法的资源。 用于查找XML中具有特定类型类型的任何书籍项目的XPath查询是:

//library/book/genre[text() = "some genre"]/..

This query tells first we want to access a genre element in the path //library/book. The two forward slashes indicate that library is the root element, and the single slashes indicate book is a child of library and genre is a child of book. [text() = "some genre"] indicates that we are looking for an where the text inside it is “some genre”. On it’s own, the result would just be the genre element which is why /.. is tagged at the end to indicate that we actually need genre‘s parent.

该查询首先告诉我们我们要访问路径//library/book的genre元素。 两个正斜杠表示library是根元素,单斜杠表示book是library的子代, genre是book的子代。 [text() = "some genre"]表示我们正在寻找其中的文本为“ some genre”的地方。 就其本身而言,结果将只是genre元素,这就是为什么在末尾标记/..以指示我们实际上需要genre的父级的原因。

XPath is a great way to locate nodes in a structure. If you find yourself iterating over a few DOMNodeLists and testing nodeValues for certain values the you’d probably be better off look at an equivalent XPath query which will certainly be much shorter, quicker and easier to read.

XPath是在结构中定位节点的好方法。 如果发现自己遍历了几个DOMNodeLists并测试了nodeValues的某些值,那么最好看一下等效的XPath查询,该查询肯定会更短,更快捷,更容易阅读。

Here’s what the search method looks like:

搜索方法如下所示:

<?php public function findBooksByGenre($genre) { // use XPath to find the book we"re looking for $query = '//library/book/genre[text() = "' . $genre . '"]/..'; // create a new XPath object and associate it with the document we want to query against $xpath = new DOMXPath($this->domDocument); $result = $xpath->query($query); $arrBooks = array(); // iterate of the results foreach($result as $book) { // add the title of the book to an array $arrBooks[] = $book->getElementsByTagName("title")->item(0)->nodeValue; } return $arrBooks; }

摘要 (Summary)

This article was just a taster to show you how you can use DOM to manipulate and report back from XML data. PHP DOM is not as scary as it looks, and you may find that you prefer it over SimpleXML in certain circumstances.

本文只是向您展示如何使用DOM来操作XML数据并从XML数据返回报告的尝试者。 PHP DOM并不像看起来那样可怕,在某些情况下,您可能会发现它比SimpleXML更受青睐。

One of the most important things you learned was the concept of the node, the basic building block of an XML document as far as DOM is concerned. You saw how to load an XML document into memory and validate it, pulled data from an XML document using getElementById() and getElementsByTagName(), add and remove elements, work with attributes, and looked at the collections of DOMNodeList and DOMNamedNodeMap to pull collections of data.

您学到的最重要的事情之一是节点的概念,就DOM而言,它是XML文档的基本构建块。 您了解了如何将XML文档加载到内存中并进行验证,如何使用getElementById()和getElementsByTagName()从XML文档中提取数据,添加和删除元素,使用属性,以及如何查看DOMNodeList和DOMNamedNodeMap集合以提取集合。数据的。

While a lot of things you saw today are things that you can probably do easily in SimpleXML already, I hope this article showed you how the same things can be achieved with DOM and what some of the benefits of DOM are.

尽管您今天看到的很多事情已经可以在SimpleXML中轻松完成,但是我希望本文向您展示使用DOM如何实现相同的事情以及DOM的一些好处。

Image via Fotolia

图片来自Fotolia

翻译自: https://www.sitepoint.com/php-dom-working-with-xml/

最新回复(0)