How can one parse HTML/XML and extract information from it?
This is a General Reference question for the php tag
How can one parse HTML/XML and extract information from it?
|
||||
|
Native XML ExtensionsI prefer using one of the native XML extensions since they come bundled with PHP, are usually faster than all the 3rd party libs and give me all the control I need over the markup. DOM
DOM is capable of parsing and modifying real world (broken) HTML and it can do XPath queries. It is based on libxml. It takes some time to get productive with DOM, but that time is well worth it IMO. Since DOM is a language-agnostic interface, you'll find implementations in many languages, so if you need to change your programming language, chances are you will already know how to use that language's DOM API then. A basic usage example can be found in Grabbing the href attribute of an A element and a general conceptual overview can be found at DOMDocument in php How to use the DOM extension has been covered extensively on StackOverflow, so if you choose to use it, you can be sure most of the issues you run into can be solved by searching/browsing Stack Overflow. XMLReader
XMLReader, like DOM, is based on libxml. I am not aware of how to trigger the HTML Parser Module, so chances are using XMLReader for parsing broken HTML might be less robust than using DOM where you can explicitly tell it to use libxml's HTML Parser Module. A basic usage example can be found at getting all values from h1 tags using php XML Parser
The XML Parser library is also based on libxml, and implements a SAX style XML push parser. It may be a better choice for memory management than DOM or SimpleXML, but will be more difficult to work with than the pull parser implemented by XMLReader. SimpleXml
SimpleXML is an option when you know the HTML is valid XHTML. If you need to parse broken HTML, don't even consider SimpleXml because it will choke. A basic usage example can be found at A simple program to CRUD node and node values of xml file and there is lots of additional examples in the PHP Manual. 3rd Party Libraries (libxml based)If you prefer to use a 3rd-party lib, I'd suggest using a lib that actually uses DOM/libxml underneath instead of string parsing. phpQuery
Zend_Dom
QueryPath
FluentDom
fDOMDocument
3rd-Party (not libxml-based)The benefit of building upon DOM/libxml is that you get good performance out of the box because you are based on a native extension. However, not all 3rd-party libs go down this route. Some of them listed below SimpleHtmlDom
I generally do not recommend this parser. The codebase is horrible and the parser itself is rather slow and memory hungry. Any of the libxml based libraries should outperform this easily. Ganon
Never used it. Can't tell if it's any good. HTML 5You can use the above for parsing HTML5, but there can be quirks due to the markup HTML5 allows. So for HTML5 you want to consider using a dedicated parser, like
We might see more dedicated parsers once HTML5 is finalized. There is also a blogpost by the W3's titled How-To for html 5 parsing that is worth checking out. WebServicesIf you don't feel like programming PHP, you can also use web services. In general, I found very little utility for these, but that's just me and my use cases. YQL
ScraperWiki.
Regular ExpressionsLast and least recommended, you can extract data from HTML with regular expressions. In general using Regular Expressions on HTML is discouraged. Most of the snippets you will find on the web to match markup are brittle. In most cases they are only working for a very particular piece of HTML. Tiny markup changes, like adding whitespace somewhere, or adding or changing attributes in a tag, can make the regex fails when it's not properly written. You should know what you are doing before using regex on HTML. HTML parsers already know the syntactical rules of HTML. Regular expressions have to be taught for each new regex you write. Regex are fine in some cases, but it really depends on your use-case. You can write more reliable parsers, but writing a complete and reliable custom parser with regular expressions is a waste of time when the aforementioned libraries already exist and do a much better job on this. Also see Parsing Html The Cthulhu Way BooksIf you want to spend some money, have a look at I am not affiliated with PHP Architect or the authors. |
|||||||||||||||||||||
|
How to get HTML elements:
|
|||||||||||||||||||||
|
Just use DOMDocument->loadHTML() and be done with it. libxml's HTML parsing algorithm is quite good and fast, and contrary to popular belief, does not choke on malformed HTML. |
|||||||||||||||||||||
|
phpQuery and QueryPath are extremely similar in replicating the fluent jQuery API. That's also why they're two of the easiest approaches to properly parse HTML in PHP. Examples for QueryPath Basically you first create a queryable DOM tree from an HTML string:
The resulting object contains a complete tree representation of the HTML document. It can be traversed using DOM methods. But the common approach is to use CSS selectors like in jQuery:
Mostly you want to use simple
QueryPath also allows injecting new tags into the stream (
. phpQuery or QueryPath? Generally QueryPath is better suited for manipulation of documents. While phpQuery also implements some pseudo AJAX methods (just HTTP requests) to more closely resemble jQuery. It is said that phpQuery is often faster than QueryPath (because of fewer overall features). For further information on the differences see this comparison on the wayback machine from tagbyte.org. (Original source went missing, so here's an internet archive link. Yes, you can still locate missing pages, people.) And here's a comprehensive QueryPath introduction. Advantages
|
|||||
|
Why you shouldn't and when you should use regular expressions? First off, HTML cannot be properly parsed using regular expressions. Regexes can however extract data. Extracting is what they're made for. The major drawback of regex HTML extraction over proper SGML toolkits or basic XML parsers are their syntactic cumbersomeness and meager reliability. Consider that making a somewhat reliable HTML extraction regex:
is way less readable than a simple phpQuery or QueryPath equivalent:
There are however specific use cases where they can help. Most XML parsers cannot see HTML document comments It's sometimes even advisable to pre-extract a snippet of HTML using regular expressions Note: I actually have this |
|||||||||||||||||||||
|
I suggest using phpQuery. |
|||||
|
Simple HTML Dom is a great open-source parser: It treats dom elements in an object-oriented way, and the new iteration has a lot of coverage for non-compliant code. There are also some great functions like you'd see in JavaScript, such as the "find" function, which will return all instances of elements of that tag name. I've used this in a number of tools, testing it on many different types of web pages, and I think it works great. |
||||
|
One general approach I haven't seen mentioned here is to run HTML through Tidy, which can be set to spit out guaranteed-valid XHTML. Then you can use any old XML library on it. But to your specific problem, you should take a look at this project: http://fivefilters.org/content-only/ -- it's a modified version of the Readability algorithm, which is designed to extract just the textual content (not headers and footers) from a page. |
|||
|
For 1a and 2: I would vote for the new Symfony Componet class DOMCrawler ( DomCrawler ). This class allows queries similar to CSS Selectors. Take a look at this presentation for real-world examples: news-of-the-symfony2-world. The component is designed to work standalone and can be used without Symfony. The only drawback is that it will only work with PHP 5.3 or newer. |
|||||
|
This is commonly referred to as screen scraping, by the way. The library I have used for this is Simple HTML Dom Parser. |
|||||||||||||
|
This sounds like a good task description of W3C XPath technology. It's easy to express queries like "return all |
|||
|
PHP Simple DOM Parser looks good. I haven't tried using it yet though. |
|||||||||
|
We have created quite a few crawlers for our needs before. At the end of the day, it is usually simple regular expressions that do the thing best. While libraries listed above are good for the reason they are created, if you know what you are looking for, regular expressions is a safer way to go, as you can handle also non-valid HTML/XHTML structures, which would fail, if loaded via most of the parsers. |
|||||||||
|
I recommend PHP Simple HTML DOM Parser
|
|||
|
There is also Goutte (PHP Web Scraper) which is now available : https://github.com/fabpot/Goutte/ |
|||
|
Third party alternatives to SimpleHtmlDom that use DOM instead of String Parsing: phpQuery, Zend_Dom, QueryPath and FluentDom. |
|||||||||
|
With PHP I would advise you to use the Simple HTML Dom Parser, the best way to learn more about it is to look for samples on the ScraperWiki website. |
|||
|
I've used HTML Purifier with a lot of success on a couple different projects. |
|||||
|
html5lib has a PHP version. (I don't know how up-to-date it is.) |
|||
|
Yes you can use simple_html_dom for the purpose. However I have worked quite a lot with the simple_html_dom, particularly for web scrapping and have found it to be too vulnerable. It does the basic job but I won't recommend it anyways. I have never used curl for the purpose but what I have learned is that curl can do the job much more efficiently and is much more solid. Kindly check out this link:scraping-websites-with-curl |
|||||
|
QueryPath is good, but be careful of "tracking state" cause if you didn't realise what it means, it can mean you waste a lot of debugging time trying to find out what happened and why the code doesn't work. What it means is that each call on the result set modifies the result set in the object, it's not chainable like in jquery where each link is a new set, you have a single set which is the results from your query and each function call modifies that single set. in order to get jquery-like behaviour, you need to branch before you do a filter/modify like operation, that means it'll mirror what happens in jquery much more closely.
then |
||||
|
Goutte is a simple but awesome web scraper which you can just drop into your code. Its used heavily by some other libraries such as Behat and is stable and well featured. |
|||
|
For html5, html5 lib has been abandoned for years now. The only html5 library I can find with a recent update and maintenance records is html5-php which was just brought to beta 1.0 a little over a week ago. |
||||
|
There is Wiseparser. It requires PHP 5 and works in a manner close to real browsers. |
|||||
|
You could try using something like HTML Tidy to cleanup any "broken" HTML and convert the HTML to XHTML, which you can then parse with a XML parser. |
|||
|
I have written a general purpose XML parser that can easily handle GB files. It's based on XMLReader and it's very easy to use:
Here's the github repo: XmlExtractor |
||||
|
Another option you can try is QueryPath. It's inspired by jQuery, but on the server in PHP and used in Drupal. |
||||
|
The Symfony framework has bundles which can parse the HTML, and you can use CSS style to select the DOMs instead of using XPath. |
||||
|
|
|||
|
Json & Array from XML in 3 lines:
Ta da! |
||||
|
Thank you for your interest in this question. Because it has attracted low-quality answers, posting an answer now requires 10 reputation on this site.
Would you like to answer one of these unanswered questions instead?