How do you parse and process HTML/XML in PHP?

Question

How can one parse HTML/XML and extract information from it?

This is a General Reference question for the php tag

mindplay.dk · Accepted Answer · 2014-11-10 09:38:49Z

Native XML Extensions

I prefer using one of the native XML extensions since they come bundled with PHP, are usually faster than all the 3rd party libs and give me all the control I need over the markup.

DOM

The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It is an implementation of the W3C's Document Object Model Core Level 3, a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents.

DOM is capable of parsing and modifying real world (broken) HTML and it can do XPath queries. It is based on libxml.

It takes some time to get productive with DOM, but that time is well worth it IMO. Since DOM is a language-agnostic interface, you'll find implementations in many languages, so if you need to change your programming language, chances are you will already know how to use that language's DOM API then.

A basic usage example can be found in Grabbing the href attribute of an A element and a general conceptual overview can be found at DOMDocument in php

How to use the DOM extension has been covered extensively on StackOverflow, so if you choose to use it, you can be sure most of the issues you run into can be solved by searching/browsing Stack Overflow.

XMLReader

The XMLReader extension is an XML pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.

XMLReader, like DOM, is based on libxml. I am not aware of how to trigger the HTML Parser Module, so chances are using XMLReader for parsing broken HTML might be less robust than using DOM where you can explicitly tell it to use libxml's HTML Parser Module.

A basic usage example can be found at getting all values from h1 tags using php

XML Parser

This extension lets you create XML parsers and then define handlers for different XML events. Each XML parser also has a few parameters you can adjust.

The XML Parser library is also based on libxml, and implements a SAX style XML push parser. It may be a better choice for memory management than DOM or SimpleXML, but will be more difficult to work with than the pull parser implemented by XMLReader.

SimpleXml

The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.

SimpleXML is an option when you know the HTML is valid XHTML. If you need to parse broken HTML, don't even consider SimpleXml because it will choke.

A basic usage example can be found at A simple program to CRUD node and node values of xml file and there is lots of additional examples in the PHP Manual.

3rd Party Libraries (libxml based)

If you prefer to use a 3rd-party lib, I'd suggest using a lib that actually uses DOM/libxml underneath instead of string parsing.

phpQuery

phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library written in PHP5 and provides additional Command Line Interface (CLI).

Zend_Dom

Zend_Dom provides tools for working with DOM documents and structures. Currently, we offer Zend_Dom_Query, which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors.

QueryPath

QueryPath is a PHP library for manipulating XML and HTML. It is designed to work not only with local files, but also with web services and database resources. It implements much of the jQuery interface (including CSS-style selectors), but it is heavily tuned for server-side use. Can be installed via Composer.

FluentDom

FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in PHP. Selectors are written in XPath, rather than CSS (claims to improve performance). Current versions extend the standard DOM implementing standard interfaces and shortcuts. Can be installed via Composer.

fDOMDocument

fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP warnings or notices. They also add various custom methods and shortcuts for convenience and to simplify the usage of DOM.

3rd-Party (not libxml-based)

The benefit of building upon DOM/libxml is that you get good performance out of the box because you are based on a native extension. However, not all 3rd-party libs go down this route. Some of them listed below

SimpleHtmlDom

An HTML DOM parser written in PHP5+ lets you manipulate HTML in a very easy way!

Require PHP 5+.

Supports invalid HTML.

Find tags on an HTML page with selectors just like jQuery.

Extract contents from HTML in a single line.

I generally do not recommend this parser. The codebase is horrible and the parser itself is rather slow and memory hungry. Any of the libxml based libraries should outperform this easily.

Ganon

A universal tokenizer and HTML/XML/RSS DOM Parser

Ability to manipulate elements and their attributes

Supports invalid HTML and UTF8

Can perform advanced CSS3-like queries on elements (like jQuery -- namespaces supported)

A HTML beautifier (like HTML Tidy)

Minify CSS and Javascript

Sort attributes, change character case, correct indentation, etc.

Extensible

Parsing documents using callbacks based on current character/token

Operations separated in smaller functions for easy overriding

Fast and Easy

Never used it. Can't tell if it's any good.

HTML 5

You can use the above for parsing HTML5, but there can be quirks due to the markup HTML5 allows. So for HTML5 you want to consider using a dedicated parser, like

html5lib

A Python and PHP implementations of a HTML parser based on the WHATWG HTML5 specification for maximum compatibility with major desktop web browsers.

We might see more dedicated parsers once HTML5 is finalized. There is also a blogpost by the W3's titled How-To for html 5 parsing that is worth checking out.

WebServices

If you don't feel like programming PHP, you can also use web services. In general, I found very little utility for these, but that's just me and my use cases.

YQL

The YQL Web Service enables applications to query, filter, and combine data from different sources across the Internet. YQL statements have a SQL-like syntax, familiar to any developer with database experience.

ScraperWiki.

ScraperWiki's external interface allows you to extract data in the form you want for use on the web or in your own applications. You can also extract information about the state of any scraper.

Regular Expressions

Last and least recommended, you can extract data from HTML with regular expressions. In general using Regular Expressions on HTML is discouraged.

Most of the snippets you will find on the web to match markup are brittle. In most cases they are only working for a very particular piece of HTML. Tiny markup changes, like adding whitespace somewhere, or adding or changing attributes in a tag, can make the regex fails when it's not properly written. You should know what you are doing before using regex on HTML.

HTML parsers already know the syntactical rules of HTML. Regular expressions have to be taught for each new regex you write. Regex are fine in some cases, but it really depends on your use-case.

You can write more reliable parsers, but writing a complete and reliable custom parser with regular expressions is a waste of time when the aforementioned libraries already exist and do a much better job on this.

Also see Parsing Html The Cthulhu Way

Books

If you want to spend some money, have a look at

PHP Architect's Guide to Webscraping with PHP

I am not affiliated with PHP Architect or the authors.

@Naveed that depends on your needs. I have no need for CSS Selector queries, which is why I use DOM with XPath exclusively. phpQuery aims to be a jQuery port. Zend_Dom is lightweight. You really have to check them out to see which one you like best. — Gordon♦, Aug 26 '10 at 17:38
i selected yours as the best answer because you actually posted many alternatives and some i never knew about, ill be doing some benches on the Zend_Dom and see how that goes, Thanks — RobertPitt, Aug 26 '10 at 22:02

NAVEED · Answer 2 · 2013-03-07 08:13:03Z

up vote 192 down vote

Try Simple HTML Dom Parser

A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!
Require PHP 5+.
Supports invalid HTML.
Find tags on an HTML page with selectors just like jQuery.
Extract contents from HTML in a single line.
Download

Examples:

How to get HTML elements:

// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');

// Find all images 
foreach($html->find('img') as $element) 
       echo $element->src . '<br>';

// Find all links 
foreach($html->find('a') as $element) 
       echo $element->href . '<br>';

How to modify HTML elements:

// Create DOM from string
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');

$html->find('div', 1)->class = 'bar';

$html->find('div[id=hello]', 0)->innertext = 'foo';

echo $html;

Extract content from HTML:

// Dump contents (without tags) from HTML
echo file_get_html('http://www.google.com/')->plaintext;

Scraping Slashdot:

// Create DOM from URL
$html = file_get_html('http://slashdot.org/');

// Find all article blocks
foreach($html->find('div.article') as $article) {
    $item['title']     = $article->find('div.title', 0)->plaintext;
    $item['intro']    = $article->find('div.intro', 0)->plaintext;
    $item['details'] = $article->find('div.details', 0)->plaintext;
    $articles[] = $item;
}

print_r($articles);

edited Mar 7 '13 at 8:13

answered Aug 26 '10 at 17:18

NAVEED
17.6k1661108

3

I know about SimpleDom, but I was just looking for some more Professional approaches +1 – RobertPitt Aug 26 '10 at 17:29

14

What would make an approach more "professional"? – donut Aug 26 '10 at 17:32

26

How is using a tested library with 75,000 downloads and many active users unprofessional? I'm curious :) – Erik Aug 26 '10 at 17:34

6

Well firstly there's things I need to prepare for such as bad DOM's, Invlid code, also js analysing against DNSBL engine, this will also be used to look out for malicious sites / content, also the as i have built my site around a framework i have built it needs to be clean, readable, and well structured. SimpleDim is great but the code is slightly messy – RobertPitt Aug 26 '10 at 17:35

6

@Robert you might also want to check out htmlpurifier.org for the security related things. – Gordon♦ Aug 31 '10 at 7:40

| show 6 more comments

Edward Z. Yang · Answer 3 · 2008-11-26 20:02:44Z

up vote 150 down vote

Just use DOMDocument->loadHTML() and be done with it. libxml's HTML parsing algorithm is quite good and fast, and contrary to popular belief, does not choke on malformed HTML.

answered Nov 26 '08 at 20:02

Edward Z. Yang
14.3k114672

12

True. And it works with PHP's built-in XPath and XSLTProcessor classes, which are great for extracting content. – porneL Nov 27 '08 at 13:28

7

For really mangled HTML, you can always run it through htmltidy before handing it off to DOM. Whenever I need to scrape data from HTML, I always use DOM, or at least simplexml. – Frank Farmer Oct 13 '09 at 0:41

4

I've be re-researching this, and discovered that the problem I was having with DomDocument's loadXML method was due to an older linked version of libxml. I've been working on more up-to-date systems and DomDocument::loadHTML works like a charm. – Alan Storm Nov 21 '09 at 18:04

7

Another thing with loading malformed HTML i that it might be wise to call libxml_use_internal_errors(true) to prevent warnings that will stop parsing. – Husky May 24 '10 at 17:51

5

I have used DOMDocument to parse about 1000 html sources (in various languages encoded with different charsets) without any issues. You might run into encoding issues with this, but they aren't insurmountable. You need to know 3 things: 1) loadHTML uses meta tag's charset to determine encoding 2) #2 can lead to incorrect encoding detection if the html content doesn't include this information 3) bad UTF-8 characters can trip the parser. In such cases, use a combination of mb_detect_encoding() and Simplepie RSS Parser's encoding / converting / stripping bad UTF-8 characters code for workarounds. – Zero Sep 19 '10 at 6:58

| show 4 more comments

5 revs, 3 users 83% · Answer 4 · 2012-11-23 04:57:04Z

phpQuery and QueryPath are extremely similar in replicating the fluent jQuery API. That's also why they're two of the easiest approaches to properly parse HTML in PHP.

Examples for QueryPath

Basically you first create a queryable DOM tree from an HTML string:

 $qp = qp("<html><body><h1>title</h1>..."); // or give filename or URL

The resulting object contains a complete tree representation of the HTML document. It can be traversed using DOM methods. But the common approach is to use CSS selectors like in jQuery:

 $qp->find("div.classname")->children()->...;

 foreach ($qp->find("p img") as $img) {
     print qp($img)->attr("src");
 }

Mostly you want to use simple #id and .class or DIV tag selectors for ->find(). But you can also use XPath statements, which sometimes are faster. Also typical jQuery methods like ->children() and ->text() and particularly ->attr() simplify extracting the right HTML snippets. (And already have their SGML entities decoded.)

 $qp->xpath("//div/p[1]");  // get first paragraph in a div

QueryPath also allows injecting new tags into the stream (->append), and later output and prettify an updated document (->writeHTML). It can not only parse malformed HTML, but also various XML dialects (with namespaces), and even extract data from HTML microformats (XFN, vCard).

 $qp->find("a[target=_blank]")->toggleClass("usability-blunder");

.

phpQuery or QueryPath?

Generally QueryPath is better suited for manipulation of documents. While phpQuery also implements some pseudo AJAX methods (just HTTP requests) to more closely resemble jQuery. It is said that phpQuery is often faster than QueryPath (because of fewer overall features).

For further information on the differences see this comparison on the wayback machine from tagbyte.org. (Original source went missing, so here's an internet archive link. Yes, you can still locate missing pages, people.)

And here's a comprehensive QueryPath introduction.

Advantages

Simplicity and Reliability
Simple to use alternatives ->find("a img, a object, div a")
Proper data unescaping (in comparison to regular expression grepping)

The link to tagbytag article is no longer valid. – Majid Fouladpour Sep 24 '11 at 0:26 — Majid Fouladpour, Sep 24 '11 at 0:26

3 revs, 2 users 98% · Answer 5 · 2013-12-26 18:35:26Z

up vote 76 down vote

Why you shouldn't and when you should use regular expressions?

First off, HTML cannot be properly parsed using regular expressions. Regexes can however extract data. Extracting is what they're made for. The major drawback of regex HTML extraction over proper SGML toolkits or basic XML parsers are their syntactic cumbersomeness and meager reliability.

Consider that making a somewhat reliable HTML extraction regex:

<a\s+class="?playbutton\d?[^>]+id="(\d+)".+?    <a\s+class="[\w\s]*title
[\w\s]*"[^>]+href="(http://[^">]+)"[^>]*>([^<>]+)</a>.+?

is way less readable than a simple phpQuery or QueryPath equivalent:

$div->find(".stationcool a")->attr("title");

There are however specific use cases where they can help. Most XML parsers cannot see HTML document comments <!-- which sometimes however are more useful anchors for extraction purposes. Occasionally regular expressions can save post-processing. And lastly, for extremely simple tasks like extracting <img src= urls, they are in fact a probable tool. The speed advantage over SGML/XML parsers mostly just comes to play for these very basic extraction procedures.

It's sometimes even advisable to pre-extract a snippet of HTML using regular expressions /(.+?)/ and process the remainder using the simpler HTML parser methods.

Note: I actually have this ~~app~~, where I employ XML parsing and regular expressions alternatively. Just last week the PyQuery parsing broke, and the regex still worked. Yes weird, and I can't explain it myself. But so it happened.
So please don't vote real-world considerations down, just because it doesn't match the regex=evil meme. But let's also not vote this up too much. It's just a sidenote for this topic.

edited Dec 26 '13 at 18:35

community wiki

3 revs, 2 users 98%
mario

13

DOMComment can read comments, so no reason to use Regex for that. – Gordon♦ Sep 6 '10 at 9:48

2

Neither SGML toolkits or XML parsers are suitable for parsing real world HTML. For that, only a dedicated HTML parser is appropriate. – Alohci Sep 6 '10 at 9:53

9

@Alohci DOM uses libxml and libxml has a separate HTML parser module which will be used when loading HTML with loadHTML() so it can very much load "real-world" (read broken) HTML. – Gordon♦ Sep 6 '10 at 9:57

2

Well, just a comment about your "real-world consideration" standpoint. Sure, there ARE useful situations for Regex when parsing HTML. And there are also useful situations for using GOTO. And there are useful situations for variable-variables. So no particular implementation is definitively code-rot for using it. But it is a VERY strong warning sign. And the average developer isn't likely to be nuanced enough to tell the difference. So as a general rule, Regex GOTO and Variable-Variables are all evil. There are non-evil uses, but those are the exceptions (and rare at that)... (IMHO) – ircmaxell Sep 7 '10 at 12:11

9

@mario: Actually, HTML can be ‘properly’ parsed using regexes, although usually it takes several of them to do a fair job a tit. It’s just a royal pain in the general case. In specific cases with well-defined input, it verges on trivial. Those are the cases that people should be using regexes on. Big old hungry heavy parsers are really what you need for general cases, though it isn’t always clear to the casual user where to draw that line. Whichever code is simpler and easier, wins. – tchrist Nov 21 '10 at 1:38

| show 4 more comments

Peter Mortensen · Answer 6 · 2014-03-31 22:45:39Z

up vote 61 down vote

I suggest using phpQuery.

edited Mar 31 '14 at 22:45

Peter Mortensen
7,77985590

answered May 12 '10 at 18:43

Dan Hulton
1,113812

3

This beats simple_html_dom hands down. Same functionallity, and only 1.5mb of memory_get_peak_usage() ? No memory leaks! Awesome. You saved my life. – HappyDeveloper Mar 25 '11 at 22:12

| show 3 more comments

NANNAV · Answer 7 · 2013-06-14 10:42:31Z

Simple HTML Dom is a great open-source parser:

simplehtmldom.sourceforge

It treats dom elements in an object-oriented way, and the new iteration has a lot of coverage for non-compliant code. There are also some great functions like you'd see in JavaScript, such as the "find" function, which will return all instances of elements of that tag name.

I've used this in a number of tools, testing it on many different types of web pages, and I think it works great.

Eli · Answer 8 · 2011-05-01 02:04:38Z

One general approach I haven't seen mentioned here is to run HTML through Tidy, which can be set to spit out guaranteed-valid XHTML. Then you can use any old XML library on it.

But to your specific problem, you should take a look at this project: http://fivefilters.org/content-only/ -- it's a modified version of the Readability algorithm, which is designed to extract just the textual content (not headers and footers) from a page.

3 revs, 3 users 53% · Answer 9 · 2013-06-14 10:33:52Z

up vote 31 down vote

For 1a and 2: I would vote for the new Symfony Componet class DOMCrawler ( DomCrawler ). This class allows queries similar to CSS Selectors. Take a look at this presentation for real-world examples: news-of-the-symfony2-world.

The component is designed to work standalone and can be used without Symfony.

The only drawback is that it will only work with PHP 5.3 or newer.

edited Jun 14 '13 at 10:33

community wiki

3 revs, 3 users 53%
Timo

6

I wish people would stop calling them jQuery-like CSS Queries. CSS Selectors are a W3C recommendation and can very much be done without jQuery. – Gordon♦ Sep 6 '10 at 9:52

| show 5 more comments

Joel Verhagen · Answer 10 · 2010-08-26 17:20:17Z

up vote 30 down vote

This is commonly referred to as screen scraping, by the way. The library I have used for this is Simple HTML Dom Parser.

answered Aug 26 '10 at 17:20

Joel Verhagen
2,15422340

6

Not strictly true (en.wikipedia.org/wiki/Screen_scraping#Screen_scraping). The clue is in "screen"; in the case described, there's no screen involved. Although, admittedly, the term has suffered an awful lot of recent misuse. – Bobby Jack Aug 26 '10 at 17:24

3

Im not screen scraping, the content that will be parsed will be authorized by the content supplier under my agreement. – RobertPitt Aug 26 '10 at 17:30

7

I love it when I learn something from answering a question :) – Joel Verhagen Aug 26 '10 at 17:59

add a comment |

Jens · Answer 11 · 2011-04-14 19:08:11Z

This sounds like a good task description of W3C XPath technology. It's easy to express queries like "return all href attributes in img tags that are nested in <foo><bar><baz> elements." Not being a PHP buff, I can't tell you in what form XPath may be available. If you can call an external program to process the HTML file you should be able to use a command line version of XPath. For a quick intro, see http://en.wikipedia.org/wiki/XPath.

Josh · Answer 12 · 2008-11-16 01:12:19Z

up vote 19 down vote

PHP Simple DOM Parser looks good. I haven't tried using it yet though.

answered Nov 16 '08 at 1:12

Josh
1,106712

1

ive used this many times, very effective and easy to use. Will buy from again A++++, heh. – GrapeCamel Mar 21 '10 at 20:52

5

Suggested third party alternatives to SimpleHtmlDom that actually use DOM instead of String Parsing: phpQuery, Zend_Dom, QueryPath and FluentDom. – Gordon♦ Sep 9 '10 at 14:03

| show 2 more comments

Peter Mortensen · Answer 13 · 2014-03-31 22:48:43Z

up vote 19 down vote

We have created quite a few crawlers for our needs before. At the end of the day, it is usually simple regular expressions that do the thing best. While libraries listed above are good for the reason they are created, if you know what you are looking for, regular expressions is a safer way to go, as you can handle also non-valid HTML/XHTML structures, which would fail, if loaded via most of the parsers.

edited Mar 31 '14 at 22:48

Peter Mortensen
7,77985590

answered Oct 4 '11 at 13:14

jancha
3,1341723

2

+1 for the fortitude to stand against the crowd! – Christopher Nov 22 '11 at 2:05

1

+1 The key point is "If you know what you are doing". And I think any good Developer knows what he is doing. – shiplu.mokadd.im Apr 17 '12 at 9:24

add a comment |

user1090298 · Answer 14 · 2012-11-06 21:02:09Z

up vote 16 down vote

I recommend PHP Simple HTML DOM Parser
it really has nice features like

foreach($html->find('img') as $element) 
       echo $element->src . '<br>';

answered Nov 6 '12 at 21:02

user1090298
36626

add a comment |

Thierry Marianne · Answer 15 · 2012-08-17 23:16:12Z

up vote 14 down vote

There is also Goutte (PHP Web Scraper) which is now available : https://github.com/fabpot/Goutte/

answered Aug 17 '12 at 23:16

Thierry Marianne
1,66021027

add a comment |

2 revs, 2 users 78% · Answer 16 · 2013-12-26 18:31:35Z

up vote 14 down vote

Third party alternatives to SimpleHtmlDom that use DOM instead of String Parsing: phpQuery, Zend_Dom, QueryPath and FluentDom.

edited Dec 26 '13 at 18:31

community wiki

2 revs, 2 users 78%
Amal Murali

2

If you already copy my comments, at least link them properly ;) That should be: Suggested third party alternatives to SimpleHtmlDom that actually use DOM instead of String Parsing: phpQuery, Zend_Dom, QueryPath and FluentDom. – Gordon♦ Sep 7 '10 at 18:49

1

Good answers are a great source. stackoverflow.com/questions/3606792/… – danip Sep 8 '10 at 12:46

add a comment |

Roch · Answer 17 · 2011-05-30 21:59:16Z

up vote 13 down vote

With PHP I would advise you to use the Simple HTML Dom Parser, the best way to learn more about it is to look for samples on the ScraperWiki website.

answered May 30 '11 at 21:59

Roch
4,337154883

add a comment |

MattBelanger · Answer 18 · 2008-11-15 19:19:29Z

up vote 12 down vote

I've used HTML Purifier with a lot of success on a couple different projects.

answered Nov 15 '08 at 19:19

MattBelanger
3,39242431

2

A quick glances make that look more like a filtering library than a parser. Have you used the parser classes to actually extract information from documents? – Alan Storm Nov 15 '08 at 19:48

| show 1 more comment

Ms2ger · Answer 19 · 2009-10-21 11:34:10Z

up vote 10 down vote

html5lib has a PHP version. (I don't know how up-to-date it is.)

answered Oct 21 '09 at 11:34

Ms2ger
8,12432227

add a comment |

NANNAV · Answer 20 · 2013-06-14 10:40:04Z

up vote 10 down vote

Yes you can use simple_html_dom for the purpose. However I have worked quite a lot with the simple_html_dom, particularly for web scrapping and have found it to be too vulnerable. It does the basic job but I won't recommend it anyways.

I have never used curl for the purpose but what I have learned is that curl can do the job much more efficiently and is much more solid.

Kindly check out this link:scraping-websites-with-curl

edited Jun 14 '13 at 10:40

NANNAV
2,80831334

answered Jan 5 '12 at 14:49

Mohammad Rafay Aleem
1,52831841

2

curl can get the file, but it won't parse HTML for you. That's the hard part. – cHao Nov 21 '12 at 18:37

add a comment |

dragonfire · Answer 21 · 2014-11-02 02:33:44Z

QueryPath is good, but be careful of "tracking state" cause if you didn't realise what it means, it can mean you waste a lot of debugging time trying to find out what happened and why the code doesn't work.

What it means is that each call on the result set modifies the result set in the object, it's not chainable like in jquery where each link is a new set, you have a single set which is the results from your query and each function call modifies that single set.

in order to get jquery-like behaviour, you need to branch before you do a filter/modify like operation, that means it'll mirror what happens in jquery much more closely.

$results = qp("div p");<br/>
$forename = $results->find("input[name='forename']");

$results now contains the result set for input[name='forename'] NOT the original query "div p" this tripped me up a lot, what I found was that QueryPath tracks the filters and finds and everything which modifies your results and stores them in the object. you need to do this instead

$forename = $results->branch()->find("input[name='forname']")

then $results won't be modified and you can reuse the result set again and again, perhaps somebody with much more knowledge can clear this up a bit, but it's basically like this from what I've found.

Josiah · Answer 22 · 2012-12-31 15:06:34Z

up vote 8 down vote

Goutte is a simple but awesome web scraper which you can just drop into your code. Its used heavily by some other libraries such as Behat and is stable and well featured.

answered Dec 31 '12 at 15:06

Josiah
1,32921221

add a comment |

Reid Johnson · Answer 23 · 2013-07-08 18:50:37Z

up vote 7 down vote

For html5, html5 lib has been abandoned for years now. The only html5 library I can find with a recent update and maintenance records is html5-php which was just brought to beta 1.0 a little over a week ago.

answered Jul 8 '13 at 18:50

community wiki

Reid Johnson

add a comment |

Peter Mortensen · Answer 24 · 2014-03-31 22:38:41Z

up vote 7 down vote

There is Wiseparser. It requires PHP 5 and works in a manner close to real browsers.

edited Mar 31 '14 at 22:38

Peter Mortensen
7,77985590

answered Feb 12 '10 at 6:35

Marat
1,272611

1

This is a reimplementation of the Perl HTML::Treebuilder class in PHP. It should be helpful if you want that classes behaviour. – ftrotter May 18 '10 at 10:57

| show 1 more comment

CesarB · Answer 25 · 2008-11-15 22:24:15Z

up vote 6 down vote

You could try using something like HTML Tidy to cleanup any "broken" HTML and convert the HTML to XHTML, which you can then parse with a XML parser.

answered Nov 15 '08 at 22:24

CesarB
19k23761

add a comment |

Paul Warelis · Answer 26 · 2013-05-12 01:23:11Z

I have written a general purpose XML parser that can easily handle GB files. It's based on XMLReader and it's very easy to use:

$source = new XmlExtractor("path/to/tag", "/path/to/file.xml");
foreach ($source as $tag) {
    echo $tag->field1;
    echo $tag->field2->subfield1;
}

Here's the github repo: XmlExtractor

Peter Mortensen · Answer 27 · 2014-03-31 22:47:49Z

up vote 4 down vote

Another option you can try is QueryPath. It's inspired by jQuery, but on the server in PHP and used in Drupal.

edited Mar 31 '14 at 22:47

Peter Mortensen
7,77985590

answered May 31 '11 at 15:12

Richard Le Poidevin
1,1781227

add a comment |

Peter Mortensen · Answer 28 · 2014-03-31 22:49:29Z

up vote 4 down vote

The Symfony framework has bundles which can parse the HTML, and you can use CSS style to select the DOMs instead of using XPath.

edited Mar 31 '14 at 22:49

Peter Mortensen
7,77985590

answered Dec 29 '11 at 10:07

Tuong Le
1,04398

add a comment |

troelskn · Answer 29 · 2008-11-15 19:55:44Z

up vote 2 down vote

XML_HTMLSax is rather stable - even if it's not maintained any more. Another option could be to pipe you HTML through Html Tidy and then parse it with standard XML tools.

answered Nov 15 '08 at 19:55

troelskn
55.6k1577113

add a comment |

Antonio Max · Answer 30 · 2013-10-15 21:35:10Z

up vote 1 down vote

Json & Array from XML in 3 lines:

$xml = simplexml_load_string($xml_string);
$json = json_encode($xml);
$array = json_decode($json,TRUE);

Ta da!

answered Oct 15 '13 at 21:35

community wiki

Antonio Max

add a comment |

asked	4 years ago
viewed	186264 times
active	15 days ago

current community

your communities

more stack exchange communities

How do you parse and process HTML/XML in PHP?

33 Answers 33

Native XML Extensions

DOM

XMLReader

XML Parser

SimpleXml

3rd Party Libraries (libxml based)

phpQuery

Zend_Dom

QueryPath

FluentDom

fDOMDocument

3rd-Party (not libxml-based)

SimpleHtmlDom

Ganon

HTML 5

WebServices

YQL

ScraperWiki.

Regular Expressions

Books

Examples:

protected by obi NullPoiиteя kenobi Jun 10 '13 at 5:13

Not the answer you're looking for? Browse other questions tagged php parsing xml-parsing html-parsing or ask your own question.

Linked

Hot Network Questions