Fetching pages with PHP: two different approaches

I played a little bit on my local machine with some content downloaded from Html.it. Basically, I used two different approaches for fetching a page: a stream-based approach and a DOM-driven one. Here's the first:

$edit_posts = file_get_contents('http://blog.html.it/author/gabroman');
$posts = preg_match_all('#<p id="post-\d+"><a\shref="http://blog.html.it/\d+/\d+/\d+/.+">.+</a></p>#', 
$edit_posts, $matches);

foreach($matches[0] as $match) {
$match = preg_replace('#<p id="post-\d+">#', '', $match);
$match = preg_replace('/title=".+"/', '', $match);
$match = preg_replace('#</p>#', '', $match);
echo '<li>' . $match .  '</li>';

The DOM approach is quite different:

$html = new DOMDocument();

$content = $html->getElementById('content');
$links = $content->getElementsByTagName('a');

foreach($links as $link) {

    $url = $link->getAttribute('href');
    $text = $link->firstChild->nodeValue;
    $posted = $link->parentNode->nextSibling->nextSibling->firstChild->nodeValue;
    if(preg_match('#^http://blog.html.it/\d#', $url)) {
        echo '  <li><a href="' . $url . '">' . $text . '</a>' . "\n" . '    
 <div>' . $posted . '</div>'. "  </li>\n";

Actually, in order to properly treat special characters, both pages must have the same encoding (the target encoding is UTF-8). However, both approaches rely on the fact that a page structure should remain the same for ever and ever, and obviously this is not the case in a real web!

This entry was posted in by Gabriele Romanato. Bookmark the permalink.

Leave a Reply

Note: Only a member of this blog may post a comment.