I played a little bit on my local machine with some content downloaded from Html.it. Basically, I used two different approaches for fetching a page: a stream-based approach and a DOM-driven one. Here's the first:
$edit_posts = file_get_contents('http://blog.html.it/author/gabroman'); $posts = preg_match_all('#<p id="post-\d+"><a\shref="http://blog.html.it/\d+/\d+/\d+/.+">.+</a></p>#', $edit_posts, $matches); foreach($matches[0] as $match) { $match = preg_replace('#<p id="post-\d+">#', '', $match); $match = preg_replace('/title=".+"/', '', $match); $match = preg_replace('#</p>#', '', $match); echo '<li>' . $match . '</li>'; }
The DOM approach is quite different:
$html = new DOMDocument(); $html->loadHTMLFile('http://blog.html.it/author/gabroman'); $content = $html->getElementById('content'); $links = $content->getElementsByTagName('a'); foreach($links as $link) { $url = $link->getAttribute('href'); $text = $link->firstChild->nodeValue; $posted = $link->parentNode->nextSibling->nextSibling->firstChild->nodeValue; if(preg_match('#^http://blog.html.it/\d#', $url)) { echo ' <li><a href="' . $url . '">' . $text . '</a>' . "\n" . ' <div>' . $posted . '</div>'. " </li>\n"; } }
Actually, in order to properly treat special characters, both pages must have the same encoding (the target encoding is UTF-8). However, both approaches rely on the fact that a page structure should remain the same for ever and ever, and obviously this is not the case in a real web!