I played a little bit on my local machine with some content downloaded from Html.it. Basically, I used two different approaches for fetching a page: a stream-based approach and a DOM-driven one. Here's the first:
$edit_posts = file_get_contents('http://blog.html.it/author/gabroman');
$posts = preg_match_all('#<p id="post-\d+"><a\shref="http://blog.html.it/\d+/\d+/\d+/.+">.+</a></p>#',
$edit_posts, $matches);
foreach($matches[0] as $match) {
$match = preg_replace('#<p id="post-\d+">#', '', $match);
$match = preg_replace('/title=".+"/', '', $match);
$match = preg_replace('#</p>#', '', $match);
echo '<li>' . $match . '</li>';
}
The DOM approach is quite different:
$html = new DOMDocument();
$html->loadHTMLFile('http://blog.html.it/author/gabroman');
$content = $html->getElementById('content');
$links = $content->getElementsByTagName('a');
foreach($links as $link) {
$url = $link->getAttribute('href');
$text = $link->firstChild->nodeValue;
$posted = $link->parentNode->nextSibling->nextSibling->firstChild->nodeValue;
if(preg_match('#^http://blog.html.it/\d#', $url)) {
echo ' <li><a href="' . $url . '">' . $text . '</a>' . "\n" . '
<div>' . $posted . '</div>'. " </li>\n";
}
}
Actually, in order to properly treat special characters, both pages must have the same encoding (the target encoding is UTF-8). However, both approaches rely on the fact that a page structure should remain the same for ever and ever, and obviously this is not the case in a real web!