PHP: fetching only images from a Flickr feed

A Flickr feed is very useful when we want to retrieve our images uploaded on a Flickr's album. Within this feed, we're only interested in the description element contained inside each item element. By default, all the content of this element is served as HTML and it's a little bit redundant because it also contains unnecessary information, such as "Gabriele Romanato posted a photo" repeated several times. Let's say that we want only the images, with no links or descriptions. How can we achieve this result?

PHP allows us to use Perl-Compatible Regular Expressions (PCRE) to select some portions inside strings or, more generally, textual content. So we're going to use them together with the SimpleXML library. Example:

$flickr_rss = simplexml_load_file('your/feed/url');
       

        $entries = $flickr_rss->channel->item;
        $i = -1;
        $html = '';
        $html .= '<div class="pics">' . "\n";
        
        do {
        
          $i++;
          
          $entry = $entries[$i];
          $content = $entry->description;
          $pre_content = preg_replace('/\n+|\r\n+/', '', $content);
          $img_src_re = preg_match_all('/src=".+"/', $pre_content, $matches);
          $src = str_replace('src="', '', $matches[0][0]);
          
          $html .= '<img src="' . $src . '" />' . "\n";
                           
        
        } while($i < 4);
        
        $html .= '</div>' . "\n";
        
        
        echo $html;

We're actually retrieving only the first five images of our album using a do...while loop. Since we know that the URL of each image is contained within a src attribute, we use a regular expression to accomplish this task. Then we remove the src=" characters from the string using str_replace(). Note that since we've used preg_match_all(), the returned result is an array whose first value is in turn another array, so our value is contained in the first item of this second array.

Demo

Live demo

Parsing a Flickr feed with SimpleXML

Parsing a Flickr RSS feed requires some preparatory steps. First of all, I tried to download a static copy of my feeds using wget just to study the file structure. Wrong choice! In fact, the format returned was Atom, not RSS. So I used the following approach:

$file = file_get_contents('http://api.flickr.com/services/feeds/photos_public.gne?id=31968388@N02&lang=it-it&format=rss_200');
echo $file;

So I got the correct format. Parsing the feed is quite a simple task:

$rss = simplexml_load_file('http://api.flickr.com/services/feeds/photos_public.gne?id=31968388@N02&lang=it-it&format=rss_200');
    
    foreach($rss->channel->item as $entry) {
    
    
    
        $title = $entry->title;
        $raw_published = $entry->pubDate;
        $published = str_replace('-0700', '', $raw_published);
        $raw_content = $entry->description;
        $content = html_entity_decode($raw_content, UTF-8);
        
        
        echo '<li>' . "\n" . '<h2>' . $title . '</h2>' . "\n" . '<p class="pubdate">' . $published . "</p>\n" . 
        $content . "</li>\n";
    
    
    
    
    }

Two notes here:

  1. you need to start parsing from the root element, not from the SimpleXMLObject itself
  2. the description element contains markup that needs to be expanded using the html_entity_decode() function

You can see this test here.

Limiting results with PHP SimpleXML

While parsing an RSS feed with SimpleXML, is certainly useful to limit the number of items fetched with the simplexml_load_file() method. Suppose that we want to parse a Twitter feed and we want only the first five items. We could write the following:

  $rss = simplexml_load_file('http://twitter.com/statuses/user_timeline/120345723.rss');
  
  $items = $rss->channel->item;
  $i = -1;
  
  do {
      
    $i++;
    
       $raw_title = $items[$i]->title;
        $title = str_replace('gabromanato:', '', $raw_title);
        $title = preg_replace('/http.+/', '', $title);
        
        $raw_date = $items[$i]->pubDate;
        $date = str_replace('+0000', '', $raw_date);
        
        $link = $items[$i]->link;
        
         echo '<li><a href="' . $link . '">' . $title . '</a>' . "\n" . '<div class="pubdate">' . $date . '</div>' . "\n" . "</li>\n";                               
      
      
  } while($i < 5);

We're actually using a do.. while loop here. Every time we increment our internal counter, we move to the next item in the array. Since the returned array starts from 0, we set our counter to -1 to start from the very first item. You can see a test here.

Parsing a FeedBurner feed with SimpleXML

Parsing a FeedBurner feed with SimpleXML requires only a single gotcha: the actual parsing starts from the root element, not from the whole object created with the simplexml_load_file() function. For example, the following code returns nothing:

$feed = simplexml_load_file('http://feeds.feedburner.com/blogspot/onwebdev/');

foreach($feed->item as $item) {

    //... nothing here

}

Instead, the following code works as expected:

 $feed = simplexml_load_file('http://feeds.feedburner.com/blogspot/onwebdev/');
    
    foreach($feed->channel->item as $item) {
    
        $title = $item->title;
        $raw_author = $item->author;
        $author = str_replace('gabriele.romanato@gmail.com', '', $raw_author);
        $author = str_replace('(', '', $author);
        $author = str_replace(')', '', $author);
        $links = $item->children('http://rssnamespace.org/feedburner/ext/1.0');
        $link = $links->origLink;
        $raw_date = $item->pubDate;
        $pubdate = str_replace('+0000', '', $raw_date);
        
        echo '<li><a href="' . $link . '">' . $title . '</a>' . "\n" . '<div class="author">' . $author . '</div>' . "\n" .
        '<div class="pubdate">' . $pubdate . '</div>' . "</li>\n";
    
    
    }

You can notice that now the parsing starts from the channel element. You can see the final result here.

Parsing Twitter feeds with SimpleXML

Parsing Twitter feeds with SimpleXML is quite a simple task. First of all, you need the URL of your Twitter RSS/Atom feed. Then you can use SimpleXML as follows:

$tweets = simplexml_load_file('http://twitter.com/statuses/user_timeline/120345723.rss');
        
    
    foreach($tweets->channel->item as $tweet) {
    
        $raw_title = $tweet->title;
        $title = str_replace('gabromanato:', '', $raw_title);
        $title = preg_replace('/http.+/', '', $title);
        
        $raw_date = $tweet->pubDate;
        $date = str_replace('+0000', '', $raw_date);
        
        $link = $tweet->link;
        
        echo '<li><a href="' . $link . '">' . $title . '</a>' . "\n" . '<div class="pubdate">' . $date . '</div>' . "\n" . "</li>\n";
    
    
    }

I've only removed some unnecessary strings using str_replace() and preg_replace(). You can see the final result here.

Parsing XHTML with PHP SimpleXML

Again, XHTML is really powerful when served as application/xhtml+xml. We can even parse it with the PHP SimpleXML library. For example, given the following markup:

<body>
  <p>Test</p>
</body>

we can write the following PHP code:

$xhtml_file = 'test.xhtml';
$xhtml_doc = simplexml_load_file($xhtml_file);

$p = $xhtml_doc->body->p;
echo $p; // 'Test'

Very simple, isn't it? The fact is that XHTML is treated exactly as XML when served with its proper content type.