Overuse of PCRE in PHP considered harmful

PCRE (Perl-Compatible Regular Expressions) in PHP can be really useful, but only when used properly. A common error with this kind of approach occurs when we try to extract complex markup snippets from a web document. The sad truth is that we're actually assuming that a document has a well-formed and regular markup structure, which is not always the case. Let's take for example a document that contains two paragraphs like the following:

<p>First paragraph on a single line.</p>

<p>Second paragraph on two lines.
</p>

We could use the following PHP code to extract the content of both paragraphs:

$test_file = file_get_contents('test.html');
    $single_line_para = '#<p>.+</p>#';
    
    if(preg_match_all($single_line_para, $test_file, $matches)) {
        
        foreach($matches[0] as $match) {
            
            
            $match = strip_tags($match);
            
            
        }
        
        echo '<p>The content of the first paragraph is: <em>' . $match . '</em></p>' . "\n";
        
        
    }
    
    $two_line_para = '#<p>.+\n</p>#';

    if(preg_match_all($two_line_para, $test_file, $matches)) {
        
        foreach($matches[0] as $match) {
            
            
            $match = strip_tags($match);
            
            
        }
        
        echo '<p>The content of the second paragraph is: <em>' . $match . '</em></p>' . "\n";
        
        
    }

So far so good. But what happens if paragraphs have no closing tags? Or what happens if there are many occurrences of the newline character? Simply put, everything fails. In short, PCRE should be used only in really predictable scenarios, like form validation, file and directory handling and URLs management. Everything else should be pondered carefully.

Leave a Reply

Note: Only a member of this blog may post a comment.