Creating a word filter in PHP

You are no doubt familiar with the concept of a word filter – a function that replaces "naughty" words with asterisks. There are many cases when a word filter is useful. A profanity filter can keep a site "G-rated" by removing profanity, while a word filter can also be used simply to remove any words that shouldn't appear in a block of text (such as removing spammy terms or technical data).

A good word filter should support regular expressions, which allow you to define various forms of a word without typing all of them out. The regular expression /\bass(es|holes?)?\b/i will match most instances of the A-word without generating false positives. Nothing is more annoying than a word filter censoring out words like "classical" for no apparent reason.

A basic word filter

The most basic of all word filters simply uses the str_replace() function to replace every occurrence of a word. You can create a simple function as follows:

function wordFilter($text)
{
    $filtered_text = $text;
    $filtered_text = str_replace('ass', '***', $filtered_text);
    $filtered_text = str_replace('shit', '****', $filtered_text);
    // ... and so on
    return $filtered_text;
}

This method is far from perfect, however. As you can see, this code:

echo wordFilter('My ass Ass asshole classroom shit shithead');

will output:

My *** Ass ***hole cl***room **** ****head

Talk about false positives! This code failed to filter out a form of a "naughty" word which happened to be capitalized, and also caught a perfectly innocent word (classroom).

Regular expressions

Regular expressions greatly facilitate the matching of various forms of a single word, and can be used to restrict the search to only whole words (to avoid false positives). A basic filter that makes use of regular expressions can be implemented as follows:

function wordFilter($text)
{
    $filter_terms = array('/\bass(es|holes?)?\b/i', '/\bshit(e|ted|ting|ty|head)\b/i');
    $filtered_text = preg_replace($filter_terms, '***', $filtered_text);
    return $filtered_text;
}

Notice that now it's possible to restrict the filter to only catch words like "ass", "asses", "asshole", but ignore words like "classroom". The \b modifier tells the search to match only whole words, and /i tells it to ignore case. Passing the same string of text into the filter, we get:

My *** *** *** classroom *** ***

It works... but is far from perfect. Although this filter successfully catches any blacklisted words (and their variants) that it is programmed to, it replaces each term with a fixed number of ***'s. This is not suitable for most applications.

The ultimate filter

How does one create a filter that not only allows regular expressions, but also replaces each filtered term with the required number of asterisks?

//Word filter by Nookkin, http://nookkin.com/
//You may use this code in any project, commercial or non-commercial in nature,
//provided that this notice is present wherever this code is used.

function wordFilter($text)
{
    $filter_terms = array('\bass(es|holes?)?\b', '\bshit(e|ted|ting|ty|head)\b');
    $filtered_text = $text;
    foreach($filter_terms as $word)
    {
        $match_count = preg_match_all('/' . $word . '/i', $text, $matches);
        for($i = 0; $i < $match_count; $i++)
            {
                $bwstr = trim($matches[0][$i]);
                $filtered_text = preg_replace('/\b' . $bwstr . '\b/', str_repeat("*", strlen($bwstr)), $filtered_text);
            }
    }
    return $filtered_text;
}

Again, passing the same string of text into this filter, we get:

My *** *** ******* classroom **** ********

You can easily add words into the $filter_terms array; note that the beginning / and trailing /i are automatically included in the loop, eliminating the need to type them every time. You can also add "greedy" matches; simply remove the \b from both ends of the word, and the filter will catch anything that contains that word within it. A good idea would be to load all of the terms into the array from an external file.

Depending on the desired level of "cleanliness" desired, you can modify the way that filtered terms are displayed. For example, you can keep the first and last letters of a word, if you want the user to be able to guess at the original term; you can also add inline HTML that displays the original word as a tooltip.

Final note

This is a versatile word filter that can be used just about anywhere; I use a slightly modified version of this on Nookkin.com's comment script. At the same time, it isn't perfect. I welcome any suggestions or criticisms of the code. Please don't hesitate to leave a comment below.

Posted on 7/13/2009 5:07 PM | Permalink

Comments (4)

Posted by Jillian on Sunday, April 11, 2010 @ 1:50 PM
I've spent the last 5 days looking for a profanity filter to work with my app, and yours is the one that finally worked without much tweaking. Thanks a bunch!
Options: Reply | Quote

Posted by nookkin on Sunday, April 11, 2010 @ 1:56 PM
@Jillian: I'm glad it worked out for you. The nice thing about it is that not only does it catch words, but it also adds the correct number of asterisks.
Options: Reply | Quote

Posted by Irwan on Wednesday, May 19, 2010 @ 7:38 AM
need help, there is other word to filter,
repeat filter like :

fuuuuck youuuuu iiiiiiiiiii aaaaaaaaaa ??????? !!!!!!!!! etc

thank
Options: Reply | Quote

Posted by nookkin on Wednesday, May 19, 2010 @ 12:12 PM
@Irwan Since my function supports regular expressions, all you need to do is allow it to match multiple repeated letters. Add this to $filter_terms to let it catch "f*ck" (the real F word), "fuuuuuuck", and "fffuuuuuuuuccccckkk":

\bf+u+c+k+\b

Since there is no legitimate English word containing the sequence of characters F-U-C-K, it's safe to omit the \b at the end. However, don't do it with A-S-S, since many legitimate English words contain that sequence of characters (assertive, classroom, etc).

Another thing you might want to do is to match "garbage" characters in order to also catch "f u ck" and "f-u-c-k". You must be very careful here though, since it's VERY easy to accidentally get loads of false positives. Look at this:

\bf.*u.*c.*k.*\b

This will successfully match "f u ck" and "f-u-c-k", but it will also incorrectly match "for all of us, we can keep..."
Options: Reply | Quote

Leave a comment:

You may format your comment using BBCode. (More information) Your e-mail address will not be displayed publicly. Please read the Comment Rules and Tips, Privacy Policy, and Disclaimer before posting.
Valid XHTML 1.0 Strict Valid CSS level 3 Level A Conformance to Web Content Accessibility Guidelines 1.0
Web Design by Nookkin