Creating a word filter in PHP

You are no doubt familiar with the concept of a word filter – a function that replaces "naughty" words with asterisks. There are many cases when a word filter is useful. A profanity filter can keep a site "G-rated" by removing profanity, while a word filter can also be used simply to remove any words that shouldn't appear in a block of text (such as removing spammy terms or technical data).

A good word filter should support regular expressions, which allow you to define various forms of a word without typing all of them out. The regular expression /\bass(es|holes?)?\b/i will match most instances of the A-word without generating false positives. Nothing is more annoying than a word filter censoring out words like "classical" for no apparent reason.

A basic word filter

The most basic of all word filters simply uses the str_replace() function to replace every occurrence of a word. You can create a simple function as follows:

function wordFilter($text)
{
    $filtered_text = $text;
    $filtered_text = str_replace('ass', '***', $filtered_text);
    $filtered_text = str_replace('shit', '****', $filtered_text);
    // ... and so on
    return $filtered_text;
}

This method is far from perfect, however. As you can see, this code:

echo wordFilter('My ass Ass asshole classroom shit shithead');

will output:

My *** Ass ***hole cl***room **** ****head

Talk about false positives! This code failed to filter out a form of a "naughty" word which happened to be capitalized, and also caught a perfectly innocent word (classroom).

Regular expressions

Regular expressions greatly facilitate the matching of various forms of a single word, and can be used to restrict the search to only whole words (to avoid false positives). A basic filter that makes use of regular expressions can be implemented as follows:

function wordFilter($text)
{
    $filter_terms = array('/\bass(es|holes?)?\b/i', '/\bshit(e|ted|ting|ty|head)\b/i');
    $filtered_text = preg_replace($filter_terms, '***', $filtered_text);
    return $filtered_text;
}

Notice that now it's possible to restrict the filter to only catch words like "ass", "asses", "asshole", but ignore words like "classroom". The \b modifier tells the search to match only whole words, and /i tells it to ignore case. Passing the same string of text into the filter, we get:

My *** *** *** classroom *** ***

It works... but is far from perfect. Although this filter successfully catches any blacklisted words (and their variants) that it is programmed to, it replaces each term with a fixed number of ***'s. This is not suitable for most applications.

The ultimate filter

How does one create a filter that not only allows regular expressions, but also replaces each filtered term with the required number of asterisks?

//Word filter by Nookkin, http://nookkin.com/
//You may use this code in any project, commercial or non-commercial in nature,
//provided that this notice is present wherever this code is used.

function wordFilter($text)
{
    $filter_terms = array('\bass(es|holes?)?\b', '\bshit(e|ted|ting|ty|head)\b');
    $filtered_text = $text;
    foreach($filter_terms as $word)
    {
        $match_count = preg_match_all('/' . $word . '/i', $text, $matches);
        for($i = 0; $i < $match_count; $i++)
            {
                $bwstr = trim($matches[0][$i]);
                $filtered_text = preg_replace('/\b' . $bwstr . '\b/', str_repeat("*", strlen($bwstr)), $filtered_text);
            }
    }
    return $filtered_text;
}

Again, passing the same string of text into this filter, we get:

My *** *** ******* classroom **** ********

You can easily add words into the $filter_terms array; note that the beginning / and trailing /i are automatically included in the loop, eliminating the need to type them every time. You can also add "greedy" matches; simply remove the \b from both ends of the word, and the filter will catch anything that contains that word within it. A good idea would be to load all of the terms into the array from an external file.

Depending on the desired level of "cleanliness" desired, you can modify the way that filtered terms are displayed. For example, you can keep the first and last letters of a word, if you want the user to be able to guess at the original term; you can also add inline HTML that displays the original word as a tooltip.

Final note

This is a versatile word filter that can be used just about anywhere; I use a slightly modified version of this on Nookkin.com's comment script. At the same time, it isn't perfect. I welcome any suggestions or criticisms of the code. Please don't hesitate to leave a comment below.

Posted on Monday, July 13, 2009 at 5:07 PM | Permalink

Comments (8)

Jillian
Sunday, April 11, 2010 at 1:50 PM
I've spent the last 5 days looking for a profanity filter to work with my app, and yours is the one that finally worked without much tweaking. Thanks a bunch!

Sunday, April 11, 2010 at 1:56 PM
@Jillian: I'm glad it worked out for you. The nice thing about it is that not only does it catch words, but it also adds the correct number of asterisks.

Wednesday, May 19, 2010 at 7:38 AM
need help, there is other word to filter,
repeat filter like :

fuuuuck youuuuu iiiiiiiiiii aaaaaaaaaa ??????? !!!!!!!!! etc

thank

Wednesday, May 19, 2010 at 12:12 PM
@Irwan Since my function supports regular expressions, all you need to do is allow it to match multiple repeated letters. Add this to $filter_terms to let it catch "f*ck" (the real F word), "fuuuuuuck", and "fffuuuuuuuuccccckkk":

\bf+u+c+k+\b

Since there is no legitimate English word containing the sequence of characters F-U-C-K, it's safe to omit the \b at the end. However, don't do it with A-S-S, since many legitimate English words contain that sequence of characters (assertive, classroom, etc).

Another thing you might want to do is to match "garbage" characters in order to also catch "f u ck" and "f-u-c-k". You must be very careful here though, since it's VERY easy to accidentally get loads of false positives. Look at this:

\bf.*u.*c.*k.*\b

This will successfully match "f u ck" and "f-u-c-k", but it will also incorrectly match "for all of us, we can keep..."

Monday, May 30, 2011 at 10:47 AM
i used it as a spam filter in my message sending process but it doesnt worded out it blocked my thorughtout pms....and not even single pm could be sended by it..please help me
it also blocks the message and words which are not placed in filter_terms

Suhan
Wednesday, October 12, 2011 at 7:05 AM
I have a article directory, I need this code for me very badly. How and where to add this code ?

I dont know php coding please help me.

Joe
Thursday, May 31, 2012 at 7:17 PM
First let me start, This is amazing except I am having some issues. I want to leave the first and the last letters 'un-starred' so they user can somewhat guess what was said. I have tried striping the first character using 'substr' on the string $bwstr. The problem is that it takes the first character of the entire string. Not just the word. What string would I have to pass through 'substr' in order to get this to work. Also, am I even doing this correctly or over thinking it? Thank you!

Thursday, May 31, 2012 at 7:24 PM
@Joe You want to modify the "output" of the regular expression. I don't have time to test right now but you'd basically want to replace this:

str_repeat("*", strlen($bwstr))

with:

$bwstr[0] . str_repeat("*", strlen($bwstr)-2) . $bwstr[strlen($bwstr)-1]

Notice that this takes the first letter of the "bad word" string, adds on enough *'s to fill in the remaining n-2 letters, and then tacks on the last letter at the end.

Leave a comment:

You may format your comment using BBCode. (More information) Your e-mail address will be used for Gravatars and opt-in notifications, and will never be displayed publicly.

Please read the Comment Rules and Tips, Privacy Policy, and Disclaimer before posting.
Comment moderation is enabled.