Creating a word filter in PHP

You are no doubt familiar with the concept of a word filter – a function that replaces "naughty" words with asterisks. There are many cases when a word filter is useful. A profanity filter can keep a site "G-rated" by removing profanity, while a word filter can also be used simply to remove any words that shouldn't appear in a block of text (such as removing spammy terms or technical data).

A good word filter should support regular expressions, which allow you to define various forms of a word without typing all of them out. The regular expression /\bass(es|holes?)?\b/i will match most instances of the A-word without generating false positives. Nothing is more annoying than a word filter censoring out words like "classical" for no apparent reason.

A basic word filter

The most basic of all word filters simply uses the str_replace() function to replace every occurrence of a word. You can create a simple function as follows:

function wordFilter($text)
{
    $filtered_text = $text;
    $filtered_text = str_replace('ass', '***', $filtered_text);
    $filtered_text = str_replace('shit', '****', $filtered_text);
    // ... and so on
    return $filtered_text;
}

This method is far from perfect, however. As you can see, this code:

echo wordFilter('My ass Ass asshole classroom shit shithead');

will output:

My *** Ass ***hole cl***room **** ****head

Talk about false positives! This code failed to filter out a form of a "naughty" word which happened to be capitalized, and also caught a perfectly innocent word (classroom).

Regular expressions

Regular expressions greatly facilitate the matching of various forms of a single word, and can be used to restrict the search to only whole words (to avoid false positives). A basic filter that makes use of regular expressions can be implemented as follows:

function wordFilter($text)
{
    $filter_terms = array('/\bass(es|holes?)?\b/i', '/\bshit(e|ted|ting|ty|head)\b/i');
    $filtered_text = preg_replace($filter_terms, '***', $filtered_text);
    return $filtered_text;
}

Notice that now it's possible to restrict the filter to only catch words like "ass", "asses", "asshole", but ignore words like "classroom". The \b modifier tells the search to match only whole words, and /i tells it to ignore case. Passing the same string of text into the filter, we get:

My *** *** *** classroom *** ***

It works... but is far from perfect. Although this filter successfully catches any blacklisted words (and their variants) that it is programmed to, it replaces each term with a fixed number of ***'s. This is not suitable for most applications.

The ultimate filter

How does one create a filter that not only allows regular expressions, but also replaces each filtered term with the required number of asterisks?

//Word filter by Nookkin, http://nookkin.com/
//You may use this code in any project, commercial or non-commercial in nature,
//provided that this notice is present wherever this code is used.

function wordFilter($text)
{
    $filter_terms = array('\bass(es|holes?)?\b', '\bshit(e|ted|ting|ty|head)\b');
    $filtered_text = $text;
    foreach($filter_terms as $word)
    {
        $match_count = preg_match_all('/' . $word . '/i', $text, $matches);
        for($i = 0; $i < $match_count; $i++)
            {
                $bwstr = trim($matches[0][$i]);
                $filtered_text = preg_replace('/\b' . $bwstr . '\b/', str_repeat("*", strlen($bwstr)), $filtered_text);
            }
    }
    return $filtered_text;
}

Again, passing the same string of text into this filter, we get:

My *** *** ******* classroom **** ********

You can easily add words into the $filter_terms array; note that the beginning / and trailing /i are automatically included in the loop, eliminating the need to type them every time. You can also add "greedy" matches; simply remove the \b from both ends of the word, and the filter will catch anything that contains that word within it. A good idea would be to load all of the terms into the array from an external file.

Depending on the desired level of "cleanliness" desired, you can modify the way that filtered terms are displayed. For example, you can keep the first and last letters of a word, if you want the user to be able to guess at the original term; you can also add inline HTML that displays the original word as a tooltip.

Final note

This is a versatile word filter that can be used just about anywhere; I use a slightly modified version of this on Nookkin.com's comment script. At the same time, it isn't perfect. I welcome any suggestions or criticisms of the code. Please don't hesitate to leave a comment below.

Posted on Monday, July 13, 2009 at 5:07 PM | Permalink

Comments (16)

Jillian
Sunday, April 11, 2010 at 1:50 PM
I've spent the last 5 days looking for a profanity filter to work with my app, and yours is the one that finally worked without much tweaking. Thanks a bunch!
Options: Reply | Quote | Flag / Report

Sunday, April 11, 2010 at 1:56 PM
@Jillian: I'm glad it worked out for you. The nice thing about it is that not only does it catch words, but it also adds the correct number of asterisks.
Options: Reply | Quote | Flag / Report

Wednesday, May 19, 2010 at 7:38 AM
need help, there is other word to filter,
repeat filter like :

fuuuuck youuuuu iiiiiiiiiii aaaaaaaaaa ??????? !!!!!!!!! etc

thank
Options: Reply | Quote | Flag / Report

Wednesday, May 19, 2010 at 12:12 PM
@Irwan Since my function supports regular expressions, all you need to do is allow it to match multiple repeated letters. Add this to $filter_terms to let it catch "f*ck" (the real F word), "fuuuuuuck", and "fffuuuuuuuuccccckkk":

\bf+u+c+k+\b

Since there is no legitimate English word containing the sequence of characters F-U-C-K, it's safe to omit the \b at the end. However, don't do it with A-S-S, since many legitimate English words contain that sequence of characters (assertive, classroom, etc).

Another thing you might want to do is to match "garbage" characters in order to also catch "f u ck" and "f-u-c-k". You must be very careful here though, since it's VERY easy to accidentally get loads of false positives. Look at this:

\bf.*u.*c.*k.*\b

This will successfully match "f u ck" and "f-u-c-k", but it will also incorrectly match "for all of us, we can keep..."
Options: Reply | Quote | Flag / Report

Monday, May 30, 2011 at 10:47 AM
i used it as a spam filter in my message sending process but it doesnt worded out it blocked my thorughtout pms....and not even single pm could be sended by it..please help me
it also blocks the message and words which are not placed in filter_terms
Options: Reply | Quote | Flag / Report

Suhan
Wednesday, October 12, 2011 at 7:05 AM
I have a article directory, I need this code for me very badly. How and where to add this code ?

I dont know php coding please help me.
Options: Reply | Quote | Flag / Report

Joe
Thursday, May 31, 2012 at 7:17 PM
First let me start, This is amazing except I am having some issues. I want to leave the first and the last letters 'un-starred' so they user can somewhat guess what was said. I have tried striping the first character using 'substr' on the string $bwstr. The problem is that it takes the first character of the entire string. Not just the word. What string would I have to pass through 'substr' in order to get this to work. Also, am I even doing this correctly or over thinking it? Thank you!
Options: Reply | Quote | Flag / Report

Thursday, May 31, 2012 at 7:24 PM
@Joe You want to modify the "output" of the regular expression. I don't have time to test right now but you'd basically want to replace this:

str_repeat("*", strlen($bwstr))

with:

$bwstr[0] . str_repeat("*", strlen($bwstr)-2) . $bwstr[strlen($bwstr)-1]

Notice that this takes the first letter of the "bad word" string, adds on enough *'s to fill in the remaining n-2 letters, and then tacks on the last letter at the end.
Options: Reply | Quote | Flag / Report

Alan
Thursday, May 15, 2014 at 11:40 AM
I need help in where to put this in my code? I'm trying to use it for a chat room, and I can't find a place to put it that it works.
Options: Reply | Quote | Flag / Report

Friday, July 25, 2014 at 2:21 PM
Is there a way to make a personal filter? One the users could add in the words they choose not to see? Like an open typing area they could add the words they do not want to see.
Options: Reply | Quote | Flag / Report

Friday, July 25, 2014 at 7:41 PM
@Cynthia My code just does the word filtering part given some text and a filter array.

If you want users to have a personal filter, you'll need to load their terms into the filter array yourself -- for example, you can split the contents of the text field on spaces and add each token to the array.

If you want users to enter custom patterns in a user-friendly way (as most users don't know regex and it's possible to really mess up the filter with a malformed pattern), you can look into writing a simple translator that translates, say, simple wildcards like * into proper regex patterns.

Alternatively, just have a predefined list of known bad words (say, in a database table) with the proper patterns, and allow users to check/uncheck them on a form. Or maybe just store levels of offensiveness with the words and allow the user to select, e.g. allow "soft" profanity like sh*t but block "hard" profanity like f*ck.
Options: Reply | Quote | Flag / Report

Ronald
Saturday, July 26, 2014 at 7:59 PM
Is it possible to add another replace option so that spaces are removed ?

Say i block the word fuck and the user types f u c k. Of course this isn't going to be filtered. If it's possible, how do i do this ?

I tried it, and removes spaces, but then does not filter the word.

Thanks!
Options: Reply | Quote | Flag / Report

Sunday, July 27, 2014 at 2:03 PM
@Ronald Yep, just modify the pattern to include spaces. I haven't tested but this (or similar) should do the trick.

\bf[ ]*u[ ]*c[ ]*k[ ]*\b

Options: Reply | Quote | Flag / Report

Prakash Nanda
Wednesday, September 3, 2014 at 2:11 AM
nice, need to try it...
Options: Reply | Quote | Flag / Report

ryan
Thursday, March 9, 2017 at 5:24 PM
Is there a way to make it so say if you type in ass1

it will be blanked out? Because it doesnt blank out ass1 but only ass
Options: Reply | Quote | Flag / Report

Thursday, March 9, 2017 at 9:20 PM
@ryan Add it to your pattern.

\bass[0-9]?(es|holes?)?\b

Options: Reply | Quote | Flag / Report

Leave a comment

 
ten minus zero is (Huh?)
Comment moderation is enabled.
Your comment will appear on the page after it has been reviewed.