You are no doubt familiar with the concept of a word filter – a function that replaces "naughty" words with asterisks. There are many cases when a word filter is useful. A profanity filter can keep a site "G-rated" by removing profanity, while a word filter can also be used simply to remove any words that shouldn't appear in a block of text (such as removing spammy terms or technical data).
A good word filter should support regular expressions, which allow you to define various forms of a word
without typing all of them out. The regular expression /\bass(es|holes?)?\b/i
will match
most instances of the A-word without generating false positives. Nothing is more annoying than a word
filter censoring out words like "classical" for no apparent reason.
A basic word filter
The most basic of all word filters simply uses the str_replace()
function to replace every
occurrence of a word. You can create a simple function as follows:
function wordFilter($text)
{
$filtered_text = $text;
$filtered_text = str_replace('ass', '***', $filtered_text);
$filtered_text = str_replace('shit', '****', $filtered_text);
// ... and so on
return $filtered_text;
}
This method is far from perfect, however. As you can see, this code:
echo wordFilter('My ass Ass asshole classroom shit shithead');
will output:
My *** Ass ***hole cl***room **** ****head
Talk about false positives! This code failed to filter out a form of a "naughty" word which happened to be capitalized, and also caught a perfectly innocent word (classroom).
Regular expressions
Regular expressions greatly facilitate the matching of various forms of a single word, and can be used to restrict the search to only whole words (to avoid false positives). A basic filter that makes use of regular expressions can be implemented as follows:
function wordFilter($text)
{
$filter_terms = array('/\bass(es|holes?)?\b/i', '/\bshit(e|ted|ting|ty|head)\b/i');
$filtered_text = preg_replace($filter_terms, '***', $filtered_text);
return $filtered_text;
}
Notice that now it's possible to restrict the filter to only catch words like "ass", "asses",
"asshole", but ignore words like "classroom".
The \b
modifier tells the search to match only whole words, and /i
tells it to ignore case. Passing the same string of text into the filter, we get:
My *** *** *** classroom *** ***
It works... but is far from perfect. Although this filter successfully catches any blacklisted words (and their variants) that it is programmed to, it replaces each term with a fixed number of ***'s. This is not suitable for most applications.
The ultimate filter
How does one create a filter that not only allows regular expressions, but also replaces each filtered term with the required number of asterisks?
//Word filter by Nookkin, http://nookkin.com/
//You may use this code in any project, commercial or non-commercial in nature,
//provided that this notice is present wherever this code is used.
function wordFilter($text)
{
$filter_terms = array('\bass(es|holes?)?\b', '\bshit(e|ted|ting|ty|head)\b');
$filtered_text = $text;
foreach($filter_terms as $word)
{
$match_count = preg_match_all('/' . $word . '/i', $text, $matches);
for($i = 0; $i < $match_count; $i++)
{
$bwstr = trim($matches[0][$i]);
$filtered_text = preg_replace('/\b' . $bwstr . '\b/', str_repeat("*", strlen($bwstr)), $filtered_text);
}
}
return $filtered_text;
}
Again, passing the same string of text into this filter, we get:
My *** *** ******* classroom **** ********
You can easily add words into the $filter_terms
array; note that the beginning /
and trailing /i
are automatically included in the loop, eliminating the need to type them
every time. You can also add "greedy" matches; simply remove the \b
from both ends of the
word, and the filter will catch anything that contains that word within it. A good idea would be to
load all of the terms into the array from an external file.
Depending on the desired level of "cleanliness" desired, you can modify the way that filtered terms are displayed. For example, you can keep the first and last letters of a word, if you want the user to be able to guess at the original term; you can also add inline HTML that displays the original word as a tooltip.
Final note
This is a versatile word filter that can be used just about anywhere; I use a slightly modified version of this on Nookkin.com's comment script. At the same time, it isn't perfect. I welcome any suggestions or criticisms of the code. Please don't hesitate to leave a comment below.
Comments (18)
repeat filter like :
fuuuuck youuuuu iiiiiiiiiii aaaaaaaaaa ??????? !!!!!!!!! etc
thank
$filter_terms
to let it catch "f*ck" (the real F word), "fuuuuuuck", and "fffuuuuuuuuccccckkk":
Since there is no legitimate English word containing the sequence of characters F-U-C-K, it's safe to omit the \b at the end. However, don't do it with A-S-S, since many legitimate English words contain that sequence of characters (assertive, classroom, etc).\bf+u+c+k+\b
Another thing you might want to do is to match "garbage" characters in order to also catch "f u ck" and "f-u-c-k". You must be very careful here though, since it's VERY easy to accidentally get loads of false positives. Look at this:
This will successfully match "f u ck" and "f-u-c-k", but it will also incorrectly match "for all of us, we can keep..."\bf.*u.*c.*k.*\b
it also blocks the message and words which are not placed in filter_terms
I dont know php coding please help me.
with:str_repeat("*", strlen($bwstr))
Notice that this takes the first letter of the "bad word" string, adds on enough *'s to fill in the remaining n-2 letters, and then tacks on the last letter at the end.$bwstr[0] . str_repeat("*", strlen($bwstr)-2) . $bwstr[strlen($bwstr)-1]
If you want users to have a personal filter, you'll need to load their terms into the filter array yourself -- for example, you can split the contents of the text field on spaces and add each token to the array.
If you want users to enter custom patterns in a user-friendly way (as most users don't know regex and it's possible to really mess up the filter with a malformed pattern), you can look into writing a simple translator that translates, say, simple wildcards like * into proper regex patterns.
Alternatively, just have a predefined list of known bad words (say, in a database table) with the proper patterns, and allow users to check/uncheck them on a form. Or maybe just store levels of offensiveness with the words and allow the user to select, e.g. allow "soft" profanity like sh*t but block "hard" profanity like f*ck.
Say i block the word fuck and the user types f u c k. Of course this isn't going to be filtered. If it's possible, how do i do this ?
I tried it, and removes spaces, but then does not filter the word.
Thanks!
\bf[ ]*u[ ]*c[ ]*k[ ]*\b
it will be blanked out? Because it doesnt blank out ass1 but only ass
\bass[0-9]?(es|holes?)?\b
Good post, PHP have some in built filter function and can implement it easily. Check about PHP filters here: https://www.codefixup.com/php-filters-to-sanitize-and-validate/
Leave a comment