敏感词过滤方案有很多,但用的较多的是DFA(Deterministic Finite Automata,确定有限自动机),本文主要借助Hutool工具包来实现敏感词过滤。
一、将WordTree装配进SpringBoot
由于WordTree是Hutool工具包中的一个类,而且我们在初始化WordTree对象时,要把所有敏感词预先添加进WordTree对象中,因此我们要用@Bean来装配WordTree
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| @Configuration public class SensitiveConfig {
@Autowired WordDao wordDao;
@Bean public WordTree wordTree(){
WordTree wordTree=new WordTree(); List<OneWord> allWords = wordDao.getAllWords(); for (int i = 0; i < allWords.size(); i++) { wordTree.addWord(allWords.get(i).getWord()); }
return wordTree; }
}
|
程序运行时间: 2352ms
程序运行时间: 2339ms
程序运行时间: 1991ms
程序运行时间: 2642ms
程序运行时间: 2488ms
数据库中存有将近50000个敏感词,在本机上多次测试,创建WordTree对象并添加数据库中敏感词的整个过程大概花费了2500ms左右。
二、接口实现
有时候,正文中的关键字常常包含特殊字符,比如:"〓关键☆字",针对这种情况,Hutool提供了StopChar
类,专门针对特殊字符做跳过处理,这个过程是在match
方法或matchAll
方法执行的时候自动去掉特殊字符。
1.获得字符串中的所有敏感词,使用matchAll
匹配到最短关键词,并跳过已经匹配的关键词
1 2 3 4 5
| public List<String> getAllSensitiveWords(String text){ List<String> list = wordTree.matchAll(text, -1, false, false); return list; }
|
2.获得第一个关键词
1 2 3 4
| public String getFirstSensitiveWord(String text){ return wordTree.match(text); }
|
3.将字符串中的敏感词替换成指定符号
1 2 3 4 5 6 7 8 9 10
| public String sensitiveWordReplacedByChar(String text,char ch){ String str=""+ch+ch; String resultStr=text; List<String> allSensitiveWords = getAllSensitiveWords(text); for(String s:allSensitiveWords){ resultStr= resultStr.replace(s,str); } return resultStr; }
|
三、测试结果
1.获得所有敏感词
1 2 3 4 5 6 7 8 9 10 11 12
| @Test void getAllSensitiveWordsTest(){ String text1="傻逼123赌博哈哈AV"; String text2="哈哈哈哈"; String text3="傻@@@逼123赌%%博哈哈A‘’V"; List<String> allSensitiveWords1 = sensitiveWordFilter.getAllSensitiveWords(text1); List<String> allSensitiveWords2 = sensitiveWordFilter.getAllSensitiveWords(text2); List<String> allSensitiveWords3 = sensitiveWordFilter.getAllSensitiveWords(text3); System.out.println(allSensitiveWords1); System.out.println(allSensitiveWords2); System.out.println(allSensitiveWords3); }
|
结果:
[傻逼, 赌博, AV]
[]
[傻@@@逼, 赌%%博, A‘’V]
2.获得第一个关键词
1 2 3 4 5 6 7 8 9 10 11 12
| @Test void getFirstSensitiveWordTest(){ String text1="傻逼123赌博哈哈AV"; String text2="哈哈哈哈"; String text3="傻@@@逼123赌%%博哈哈A‘’V"; String firstSensitiveWord1 = sensitiveWordFilter.getFirstSensitiveWord(text1); String firstSensitiveWord2 = sensitiveWordFilter.getFirstSensitiveWord(text2); String firstSensitiveWord3 = sensitiveWordFilter.getFirstSensitiveWord(text3); System.out.println(firstSensitiveWord1); System.out.println(firstSensitiveWord2); System.out.println(firstSensitiveWord3); }
|
结果:
傻逼
null
傻@@@逼
3.替换敏感词
1 2 3 4 5 6 7 8 9 10 11 12
| @Test void sensitiveWordReplacedByStarTest(){ String text1="傻逼123赌博哈哈AV"; String text2="哈哈哈哈"; String text3="傻@@@逼123赌%%博哈哈A‘’V"; String s1 = sensitiveWordFilter.sensitiveWordReplacedByChar(text1, '*'); String s2 = sensitiveWordFilter.sensitiveWordReplacedByChar(text2, '*'); String s3 = sensitiveWordFilter.sensitiveWordReplacedByChar(text3, '*'); System.out.println(s1); System.out.println(s2); System.out.println(s3); }
|
结果:
**123**哈哈**
哈哈哈哈
**123**哈哈**