李恒训,张华平,,基于主题词的网络热点话题发现,第五届全国信息检索学术会议(CCIR2009), 上海,2009-11
摘 要:网络话题层出不穷,往往会引发重大舆情危机,如何快速高效的从海量信息中发现热点是一重大
挑战。本文提出了一种基于主题词的网络热点话题发现算法。其基本思想为:首先综合主题词表和有意义
串识别结果生成主题词候选集;然后对候选集进行多重过滤并采用启发式规则对主题词进行权重计算;最
后,以主题词为线索进行热点话题提取,采用多特征的话题模型,融合新闻、论坛、博客的相应特征实现
了网络热点话题的发现。通过在TDT4 评测语料和中科院计算所天玑舆情监测系统平台上的实验分别取得
了0.282 的最小识别代价和93.3%的用户满意度,算法运行效率高于传统方法。实验表明,该算法对网络
热点话题发现行之有效。
关键词:主题词提取;热点话题发现;聚类;舆情;天玑
Internet Hot Topic Detection Based on Topic Words
Abstract: There are mass of information produced by the Internet everyday, in order to get the hotspot from the
mass, this paper showed a quick and effective strategy of the Internet hotspot topic detection based on topic
words extraction. Its basic content can be summed up as follows: Firstly, we pretreatment the corpus for Chinese
word segmentation with ICTCLAS and use the scan algorithm based on the topic word dictionary and meaningful
string recognition algorithm to get the candidate topic-word set, then filter the topic words in accordance with
certain heuristic rules and calculate the weight, Lastly, considerable and selective use is made of the Meta
information of the web pages to hotspot event cluster quickly, because of the different characteristics of the BBS,
News and Blog respectively, which obtains a relatively better results in the experiment.
Keywords: topic words extraction; hot topic detection; clustering; public opinions; golaxy