ITA
●钱爱兵1,江岚2
(1.南京中医药大学 经贸管理学院,江苏 南京 210046;2.南京大学 信息管理系,江苏 南京
210093)
基于改进TF.IDF的中文网页关键词抽取
——以新闻网页为例
摘要:结合新闻网页的内容特征对中文网页关键词的构成特点进行阐述;对经典的TF.IDF加权公
式进行改进,构建一个综合考虑多种影响因素的候选关键词评分加权公式;对SharplCTCLAS分词进行改
进,增加位置标注;选择评分较高的词作为候选关键词,利用词的位置标注进行关键词抽取优化操作,将
“切碎”的候选关键词进行组配,形成正式抽取的关键词。实验结果表明:该方法明显优于基准方法,能
够抽取到令人满意的关键词。
关键词:词频;逆beplayapp体育下载频率;新闻网页;关键词抽取
Abstract:This paper gives a description of the characteristics of the keywords of Chinese Web pages in combi-naton with the characteristics of the content of Web news,and based on the impmved classic TF-IDF weighting for-mula,proposes a candidate keyword grading and weighting formula which takes varieties of impact facto虹into ac-count.Moreover,the paper improves the SharplCTCLAS,and adds the position tag.The method selects the key-
words稍出high scores as the candidate keywords.and tries to link them together according to their positions in Web
news.Finally,the formal keywords are extracted.The experimental results show that the pmposed method Call sig-nificantly outperform the baseline method,and the quality of the extracted keywords are satisfactory.
Keywords:term frequency;inverse document frequency;Web news;keyword extraction
目前,国内外的许多学者已经在关键词抽取领域做了 接影响关键词抽取的结果。综上所述,与英文关键词抽取
大量研究工作,并且提出诸多有代表性的方法。简立峰采 研究相比,中文关键词抽取研究主要面临两方面的挑战:
用PAT树结构,同时利用词之间的互信息来抽取中文关 ①缺乏标准语料库;②依赖分词。
键词¨J。实验结果表明:该方法抽取关键