包含四个数据集,分别从english20newsgroup、reuters 中提取,分别为500条记录,各含五类,每类文档数目不同!从两个母数据库中提取,存储为sqlserver2008格式,可以直接附加,表结构如下!全部进行了标注,可以用来分类或者聚类!
CREATE TABLE [dbo].[reutersdataset5lau](
[ID] [int] NULL,
[Title] [nvarchar](50) NULL,
[ActualClass] [nvarchar](20) NULL,
[TextContent] [nvarchar](max) NULL,
[ShowClass] [nvarchar](20) NULL,
[Note] [nvarchar](50) NULL,
[HtmlCode] [nvarchar](4000) NULL,
[SegResult] [nvarchar](max) NULL,
[SegResultMark] [nvarchar](max) NULL,
[Author] [nvarchar](50) NULL,
[CreateTime] [nvarchar](20) NULL,
[TrainSetID] [int] NULL,
[SubClass] [nvarchar](50) NULL,
[SegResultSentence] [nvarchar](max) NULL
) ON [PRIMARY]
语料库下载地址:http://www.nlpir.org/wordpress/download/English-Cluster-Corpus.rar