命名实体识别学习-数据集介绍-conll03

命名实体识别学习-数据集介绍-conll03

[TOC]

conll 2003 是命名实体中最常见的公开数据集。其官网:https://www.clips.uantwerpen.be/conll2003/ner/

有很详细的介绍。

一 类别个数

Named entities are phrases that contain the names of persons, organizations, locations, times and quantities. Example:

[ORG U.N. ] official [PER Ekeus ] heads for [LOC Baghdad ] .

The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.The participants of the shared task will be offered training and test data for two languages. They will use the data for developing a named-entity recognition system that includes a machine learning component. For each language, additional information (lists of names and non-annotated data) will be supplied as well. The challenge for the participants is to find ways of incorporating this information in their system.

上文来自官网,高亮部分介绍其所要分的类别。总共四类:persons, locations, organizations ,miscellaneous entities

二 数据集样例

image-20200714231551083

这是其训练集中某个部分。

通过其官网介绍,可知改数据集第一例是单词,第二列是词性,第三列是语法快,第四列是实体标签。在NER任务中,只关心第一列和第四列。实体类别标注采用BIO标注法,前面博客介绍这种标注法。

以下是官网的介绍:

The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Here is an example:

1
2
3
4
5
6
7
U.N.         NNP  I-NP  I-ORG 
official NN I-NP O
Ekeus NNP I-NP I-PER
heads VBZ I-VP O
for IN I-PP O
Baghdad NNP I-NP I-LOC
. . O O

The data consists of three files per language: one training file and two test files testa and testb. The first test file will be used in the development phase for finding good parameters for the learning system. The second test file will be used for the final evaluation. There are data files available for English and German. The German files contain an extra column (the second) which holds the lemma of each word.

三 预处理

上面已经介绍了数据集的结构,同时,我们想要将数据输入变为下面这种数据结构,所以要对文本数据做一定处理:

1
2
3
4
5
6
7
 [(
"the wall street journal reported today that apple corporation made money".split(),
"B I I I O O O B I O O".split()
), (
"georgia tech is a university in georgia".split(),
"B I O O O O B".split()
)]

上面代码在命名实体识别学习-从基础算法开始(02)lstm+crf序列标注 里介绍过。

预处理思路:获取每句话的tokens数组和tags数组,其中每句话用一个空行分隔。

根据上面思路,处理起来很简单:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
from tqdm import tqdm
import os
class Conll03Reader:
def read(self, data_path):
data_parts = ['train', 'valid', 'test']
extension = '.txt'
dataset = {}
for data_part in tqdm(data_parts):
file_path = os.path.join(data_path, data_part+extension)
dataset[data_part] = self.read_file(str(file_path))
return dataset

def read_file(self, file_path):
samples = []
tokens = []
tags = []
with open(file_path,'r', encoding='utf-8') as fb:
for line in fb:
line = line.strip('\n')

if line == '-DOCSTART- -X- -X- O':
# 去除数据头
pass
elif line =='':
# 一句话结束
if len(tokens) != 0:
samples.append((tokens, tags))
tokens = []
tags = []
else:
# 数据分割,只要开头的词和最后一个实体标注。
contents = line.split(' ')
tokens.append(contents[0])
tags.append(contents[-1])
return samples
if __name__ == "__main__":
ds_rd = Conll03Reader()
data = ds_rd.read("./conll2003_v2")
for sample in data['train'][:10]:
print(sample)

四 总结

将数据集处理为模型可读的标准化输入是整个模型的第一步,接下来文章将尝试用模型处理该数据集,并介绍一种可通用的数据处理的pipeline。