ACE 2005 文件格式 | 易学教程

由于做关系抽取要用到ACE 2005的语料，所以在此记录一下相关的信息，包括各个文件的内容和格式等，也方便初入门者可以更快地了解这个语料。

ACE 2005数据集标注了基本任务：the recognition of entities, values, temporal expressions, relation and events。如果想了解更详细的关于ACE05评测的内容，可以看这里The ACE 2005 (ACE05) Evaluation Plan。

这个数据集可以在Linguistic Data Consortium上找到，如果觉得太麻烦不想看的话可以直接去看我要用到的ACE2005数据集。这个数据集里的数据来源于多种资源，可用于阿拉伯语、汉语和英语这三种语言的任务。

ACE 2005语料库训练部分的详细统计数字如下图所示：

上图中的英文资源的各个类别应该对应于语料English文件夹中的bn、bc、nw、wl、un、cts文件夹；阿拉伯语资源对应Arabic文件夹中的bn、nw、wl文件夹；汉语资源对应于Chinese文件夹中的bn、nw、wl文件夹。

在上述每个文件夹下，又包含adj、fp1、fp2、timex2norm文件夹和Filelist文件（Arabic和chinese文件夹下不包含timex2norm文件夹，由于我只用到English语料，所以未探究为啥另外两种语料中没有timex2norm，了解的小伙伴麻烦告知一下）。

以上adj、fp1、fp2、timex2norm文件夹分别表示的是不同的标注过程。ACE语料在所有任务上都是通过两个独立工作的标注器来进行标注的。第一轮的标注成为1P，与之独立的双重第一轮标注成为DUAL。对于1P和DUAL来说，一个标注器完成文件的所有任务。文件是通过自动标注工作流程系统（Annotation Work-flow System， AWS）来进行分配的，而且文件分配是双盲的。（这一段我是瞎翻的，我也不知道自己在说啥）

Note：1P和DUAL在文件夹里都是以'fp1'和'fp2'来存放的，也就是说1P和fp1对应，DUAL和fp2对应。

每个文件的1P和DUAL版本之间的差异由资深标注员或者小组负责人来进行裁决，从而得到一个高质量的gold standard文件。gold standard裁决文件被成为ADJ（也就是我们上边说的ADJ文件夹）。在裁决之后，TIMEX2值被标准化处理以后得到NORM。这个语料中的所有数据集都已经被NORM标注。

整个的标注过程可以用如下的图来表示：

1P: entities         DUAL: entities     TIMEX2 extents         TIMEX2 extents          |                    |         |                    |         |____________________|                   |                   |                   |                   V              ADJ: entities                   TIMEX2 extents                   |                   |                   |                   V              NORM: TIMEX2 normalization

在上述fp1、fp2、adg和timex2norm文件夹中，对于一个给定的文档，我们能够看到这个文档的.sgm源文件以及.ag.xml和.apf.xml的标注文件。

换句话说，对于每一个新闻专题来说，上述每一个文件夹中都包含一个源文本(.sgm文件)的相同副本以及相关标注的不同版本(.ag.xml、.apg.xml和.tab文件)。需要注意的是，在许多情况下，对于一个给定的源文本，如果在两个标注阶段的后一个阶段中没有做任何更改，那么两个标注阶段会产生相同的输出。

FIlelist文件包含了对于每一个文件的单词统计信息和标注状态。

如下为完整标注文件和它们对应的源文件的路径：

    */timex2norm/*sgm     */timex2norm/*apf.xml

接下来是每一种文件类型的内容格式。对于大多数用户来说，最重要的文件是.sgm文件和.apg.xml文件。

Source Text (.sgm) Files        - These files contain the source text data in an SGML format; they         use UTF-8 encoding and UNIX-style line termination.     AG (.ag.xml) Files        - These are annotation files created with the LDC's annotation         toolkit.  These files have been converted to the corresponding         .apf.xml files.             ACE Program Format (APF) (.apf.xml) Files        - These files are in the official ACE annotation file format. ACE          format is derived by means of a routine format conversion process,         so that the underlying annotation content of the two files is          equivalent  See section 8 for more details.     ID table (.tab) Files        - These files store mapping tables between the IDs used in the         ag.xml files and their corresponding apf.xml files.

关于APF的一些说明（懒得翻，以后有需要的时候再翻一下）

- Offsets APF uses the offset counting method traditionally used in previous ACE evaluation programs: 1) Each (UTF-8) character, not byte, is counted as one. 2) Each newline character is counted as one. (The .sgm files use the UNIX-style end of line characters.) 3) SGML tags are *not* counted towards offsets. (Please note that the AG files included in this release do count SGML tags in offsets.) 4) SGML entities are counted in terms of each character in the entities. For example, "&amp;" is counted as five characters, not as one character. - TIMEX2 The timex2 element represents TIMEX2 time expression annotations. Its optional attributes, such as "VAL" and "MOD", represent the TIMEX2 normalization values. - TYPE, LDCTYPE and LDCATR in entity_mention The TYPE attribute in entity_mention stores the official ACE entity mention type, and the LDCTYPE and LDCATR attributes store the attributes used in the LDC's annotation process. - Name in entity_attributes The "name" element in entity_attributes stores the heads of "NAM"-type mentions as in the previous years. In response to George Doddington's request, we have added the NAME attribute to the "name" element. The NAME attribute stores slightly normalized versions of the names where: - \n is replaced with a space - multiple spaces are reduced to one space - " (double quote) is removed - Example: <entity_attributes> <name NAME="United States"> <charseq START="4242" END="4254">United States</charseq> </name> </entity_attributes> - Nickname metonymy Nickname metonyms are indicated with METONYMY_MENTION="TRUE" in entity_mentions. "NAN"-type entity mentions marked as nickname metonymy do not give rise to name elements. - Cross-type metonymy "Cross-type" metonyms are represented with relations of the type METONYMY. The METONYMY type relations do not have relation_mentions. The METONYMY type relations are automatically generated after the annotation process, and are the only kind of relation annotations that appear in this corpus. - For more details, please refer to the APF V5.1.2 DTD.

文章来源: ACE 2005 文件格式

标签

ace

文本文件格式

软件