test

训练基于分类器的分块器

无论是基于正则表达式的分块器还是n-gram分块器,创建什么样的分块完全取决于词性标记.
然而,有时词性标记不足以确定一个句子应如何分块

1 2	(3) a. Joey/NN sold/VBD the/DT farmer/NN rice/NN ./. b. Nick/NN broke/VBD my/DT computer/NN monitor/NN ./.

考虑例句,两句话词性标记相同,但分块方式不同

第一句中,the farmer 和 rice 都是单独分块
第二句中,my computer monitor 是单独的分块
如果想最大限度的提升分块的性能,需要使用词的内容作为词性标记的补充.

包含词的内容信息的一种方法是使用基于分类器的标注器对句子分块.
比如使用n-gram分块器,这个基于分类器器分块器分配IOB标记给句子中的词,
然后将这些标记转换为块.

本小节示例需要安装 oCaml

使用连续分类器对名词短语分块

1
2

7.4 语言结构中的递归

用级联分块器构建嵌套结构

只需创建一个包含递归规则的多级的分块语法,就可以建立任意深度的分块结构

例子展示名词短语、介词短语、动词短语和句子的模式
这是一个4级分块语法器,可以用来创建深度最深为4的结构

import nltk

grammar = r"""
   NP: {<DT|JJ|NN.*>+}
   PP: {<IN><NP>}
   VP: {<VB.*><NP|PP|CLAUSE>+$}
   CLAUSE: {<NP><VP>}
"""
cp = nltk.RegexpParser(grammar)
sentence = [("Mary", "NN"), ("saw", "VBD"), ("the", "DT"), ("cat", "NN"), ("sit", "VB"), ("on", "IN"), ("the", "DT"),
            ("mat", "NN")]
print(cp.parse(sentence))
"""
(S
  (NP Mary/NN)
  saw/VBD
  (CLAUSE
    (NP the/DT cat/NN)
    (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))
"""

结果丢掉了以saw为首的VP
将此分块器应用到有更深嵌套的句子中,无法识别开始的VP块,如下例:

sentence = [("John", "NNP"), ("thinks", "VBZ"), ("Mary", "NN"),
            ("saw", "VBD"), ("the", "DT"), ("cat", "NN"), ("sit", "VB"),
            ("on", "IN"), ("the", "DT"), ("mat", "NN")]
print(cp.parse(sentence))
"""
(S
  (NP John/NNP)
  thinks/VBZ
  (NP Mary/NN)
  saw/VBD
  (CLAUSE
    (NP the/DT cat/NN)
    (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))
"""

解决方案: 添加loop参数,指定模式应该循环次数,让分块器在他的模式中循环

cp = nltk.RegexpParser(grammar, loop=2)  # 添加循环
print(cp.parse(sentence))
"""
(S
  (NP John/NNP)
  thinks/VBZ
  (CLAUSE
    (NP Mary/NN)
    (VP
      saw/VBD
      (CLAUSE
        (NP the/DT cat/NN)
        (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))))
"""

树状图

在NLTK中,创建树状图,方法是给节点添加标签和一个子链表

tree = nltk.Tree('NP',['the','rabbit'])

tree的一些方法:

print(tree[1])
tree.node
tree.leaves()
tree.draw()

树遍历

使用递归函数来遍历树

1
2

书中例子会报TypeError: Tree: Expected a node value and child list 错误

7.5 命名实体识别

表 7-3 常用命名实体类型

NE类型	例子
组织(ORGANIZATION)	Georgia-Pacific Corp., WHO
人(PERSON)	Eddy Bonte, President Obama
地点(LOCATION)	Murray River, Mount Everest
日期(DATE)	June, 2008-06-29
时间(TIME)	two fifty a m, 1:30 p.m.
货币(MONEY)	175 million Canadian Dollars, GBP 10.40
百分数(PERCENT)	twenty pct, 18.75 %
设施(FACILITY)	Washington Monument, Stonehenge
地缘政治实体(GPE South)	East Asia, Midlothian

命名实体识别(NER)系统的目标是识别所有文字提及的命名实体。

这可以分解成两个子任务:确定NE的边界和确定其类型。

命名实体识别经常是信息提取中关系识别的前奏,也有助于其他任务。例如:在问答系统(QA)中,我们试图提高信息检索的精确度,不用返回整个页面而只是包含用户问题的答案的那部分。大多数QA系统利用标准信息检索返回的文件,然后尝试分离文档中包含答案的最小的文本分段。

假设问题:Who was the first President of the US?
被检索的文档中包含答案,如下:

(5) The Washington Monument is the most prominent structure in Washington,
D.C. and one of the city’s early attractions. It was built in honor of George
Washington, who led the country to independence and then became its first
President.

我们想得到的答案应该是X was the first President of the US的形式,其中X不仅是一个名词短语也是一个PER类型的命名实体。

识别命名实体可以通过查找适当的名称列表(如识别地点时,可以使用地名词典),但盲目这样做会出问题,比如人或组织名词的列表无法完全覆盖,另外许多实体措辞有歧义,如May和North可能是日期和地点,也有可能都是人名.
更大的挑战来自如’Stanford University’这样的多词名词和包含其他名词的名称,因此我们需要能够识别多标识符序列的开头和结尾

NER是一个非常适合用于分类器类型的方法。

NLTK提供了一个已经训练好的可以识别命名实体的分类器,使用函数nltk.ne_chunk()访问。

import nltk

sent = nltk.corpus.treebank.tagged_sents()[22]
# 如果设置参数binary=True,那么命名实体只被标注为NE,否则,分类器会添加类型标签,如 PERSON, ORGANIZATION and GPE 等
print(nltk.ne_chunk(sent, binary=True))
# (S
#   The/DT
#   (NE U.S./NNP)
#   is/VBZ
#   one/CD
#   ....
#   according/VBG
#   to/TO
#   (NE Brooke/NNP)
#   ...)
print(nltk.ne_chunk(sent))  # PERSON, ORGANIZATION and GPE
# (S
#   The/DT
#   (GPE U.S./NNP)
#   is/VBZ
#   one/CD
#   ......
#   according/VBG
#   to/TO
#   (PERSON Brooke/NNP T./NNP Mossman/NNP)
#   ....)

7.6 关系抽取

只要文本中的命名实体被识别,就可以提取它们之间存在的关系。

关系抽取的方法是首先寻找所有(X, $\alpha$, Y)形式的三元组,其中X和Y是指定类型的命名实体,$\alpha$表示X和Y之间关系的字符串

搜索包含词 in 的字符串,正则表达式会忽略动名词前为in的字符串(否定预测先行断言)

import re
import nltk

IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
    for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern=IN):
        print(nltk.sem.relextract.rtuple(rel))

# [ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
# [ORG: 'McGlashan &AMP; Sarrail'] 'firm in' [LOC: 'San Mateo']
# [ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
# [ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
# [ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
# [ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
# [ORG: 'WGBH'] 'in' [LOC: 'Boston']
# [ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']
# [ORG: 'Omnicom'] 'in' [LOC: 'New York']
# [ORG: 'DDB Needham'] 'in' [LOC: 'New York']
# [ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']
# [ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']
# [ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']

荷兰语的命名实体语料库

from nltk.corpus import conll2002

vnv = """
(
is/V| # 3rd sing present and
was/V| # past forms of the verb zijn ('be')
werd/V| # and also present
wordt/V # past of worden ('become')
)
.* # followed by anything
van/Prep # followed by van ('of')
"""

VAN = re.compile(vnv, re.VERBOSE)
for doc in conll2002.chunked_sents('ned.train'):
    for r in nltk.sem.extract_rels('PER', 'ORG', doc, corpus='conll2002', pattern=VAN):
        print(nltk.sem.relextract.clause(r, relsym="VAN"))
# VAN("cornet_d'elzius", 'buitenlandse_handel')
# VAN('johan_rottiers', 'kardinaal_van_roey_instituut')
# VAN('annie_lennox', 'eurythmics')

for doc in conll2002.chunked_sents('ned.train'):
    for r in nltk.sem.extract_rels('PER', 'ORG', doc, corpus='conll2002', pattern=VAN):
        print(nltk.sem.relextract.rtuple(r, lcon=True, rcon=True))
# ...'')[PER: "Cornet/V d'Elzius/N"] 'is/V op/Prep dit/Pron ogenblik/N kabinetsadviseur/N van/Prep staatssecretaris/N voor/Prep' [ORG: 'Buitenlandse/N Handel/N'](''...
# ...'')[PER: 'Johan/N Rottiers/N'] 'is/V informaticacoördinator/N van/Prep het/Art' [ORG: 'Kardinaal/N Van/N Roey/N Instituut/N']('in/Prep'...
# ...'Door/Prep rugproblemen/N van/Prep zangeres/N')[PER: 'Annie/N Lennox/N'] 'wordt/V het/Art concert/N van/Prep' [ORG: 'Eurythmics/N']('vandaag/Adv in/Prep'...