Python自然语言处理03 处理原始文本

前言

3.1 从网络和硬盘上访问文本

电子书

nltk 包含古腾堡项目的一小部分样本文本。
对其它文本感兴趣可访问:
https://www.gutenberg.org/catalog/

此站点包含 25000 本免费在线书籍(ASCII 码文本文件)
90%的文本是英文的,但是还包括50多种其他语言的文本材料

处理HTML

nltk.clean_html(html) # 通过 html 字符串,返回原始文本

更多处理 HTML 的内容,可以下载 Beautiful Soup 软件包
https://www.crummy.com/software/BeautifulSoup/

处理搜索引擎的结果

优点:数据量大;容易使用
缺点:搜索方式允许的范围受到限制;搜索引擎得到的结果在异时异地不同;返回结果可能会不可预料的变化

处理 RSS 订阅

Python 库 Universal Feed Parser 可以访问博客内容,

下载:pip install feedparser

import feedparser

llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")

print(llog['feed']['title'])
print(len(llog.entries))
post = llog.entries[2]
"""
# 博客实体的属性有这些,可以打印查看
'authors', 'author_detail', 'href', 'author', 'title', 'title_detail', 'links', 
'link', 'id', 'guidislink', 'updated', 'updated_parsed', 'published', 'published_parsed',
'tags', 'summary', 'summary_detail', 'content', 'thr_total'
"""
print(post.title)
content = post.content[0].value
print(content[:40])

"""Output:
Language Log
13
Sino-English neologisms
<p>As I've mentioned before, Chinese fee
"""

读取本地文本

Python 中查看当前目录

1 2	import os os.listdir('.')

open(filepath,'rU') # r 表示”只读”,U表示”通用”,即忽略换行符公约

open访问NLTK中的语料库文件

1
2
3

# 注意安装语料库的位置,
path = nltk.data.find(r"corpora\gutenberg\melville-moby_dick.txt")
raw = open(path, 'rU').read()

从PDF、MS Word 及其他二进制格式中提取文本

ASCII 码和 HTML 文本是可读格式
文字以二进制格式出现,(PDF和MSWord…)只能使用专门的软件打开
- pypdf 、 pywin32 等

捕获用户输入

python3 中没有 raw_input,直接使用 input 即可

1 2	s = input("Enter some text:") print(nltk.word_tokenize(s))

NLP 的流程

处理流程:HTML->ASCII->Text->Vocab
讲一个字符串分词,会产生一个(词的)链表(<list>类型)
,规范化和排序链表可产生其他链表
1
2
3
text = nltk.word_tokenize(s)
words = [w.lower() for w in text]
vocab = sorted(set(words))

3.2 字符串:最底层的文本处理

字符串的基本操作(略)

输出字符串(略)

访问单个字符(略)

访问子字符串(略)

链表与字符串的差异(略)

3.3 使用 Unicode 进行文字处理(略)

什么是 Unicode (略) P101(略)

从文件中提取已编码文件

Python 特定的编码 unicode_escape 是一个虚拟编码
它把所有非 ASCII 字符转换成 \uXXXX 形式
编码点在 0~127 的范围以外但低于 256 的,使用两位数字的形式 \xXX 表示

str.encode('unicode_escape')

Python 可以使用 ord() 查找字符的整数序列

1 2	ord('a') # =97 print(u'\u0061') # 输出 a , 因为97的十六进制四位数是 0061

Python 的print语句假设Unicode 字符的默认编码是 ASCII 码,
不在 ASCII 码范围之内,除非指定编码否则不能输出

repr(str) # 输出 UTF-8 转义序列

在 Python 中使用本地编码(略)

3.4 使用正则表达式检测词组搭配

使用基本的元字符(略)

$ ^ ?

范围与闭包(略) P109 表3-3

[]
+ 一个或多个实例
* 0个或多个实例
\ 转义成普通字符
{} 重复次数
() 操作符范围
| 管道(析取)
3.5 正则表达式的有益应用
正则表达式作用:
检查是否匹配单词
从词汇中提取特征
以特殊的方式修改词
提取字符块
[int(n) for n in re.findall(r'\d+', '2009-12-31')]
在字符快上做更多事情

去除元音

import re
import nltk

regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'


def compress(word):
	pieces = re.findall(regexp, word)
	return ''.join(pieces)


english_udhr = nltk.corpus.udhr.words('English-Latin1')
print(nltk.tokenwrap([compress(w) for w in english_udhr[:75]]))
"""
Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and
of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn
of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn
rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,
and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and
"""

初始化条件频率分布

rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
cvs = [cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]
cfd = nltk.ConditionalFreqDist(cvs)
cfd.tabulate()
"""
    a   e   i   o   u 
k 418 148  94 420 173 
p  83  31 105  34  51 
r 187  63  84  89  79 
s   0   0 100   2   1 
t  47   8   0 148  37 
v  93  27 105  48  49 
"""

检查上表中数字背后对应的词汇

cv_word_pairs = [(cv, w) for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]
cv_index = nltk.Index(cv_word_pairs)
print(cv_index['su'])
"""
['kasuari']
"""

查找词干

方法1:直接去掉任何看起来像后缀的字符
- endswith
  1
  2
  3
  for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
  if word.endswith(suffix):
  return word[:-len(suffix)]
- re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing') # [‘ing’]
- re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing') # [‘processing’]
- re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing') # [(‘process’, ‘ing’)]
- re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes') # [(‘processe’, ‘s’)] *贪婪的
- re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes') # [(‘process’, ‘es’)] *?非贪婪的
- re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language') # [(‘language’, ‘’)] 空后缀

搜索已分词文本

from nltk.corpus import gutenberg,nps_chart
moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
moby.findall(r'<a>(<.*>)<man>')

chat = nltk.Text(nps_chat.words())
chat.findall(r"<.*><.*><bro>")
chat.findall(r"<l.*>{3,}")

搜索 “x and other ys”发现上位词

1
2
3

from nltk.corpus import brown
hobbies_learned = nltk.Text(brown.words(categoriec=['hobbies','learned']))
hobbies_learned.findall(r'<\w*><and><other><\w*s>')

3.6 规范化文本

数据准备

raw = """DENNIS:Listen,strange women lying in ponds distributing swords
is no basis for a system of goverment. Supreme executive pover derives from 
a mandate from the masses,not from some farcical aquatic ceremony."""
tokens = nltk.word_tokenize(raw)

词干提取器

Porter

1 2	porter = nltk.PorterStemmer() [porter.stem(t) for t in tokens]

Lancaster

1 2	lancaster = nltk.LancasterStemmer() [lancaster.stem(t) for t in tokens]

IndexedText()

class IndexedText(object):
    def __init__(self,stemmer,text):
        self._text = text
        self._stemmer = stemmer
        self._index = nltk.Index((self._stem(word),i)
                                for (i,word) in enumerate(text))
    
    def concordance(self,word,width=40):
        key = self._stem(word)
        wc = width/4
        for i in self._index[key]:
            lcontext = ' '.join(self._text[i-wc:i])
            rcontext = ' '.join(self._text[i:i+wc])
            ldisplay = '%*s' % (width, lcontext[-width:])
            rdisplay = '%-*s' % (width, rcontext[:width])
            print(ldisplay,rdisplay)
    
    def _stem(self,word):
        return self._stemmer.stem(word).lower()

porter = nltk.PorterStemmer()
grail = nltk.corpus.webtext.words('grail.txt')
text = Indexedtext(porter,grail)
text.concordance('lie')

词性归并

women —> woman(没有处理lying)

1 2	wnl = nltk.WordNetLemmatizer() [wnl.lemmatize(t) for t in tokens]

识别非标准词（数字，缩写，日期及任何此类标识符到一个特殊到词汇到映射）

eg:
每个十进制数可映射到一个单独到标识符 0.0 上
每个首字母缩写映射为 AAA
[优点]词汇量变小，提高语言建模任务到准确性

3.7 用正则表达式为文本分词

分词到简单方法

空格分词 raw.split()
空格分词 re.split(r' ',raw)
空白，制表符，换行符分词 re.split(r'[ \t\n]+',raw)
空字符分词 re.split(r'\s+',raw)
使用 Python 的字符类 “\w”([a-zA-Z0-9]), “\W”([all 字母，数字，下划线])
无空字符串 re.findall(r'\w+',raw)
re.findall(r'\w+(?:[-']\w+)*|'|[-.(]+|\S\w*)

正则表达式符号表3-4

NLTK的正则表达式分词器

nltk.regexp_tokenize() (与 re.findall() 类似)

text = 'That U.S.A poster-print costs $12.40 ...'
pattern = r'''(?x)  # set flag to allow verbose regexps
    ([A-Z\.])+      # abbreviations, e.g. U.S.A
    | \w+(-\w+)*    # words with optional internal hyphens
    | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
    | \.\.\.        # ellipsis
    | [][.,;"'?():-_`]  # these are separate tokens
'''
nltk.regexp_tokenize(text,pattern)

使用 verbose 标志时，可以不再使用 ‘ ‘ 来匹配空格字符,而用 ‘\s’ 代替

regexp_tokenize() 函数有一个可选的 gaps 参数，设置 True,正则表达式指定标识符的距离

[注意:]使用 set(tokens).difference(wordlist),通过比较分词结果与一个词表，
然后报告任何没有在词表出现的标识符，来评估一个分词器

分词的进一步任务

没有单一的解决方案能在所有领域都行之有效
必须根据应用领域都需要决定哪些是标识符
可以让分词器都输出结果与高品质都标注进行比较
- nltk.corpus.treebank_raw.raw() # <华尔街日报>原始文本
- nltk.corpus.reebank.words() # 分好词都版本
缩写都问题，e.g. “didn’t”
- 可规范化为两个独立都形式 “did” 和 “n’t”
- 可查表完成

3.8 分割

断句

NLTK Punkt 句子分割器
分词
词边界无可视表示:

e.g. 爱国人(ai4 “love” [verb],guo2 “country”,ren2 “person”)
- “爱国/人”,”country-loving person”
- “爱/国人”,”love country-person”

给每个字符标注一个布尔值,来指示此字符后面是否有一个分词标志(详见第7章 “分块”)

e.g. 3-2. P(123)
表示方法(其中一种):
a.do you see the kitty
b.see the doggy
c.dou you like the kitty
d.like the doggy
去除词边界的文本

text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
seg2 = "0100100100100001001001000010100100010010000100010010000"


def segment(t, segs):
    words = []
    last = 0
    for i in range(len(segs)):
        if segs[i] == '1':
        	words.append(t[last:i + 1])
        	last = i + 1
    words.append(t[last:])
    return words


s1, s2 = segment(text, seg1), segment(text, seg2)
print(s1)
print(s2)
"""
['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy'] 
['do', 'you', 'see', 'the', 'kitty', 'see', 'the', 'doggy', 'do', 'you', 'like', 'the', 'kitty', 'like', 'the', 'doggy']
"""

标注后,分词任务变成一个搜索问题

给定一个合适词典,可以由词典中的词的序列来重构原文本.定义目标函数,评分函数:基于词典的大小和从词典中重构原文本所需的信息量尽量优化它的值(Brent & Cart-wright(1995))
分词质量得分越小越好
e.g. 3-3. P(3-3)

seg3 = "0000100100000011001000000110000100010000001100010000001"


# 计算存储词典和重构原文本的成本
def evaluate(t, segs):
	words = segment(t, segs)
	text_size = len(words)
	lexicon_size = len(' '.join(list(set(words))))
	return text_size + lexicon_size


s3 = segment(text, seg3)
print(s3)
e3 = evaluate(text, seg3)
e2 = evaluate(text, seg2)
e1 = evaluate(text, seg1)
"""
['doyou', 'see', 'thekitt', 'y', 'see', 'thedogg', 'y', 'doyou', 'like', 'thekitt', 'y', 'like', 'thedogg', 'y']
46
47
63

"""

寻找最大化目标函数值 0 和 1 的模式(数据中没有足够的证据进一步分割的话,得到的词也算最好的分词)

e.g. 3-4. P(125) 模拟退火算法的非确定性搜索(随机扰动0和1,与温度成正比,温度会随迭代降低,扰动边界会减少why?)

from random import randint


def flip(segs, pos):
	return segs[:pos] + str(1 - int(segs[pos])) + segs[pos + 1:]


def flip_n(segs, n):
	for i in range(n):
		segs = flip(segs, randint(0, len(segs) - 1))
	return segs


def anneal(txt, segs, iterations, cooling_rate):
	temperature = float(len(segs))
	while temperature > 0.5:
		best_segs, best = segs, evaluate(txt, segs)
		for i in range(iterations):
			guess = flip_n(segs, int(round(temperature)))
			score = evaluate(txt, guess)
			if score < best:
				best, best_segs = score, guess
		score, segs = best, best_segs
		temperature = temperature / cooling_rate
		print(evaluate(txt, segs), segment(txt, segs))
	print()
	return segs


anneal(text, seg1, 5000, 1.2)
"""
63 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
60 ['doyo', 'useet', 'hekitty', 'seethed', 'oggy', 'doyouliket', 'hekitty', 'likethed', 'oggy']
59 ['doyo', 'useet', 'hekitty', 'seethed', 'oggy', 'doyo', 'uliket', 'hekitty', 'lik', 'ethed', 'oggy']
54 ['doyo', 'usee', 't', 'hekitty', 'se', 'ethedoggy', 'doyo', 'ulike', 't', 'hekitty', 'lik', 'ethedoggy']
54 ['doyo', 'usee', 't', 'hekitty', 'se', 'ethedoggy', 'doyo', 'ulike', 't', 'hekitty', 'lik', 'ethedoggy']
52 ['doyo', 'useet', 'hekitty', 'se', 'ethedoggy', 'doyo', 'uliket', 'hekitty', 'lik', 'ethedoggy']
"""

有足够数据,就可能以一个合理的准确度自动将文本分割成词汇(可用于为词没有任何视觉表示的书写系统分词).

3.9 格式化:从链表到字符串
从链表到字符串(略)
''.join(list)

字符串与格式(略)
%s %d转换说明符

排列

几点注意说明:

格式化字符串可以指定宽度,eg:%6s,%5d,默认右对齐
可以指定左对齐,eg:'%-6s' % 'dog' # 'dog
可以用变量指定宽度,eg:'%-*s' % (width, 'dog') # 'dog
%的输出,eg:'%2.4f%%' % 100 * count / total # 34.1867%
将结果写入文件
几点经验
将总词数写入文件
避免文件名中包含空格字符

避免除了大小写不同其他都相同的文件名称

文本换行

from textwrap import fill

saying = ['After', 'all', 'is', 'said', 'and', 'done', ',', 'more', 'is', 'said', 'than', 'down', '.']
format = '%s (%d),'
pieces = [format % (word, len(word)) for word in saying]
output = ' '.join(pieces)
print(output)
wrapped = fill(output)
print(wrapped)

最后更新： 2019年05月09日 12:35

原始链接： https://ice-melt.github.io/2019/04/16/Python_NLP_03/

赏