python自动化拼写检查器

2022-09-24 13:16  
真正精彩的代码是能够做到使用简短的代码,实现有意义的功能。
 ——Guido鲁迅
今天分享的代码是谷歌Norvig大神的杰作,他用很简洁的方式实现了单词的自动化拼写检查。
当我们使用谷歌搜索内容的时候,如果你拼错一个单词,网页会提醒你可能的正确拼法,这就是所谓的“拼写检查”(spelling corrector)。谷歌使用的是基于贝叶斯推断的统计学习方法。这种方法的特点就是快,很快的时间内处理大量文本,并且有很高的精确度(90%以上)。
效果大概是这个样子:

当用户输入一个单词的时候,分为了两种情况:
  • 拼写正确,记为c(correct)
  • 拼写错误,记为w(wrong)

所谓的“拼写检查”,从概率论的角度看,就是已知w,然后在若干个备选方案中,找出可能性最大的那个c,也就是求式(1)的最大值。

根据贝叶斯定理可得:

对于所有的备选项c来说,w都是相同的,因此可将上式简化为:

所以,实际上可以看成是求式(3)的最大值

P(c)的含义是,某个正确的词的出现“概率”,它可以用“频率”代替。如果我们有一个足够大的文本库,那么这个文本库中每个单词的出现频率,就相当于它的发生概率。某个词的出现频率越高,P(c)就越大。

P(w|c)的含义是,在试图拼写c的情况下,出现拼写错误w的概率。这需要统计数据的支持,但是为了简化问题,我们假设两个单词在字形上越接近,就越有可能拼错,P(w|c)就越大。

举例来说,相差一个字母的拼法,就比相差两个字母的拼法,发生概率更高。

例如,如果你想拼写单词Serendipity,那么错误拼写成Serendipitu(相差一个字母)的可能性,就比拼成Serendipituu要高(相差两个字母)。

所以,我们只要找到与输入单词在字形上最相近的那些词,再在其中挑出出现频率最高的一个,就能实现P(w|c)*P(c)的最大值。

实现代码如下:

import refrom collections import Counter
def words(text): return re.findall(r'\w+', text.lower())
WORDS = Counter(words(open('big.txt').read()))
def P(word, N=sum(WORDS.values())): "Probability of `word`." return WORDS[word] / N
def correction(word): "Most probable spelling correction for word." return max(candidates(word), key=P)
def candidates(word): "Generate possible spelling corrections for word." return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])
def known(words): "The subset of `words` that appear in the dictionary of WORDS." return set(w for w in words if w in WORDS)
def edits1(word): "All edits that are one edit away from `word`." letters = 'abcdefghijklmnopqrstuvwxyz' splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [L + R[1:] for L, R in splits if R] transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1] replaces = [L + c + R[1:] for L, R in splits if R for c in letters] inserts = [L + c + R for L, R in splits for c in letters] return set(deletes + transposes + replaces + inserts)
def edits2(word): "All edits that are two edits away from `word`." return (e2 for e1 in edits1(word) for e2 in edits1(e1))
################ Test Code
def unit_tests(): assert correction('speling') == 'spelling' # insert assert correction('korrectud') == 'corrected' # replace 2 assert correction('bycycle') == 'bicycle' # replace assert correction('inconvient') == 'inconvenient' # insert 2 assert correction('arrainged') == 'arranged' # delete assert correction('peotry') =='poetry' # transpose assert correction('peotryy') =='poetry' # transpose + delete assert correction('word') == 'word' # known assert correction('quintessential') == 'quintessential' # unknown assert words('This is a TEST.') == ['this', 'is', 'a', 'test'] assert Counter(words('This is a test. 123; A TEST this is.')) == ( Counter({'123': 1, 'a': 2, 'is': 2, 'test': 2, 'this': 2})) assert len(WORDS) == 32192 assert sum(WORDS.values()) == 1115504 assert WORDS.most_common(10) == [ ('the', 79808), ('of', 40024), ('and', 38311), ('to', 28765), ('in', 22020), ('a', 21124), ('that', 12512), ('he', 12401), ('was', 11410), ('it', 10681)] assert WORDS['the'] == 79808 assert P('quintessential') == 0 assert 0.07 < P('the') < 0.08 return 'unit_tests pass'
def spelltest(tests, verbose=False): "Run correction(wrong) on all (right, wrong) pairs; report results." import time start = time.clock() good, unknown = 0, 0 n = len(tests) for right, wrong in tests: w = correction(wrong) good += (w == right) if w != right: unknown += (right not in WORDS) if verbose: print('correction({}) => {} ({}); expected {} ({})' .format(wrong, w, WORDS[w], right, WORDS[right])) dt = time.clock() - start print('{:.0%} of {} correct ({:.0%} unknown) at {:.0f} words per second ' .format(good / n, n, unknown / n, n / dt))
def Testset(lines): "Parse 'right: wrong1 wrong2' lines into [('right', 'wrong1'), ('right', 'wrong2')] pairs." return [(right, wrong) for (right, wrongs) in (line.split(':') for line in lines) for wrong in wrongs.split()]
if __name__ == '__main__': print(unit_tests()) spelltest(Testset(open('spell-testset1.txt'))) spelltest(Testset(open('spell-testset2.txt')))


小马过河啊
要好好学习呀!
 最新文章