NLP(四) 正则表达式
阅读原文时间:2023年07月14日阅读:1

原文链接:http://www.one2know.cn/nlp4/

  • * + ?

    * :0个或多个

    + :1个或多个

    ? :0个或1个

    re.search()函数,将str和re匹配,匹配正确返回True

    import re

    匹配函数,输入:文本,匹配模式(即re)

    def text_match(text,patterns):
    if re.search(patterns,text):
    return 'Found a match!'
    else:
    return 'Not matched!'

    测试

    print(text_match('ac','ab?'))
    print(text_match('abc','ab?'))
    print(text_match('abbc','ab?'))

    print(text_match('ac','ab')) print(text_match('abc','ab'))
    print(text_match('abbc','ab*'))

    print(text_match('ac','ab+'))
    print(text_match('abc','ab+'))
    print(text_match('abbc','ab+'))

    print(text_match('abbc','ab{2}'))

    print(text_match('aabbbbc','ab{3,5}?'))

输出:

Found a match!
Found a match!
Found a match!
Found a match!
Found a match!
Found a match!
Not matched!
Found a match!
Found a match!
Found a match!
Found a match!
  • $ ^ .

    $ :结尾

    ^ :开头

    . :除换行符以外的任何字符

    \w :字母,数字,下划线

    \s :空格符

    \S :非空格符

    \b :空格

    \B :非空格

    import re
    def text_match(text,patterns):
    if re.search(patterns,text):
    return 'Found a match!'
    else:
    return 'Not matched!'

    任意以a开头,以c结尾

    print(text_match('abbc','^a.*c$'))

    以文本开始,后面有出现一次或多次的文本

    print(text_match('Tuffy eats pie, Loki eats peas!','^\w+'))

    文末一个或多个\w加上0个或多个非空字符,\S在\w后面表示标点符号

    print(text_match('Tuffy eats pie, Loki eats peas!','\w+\S*$'))

    含u在中间的单词

    print(text_match('Tuffy eats pie, Loki eats peas!','\Bu\B'))

输出:

Found a match!
Found a match!
Found a match!
Found a match!
  • 字符串匹配

    re.search(pattern,text) :判断text里是否有pattern

    re.finditer(pattern,text) :在text里找到pattern

    import re

    patterns = ['Tuffy','Pie','Loki']
    text = 'Tuffy eats pie, Loki eats peas!'

    匹配字符串

    for pattern in patterns:
    print('Searching for "%s" in "%s" ->' % (pattern,text))
    if re.search(pattern,text):
    # 如果不想区分大小写,加参数 flags=re.IGHORECASE
    print('Found!')
    else:
    print('Not Found!')

    匹配字符串,并找到他的位置

    pattern = 'eats'
    for match in re.finditer(pattern,text):
    s = match.start()
    e = match.end()
    print('Found "%s" at %d:%d'%(text[s:e],s,e))

输出:

Searching for "Tuffy" in "Tuffy eats pie, Loki eats peas!" ->
Found!
Searching for "Pie" in "Tuffy eats pie, Loki eats peas!" ->
Not Found!
Searching for "Loki" in "Tuffy eats pie, Loki eats peas!" ->
Found!
Found "eats" at 6:10
Found "eats" at 21:25
  • 日期,一组字符集合(或字符范围)

    \d :数字

    re.compile() :string => RegexObject的对象

    方括号[]内的所有内容都是OR关系

    import re
    url = 'http://www.awdawd.com/da/wda/2019/7/2/wda.html'

    YYYY/MM/DD

    date_regex = '/(\d{4})/(\d{1,2})/(\d{1,2})'
    print('Data found in the URL :',re.findall(date_regex,url))

    有特殊字符返回Flase

    def is_allowed_specific_char(string):
    charRe = re.compile(r'[^a-zA-Z0-9.]')
    string = charRe.search(string)
    return not bool(string)

    print(is_allowed_specific_char('adIDHihdHDIh.'))
    print(is_allowed_specific_char('*#$%^&!{}'))

输出:

Data found in the URL : [('2019', '7', '2')]
True
False
  • 找到所有长度为5的单词,缩写替换单词

    import re

    用缩写替换

    street = '21 Ramkrishna Road'
    print(re.sub('Road','Rd',street))

    找到长度为5的单词

    text = 'Tuffy eats pie, Loki eats bread!'
    print(re.findall(r'\b\w{5}\b',text))

输出:

21 Ramkrishna Rd
['Tuffy', 'bread']
  • 基于RE的分词器

    import re

    raw = 'I am big! It\'s the pictures that got small.'

    用一个或多个空格分词

    print(re.split(r' +',raw))

    非 字母数字下划线 分词

    print(re.split(r'\W+',raw))

    匹配分词 !

    print(re.findall(r'\w+|\S\w*',raw))

输出:

['I', 'am', 'big!', "It's", 'the', 'pictures', 'that', 'got', 'small.']
['I', 'am', 'big', 'It', 's', 'the', 'pictures', 'that', 'got', 'small', '']
['I', 'am', 'big', '!', 'It', "'s", 'the', 'pictures', 'that', 'got', 'small', '.']
  • 基于RE的词干提取器

    import re

    自己的词干提取器

    def stem(word):
    split = re.findall(r'^(.*?)(ing|ly|ed|ies|ive|es|s|ment)?$',word)
    stem = split[0][0]
    return stem

    上节中re分词

    raw = 'Keep your friends close, but your enemies closer.'
    tokens = re.findall(r'\w+|\S\w*',raw)
    print(tokens)

    测试

    for t in tokens:
    print("'",stem(t),"'")

输出:

['Keep', 'your', 'friends', 'close', ',', 'but', 'your', 'enemies', 'closer', '.']
' Keep ' ' your ' ' friend ' ' close ' ' , ' ' but ' ' your ' ' enem ' ' closer ' ' . '