原文链接:http://www.one2know.cn/nlp4/
* + ?
* :0个或多个
+ :1个或多个
? :0个或1个
re.search()函数,将str和re匹配,匹配正确返回True
import re
def text_match(text,patterns):
if re.search(patterns,text):
return 'Found a match!'
else:
return 'Not matched!'
print(text_match('ac','ab?'))
print(text_match('abc','ab?'))
print(text_match('abbc','ab?'))
print(text_match('ac','ab'))
print(text_match('abc','ab'))
print(text_match('abbc','ab*'))
print(text_match('ac','ab+'))
print(text_match('abc','ab+'))
print(text_match('abbc','ab+'))
print(text_match('abbc','ab{2}'))
print(text_match('aabbbbc','ab{3,5}?'))
输出:
Found a match!
Found a match!
Found a match!
Found a match!
Found a match!
Found a match!
Not matched!
Found a match!
Found a match!
Found a match!
Found a match!
$ ^ .
$ :结尾
^ :开头
. :除换行符以外的任何字符
\w :字母,数字,下划线
\s :空格符
\S :非空格符
\b :空格
\B :非空格
import re
def text_match(text,patterns):
if re.search(patterns,text):
return 'Found a match!'
else:
return 'Not matched!'
print(text_match('abbc','^a.*c$'))
print(text_match('Tuffy eats pie, Loki eats peas!','^\w+'))
print(text_match('Tuffy eats pie, Loki eats peas!','\w+\S*$'))
print(text_match('Tuffy eats pie, Loki eats peas!','\Bu\B'))
输出:
Found a match!
Found a match!
Found a match!
Found a match!
字符串匹配
re.search(pattern,text) :判断text里是否有pattern
re.finditer(pattern,text) :在text里找到pattern
import re
patterns = ['Tuffy','Pie','Loki']
text = 'Tuffy eats pie, Loki eats peas!'
for pattern in patterns:
print('Searching for "%s" in "%s" ->' % (pattern,text))
if re.search(pattern,text):
# 如果不想区分大小写,加参数 flags=re.IGHORECASE
print('Found!')
else:
print('Not Found!')
pattern = 'eats'
for match in re.finditer(pattern,text):
s = match.start()
e = match.end()
print('Found "%s" at %d:%d'%(text[s:e],s,e))
输出:
Searching for "Tuffy" in "Tuffy eats pie, Loki eats peas!" ->
Found!
Searching for "Pie" in "Tuffy eats pie, Loki eats peas!" ->
Not Found!
Searching for "Loki" in "Tuffy eats pie, Loki eats peas!" ->
Found!
Found "eats" at 6:10
Found "eats" at 21:25
日期,一组字符集合(或字符范围)
\d :数字
re.compile() :string => RegexObject的对象
方括号[]内的所有内容都是OR关系
import re
url = 'http://www.awdawd.com/da/wda/2019/7/2/wda.html'
date_regex = '/(\d{4})/(\d{1,2})/(\d{1,2})'
print('Data found in the URL :',re.findall(date_regex,url))
def is_allowed_specific_char(string):
charRe = re.compile(r'[^a-zA-Z0-9.]')
string = charRe.search(string)
return not bool(string)
print(is_allowed_specific_char('adIDHihdHDIh.'))
print(is_allowed_specific_char('*#$%^&!{}'))
输出:
Data found in the URL : [('2019', '7', '2')]
True
False
找到所有长度为5的单词,缩写替换单词
import re
street = '21 Ramkrishna Road'
print(re.sub('Road','Rd',street))
text = 'Tuffy eats pie, Loki eats bread!'
print(re.findall(r'\b\w{5}\b',text))
输出:
21 Ramkrishna Rd
['Tuffy', 'bread']
基于RE的分词器
import re
raw = 'I am big! It\'s the pictures that got small.'
print(re.split(r' +',raw))
print(re.split(r'\W+',raw))
print(re.findall(r'\w+|\S\w*',raw))
输出:
['I', 'am', 'big!', "It's", 'the', 'pictures', 'that', 'got', 'small.']
['I', 'am', 'big', 'It', 's', 'the', 'pictures', 'that', 'got', 'small', '']
['I', 'am', 'big', '!', 'It', "'s", 'the', 'pictures', 'that', 'got', 'small', '.']
基于RE的词干提取器
import re
def stem(word):
split = re.findall(r'^(.*?)(ing|ly|ed|ies|ive|es|s|ment)?$',word)
stem = split[0][0]
return stem
raw = 'Keep your friends close, but your enemies closer.'
tokens = re.findall(r'\w+|\S\w*',raw)
print(tokens)
for t in tokens:
print("'",stem(t),"'")
输出:
['Keep', 'your', 'friends', 'close', ',', 'but', 'your', 'enemies', 'closer', '.']
' Keep ' ' your ' ' friend ' ' close ' ' , ' ' but ' ' your ' ' enem ' ' closer ' ' . '
手机扫一扫
移动阅读更方便
你可能感兴趣的文章