原文链接:http://www.one2know.cn/nlp2/
Why we do this
将获取的数据统一格式,得到规范化和结构化得数据
字符串操作
namesList = ['Tuffy','Ali','Nysha','Tim']
sentence = 'My dog sleeps on sofa'
names = ';'.join(namesList) # 以';'为分隔符将所有对象连成一个对象
print(type(names),':',names)
wordList = sentence.split(' ') # 以' '为分隔符将一个对象分割成多个对象的list
print(type(wordList),':',wordList)
print('a'+'a'+'a')
print('b' * 3)
str = 'Python NLTK'
print(str[1])
print(str[-3])
输出:
<class 'str'> : Tuffy;Ali;Nysha;Tim
<class 'list'> : ['My', 'dog', 'sleeps', 'on', 'sofa']
aaa
bbb
y
L
字符串操作深入
str = 'NLTK Dolly Python'
print(str[:4])
print(str[11:])
print(str[5:10])
print(str[-12:-7])
if 'NLTK' in str:
print('found NLTK')
replaced = str.replace('Dolly','Dorothy')
print('Replaced String:',replaced)
for s in replaced:
print(s,end='/')
输出:
NLTK
Python
Dolly
Dolly
found NLTK
Replaced String: NLTK Dorothy Python
N/L/T/K/ /D/o/r/o/t/h/y/ /P/y/t/h/o/n/
Pyhton读取PDF
from PyPDF2 import PdfFileReader
def get_text_pdf(pdf_filename,password=''):
pdf_file = open(pdf_filename,'rb')
read_pdf = PdfFileReader(pdf_file)
# 密码不为空,则用输入的密码解密
if password != '':
read_pdf.decrypt(password)
# 读取文本:创建字符串列表,把每页的文本都加到列表中
text = []
for i in range(0,read_pdf.getNumPages()):
text.append(read_pdf.getPage(i).extractText())
return '\n'.join(text)
if name == "main":
pdfFile = 'sample-one-line.pdf'
pdfFileEncrypted = 'sample-one-line.protected.pdf'
print('PDF 1:\n',get_text_pdf(pdfFile))
print('PDF 2:\n',get_text_pdf(pdfFileEncrypted,'tuffy'))
输出:
PDF 1:
This is a sample PDF document I am using to demonstrate in the tutorial.
PDF 2:
This is a sample PDF document
password protected.
Python读取Word
每个文档有多个paragraph,每个paragraph有多个Run对象,Run对象表示格式的变化:字体,尺寸,颜色,其他样式元素(下划线 加粗 斜体等),这些元素每次发生变化时,都会创建一个新的Run对象。
import docx
def get_text_word(word_filename):
doc = docx.Document(word_filename)
full_text = []
for para in doc.paragraphs:
full_text.append(para.text)
return '\n'.join(full_text)
if name == "main":
docFile = 'sample-one-line.docx'
print('Document in full :\n',get_text_word(docFile))
# 其他功能
doc = docx.Document(docFile)
print('段落个数:',len(doc.paragraphs))
print('第二段内容:',doc.paragraphs[1].text)
print('第二段样式:',doc.paragraphs[1].style)
# 打印第一段所有的run对象
# 通过run对象体现文本样式的变化
print('第一段:',doc.paragraphs[0].text)
print('Number of runs in paragraph 1 :',len(doc.paragraphs[0].runs))
for idx,run in enumerate(doc.paragraphs[0].runs):
print('Run %s : %s' % (idx,run.text))
# 检查run对象的样式 :下划线 加粗 斜体
print('is Run 5 underlined:',doc.paragraphs[0].runs[5].underline)
print('is Run 1 bold:',doc.paragraphs[0].runs[1].bold)
print('is Run 3 italic',doc.paragraphs[0].runs[3].italic)
输出:
Document in full :
This is a sample PDF document with some text in BOLD, some in ITALIC and some underlined. We are also embedding a Title down below.
This is my TITLE.
This is my third paragraph.
段落个数: 3
第二段内容: This is my TITLE.
第二段样式: _ParagraphStyle('Title') id: 2046137402144
第一段: This is a sample PDF document with some text in BOLD, some in ITALIC and some underlined. We are also embedding a Title down below.
Number of runs in paragraph 1 : 8
Run 0 : This is a sample PDF document with
Run 1 : some text in BOLD
Run 2 : ,
Run 3 : some in ITALIC
Run 4 : and
Run 5 : some underlined.
Run 6 : We are also embedding a Title down below
Run 7 : .
is Run 5 underlined: True
is Run 1 bold: True
is Run 3 italic True
创建自定义语料库
通过txt,pdf,word创建:
import pdf,word
import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
def get_text(text_filename):
file = open(text_filename,'r') # 只读
return file.read() # 内容=>string对象
newCorpusDir = 'mycorpus/'
if not os.path.isdir(newCorpusDir):
os.mkdir(newCorpusDir)
txt1 = get_text('sample_feed.txt')
txt2 = pdf.get_text_pdf('sample-pdf.pdf')
txt3 = word.get_text_word('sample-one-line.docx')
files = [txt1,txt2,txt3]
for idx,f in enumerate(files):
with open(newCorpusDir+str(idx) + '.txt','w') as fout:
fout.write(f)
newCorpus = PlaintextCorpusReader(newCorpusDir,'.*')
print(newCorpus.words()) # 打印语料库中所有的单词数组
print(newCorpus.sents(newCorpus.fileids()[1])) # 打印1.txt中的句子
print(newCorpus.paras(newCorpus.fileids()[0])) # 打印0.txt中的段落
输出:
['i', 'want', 'to', 'eat', 'dinner', 'i', 'want', 'to', ...]
[['A', 'generic', 'NLP'], ['(', 'Natural', 'Language', 'Processing', ')', 'toolset'], ...]
[[['i', 'want', 'to', 'eat', 'dinner']], [['i', 'want', 'to', 'run']]]
读取RSS信息源的内容
RSS = rich site summary 丰富网站摘要
以全球之声为例(url=http://feeds.mashable.com/Mashable):
import feedparser
myFeed = feedparser.parse('http://feeds.mashable.com/Mashable')
print('Feed Title :',myFeed['feed']['title'])
print('Number of posts :',len(myFeed.entries)) # entries返回所有帖子的list
post = myFeed.entries[0]
print('Post Title :',post.title)
content = post.content[0].value
print('Raw content :\n',content)
fout = open('sample-html.html','w')
fout.write(content)
fout.close()
输出:
Feed Title : Mashable
Number of posts : 30
Post Title : Revolut launches new, effortless way to donate to charities
Raw content :
<img alt="" src="https://mondrian.mashable.com/uploads%252Fcard%252Fimage%252F1007924%252F37167fff-e81c-446d-849a-37d0b625b7a7.jpg%252F575x323__filters%253Aquality%252880%2529.jpg?signature=ReYfFvy3gpD2t0oTt7_Z4kd7NQo=&source=https%3A%2F%2Fblueprint-api-production.s3.amazonaws.com" /><div style="float: right; width: 50px;"><a href="https://twitter.com/share?via=Mashable&text=Revolut+launches+new%2C+effortless+way+to+donate+to+charities&url=https%3A%2F%2Fmashable.com%2Farticle%2Frevolut-donations" style="margin: 10px;"><img alt="Twitter" border="0" src="https://a.amz.mshcdn.com/assets/feed-tw-e71baf64f2ec58d01cd28f4e9ef6b2ce0370b42fbd965068e9e7b58be198fb13.jpg" /></a><a href="https://www.facebook.com/sharer.php?u=https%3A%2F%2Fmashable.com%2Farticle%2Frevolut-donations&src=sp" style="margin: 10px;"><img alt="Facebook" border="0" src="https://a.amz.mshcdn.com/assets/feed-fb-8e3bd31e201ea65385a524ef67519d031e6851071807055648790d6a4ca77139.jpg" /></a></div><p><a href="https://www.revolut.com/">Revolut</a> is a UK-based financial services company that offers clients a bank account and a pre-paid card, with many of its services free or incurring a lower fee than you'd get from a typical bank. It's now also offering a new feature that makes it really easy to donate to charities — every time you make a payment. </p>
<p>The feature, called Donations, lets you round up your Revolut card payments and donate the spare change to a charity of your choice. The service is kicking off with three charities: <a href="https://www.ilga-europe.org/">ILGA-Europe</a>, <a href="https://www.savethechildren.net/">Save the Children</a> and <a href="https://www.worldwildlife.org/">WWF</a>. </p>
<div><p>SEE ALSO: <a href="http://mashable.com/article/instagram-stories-donation-sticker-causes?utm_campaign&utm_cid=a-seealso&utm_context=textlink&utm_medium=rss&utm_source">You can now donate through stickers in Instagram Stories</a> <a href="https://mashable.com/article/revolut-donations">Read more...</a></p></div>More about <a href="https://mashable.com/category/donations/?utm_campaign=Mash-Prod-RSS-Feedburner-All-Partial&utm_cid=Mash-Prod-RSS-Feedburner-All-Partial">Donations</a>, <a href="https://mashable.com/category/revolut/?utm_campaign=Mash-Prod-RSS-Feedburner-All-Partial&utm_cid=Mash-Prod-RSS-Feedburner-All-Partial">Revolut</a>, <a href="https://mashable.com/tech/?utm_campaign=Mash-Prod-RSS-Feedburner-All-Partial&utm_cid=Mash-Prod-RSS-Feedburner-All-Partial">Tech</a>, and <a href="https://mashable.com/category/big-tech-companies/?utm_campaign=Mash-Prod-RSS-Feedburner-All-Partial&utm_cid=Mash-Prod-RSS-Feedburner-All-Partial">Big Tech Companies</a><img src="http://feeds.feedburner.com/~r/Mashable/~4/s9f4V3jFdyg" height="1" width="1" alt=""/>
使用BeautifulSoup解析HTML
BeautifulSoup可用于解析任何HTML和XML内容
用于解析的HTML:
Google might finally deliver a viable version of AirDrop for Android phones.
The company is testing a new Android feature called "Fast Share" that would allow phone owners to wirelessly transmit photos, text, and other files to nearby devices using Bluetooth. The currently unreleased feature was uncovered by two separate publications, 9to5Google and XDA Developers.
According to screenshots posted by the publications, Fast Share allows you to share photos, text, and URLs with devices that are nearby even if you don't have an internet connection. Interestingly, the list of devices in the screenshots includes an iPhone as well as a Chromebook and Pixel 3 phone, suggesting the intention is for Fast Share to enable cross-platform sharing. Read more...
解析代码:
from bs4 import BeautifulSoup
# 将HTML文件以str送给BeautifulSoup对象
html_doc = open('sample-html.html','r').read()
soup = BeautifulSoup(html_doc,'html.parser')
# 去除标签,获取文本
print('Full text HTML Stripped:')
print(soup.get_text())
# 获取第一个指定标签内容
print('Accessing the <img> tag :',end=' ')
print(soup.img)
# 获取第一个指定标签的指定内容
print('Accessing the text of <p> tag :',end=' ')
print(soup.p.string)
# 访问第一个指定标签的某个属性
print('Accessing property of <img> tag :',end=' ')
print(soup.img['src'])
# 获取所有某标签的内容
print('Accessing all occurences of the <p> tag :')
for p in soup.find_all('p'):
print(p.string)
输出:
Full text HTML Stripped:
Google might finally deliver a viable version of AirDrop for Android phones.
The company is testing a new Android feature called "Fast Share" that would allow phone owners to wirelessly transmit photos, text, and other files to nearby devices using Bluetooth. The currently unreleased feature was uncovered by two separate publications, 9to5Google and XDA Developers.
According to screenshots posted by the publications, Fast Share allows you to share photos, text, and URLs with devices that are nearby even if you don't have an internet connection. Interestingly, the list of devices in the screenshots includes an iPhone as well as a Chromebook and Pixel 3 phone, suggesting the intention is for Fast Share to enable cross-platform sharing. Read more...More about Tech, Google, Airdrop, Android Q, and Tech
Accessing the <img> tag : <img alt="" src="https://mondrian.mashable.com/uploads%252Fcard%252Fimage%252F1008631%252F256dd624-5852-4df0-81b3-e686a3ac5fd2.jpg%252F575x323__filters%253Aquality%252880%2529.jpg?signature=o6SwiPnemiiF5QUbmAb8lh89GJw=&source=https%3A%2F%2Fblueprint-api-production.s3.amazonaws.com"/>
Accessing the text of <p> tag : Google might finally deliver a viable version of AirDrop for Android phones.
Accessing property of <img> tag : https://mondrian.mashable.com/uploads%252Fcard%252Fimage%252F1008631%252F256dd624-5852-4df0-81b3-e686a3ac5fd2.jpg%252F575x323__filters%253Aquality%252880%2529.jpg?signature=o6SwiPnemiiF5QUbmAb8lh89GJw=&source=https%3A%2F%2Fblueprint-api-production.s3.amazonaws.com
Accessing all occurences of the <p> tag :
Google might finally deliver a viable version of AirDrop for Android phones.
None
None
手机扫一扫
移动阅读更方便
你可能感兴趣的文章