你有一个HTML文档要从中提取数据,并了解这个HTML文档的结构。
将HTML解析成一个[Document](http://jsoup.org/apidocs/org/jsoup/nodes/Document.html)
之后,就可以使用类似于DOM的方法进行操作。示例代码:
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
Elements这个对象提供了一系列类似于DOM的方法来查找元素,抽取并处理其中的数据。具体如下:
[getElementById(String id)](http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#getElementById(java.lang.String))
[getElementsByTag(String tag)](http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#getElementsByTag(java.lang.String))
[getElementsByClass(String className)](http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#getElementsByClass(java.lang.String))
[getElementsByAttribute(String key)](http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#getElementsByAttribute(java.lang.String))
(and related methods)[siblingElements()](http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#siblingElements())
, [firstElementSibling()](http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#firstElementSibling())
, [lastElementSibling()](http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#lastElementSibling())
; [nextElementSibling()](http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#nextElementSibling())
, [previousElementSibling()](http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#previousElementSibling())
[parent()](http://jsoup.org/apidocs/org/jsoup/nodes/Node.html#parent())
, [children()](http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#children())
, [child(int index)](http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#child(int))
[attr(String key)](http://jsoup.org/apidocs/org/jsoup/select/Elements.html#attr(java.lang.String))
获取属性[attr(String key, String value)](http://jsoup.org/apidocs/org/jsoup/select/Elements.html#attr(java.lang.String,%20java.lang.String))
设置属性[attributes()](http://jsoup.org/apidocs/org/jsoup/nodes/TextNode.html#attributes())
获取所有属性[id()](http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#id())
, [className()](http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#className())
and [classNames()](http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#classNames())
[text()](http://jsoup.org/apidocs/org/jsoup/select/Elements.html#text())
获取文本内容[text(String value)](http://jsoup.org/apidocs/org/jsoup/nodes/TextNode.html#text(java.lang.String))
设置文本内容[html()](http://jsoup.org/apidocs/org/jsoup/select/Elements.html#html())
获取元素内HTML[html(String value)](http://jsoup.org/apidocs/org/jsoup/select/Elements.html#html(java.lang.String))
设置元素内的HTML内容[outerHtml()](http://jsoup.org/apidocs/org/jsoup/select/Elements.html#outerHtml())
获取元素外HTML内容[data()](http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#data())
获取数据内容(例如:script和style标签)[tag()](http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#tag())
and [tagName()](http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#tagName())
[append(String html)](http://jsoup.org/apidocs/org/jsoup/select/Elements.html#append(java.lang.String))
, [prepend(String html)](http://jsoup.org/apidocs/org/jsoup/select/Elements.html#prepend(java.lang.String))
[appendText(String text)](http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#appendText(java.lang.String))
, [prependText(String text)](http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#prependText(java.lang.String))
[appendElement(String tagName)](http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#appendElement(java.lang.String))
, [prependElement(String tagName)](http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#prependElement(java.lang.String))
[html(String value)](http://jsoup.org/apidocs/org/jsoup/select/Elements.html#html(java.lang.String))
手机扫一扫
移动阅读更方便
你可能感兴趣的文章