从零开发短视频电商使用Jsoup进行HTML爬取解析与操作

文章目录

- 简介
- 原理
- 依赖
- 基础示例
- 功能
- - 解析和遍历文档
  - 输入
  - - 从字符串中解析文档
    - 从 URL 加载文档
    - 从文件加载文档
  - 数据提取
  - - 使用 DOM 方法导航文档
    - 使用 CSS 选择器查找元素
    - 使用 XPath 选择器查找元素和节点
    - 从元素中提取属性、文本和 HTML
  - 清理HTML

官网： https://jsoup.org/

文档：https://jsoup.org/cookbook/

简介

jsoup 是一个 Java 库，可简化实际 HTML 和 XML 的使用。它提供了一个易于使用的 API，用于使用 DOM API 方法、CSS 和 xpath 选择器进行 URL 获取、数据解析、提取和操作。

jsoup 实现了 WHATWG HTML5 规范，并将 HTML 解析为与现代浏览器相同的 DOM。

抓取
从 URL、文件或字符串中抓取并解析 HTML
查找
使用 DOM 遍历或 CSS 选择器查找并提取数据
操作
操作 HTML 元素、属性和文本
清理
根据安全列表清理用户提交的内容，以防止 XSS 攻击
输出

输出整洁的 HTML

jsoup 旨在处理各种常见的 HTML；从原始和验证，到无效的标签； jsoup 将创建一个合理的解析树。

原理

jsoup 使用类似于jQuery的API，它将HTML文档解析成一个DOM树，然后可以使用类似于CSS选择器的语法来定位和操作文档中的元素。这种模型使得开发者能够以一种直观的方式进行数据提取和DOM操作。

依赖

 <dependency>
  <!-- jsoup HTML parser library @ https://jsoup.org/ -->
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.17.1</version>
</dependency>复制

基础示例

获取维基百科主页，将其解析为 DOM，然后从新闻部分中选择标题到元素列表中：

 import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
 
import java.io.IOException;
 
/**
 * A simple example, used on the jsoup website.
 */
public class Wikipedia {
    public static void main(String[] args) throws IOException {
        Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
        log(doc.title());
 
        Elements newsHeadlines = doc.select("#mp-itn b a");
        for (Element headline : newsHeadlines) {
            log("%s\n\t%s", headline.attr("title"), headline.absUrl("href"));
        }
    }
 
    private static void log(String msg, String... vals) {
        System.out.println(String.format(msg, vals));
    }
}复制

XPath更强大，而CSS选择器通常语法比较简洁，运行速度更快些。

Chrome 插件：

Ranorex

selectorgadget

功能

解析和遍历文档

解析 HTML 文档

 String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);复制

解析器将尽一切努力根据您提供的 HTML 创建干净的解析，无论 HTML 格式是否良好：

未封闭的标签（例如 Lorem Ipsum 解析为 Lorem Ipsum ）
隐式标签（例如，裸露的 <td>Table data</td> 被包装到 <table><tr><td>... 中）
可靠地创建文档结构（ html 包含 head 和 body ，并且仅在头部中包含适当的元素）

Document的对象模型

文档由 Elements 和 TextNodes（以及其他几个杂项节点）组成。
继承链是： Document extends Element extends Node 。 TextNode 扩展 LeafNode 扩展 Node 。
一个元素包含一系列子节点，并且有一个父元素。他们还提供了仅子元素的过滤列表。

输入

从字符串中解析文档

 // 从字符串中解析文档
String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
// 解析片段
String html = "<div><p>Lorem ipsum.</p>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();复制

从 URL 加载文档

 Document doc = Jsoup.connect("http://example.com/").get();
String title = doc.title();
 
Document doc = Jsoup.connect("http://example.com")
  .data("query", "Java")
  .userAgent("Mozilla")
  .cookie("auth", "token")
  .timeout(3000)
  .post();复制

从文件加载文档

 File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");复制

数据提取

使用 DOM 方法导航文档

 File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
 
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
  String linkHref = link.attr("href");
  String linkText = link.text();
}复制

寻找元素

getElementById(String id)
getElementsByTag(String tag)
getElementsByClass(String className)
getElementsByAttribute(String key)
兄弟元素： siblingElements() 、 firstElementSibling() 、 lastElementSibling() ； nextElementSibling() ，previousElementSibling()
Graph: parent(), children(), child(int index)

元素数据

attr(String key) 用于获取属性
attributes() 获取所有属性
id() 、 className() 和 classNames()
text() 获取文本内容， text(String value) 设置文本内容
html() 获取内部 HTML 内容， html(String value) 设置内部 HTML 内容
outerHtml() 获取外部 HTML 值
data() 获取数据内容（例如 script 和 style 标签）
tag() and tagName() tag() 和 tagName()

使用 CSS 选择器查找元素

 File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "https://example.com/");
 
Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]");
  // img with src ending .png
 
Element masthead = doc.select("div.masthead").first();
  // div with class=masthead
 
Elements resultDivs = doc.select("h3.r > div");
  // direct div after h3
Elements resultAs   = resultDivs.select("a");
  // A elements within resultDivs复制

选择器概述

tagname ：通过标签查找元素，例如 div
#id ：通过 ID 查找元素，例如 #logo
.class ：按类名查找元素，例如 .masthead
[attribute] ：具有属性的元素，例如 [href]
[^attrPrefix] ：带有属性名称前缀的元素，例如 [^data-] 查找具有 HTML5 数据集属性的元素
[attr=value] ：具有属性值的元素，例如 [width=500] （也可引用，如 [data-name='launch sequence'] ）
[attr^=value] 、 [attr$=value] 、 [attr*=value] ：具有以值开头、结尾或包含值的属性的元素，例如 [href*=/path/]
[attr~=regex] ：属性值与正则表达式匹配的元素；例如 img[src~=(?i)\.(png|jpe?g)]
* ：所有元素，例如 *
ns|tag ：通过命名空间前缀中的标签查找元素，例如 fb|name 查找 <fb:name> 元素
*|tag ：任何命名空间前缀中的标记的最终元素，例如 *|name 查找 <fb:name> 和 <name> 元素

选择器组合
el#id ：带有 ID 的元素，例如 div#logo
el.class ：具有类的元素，例如 div.masthead
el[attr] ：具有属性的元素，例如 a[href]
任意组合，例如 a[href].highlight
ancestor child ：从祖先继承的子元素，例如 .body p 查找类为“body”的块下任意位置的 p 元素
parent > child ：直接从父元素下降的子元素，例如 div.content > p 查找 p 元素； body > * 查找 body 标记的直接子级
siblingA + siblingB ：查找紧邻同级 A 的同级 B 元素，例如 div.head + div
siblingA ~ siblingX ：查找同级 X 元素，其前面是同级 A，例如 h1 ~ p
el, el, el ：将多个选择器分组，查找与任何选择器匹配的唯一元素；例如 div.masthead, div.logo

伪选择器
:has(selector) ：查找包含与选择器匹配的元素的元素；例如 div:has(p)
:is(selector) ：查找与选择器列表中任意选择器匹配的元素；例如 :is(h1, h2, h3, h4, h5, h6) 查找任何标题元素
:not(selector) ：查找与选择器不匹配的元素；例如 div:not(.logo)
:contains(text) ：查找包含给定文本的元素。搜索不区分大小写；例如 p:contains(jsoup)
:containsOwn(text) ：查找直接包含给定文本的元素
:matches(regex) ：查找文本与指定正则表达式匹配的元素；例如 div:matches((?i)login)
:matchesOwn(regex) ：查找自身文本与指定正则表达式匹配的元素
:lt(n) ：查找同级索引（即其在 DOM 树中相对于其父级的位置）小于 n 的元素；例如 td:lt(3)
:gt(n) ：查找同级索引大于 n 的元素；例如 div p:gt(2)
:eq(n) ：查找同级索引等于 n 的元素；例如 form input:eq(1)

使用 XPath 选择器查找元素和节点

  Document doc = Jsoup.connect("https://jsoup.org/").get();
  
  Elements elements = doc.selectXpath("//div[@class='col1']/p");
      // Each P element in div.col1
  
  List<TextNode> textNodes = doc.selectXpath("//a/text()", TextNode.class);
      // Each TextNode in every A element复制

从元素中提取属性、文本和 HTML

 String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
 
String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String linkText = link.text(); // "example""
 
String linkOuterH = link.outerHtml(); 
    // "<a href="http://example.com"><b>example</b></a>"
String linkInnerH = link.html(); // "<b>example</b>"复制

清理HTML

希望允许不受信任的用户提供 HTML 以在您的网站上输出（例如作为评论提交）。您需要清理此 HTML 以避免跨站点脚本 (XSS) 攻击.

 String unsafe = 
  "<p><a href='http://example.com/' οnclick='stealCookies()'>Link</a></p>";
String safe = Jsoup.clean(unsafe, Safelist.basic());
// now: <p><a href="http://example.com/" rel="nofollow">Link</a></p>复制

	<dependency>
	<!-- jsoup HTML parser library @ https://jsoup.org/ -->
	<groupId>org.jsoup</groupId>
	<artifactId>jsoup</artifactId>
	<version>1.17.1</version>
	</dependency>

	import org.jsoup.Jsoup;
	import org.jsoup.nodes.Document;
	import org.jsoup.nodes.Element;
	import org.jsoup.select.Elements;

	import java.io.IOException;

	/**
	* A simple example, used on the jsoup website.
	*/
	public class Wikipedia {
	public static void main(String[] args) throws IOException {
	Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
	log(doc.title());

	Elements newsHeadlines = doc.select("#mp-itn b a");
	for (Element headline : newsHeadlines) {
	log("%s\n\t%s", headline.attr("title"), headline.absUrl("href"));
	}
	}

	private static void log(String msg, String... vals) {
	System.out.println(String.format(msg, vals));
	}
	}

	String html = "<html><head><title>First parse</title></head>"
	+ "<body><p>Parsed HTML into a doc.</p></body></html>";
	Document doc = Jsoup.parse(html);

	// 从字符串中解析文档
	String html = "<html><head><title>First parse</title></head>"
	+ "<body><p>Parsed HTML into a doc.</p></body></html>";
	Document doc = Jsoup.parse(html);
	// 解析片段
	String html = "<div><p>Lorem ipsum.</p>";
	Document doc = Jsoup.parseBodyFragment(html);
	Element body = doc.body();

	Document doc = Jsoup.connect("http://example.com/").get();
	String title = doc.title();

	Document doc = Jsoup.connect("http://example.com")
	.data("query", "Java")
	.userAgent("Mozilla")
	.cookie("auth", "token")
	.timeout(3000)
	.post();

	File input = new File("/tmp/input.html");
	Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

	File input = new File("/tmp/input.html");
	Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

	Element content = doc.getElementById("content");
	Elements links = content.getElementsByTag("a");
	for (Element link : links) {
	String linkHref = link.attr("href");
	String linkText = link.text();
	}

	File input = new File("/tmp/input.html");
	Document doc = Jsoup.parse(input, "UTF-8", "https://example.com/");

	Elements links = doc.select("a[href]"); // a with href
	Elements pngs = doc.select("img[src$=.png]");
	// img with src ending .png

	Element masthead = doc.select("div.masthead").first();
	// div with class=masthead

	Elements resultDivs = doc.select("h3.r > div");
	// direct div after h3
	Elements resultAs = resultDivs.select("a");
	// A elements within resultDivs

	Document doc = Jsoup.connect("https://jsoup.org/").get();

	Elements elements = doc.selectXpath("//div[@class='col1']/p");
	// Each P element in div.col1

	List<TextNode> textNodes = doc.selectXpath("//a/text()", TextNode.class);
	// Each TextNode in every A element

	String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
	Document doc = Jsoup.parse(html);
	Element link = doc.select("a").first();

	String text = doc.body().text(); // "An example link"
	String linkHref = link.attr("href"); // "http://example.com/"
	String linkText = link.text(); // "example""

	String linkOuterH = link.outerHtml();
	// "<a href="http://example.com"><b>example</b></a>"
	String linkInnerH = link.html(); // "<b>example</b>"

	String unsafe =
	"<p><a href='http://example.com/' οnclick='stealCookies()'>Link</a></p>";
	String safe = Jsoup.clean(unsafe, Safelist.basic());
	// now: <p><a href="http://example.com/" rel="nofollow">Link</a></p>

从零开发短视频电商 使用Jsoup进行HTML爬取解析与操作