如何从HTML文件中提取所需数据

从 HTML 文件中提取数据通常需要解析 HTML 结构并提取其中的元素和属性。Python 的 BeautifulSoup 库是处理 HTML 和 XML 文件的一个强大工具，它可以帮助我们轻松地提取所需数据。

在这里插入图片描述

1、问题背景

我们需要从 HTML 文件中提取信息，该 HTML 文件包含有关一个人的信息，例如姓名、出生日期、当前年龄、主要团队、爱好、风格和位置。我们尝试使用 Beautiful Soup 库来提取数据，但遇到了一个问题，无法正确提取详细信息。

以下是 HTML 代码示例：

<p class="foo-body">
  <font class="test-proof">Full name</font> Foobar<br />
  <font class="test-proof">Born</font> July 7, 1923, foo, bar<br />
  <font class="test-proof">Current age</font> 27 years 226 days<br />
  <font class="test-proof">Major teams</font> <span style="white-space: nowrap">Japan,</span> <span style="white-space: nowrap">Jakarta,</span> <span style="white-space: nowrap">bazz,</span> <span style="white-space: nowrap">foo,</span> <span style="white-space: nowrap">foobazz</span><br />
  <font class="test-proof">Also</font> bar<br />
  <font class="test-proof">foo style</font> hand <br />
  <font class="test-proof">bar style</font> ball<br />
  <font class="test-proof">foo position</font> bak<br />
  <br class="bar" />
</p>

以下是 Python 代码示例，使用 Beautiful Soup：

def get_info(p_tags):
    """Returns brief information."""

    head_list = []
    detail_list = []
    # This works fine
    for head in p_tags.findAll('font', 'test-proof'):
        head_list.append(head.contents[0])

    # Some problem with this?
    for index in xrange(2, 30, 4):
        detail_list.append(p_tags.contents[index])


    return dict([(l, detail_list[head_list.index(l)]) for l in head_list])

使用此代码，我们可以正确提取标题列表，但详细信息列表存在问题。

2、解决方案

我们提供三种不同的解决方案来解决这个问题。

解决方案 1：

这种解决方案使用 BeautifulSoup 库来解析 HTML 并提取所需的数据。它使用 find_all() 方法来查找带有 test-proof 类的所有 font 标记，然后将它们存储在 head_list 中。然后使用 contents 属性迭代 p_tag 中的每个节点，并将类型为 NavigableString 的节点存储在 detail_list 中。最后，将 head_list 和 detail_list 作为键值对存储在字典中。

from bs4 import BeautifulSoup

def get_info(p_tags):
    """Returns brief information."""

    head_list = []
    detail_list = []

    # Find all <font> tags with the class "test-proof"
    for head in p_tags.find_all('font', 'test-proof'):
        head_list.append(head.contents[0])

    # Iterate over the contents of the <p> tag
    for node in p_tags.contents:
        # Check if the node is a <font> tag with the class "test-proof"
        if isinstance(node, BeautifulSoup.NavigableString):
            detail_list.append(node.string)

    # Create a dictionary with the head_list as keys and the detail_list as values
    info_dict = dict(zip(head_list, detail_list))

    return info_dict

# Get the HTML data
html_data = """
<p class="foo-body">
  <font class="test-proof">Full name</font> Foobar<br />
  <font class="test-proof">Born</font> July 7, 1923, foo, bar<br />
  <font class="test-proof">Current age</font> 27 years 226 days<br />
  <font class="test-proof">Major teams</font> <span style="white-space: nowrap">Japan,</span> <span style="white-space: nowrap">Jakarta,</span> <span style="white-space: nowrap">bazz,</span> <span style="white-space: nowrap">foo,</span> <span style="white-space: nowrap">foobazz</span><br />
  <font class="test-proof">Also</font> bar<br />
  <font class="test-proof">foo style</font> hand <br />
  <font class="test-proof">bar style</font> ball<br />
  <font class="test-proof">foo position</font> bak<br />
  <br class="bar" />
</p>
"""

# Parse the HTML data
soup = BeautifulSoup(html_data, 'html.parser')

# Get the <p> tag with the class "foo-body"
p_tags = soup.find('p', 'foo-body')

# Get the information from the <p> tag
info = get_info(p_tags)

# Print the information
print(info)

解决方案 2：

这种解决方案使用 HTMLParser 库来解析 HTML 并提取所需的数据。它使用 HTMLParser 类来解析 HTML 并将数据存储在 results 字典中。

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

    def __init__(self):
        self.results = {}
        self.key = None
        self.value = None

    def handle_starttag(self, tag, attrs):
        if tag == "font" and 'class' in attrs and attrs['class'] == "test-proof":
            self.key = ""

    def handle_endtag(self, tag):
        if tag == "font":
            self.key = None

    def handle_data(self, data):
        data = data.strip()
        if not data:
            return

        if self.key is not None:
            self.value = data
            self.results[self.key] = self.value

# Get the HTML data
html_data = """
<p class="foo-body">
  <font class="test-proof">Full name</font> Foobar<br />
  <font class="test-proof">Born</font> July 7, 1923, foo, bar<br />
  <font class="test-proof">Current age</font> 27 years 226 days<br />
  <font class="test-proof">Major teams</font> <span style="white-space: nowrap">Japan,</span> <span style="white-space: nowrap">Jakarta,</span> <span style="white-space: nowrap">bazz,</span> <span style="white-space: nowrap">foo,</span> <span style="white-space: nowrap">foobazz</span><br />
  <font class="test-proof">Also</font> bar<br />
  <font class="test-proof">foo style</font> hand <br />
  <font class="test-proof">bar style</font> ball<br />
  <font class="test-proof">foo position</font> bak<br />
  <br class="bar" />
</p>
"""

# Create an instance of the HTML parser
parser = MyHTMLParser()

# Parse the HTML data
parser.feed(html_data)

# Print the results
print(parser.results)

解决方案 3：

这种解决方案使用正则表达式来解析 HTML 并提取所需的数据。它使用 re.compile() 函数来编译正则表达式，然后使用 re.findall() 函数来查找匹配正则表达式的字符串。

import re

# Get the HTML data
html_data = """
<p class="foo-body">
  <font class="test-proof">Full name</font> Foobar<br />
  <font class="test-proof">Born</font> July 7, 1923, foo, bar<br />
  <font class="test-proof">Current age</font> 27 years 226 days<br />
  <font class="test-proof">Major teams</font> <span style="white-space: nowrap">Japan,</span> <span style="white-space: nowrap">Jakarta,</span> <span style="white-space: nowrap">bazz,</span> <span style="white-space: nowrap">foo,</span> <span style="white-space: nowrap">foobazz</span><br />
  <font class="test-proof">Also</font> bar<br />
  <font class="test-proof">foo style</font>

通过这些步骤，我们可以从 HTML 文件中有效地提取出所需的数据，用于各种数据分析或自动化任务。如果我们有特定的 HTML 文件和数据提取需求，我可以帮大家写出更具体的代码示例。

如何从HTML文件中提取所需数据

面试题之强缓存协商缓存

【C语言】数组篇

vue打包编译【自动删除node_modules下的.cache缓存文件夹】

jQuery理论

全局缩放后echarts鼠标偏移的问题(鼠标触发提示框位置有偏差，中心点偏移的问题)

Vue 3 中的 fragment 是什么？它为什么被引入？

若依Vue前后端分离项目部署

Vue 框架深度剖析：原理、应用与最佳实践

Vue 3 的 keep-alive 及生命周期钩子

Vue数据改变，但页面没有变的几种情况及解决方法

前端哥

面试题之强缓存协商缓存

关于webpack的文件打包分割，并防止js文件缓存

Pytorch实现之利用CGAN鉴别真假图像

【C语言】数组篇

正则表达式（2）匹配规则

正则表达式(复习)

vue打包编译【自动删除node_modules下的.cache缓存文件夹】

联核科技AGV无人叉车有哪些安全防护措施？

笔记:代码随想录算法训练营day39:LeetCode 198.打家劫舍,213.打家劫舍II,337.打家劫舍III

宇树科技嵌入式面试题及参考答案（春晚机器人的公司）

1
【Echarts系列】—— 实现电池图、3D立体圆形柱状图

2024-03-03 11:03:011001

2
CSS 动画效果（5种） - 附完整示例

2025-02-28 12:02:481000

3
在Vue中实现与OpenAI对话的功能

2025-02-27 11:02:161000

4
jQuery.flowchart 项目常见问题解决方案

2025-02-24 13:02:021000

5
CSS常用属性（文本属性）

2024-11-04 09:11:111000

6
TypeScript 中的 Number 类型，Number 类型的特性、常见操作和注意事项

2024-09-30 23:09:061000

7
CSS写代码使页面划分为左右两个区域

2024-09-09 00:09:071000

8
vue使用datav echarts

2024-09-06 00:09:381000

9
使用TweenMax.js和CSS3创建冰球运动员动画效果教程

2024-09-04 23:09:411000

10
使用CDN提高jQuery加载速度

2024-08-24 23:08:211000