多种方法从pdf文件中提取表格数据

当前位置：点晴教程→知识管理交流 →『技术文档交流』

admin

2025年8月28日 1:38 本文热度 959

我们经常遇到一些发布的pdf文件，需要获取其中表格中的数据，比如如下的表格：

提取数据有多种方法，我们采用最简单的python来实现。

建立python项目，建立文件readpdf.py如下

import 
tabula
# 检查本地的java环境是否正确
tabula.environment_info()
#jpype.startJVM(jpype.getDefaultJVMPath())
# 从PDF文件中读取表格数据到DataFrame列表
dfs = tabula.read_pdf("D:\\ai-hos\\doc\\my.pdf", pages='all', 
multiple_tables=True,force_subprocess=True)


for i, df in enumerate(dfs):
# 将每个DataFrame保存为CSV文件
df.to_csv(f"table_{i}.csv", index=False)

引用支持库

pip install tabula-py

pip install jpype1 --no-cache-dir

应该确保本地的jdk安装正确

java -version

这样运行程序就可以解析pdf文件中的数据了。

Python提取PDF表格数据还可以使用以下几种方法：

1. 使用 pdfplumber 库

安装：pip install pdfplumber

示例代码：

python

import pdfplumber
import pandas as pd


def extract_pdf_tables(pdf_path, start_page, end_page):
    with pdfplumber.open(pdf_path) as pdf:
        all_dfs = []
        for page in pdf.pages[start_page-1:end_page]:
            tables = page.extract_tables()
            for table in tables:
                if table:
                    df = pd.DataFrame(table[1:], columns=table[0])
                    all_dfs.append(df)
        combined_df = pd.concat(all_dfs, ignore_index=True)
        return combined_df


# 使用示例
pdf_path = "your_file.pdf"
start_page = 1 # 开始页码
end_page = 10 # 结束页码
data = extract_pdf_tables(pdf_path, start_page, end_page)
data.to_excel("output.xlsx", index=False)

2. 使用 camelot 库

安装：pip install camelot-py[cv]

示例代码：

python

import camelot


def extract_pdf_tables(pdf_path):
    tables = camelot.read_pdf(pdf_path)
    combined_df = pd.concat([table.df for table in tables], ignore_index=True)
    return combined_df


# 使用示例
pdf_path = "your_file.pdf"
data = extract_pdf_tables(pdf_path)
data.to_csv("output.csv", index=False)

3. 使用 tabula-py 库

安装：pip install tabula-py jpype1

示例代码：

python

import tabula


def extract_pdf_tables(pdf_path):
    dfs = tabula.read_pdf(pdf_path, pages='all', multiple_tables=True)
    combined_df = pd.concat(dfs, ignore_index=True)
    return combined_df


# 使用示例
pdf_path = "your_file.pdf"
data = extract_pdf_tables(pdf_path)
data.to_excel("output.xlsx", index=False)

注意事项

文件路径：确保PDF文件路径正确，可使用绝对路径或相对路径。

表格格式：不同库对表格格式的兼容性不同，若提取结果不理想，可尝试更换库或调整参数。

性能优化：对于大型PDF文件，可分页处理或使用多线程提高效率。

该文章在 2025/8/28 16:17:25 编辑过

关键字查询

正在查询...

点晴ERP是一款针对中小制造业的专业生产管理软件系统,系统成熟度和易用性得到了国内大量中小企业的青睐。

点晴PMS码头管理系统主要针对港口码头集装箱与散货日常运作、调度、堆场、车队、财务费用、相关报表等业务管理，结合码头的业务特点，围绕调度、堆场作业而开发的。集技术的先进性、管理的有效性于一体，是物流码头及其他港口类企业的高效ERP管理信息系统。

点晴WMS仓储管理系统提供了货物产品管理,销售管理,采购管理,仓储管理,仓库管理,保质期管理,货位管理,库位管理,生产管理,WMS管理系统,标签打印,条形码,二维码管理,批号管理软件。

点晴免费OA是一款软件和通用服务都免费，不限功能、不限时间、不限用户的免费OA协同办公管理系统。