ps. 同时有个小福利想要告诉大家:
今天我给大家带来100次文档解析的免费试用的权益,下方扫码领取!
支持多种扫描内容:能良好处理各类图片与扫描文档,包括手机照片、截屏等内容。 支持多种语言:支持简体中文/繁体中文/英文/数字/西欧主流语言/东欧主流语言等共 50+ 种语言。 表格识别效果好:能准确识别各种格式的表格,包括有线表格、无线表格、密集表格,并支持各种类型的合并单元格识别与还原。 阅读顺序还原准:能理解和还原文档的结构和元素排列,确保阅读顺序的准确性,支持多栏布局的论文、年报、业务报告等内容。 自研文档树引擎:从语义出发,提取段落embedding值,预测标题层级关系,通过构造文档树提高检索召回效果。
TextIn ParseX是一套标准的多平台支持的python sdk,帮助开发者解析pdf_to_markdownRestful API返回结果,获取对应的版面元素的数据结构。开发者只需在终端安装对应的依赖就可以使用。
pip install TextInParseX
pip3 install TextInParseX -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
import TextInParseX as px# 初始化解析器app_id = "#############################" #填入你的textin的api_id和secret——codesecret_code = "#############################"parseX_client = px.ParseXClient(app_id, secret_code)pdf_file_path = "example.pdf" #你的本地文件路径#通过ParseX直接调用url获取解析对象result = parseX_client.begin_analyze_document_from_url(pdf_file_path)
import TextInParseX as pximport jsonjson_file = 'test_json/example.json'with open(json_file, 'r') as fr: json_result = json.load(fr)parseX_client = px.ParseXClient()result = parseX_client.begin_analyze_document_from_json(json_result)#或者直接输入json文件result = parseX_client.begin_analyze_document_from_file(json_file)
print('Markdown:')print(result.all_markdown)print("\n")print("All text in document:")#为可视化方便, 输出0-1000个字符parseX_client.print_all_elements(result.all_text, 0, 1000)print("\n")print(f"Total tables in document: {len(result.all_tables)}")for index, table in enumerate(result.all_tables): print(f"Table {index}:") parseX_client.print_all_elements(table) print("\n")print(f"Total paragraphs in document: {len(result.all_paragraphs)}")for p_idx, each_paragraph in enumerate(result.all_paragraphs): print(f"\n--- Paragraph {p_idx}/{len(result.all_paragraphs)} ---") print(f"Paragraph position: {each_paragraph.pos}")for l_idx, each_line in enumerate(each_paragraph.lines): print(f" Line {l_idx}/{len(each_paragraph.lines)}") print(f" Line positions: {each_line.pos}") print(f" Line text: {each_line.text}")print(f"Total images in document: {len(result.all_images)}")for index, image in enumerate(result.all_images): print(f"Image {index}:") parseX_client.print_all_elements(image) print("\n")all_images_cv_mat = result.get_all_images_cv_mat()print(f"Total images (as cv::Mat) in document: {len(all_images_cv_mat)}")for idx, mat in enumerate(all_images_cv_mat): print(f"Image {idx} (cv::Mat) shape: {mat.shape}")
#页的索引指向pdf和文档的页,按照页数的规则,从1开始;table等版面元素的索引默认程序读取的规则,从0开始for page in result.pages: print(f"=== Page {page.page_id} ===") print("\n") for index, table in enumerate(page.tables): print(f"Table {index}:") parseX_client.print_all_elements(table) print("\n") for index, image in enumerate(page.images): print(f"Image {index}:") parseX_client.print_all_elements(image) print("\n") images_cv_mat = page.get_images_cv_mat() print(f"Total images (as cv::Mat) in page {page.page_id}: {len(images_cv_mat)}") for idx, mat in enumerate(images_cv_mat): print(f"Image {idx} (cv::Mat) shape: {mat.shape}") print("\n") print("Text:") # 限定只能打印前1000个字符 parseX_client.print_all_elements(page.paragraph_text, 0, 1000) print("\n") # 获取当前页的段落 print(f"Total paragraphs: {len(page.paragraphs)}") for p_idx, each_paragraph in enumerate(page.paragraphs): print(f"\n--- Paragraph {p_idx}/{len(page.paragraphs)} ---") print(f"Paragraph position: {each_paragraph.pos}") for l_idx, each_line in enumerate(each_paragraph.lines): print(f" Line {l_idx}/{len(each_paragraph.lines)}") print(f" Line positions: {each_line.pos}") print(f" Line text: {each_line.text}") print('Finished getting paragraphs') print("\n\n")
# 获取当前页的段落 print(f"Total paragraphs: {len(page.paragraphs)}")for p_idx, each_paragraph in enumerate(page.paragraphs): print(f"\n--- Paragraph {p_idx}/{len(page.paragraphs)} ---") print(f"Paragraph position: {each_paragraph.pos}")for l_idx, each_line in enumerate(each_paragraph.lines): print(f" Line {l_idx}/{len(each_paragraph.lines)}") print(f" Line positions: {each_line.pos}") print(f" Line text: {each_line.text}") print('Finished getting paragraphs') print("\n\n")
预览渲染主流图片格式和pdf文件,提供缩放和旋转功能 markdown结果渲染,支持各级标题、图片、公式渲染展示 各类解析元素提取展示,支持查看表格、公式、图片,和原始 JSON 结果 解析元素文档位置溯源,原文画框标注各元素位置,可以点击画框跳转解析结果,也可以点击解析结果跳转原文画框 各级目录树还原展示,支持点击跳转相应章节 接口调用选项参数配置,支持配置不同参数组合,获取相应解析结果 复制和导出markdown文件 复制解析后的表格和图片,可以直接粘贴到Excel表格中