When I use tesseract to recognize the table, I can't get any output, how can I recognize the table - ocr

Recently, I want to make a tools for Table Recognition. I have tried tesseract ocr, but I can't get any output, can anyone give me the answer?

Highly recommand paddleocr for table recognition! It can output textfile and excel file using just a few lines of code.
import os
import cv2
from paddleocr import PPStructure,save_structure_res
table_engine = PPStructure(layout=False, show_log=True, use_gpu=False)
save_folder = './output'
img_path = 'PaddleOCR_pub/ppstructure/docs/table/table.jpg'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])
for line in result:
line.pop('img')
print(line
The output files are as follows, which can help you more.
you can experience it here: https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/ppstructure/docs/quickstart_en.md#214-table-recognition

Related

Python: Creating PDF from PNG images and CSV tables using reportlab

I am trying to create a PDF document using a series of PDF images and a series of CSV tables using the python package reportlab. The tables are giving me a little bit of grief.
This is my code so far:
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate
from reportlab.pdfgen.canvas import Canvas
from reportlab.platypus import *
from reportlab.platypus.tables import Table
from PIL import Image
from matplotlib.backends.backend_pdf import PdfPages
# Set the path to the folder containing the images and tables
folder_path = 'Files'
# Create a new PDF document
pdf_filename = 'testassessment.pdf'
canvas = Canvas(pdf_filename)
# Iterate through the files in the folder
for file in os.listdir(folder_path):
file_path = os.path.join(folder_path, file)
# If the file is an image, draw it on the PDF
if file.endswith('.png'):
canvas.drawImage(file_path, 105, 148.5, width=450, height=400)
canvas.showPage() #ends page
# If the file is a table, draw it on the PDF
elif file.endswith('.csv'):
df = pd.read_csv(file_path)
table = df.to_html()
canvas.drawString(10, 10, table)
canvas.showPage()
# Save the PDF
canvas.save()
The tables are not working. When I use .drawString it ends up looking like this:
Does anyone know how I can get the table to be properly inserted into the PDF?
According to the reportlab docs, page 14, "The draw string methods draw single lines of text on the canvas.". You might want to have a look at "The text object methods" on the same page.
You might want to consider using PyMuPDF with Stories it allows for more flexibility of layout from a data input. For an example of something very similar to what you are trying to achieve see: https://pymupdf.readthedocs.io/en/latest/recipes-stories.html#how-to-display-a-list-from-json-data

Detect language/script from pdf with python

I am trying to create a python script to detect the language(s)/script(s) inside a not yet OCRed pdf with the help of pytesseract before doing the 'real' ocr by passing the correct detected language(s)
I have like 10000 pdf's not always standard english and sometimes 1000 pages long. In order to do the real OCR I need to autodetect the language first.
So a sort of two step OCR as you will that tesseract both can preform
Detecting the language/script on some centered pages
Preforming the real OCR with the found language/script over all pages
Any tips to fix/improve this script? All I want is language(s) on the given pages detected returned.
#!/usr/bin/python3
import sys
import pytesseract
from wand.image import Image
import fitz
pdffilename = sys.argv[1]
doc = fitz.open(pdffilename)
center_page = round(doc.pageCount / 2)
surround = 2
with Image(filename=pdffilename + '[' + str(center_page - surround) + '-' + str(center_page + surround) + ']') as im:
print(pytesseract.image_to_osd(im, lang='osd',config='psm=0 pandas_config=None', nice =0, timeout=0))
I run the script as follows:
script_detect.py myunknown.pdf
I am getting the following error atm:
TypeError: Unsupported image object
Assuming that you have converted your pdf-file using some tool (OCR or other) you can use langdetect. Sample your text and feed it detect
from langdetect import detect
lang = detect("je suis un petit chat")
print(lang)
```output fr````
or
from langdetect import detect
lang = detect("我是法国人")
print(lang)
output ch
There are other libraries, such as polyglot, useful if you have mixed languages.

How to get CSV to work in in RevitPythonShell?

Has anyone figured out how to get CSV (or any other package) to work in RevitPythonShell? I've only been able to get Excel from Interop to work.
When I try running csv in RPS, the terminal executes and shows no error or any kind of feed back, and the file is not created either.
This is the basic code I'm trying to run which comes from a tutorial on CSV I believe.
with open('mycsv2.csv', 'w') as f:
fieldnames = ['column1', 'column2', 'column3']
thewriter = csv.DictWriter(f, fieldnames=fieldnames)
thewriter.writeheader()
for i in range(1, 10):
thewriter.writerow({'column1':'one', 'column2':'two', 'column3':'three'})
I find CSV much more user friendly and easier to understand than Interop Excel. I believe I've read its doable somewhere but of course I cant find the source now.
All help, tips, or tricks are appreciated.
I can get it to work by supplying the full path name to the open function, so it looks like (showing full path to my Documents Folder):
import csv
with open(r'C:\Users\callum\Documents\mycsv2.csv', 'w') as f:
fieldnames = ['column1', 'column2', 'column3']
thewriter = csv.DictWriter(f, fieldnames=fieldnames)
thewriter.writeheader()
for i in range(1, 10):
thewriter.writerow({'column1':'one', 'column2':'two', 'column3':'three'})
Let me know if that does the trick!

Extracting text from plain HTML and write to new file

I'm extracting a certain part of a HTML document (to be fair: basis for this is an iXBRL document which means I do have a lot of written formatting code inside) and write my output, the original file without the extracted part, to a .txt file. My aim is to measure the difference in document size (how much KB of the original document refers to the extracted part). As far as I know there shouldn't be any difference in HTML to text format, so my difference should be reliable although I am comparing two different document formats. My code so far is:
import glob
import os
import contextlib
import re
#contextlib.contextmanager
def stdout2file(fname):
import sys
f = open(fname, 'w')
sys.stdout = f
yield
sys.stdout = sys.__stdout__
f.close()
def extractor():
os.chdir(r"F:\Test")
with stdout2file("FileShortened.txt"):
for file in glob.iglob('*.html', recursive=True):
with open(file) as f:
contents = f.read()
extract = re.compile(r'(This is the beginning of).*?Until the End', re.I | re.S)
cut = extract.sub('', contents)
print(file.split(os.path.sep)[-1], end="| ")
print(cut, end="\n")
extractor()
Note: I am NOT using BS4 or lxml because I am not only interested in HTML text but actually in ALL lines between my start and end-RegEx incl. all formatting code lines.
My code is working without problems, however as I have a lot of files my FileShortened.txt document is quickly going to be massive in size. My problem is not with the file or the extraction, but with redirecting my output to various txt-file. For now, I am getting everything into one file, what I would need is some kind of a "for each file searched, create new txt-file with the same name as the original document" condition (arcpy module?!)?
Somehting like:
File1.html --> File1Short.txt
File2.html --> File2Short.txt
...
Is there an easy way (without changing my code too much) to invert my code in the sense of printing the "RegEx Match" to a new .txt file instead of "everything except my RegEx match"?
Any help appreciated!
Ok, I figured it out.
Final Code is:
import glob
import os
import re
from os import path
def extractor():
os.chdir(r"F:\Test") # the directory containing my html
for file in glob.glob("*.html"): # iterates over all files in the directory ending in .html
with open(file) as f, open((file.rsplit(".", 1)[0]) + ".txt", "w") as out:
contents = f.read()
extract = re.compile(r'Start.*?End', re.I | re.S)
cut = extract.sub('', contents)
out.write(cut)
out.close()
extractor()

Search and Replace Text in CSV file using Python

I just started with Python 3.4.2 and trying to find and replace text in csv file.
In Details, Input.csv file contain below line:
0,0,0,13,.\New_Path-1.1.12\Impl\Appli\Library\Module_RM\Code\src\Exception.cpp
0,0,0,98,.\Old_Path-1.1.12\Impl\Appli\Library\Prof_bus\Code\src\Wrapper.cpp
0,0,0,26,.\New_Path-1.1.12\Impl\Support\Custom\Vital\Code\src\Interface.cpp
0,0,0,114,.\Old_Path-1.1.12\Impl\Support\Custom\Cust\Code\src\Config.cpp
I maintained my strings to be searched in other file named list.csv
Module_RM
Prof_bus
Vital
Cust
Now I need to go through each line of Input.csvand replace the last column with the matched string.
So my end result should be like this:
0,0,0,13,Module_RM
0,0,0,98,Prof_bus
0,0,0,26,Vital
0,0,0,114,Cust
I read the input files first line as a list. So text which i need to replace came in line[4]. I am reading each module name in the list.csv file and checking if there is any match of text in line[4]. I am not able to make that if condition true. Please let me know if it is not a proper search.
import csv
import re
with open("D:\\My_Python\\New_Python_Test\\Input.csv") as source, open("D:\\My_Python\\New_Python_Test\\List.csv") as module_names, open("D:\\My_Python\\New_Python_Test\\Final_File.csv","w",newline="") as result:
reader=csv.reader(source)
module=csv.reader(module_names)
writer=csv.writer(result)
#lines=source.readlines()
for line in reader:
for mod in module_names:
if any([mod in s for s in line]):
line.replace(reader[4],mod)
print ("YES")
writer.writerow("OUT")
print (mod)
module_names.seek(0)
lines=reader
Please guide me to complete this task.
Thanks for your support!
At-last i succeeded in solving this problem!
The below code works well!
import csv
with open("D:\\My_Python\\New_Python_Test\\Input.csv") as source, open("D:\\My_Python\\New_Python_Test\\List.csv") as module_names, open("D:\\My_Python\\New_Python_Test\\Final_File.csv","w",newline="") as result:
reader=csv.reader(source)
module=csv.reader(module_names)
writer=csv.writer(result)
flag=False
for row in reader:
i=row[4]
for s in module_names:
k=s.strip()
if i.find(k)!=-1 and flag==False:
row[4]=k
writer.writerow(row)
flag=True
module_names.seek(0)
flag=False
Thanks for people who tried to solve! If you have any better coding practices please do share!
Good Luck!