Convert and parse digitally created PDF file to tesseract tsv file directly from PDF source code - csv

I am trying to convert and parse a digitally created PDF file to tesseract tsv file. And to use this as ground truth for testing tesseract OCR performance on my PDF to TIF to TSV pipeline.
Any idea how I can achieve this task --- convert and parse a digitally created PDF file to a tesseract tsv file, without OCR or anything?
So far, I can use packages such as fitz PyMuPDF, pdfminer to extract texts. Using fitz PyMuPDF can give me sentences with their box location the PDF file (x_1,y_1,x_2,y_2). However, I don't see anyway yet to parse to tesseract-like tsv outputs. See https://blog.tomrochette.com/tesseract-tsv-format.
Any advice would be greatly appreciated. Thanks! :)

Related

How would i save a doc/docx/docm file into directory or S3 bucket using Pyspark

I am trying to save a data frame into a document but it returns saying that the below error
java.lang.ClassNotFoundException: Failed to find data source: docx. Please find packages at http://spark.apache.org/third-party-projects.html
My code is below:
#f_data is my dataframe with data
f_data.write.format("docx").save("dbfs:/FileStore/test/test.csv")
display(f_data)
Note that i could save files of CSV, text and JSON format but is there any way to save a docx file using pyspark?
My question here. Do we have the support for saving data in the format of doc/docx?
if not, Is there any way to store the file like writing a file stream object into particular folder/S3 bucket?
In short: no, Spark does not support DOCX format out of the box. You can still collect the data into the driver node (i.e.: pandas dataframe) and work from there.
Long answer:
A document format like DOCX is meant for presenting information in small tables with style metadata. Spark focus on processing large amount of files at scale and it does not support DOCX format out of the box.
If you want to write DOCX files programmatically, you can:
Collect the data into a Pandas DataFrame pd_f_data = f_data.toDF()
Import python package to create the DOCX document and save it into a stream. See question: Writing a Python Pandas DataFrame to Word document
Upload the stream to a S3 blob using for example boto: Can you upload to S3 using a stream rather than a local file?
Note: if your data has more than one hundred rows, ask the receivers how they are going to use the data. Just use docx for reporting no as a file transfer format.

How to convert a json file to binary image for semantic segmentation

I have some images and labeled them using labelimg tool of python. I have got individual JSON files for all the images. Now, I need how to convert them into binary mask images. Can anybody help me with that?
TIA

Unable to extract csv file from mat file in octave

I need help to extract this particular file marked with an arrow in a csv format in the original structure, please help me out with proper line of codes.
Thank You.
Octave has a save command that can save a matrix in a number of different formats. You can try a binary format, or R and Python can read HDF5 files.
https://octave.org/doc/v4.0.0/Simple-File-I_002fO.html

reading arabic text from text file and save the output in json

I performed ocr on images to extract Arabic content. I stored the output in a text file using
f=open(filename,'w',encoding='utf-8')
f.write(text)
f.close()
The output in the txt file is readable. But when I read the txt file using
file=open(filename,'r',encoding='utf-8')
json[name]=file.read()
I get this weird encoding that i couldn't solve
It turned out that the problem is from json when using dump I changed ensure_ascii=False and it kept it as it is.
same problem but in json

How to convert HTML file into Framemaker Interchange Format(.mif) file?

I want to mark index and cross-references like Framemaker does.
Framemaker can export the .fm into .htm and .mif file.
I have analyzed how the index and cross-references appears in .htm and .mif file after exporting it from framemaker.
Now my system will produces .htm file and I can manage to mark the index and cross-reference like framemaker does.
I want that framemaker retain the index and cross-references which will be marked by my system.
But there is no way to import or open HTML files directly in Framemaker.
We can import .mif file in framemaker.
So is there any way we can convert HTML files into .mif(FrameMaker Interchange Format).
there is one option, I know its not full proof solution for this problem.
but it can save your efforts to some point,
Save the HTML file to RTF format (using MS word/Open Office)
Open that RTF file in FM
FM accepts the RTF file and convert it into .fm file
Save the .fm file into .mif format
Note : in this conversion, some data loss may happen, i have tried using it for Markers it works but not complete solution.
All the best!!
You can open the .htm files in Structured FrameMaker and then save them to .mif. This will produce less loss in graphics, for sure.