getting number of pages in a fax tiff? - tiff

How can I get a page count from a tiff file for faxes? using "g4" format. Language preferred is c++

Two options that come to mind.
Have a look at LibTIFF. www.libtiff.org
This library is used by many other open source libraries and can be used for much more than just getting a page count from a Group4 Multi-page TIFF file. This is probably the easiest, quickest and cheapest approach. I noticed there is a section of code in tiff2pdf program that counts the tiff pages. You can probably adapt this code to suit your needs. There are versions suitable for Linux and Windows.
Download the full TIFF specification at http://partners.adobe.com/asn/developer/PDFS/TN/TIFF6.pdf and write your own code.
Basically a multipage tiff is a whole lot of singe page tiffs merged together. There are headers that include offsets to the next page. To get the pagecount you need to traverse the headers and keep count until you hit then end of the chain. The code should be quite simple once you have the correct header structures and use fread() and fseek() to traverse the chain.

Related

Merge JSON data inside dynamic PDF templates

My problem is rather simple:
I need a tool to merge some medium complex JSON input data inside a template PDF.
Then based on the data:
Some sections of the template could be replicated.
Some sections may be deleted and the gap created should disappear.
Tables could be filled by N elements without messing the formatting.
Tables could have "merged cells inside"
Also templates should be easily adjustable without re-writing the code, this means they could be: editable PDF, Word files, Spreadsheets, some tool's templates... (..html?)
In all the tools i tried (a lot! :(( ), one or more of the points above was always a nightmare.
So far i tried:
latex -> pdflatex -> pdf: this probably simply reduced my life expectation by 1.. 1 and a 1/2 years. Unfortunately the most powerful tool because.. Latex? anyway not maintainable at all
pdfminer\pdfjs\whatever npm pkg.. : coding always ends up in a low level mess or huge workarounds.
google sheets -> pdf: APIs are kind of hard, and anyway are cell\row based so it's difficult to manage dynamic sections
pdfgeneratorapi.com : basically what i needed with a rich editor in it, but formatting and aligning tables is bugged so results are always ugly. Also things like "merged cells" are not possible.
So question would be: is there a tool or package out there in 2022 capable of handling all these requirements at once?
I would recommend using LaTeX for your PDF generation requirements. LaTeX comes with a huge number of available packages and chances are that someone has already solved your issue in LaTeX.
Specifically, DynamicDocs API by ADVICEment might be a good option for your requirements. DynamicDocs is JSON to PDF API based on LaTeX. Here are some features:
Ready-made JSON to PDF templates (no need to understand LaTeX)
Ability to write your own template in LaTeX and merge JSON to PDF using the R language layer, making the templates and their content dynamic
Excel to PDF Add-in (currently in Beta) to generate PDFs based on data in Excel
Each account is given a FREE plan with a limited number of monthly API calls
Disclaimer: I am involved in developing DynamicDocs API.

Extract text from PDF into JSON or XML or whatever?

I am trying to extract data [price, information, and number] from PDF (I have like more than 10 000 PDF so the free trial of the website won't work).
Here is one example of PDF I get :
I tried it in Python (beginner on this kind of task and on Python also) with several packages like PyPDF2, pdfx and so on, but I only get the Text like this
with PyPDF2 :
So It's possible to extract the price, the number, and information but I have different format of pdf so it is not possible to just with the text and some algorithms extract the information.
What I want to do, and it is possible because a lot of websites are doing it and make people pay for it. I want to read it in a vertical way and convert the data extracted in XML/JSON or simply a dataset.
I want to read the document per columns and not by line
Is there a way to do it in python or other languages?
First let me tell you that this is not an easy problem to solve since PDF files in the wild tend to be quite diverse in layouts. I can suggest trying an open source project that works really good for extracting information from tables in PDF files. It is called Tabula, you can get it at https://tabula.technology.
Tabula is going to detect tables on each page and export the content as CSV format. Once you have it in CSV it should be easier to get the information using Python. Please note that the CSV layout depends on the table layout in the PDF, meaning that you may need to create several functions to extract the information correctly.
Tabula is not perfect but it should work with most PDF files, for those that do not work you may need to extract the information manually.

Objective-C - Parsing a .csv, extracting and inserting information, then displaying the .csv as an interface for editing

This question has been troubling me for the past week. Below, I will list my issue, and the research I have put into it.
The scenario: I was given a .csv file with 5000 rows and three columns. The three columns are defined as:
Site ID|Site Name|Site URL
My task: To create an HTML interface for the designers of the company to rate each site on a scale of 1-5.
My plan of action: I am a new hire. I am getting accustomed to the language I was hired for, which was Objective-C.
My algorithm for the project was to:
Parse the .csv
Remove the "Site Name" variable
Create a new .csv that contains the below variables: Site ID|Site URL|Rating|Image
Display the new .csv (with all aforementioned items) as an HTML page where there are toggles for "Ratings", which when pressed, will log the rating into the .csv which it was imported (or loaded) from.
The "Image" section I will be using a piece of software by the name of Paparazzi (on the Mac OS X operating system) which takes a fully formatted screenshot of the main page and saves it as a PNG file. I plan on using the file extension URL (which is stored locally) and load it into the "Image" column, thus when the designer clicks on the image, he is able to load the image that is stored locally.
My issue: As Objective-C is not entirely a scripting language, I am confused with some of the libraries I may need and/or methods I can implement this. I have the algorithm, but I am wholy unsure with the implementation.
My questions: If you have done a project similar to this before with Objective-C, what tips can you provide for me? How does one load the .csv as a HTML interface where upon edit, it will save this edit into the .csv? Will I need any servers for this, or is everything executable from just a machine? How do you grab an image (stored locally), extract its file extension, and load it onto the .csv?
The most important question: Is this achievable through Objective-C? My reasoning behind it is, I want to advance my knowledge of OC through a task like this. Yes, using Python is easier, but is it possible to do this with Objective-C?
Thank you.
It certainly is achievable, but I doubt you'd really want to go this way. If I understand it correctly, you want to serve the HTML page to others via web browser - that would mean either writing a (simple) http daemon, that would run on the server or writing a CGI script that would communicate with a standard http daemon. Python/PHP/Ruby do this for you readily, so there is much less room for possible errors.
As for
As Objective-C is not entirely a scripting language
I would perhaps rephrase it as
As Objective-C is entirely not a scripting language

CHM Creator with ability to parse html meta keywords

I have lots of scanned images of a magazine(published monthly) and i have to organize it in searchable manner.
User should be able to view magazine issue wise or can search for predefined categories/keywords.
What i have thought for now, is to create CHM as it will need less effort than creating a new custom built software.
For that i will create seperate HTMl page(Programatically) with image embedded in it along with the keywords(Stored in Excel sheet along with path of Image) for which that image should be included in result.
So i want a chm creator that can parse html meta tags and add keywords in chm keywords list.
One such software i have found is Abee CHM Maker
But i need some free alternative.
If you have any other idea to organize it with minimal efforts, then also you are welcome...
The standard (free) way to create chm files is using Microsoft's HTML help workshop:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms670169(v=vs.85).aspx
Kind regards,
Bo
Free Pascal has a CHM creator package, a html DOM implementation and a basic commandline compiler for CHM projects (.hhp). The creator package is independent of MS tools or any other binary blob, and available in source. It is portable as far as FPC is portable (not as portable as gcc on paper, but enough in practice with all major architectures and OSes supported)
One could make something like that, I made something similar, but instead of meta, I folded back titles into TOC and index and cleaned up html (TeX4ht output) and fixed links before turning it into a chm.
But it will require some work, and if you are not familiar with Object Pascal/Delphi (the language), it might be a bridge too far. (the hours required would not compare favorably with the costs of the Abee thing, if that would suit your goals).
On the other hand, in a freely programmable system you can decide yourself how far you automatize things. I put in a lot of work once, and now all new output of tex4ht (with a certain fixed set of settings) formats nicely to chms.
See if this helps you (it certainly does what you need):
KEL CHM Creator: http://dumah7.wordpress.com/2009/02/17/kel-chm-creator-v-1-4-0-0/
Alternatively, I think you could add tags on each picture (right click on it-> Properties->Details->Tags) and use Windows explorer for searching them. I have never done this but it is supposed to be working (I guess).

Reverse engineering a custom data file

At my place of work we have a legacy document management system that for various reasons is now unsupported by the developers. I have been asked to look into extracting the documents contained in this system to eventually be imported into a new 3rd party system.
From tracing and process monitoring I have determined that the document images (mainly tiff files) are stored in a number of 1.5GB files. These files seem to be read from a specific offset and then written to a tmp file that is then served via a web app to the client, and then deleted.
I guess I am looking for suggestions as to how I can inspect these large files that contain the tiff images, and eventually extract and write them to individual files.
Are the TIFFs compressed in some way? If not, then your job may be pretty easy: stitch the TIFFs together from the 1.5G files.
Can you see the output of a particular 1.5G file (or series of them)? If so, then you should be able to piece together what the bytes should look like for that TIFF if it were uncompressed.
If the bytes don't appear to be there, then try some standard compressions (zip, tar, etc.) to see if you get a match.
I'd open a file, seek to the required offset, and then stream into a tiff object (ideally one that supports streaming from memory or file). Then you've got it. Poke around at some of the other bits, as there's likely metadata about the document that may be useful to the next system.