convert pdf into small chunks of data(many chunks per page)? - html

I have a pdf file and I need to get get small pieces of data from it.
It is structured like this :
Page1:
Question 1
......................................
......................................
Question 2
......................................
......................................
Page End
I want to get Question 1 and Question 2 as separate html files, which contain text and image.
I've tried
pdftohtml -c pdffile.pdf output.html
And I got files with png images, but how to do I cut the Image into smaller chunks to fit the size of each Question (I want to separate each question into individual files)?
P.S. I have alot of pdf files, so a command-line tool would be nice.

I'll try to give you an approach on how I would go about it. You mention, that every page in your PDF document might have multiple questions and you basically want have one HTML file for every question.
It's great if pdftohtml works for you, but I also found another decent command line utility that you might want to try out.
Ok, so assuming you have an HTML file converted from the PDF you initially had, you might want to use csplit or awk to split your file into multiple files based on the delimiter 'Question' in your case. (Side note- csplit and awk are linux specific utilites, but I'm sure there are alternatives if you are on Windows or a MAC. I haven't specifically tried the following code)
From a relevant SO Post :
csplit input.txt'/^Question$/' '{*}'
awk '/Question/{filename=NR".txt"}; {print >filename}' input.txt
So, assuming this works, you will have a couple of broken html files. Broken because they'll be unsanitized due to dangling < or > or some other stray HTML elements after the splitting.
So you could start by saving the initial .html as .txt, removing the html, head and body elements specifically and going through the general structure of how the program converts the pdf into html. I'm sure you'll see a pattern around how the string 'Quetion' is wrapped in an element and is something you can take care of. That is why I mention .txt files in the code snippets.
You will basically have a bunch of text files with just the content html and not the usual starting tags for an html file because we removed that initially. Then it's only a matter of reading each file, just taking care of the element that surrounds the string 'Question' and adding the html, head and body elements around the content and saving them as .html files. You could do this in any programming language of your choice that supports file reading and writing (would be a fun exercise)
I hope this gets you started in the right direction.

Related

Lua filter for pandoc to append html

I'm currently compiling markdown to html using pandoc:
pandoc in.md -o out.html
and would like to include the same piece of html code in each of the output files, without having to write it into my markdown file.
I was hoping that a lua filter would do the job. However, the docs seem to indicate the filters will only respond to a sequence of characters within my markdown file, rather than appending something to each file.
I've played around with CSS (I've never used it before), but it doesn't look like I can just add arbitrary html code like this (correct me if I'm wrong).
To summarize, I'd like to find a way to add html code to my output.
A Lua filter is likely to be overkill here. Pandoc has an option --include-after-body (or --include-before-body) which will do what you need:
-A FILE, --include-after-body=FILE|URL
Include contents of FILE, verbatim, at the end of the document body (before the </body> tag in HTML, or the \end{document} command in LaTeX). This option can be used repeatedly to include multiple files. They will be included in the order specified. Implies --standalone.

Bookdown: Single html output file

If I add a line below the first in _output.yml:
bookdown::gitbook:
split_by: none
css: ...
in the bookdown-demo the output becomes a single .html file which looks kind of plain ugly. Is it somehow possible to retain the nice style which is produced by the default settings but in a single file? If I want to send the book to someone else sending a stack of files is not great, especially if the person who receives it is not familiar with HTML as a document format.
This turns out to be a bug of bookdown, and I just fixed it on Github. You can install and test the development version (>= 0.3.3):
devtools::install_github('rstudio/bookdown')

Displaying a .txt file as pure text without editing the file contents

I have a .txt file containing code that I cannot change the contents of. And I need to display it in two ways.
One way is inside a div as selectable, copy-able type (currently done with:
<pre><?php include '/file_location.txt';?></pre> ).
The other way is as a direct link to the .txt file so such link can have it's address copied and emailed to someone, saved as..., or any other function one might like a direct link for. (So just like <a href="/file_location.txt"> basically.)
The issue is that when php including the text file into a div any <%> strings interfere with the original source text. I need to preserve the integrity of the original .txt files (so I can't go changing all the left carrots into <).
So is there a good way to display the contents of the text file without issues with < > and still maintain it's original integrity for sake of direct-linking?
EDIT:
I currently have two separate files performing this function, one with html encodings and the raw unedited .txt file. I'd really like to get these two displays working with just one file so that each new bit of source code doesn't need to be converted to an html-friendly version and adding just its .txt file will grant both view options.
EDIT 2:
Using <textarea> instead of <pre> will not interfere with the < characters and i could CSS it to look how I want, but I don't like the idea of the user being able to resize it themselves.
You can use
<?php echo htmlspecialchars(file_get_contents("file.txt")) ?>
instead of
<?php include '/file_location.txt';?>
to display special HTML characters from a text file.
I am using this for my php files. I think it will be usefull for you too.
<?php
highlight_file("test.php");
?>
edit: I tried on a html file and it worked.
I would try to add <pre></pre> at the beginning/end of your .txt. I'm not sure if I fully understand your question, but I think this will not interfere with the <> tags.

convert docx with (ordered) list to html

I'm trying to convert a large docx document with several layers' ordered list to an html. (see an example of the document here: http://docdro.id/X1oyfBv You should download it)
I tried the following things, including:
online converters such as html-cleaner and index.html (which only recognize one layer of the list)
save as html - which creates an horrendous file but still doesn't recognize the ol structure.
saved the file as zip and then opened the xml file, but I dont see an easy way to get the ol structure out of the w:... tags
saving it to google docs and running Omar Alzabir's script
http://omaralzabir.com/wp-content/uploads/2014/05/GoogleDocsEmail.jpg
btw. If I create a word file with an ordered list with multiple layers and i convert it, it does recognize it as ol's. But the existing file is not recognized as ol's even if I 'un-list' and list it again. So possibly there is something wrong with how the original document was created (?)
Any suggestions much appreciated:) Or indications as to why this problem occurs
Are you asking how to save a Word-doc in HTML format, with multi-level ordered-lists?
Word-HTML has bugs in its multi-level ordered lists. For the list-items, the indentation tends to be incorrect and inconsistent. There's an example here.
Word-HTML has similar bugs in its multi-level unordered lists. An example is here.
I recently wrote a Python program that fixes these bugs, in Word's HTML. The program is part of WordWebNav (WWN), which is free and open-source.
WWN is an app that converts a Microsoft-Word document to a usable web-page. It adds some missing features in the Word-HTML web-page (e.g., a navigation pane), and it fixes bugs in the Word-HTML.
You can use pandoc : https://github.com/jgm/pandoc
This is an open source universal command line tool to convert markup source based document files.
You can use it as something like that:
pandoc -o output.html input.docx

HTML file to screenshot as .jpg or other image

Nothing to do with rendering an individual image on a webpage. Goal is to render the entire webpage save that as a screenshot. Want to show a thumbnail of an HTML file to the user. The HTML file I will be screenshotting will be an HTML part in a MIME email message - ideally I would like to snapshot the entire MIME file but if I can do this to an HTML file I'll be in good shape.
An API would be ideal but an executable is also good.
You need html2ps, and convert from the package ImageMagick:
html2ps index.html index.ps
convert index.ps index.png
The second program produces one png per page for a long html-page - the page layout was done by by html2ps.
I found a program evince-thumbnailer, which was reported as:
apropos postscript | grep -i png
evince-thumbnailer (1) - create png thumbnails from PostScript and PDF documents
but it didn't work on an simple, first test.
If you like to combine multiple pages to a larger image, convert will help you surely.
Now I see, that convert operates on html directly, so
convert index.html index.png
shall work too. I don't see a difference in the output, and the size of the images is nearly identical.
If you have a multipart mime-type email, you typically have a mail header, maybe some pre-html-text, the html and maybe attachments.
You can extract the html and format it seperately - but rendering it embedded might not be that easy.
Here is a file I tested, which was from Apr. 14, so I extract the one mail from the mailfolder:
sed -n "/From - Sat Apr 14/,/From -/p" /home/stefan/.mozilla-thunderbird/k2jbztqu.default/Mail/Local\ Folders-1/Archives.sbd/sample | \
sed -n '/<html>/,/<\/html>/p' | wkhtmltopdf - - > sample.pdf
then I extract just the html-part of that.
wkhtmltopdf needs - - for reading stdin/writing to stdout. The PDF is rendered, but I don't know how to integrate it into your workflow.
You can replace wkhtml ... with
convert - sample.jpg
I'm going with wkhtmltoimage. This worked once correctly set up xvfb. The postscript suggestion did not render correctly and we need img not pdf.