Most pythonic way to create PDF Files from JSON with Styling? - json

TL;DR: Looking for a python library to create a PDF template with specific styling and fill it with information from JSON file
Full Context:
I have a long RPA pipeline that ends with 500+ Json documents. Each JSON document represents an exam, each exam might have 1000-4000 Questions. The JSON file is simple, an example of that:
{
"AllQuestions": [
{
"QuestionText": "A 3-year old man did.....",
"Choices": ["Choice A", "Choice B", etc.]
}, another question, etc. ]
}
The only variable here is that sometimes I can have 5 Choices or 4 Choices and sometimes I have an image in the exam (However, I can handle those specs once I know what to use).
Well I have to create a style that's similar to this one:
"Without Key Info, attending labs, etc."
Now, I looked into PyPDF2 and FPDF, and best what I could reach is this style:
Now, for FPDF2, it is pretty straight forward, in just a few lines of codes, I could create that by initializing and class and adding page and adding the question to it. However, the styling there is very limited and I tried to make use of "WriteHTML" and it still can't reach my desired styling at all.
I read that PDFKit or other alternatives are good, do you think I should first create a full HTML document with 1000+ questions then take that into PDFKit and convert it into PDF? or is there a way to treat each question as an object with default styling and append it to a PDF file object?
Thanks in advance :)

I don't know about the most Pythonic way, but I would do it like so:
Figure out what language you want to define the final output in. Since it has a lot of complex formatting, I'd say you want HTML (probably with CSS) or Latex.
Write a Jinja template in this target language, with variables in the appropriate places.
Plug the values from your JSON into Jinja to render the template and construct the HTML/Latex of every question.
Use pandoc to convert the HTML to PDF.
While this is quite a few technologies, they are all well suited to their task and easier to work with. The problem here is that you want to build PDFs with very specific layout. However PDFs are very complex and not all libraries implement it well - but pandoc does.

Related

XML as complement to HTML

I'm having trouble wrapping my head around using XML as complement to HTML. I know what they are used for but I don't quite understand how to use them together.
I know that you can use JavaScript to convert an XML file to HTML, but I don't get how that's going to do the trick. How would I be able to style this HTML-file?
I have a template form, which I want to be accessible on a server and for which I want to enable edits. Once edited I want to save the edits on a separate file, so that the template is still available.(Just so you guys have a little bit of background regarding what I need this for).
After a lot of research I came to the conclusion that I would need to use XML, as I will have to store and transport data.
Could anyone explain in more detail how exactly XML can be used as a complement to HTML?
If you need more details or information please let me know. I did do a lot of research and I read the other posts regarding how to convert XML to HTML with JavaScript, but that doesn't answer my question about how EXACTLY they complement each other.
I guess my problem here is that I have yet to manage to wrap my head around the concept.
XML is related to HTML, as it uses the same magic characters for its markup and the same logic where to put the data.
The characters <> are used to separate the markups from the content.
The character & together with an entity code like < is used to encode characters, which would lead to troubles otherwise
elements can contain attributes like <someElement someAttribute="attr value">
elements can contain text or sub elements
The big difference is, that XML is absolutely free how you name your elements and attributes, while HTML relys on dedicated names (like <body>), whereas XML is absolutely strict in structure while HTML allows a lot (like unclosed tags).
As a thing in the middle there is XHTML, which is as strict as XML but sticks to the rules of HTML.
It is almost impossible to read HTML as XML, but you can easily create XML which is taken by any browser as a valid web page.
Your issue cries for XSLT. This is a method to transform a given XML into a new format. This allows for example, to export your data as XML and create a nice web page from it. Different XSLT will present the same data in different ways.
There are several online tools to test this feature. you might have a look here.
Your statement After a lot of research I came to the conclusion that I would need to use XML, as I will have to store and transport data is not all clear... How you send data (to a web application), and the way you send the (manipulated) data back, is not bound to XML. This is very often done with JSON, using Java Script to read, edit and send it back.
XML -> XSLT - HTML is often seen to create (rather static) reports for a web viewer

How to think about adding HTML in MongoDB?

I am new to MongoDB and JSON like data formats but I can see their combined potential in regards to being able to easily manipulate data with Javascript (jQuery) due to their similar syntax ie "key": "value" pairings.
There is a conceptual leap I am yet to make in regards to how I work with HTML content in this context, for example say I have a number of articles (with included HTML - <p>, <li>, <img> tags etc), how do I organise this content?
Do I add the HTML into Document 'content' values eg:
{ "title": "My First Article", "content": "<p>Welcome to this page</p><p>Today I would like to...</p><p>Etc <img src="cat.jpg"></p>
This seems counter intuitive and messy in terms of keeping the cleanliness of the JSON data that would be coming back to a web interface. Plus it would make it difficult to 'read' the HTML in the Documents, as line spaces are not allowed etc.
What is this conceptual leap that I need to make in terms of how I think about adding HTML in MongoDB?
Your question we consider should be the following:
Do I have to modify HTML content I store?
If you would need to insert, modify, remove elements with (for example) character data into, within, from the HTML and you need to do this differently on each request, the answer would be "maybe store it as a tree in MongoDB". But I'll just stick with "don't".
Every time you would want to print out your HTML as it is, you would need to construct the document, render as a string from the data stored in MongoDB. Also, you would need to parse and build the tree each time you wish to store it. It would be just a waste of resources and development time, just because your eye would like the view of an HTML document stored as a JSON tree.
Just start to implement it, and when you hear a shot, it will indicate a bullet in a leg.

What are some good ways to parse HTML and CSS in Perl?

I have a project where my input files used to be XML. I'm now being asked to start processing HTML with embedded CSS instead, and I'd like to accomplish this as cleanly and with as few code changes as possible. I was using XML::LibXML to parse the XML files, but now that we're moving to HTML with CSS, I'm thinking I'll need to move to something else. That said, before I dig myself knee deep into silly decisions I'll likely regret, I wanted to ask here: what do you guys use for this kind of task?
The structures of the old XML and the new HTML input files are pretty similar, with both holding the same information. The HTML uses divs in place of the XML's text nodes, and holds its style information in style tags and attributes instead of separated xml attributes.
An example of the old XML is:
<text font="TimesNewRoman,BoldItalic" size="11.04" x="59" y="405" w="52"
h="12" bold="yes" italic="yes" cs="4.6" o_bbox="59,405;52,12"
o_size="11.04" o_cs="4.6">
Some text
</text>
An example of the new HTML is:
<div o="9ka" style="position:absolute;top:145;left:89;x-pdf-top:744;x-pdf-left:60;x-pdf-bottom:732;x-pdf-right:536;">
<span class="ft19" >
Some text
</span></nobr>
</div>
where "ft19" refers to a css style element from the top of the page of the format:
.ft19{ vertical-align:top;font-size:14px;x-pdf-font-size:14px;
font-family:Times;color:#000000;x-pdf-color:#000000;font-style:italic;
x-pdf-letter-spacing:0.83px;}
Basically, all I want is a parser that can read the stylistic elements of each node as attributes, so I could do something like:
my #texts_arr = $page_node->findnodes('text');
my $test_node = $texts_arr[1];
print "node\'s bold value is: " . $text_node->getAttribute('bold');
as I'm able to do with the XML. Does anything like that exist for parsing HTML? I'd really like to make sure I start this the right way instead of finding something that sort of does what I want on CPAN and realizing two months later that there was another module that was way better for what I'm trying to do.
Ideas?
The basic one I am aware of is HTML::Parser.
There is also a project that works with it, Marpa::HTML which is the work of the larger parser project Marpa, which parses any language that can be described in BNF, documented on the author's blog which is very interesting but much newer and experimental.
I also see that wildly successful WWW::Mechanize uses HTML::TokeParser, and it uses HTML::PullParser, so there's that too.
If you need something even more generic (and evil) you can look into "writing" your own using something like Text::Balanced (which has some nice methods for tags, not sure about tag properties though) or even Regexp::Grammars, but again this means reinventing the wheel somewhat, I would only choose these routes if the above don't do what you need.
Perhaps I haven't helped. Perhaps I have just done a literature search for you, but maybe one of these will work better for you than others.
Edit: one more parser for you, seems like it might do what you need HTML::Tree. Then look at methods like look_down from HTML::Element to act on the tree. I saw an example here.
It's not clear - is the Perl parsing for the purposes of doing the conversion to HTML (with embedded CSS)? If so, why not forget Perl and use XSLT which is designed to transform XML documents?

using html/css, i would like to automatically generate a bibliography at the bottom of my website akin to latex's \bibliography command

I'll ask my question first, then give some background for those who are interested:
I would like to know if there is a command in html that will automatically generate a bibliography from a .bib file? This means that throughout the text, i would add something like <cite name="Jones2010">, and then at the bottom of the html (or css) file, I would write something like <makebib file="biblist.bib", format="APA">, and a bibliography would be generated using my .bib file, and formated according to the APA style. The functionality would be quite similar to footnotes, except that each footnote is populated according to some script that extracts the information from (essentially) an xml file and outputs the content in the desired format. It is not difficult to imagine somebody creating a tool to do just that, however, my google search skills have not enabled me to find such a tool. It is easy to find tools that convert bib files to html or xml, but that is not sufficient for my needs. I do not desire to publish my entire bib file online. Rather, for each document that I generate, I want several of the entries in the bib file to be included as footnotes. Any pointers will be greatly appreciated.
Now, the reason behind the question:
I have recently begun switching from writing all my manuscripts using latex to writing them using html/css. The advantages of this approach are fast: only 1 file for versioning (instead of .dvi, .ps, .aux, .blg, etc.), it is much smaller to share, other people can edit the html file and compile it much more easily, it is more configurable to my tastes, easier to read on screen, etc. The disadvantage for me, however, is that while I've been writing in latex for years, I've only just begin using html and css for scientific document creating. The main impetus for the switch was MathJaX, which enables me to to embed latex equations in my html files, and therefore, allows me to combine the advantages of latex with the advantages of css. I imagine that nearly all my colleagues will switch away from latex to this simpler format, assuming a few remaining issues get resolved, like ease of creating bibliographies.
Many thanks.
What you're asking isn't possible, unless when you specify html/css you really mean html/css/php or html/css/python or some other combination that includes an actual programming language, rather than just a markup language.
I understand your motivation, I'd love to switch to html instead of latex! However, I suspect an html-based solution would involve so much extra processing added on top to sort out bibliographies etc that the complexity would start approaching that of LaTeX by the time you got it all worked out.
I'd be pleased to be proven wrong on this!
I've done this, in the past, using XSLT and BibTeX. In outline, the steps are
Mark up your document using some convention or other: I used <span class='citation'>Smith99</span>
Write an XSLT script to transform that file into a .aux file with \citation commands in it
Use BibTeX along with a .bst file which spits out HTML rather than LaTeX
Use another XSLT script (or the same one, in a different mode) to pull the bibliography in
It's not quite as fiddly as it sounds, but you can look at how I did it on google code. In particular, see structure.xslt and plainhtml.bst.
If there's a more direct way, I'd be quite interested to hear about it.
Both answers so far are somewhat correct, although not quite what you were asking for. Part of the problem is that the question as it's phrased doesn't necessarily makes sense.
HTML is just markup; you need something to process the markup, be it python, php, ruby, etc.
And you probably want to write in XML (or XHTML), not HTML.
XSLT may work for you (once it's in XML), but remember, an XSLT document that defines a set of rules. You would get an XSLT engine to apply your XSLT rules against your XML document.
You can create an html bibliography from a .bib file using bibtex2html. This package takes a series of command line arguments and extracts the info from the BibTeX source and outputs a file with html markup.
As far as I know you cannot get it to read and parse the html document like the LaTeX \cite command but there are several ways to indicate the references you want. I find that the easiest way is to just maintain a text file of the BibTeX keys I use in my manuscript and then call this using the --citefile option. There is also a tool called bib2bib included that will take search commands.
It is a very flexible package and there are a lot of options so it works in a lot of situations. For example you can get it to omit the <html> headers from the output file so that you can directly paste into an existing html document.
The documentation is useful but make sure you look at the pdf documentation file and the man pages.

How do you parse a poorly formatted HTML file?

I have to parse a series of web pages in order to import data into an application. Each type of web page provides the same kind of data. The problem is that the HTML of each page is different, so the location of the data varies. Another problem is that the HTML code is poorly formatted, making it impossible to use a XML-like parser.
So far, the best strategy I can think of, is to define a template for each kind of page, like:
Template A:
<html>
...
<tr><td>Table column that is missing a td
<td> Another table column</td></tr>
<tr><td>$data_item_1$</td>
...
</html>
Template B:
<html>
...
<ul><li>Yet another poorly formatted page <li>$data_item_1$</td></tr>
...
</html>
This way I would only need one single parser for all the pages, that would compare each page with its template and retrieving the $data_item_1$, $data_item_2$, etc. Still, it is going to be a lot of work. Can you think of any simpler solution? Any library that can help?
Thanks
You can pass the page's source through tidy to get a valid page. You can find tidy here
. Tidy has bindings for a lot of programming languages. After you've done this, you can use your favorite parser/content extraction technique.
I'd recommend Html Agility Pack. It has the ability to work with poorly structured HTML while giving you Xml like selection using Xpath. You would still have to template items or select using different selections and analyze but it will get you past the poor structure hump.
As mentioned here and on other SO answers before, Beautiful Soup can parse weird HTML.
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:
Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.
Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.
Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."
Use HTML5 parser like html5lib.
Unlike HTML Tidy, this will give you error handling very close to what browsers do.
There's a couple C# specific threads on this, like Looking for C# HTML parser.
Depending on what data you need to extract regular expressions might be an option. I know a lot of people will shudder at the thought of using RegExes on structured data but the plain fact is (as you have discovered) that a lot of HTML isn't actually well structured and can be very hard to parse.
I had a similar problem to you, but in my case I only wanted one specific piece of data from the page which was easy to identify without parsing the HTML so a RegEx worked very nicely.