How to think about adding HTML in MongoDB? - html

I am new to MongoDB and JSON like data formats but I can see their combined potential in regards to being able to easily manipulate data with Javascript (jQuery) due to their similar syntax ie "key": "value" pairings.
There is a conceptual leap I am yet to make in regards to how I work with HTML content in this context, for example say I have a number of articles (with included HTML - <p>, <li>, <img> tags etc), how do I organise this content?
Do I add the HTML into Document 'content' values eg:
{ "title": "My First Article", "content": "<p>Welcome to this page</p><p>Today I would like to...</p><p>Etc <img src="cat.jpg"></p>
This seems counter intuitive and messy in terms of keeping the cleanliness of the JSON data that would be coming back to a web interface. Plus it would make it difficult to 'read' the HTML in the Documents, as line spaces are not allowed etc.
What is this conceptual leap that I need to make in terms of how I think about adding HTML in MongoDB?

Your question we consider should be the following:
Do I have to modify HTML content I store?
If you would need to insert, modify, remove elements with (for example) character data into, within, from the HTML and you need to do this differently on each request, the answer would be "maybe store it as a tree in MongoDB". But I'll just stick with "don't".
Every time you would want to print out your HTML as it is, you would need to construct the document, render as a string from the data stored in MongoDB. Also, you would need to parse and build the tree each time you wish to store it. It would be just a waste of resources and development time, just because your eye would like the view of an HTML document stored as a JSON tree.
Just start to implement it, and when you hear a shot, it will indicate a bullet in a leg.

Related

Most pythonic way to create PDF Files from JSON with Styling?

TL;DR: Looking for a python library to create a PDF template with specific styling and fill it with information from JSON file
Full Context:
I have a long RPA pipeline that ends with 500+ Json documents. Each JSON document represents an exam, each exam might have 1000-4000 Questions. The JSON file is simple, an example of that:
{
"AllQuestions": [
{
"QuestionText": "A 3-year old man did.....",
"Choices": ["Choice A", "Choice B", etc.]
}, another question, etc. ]
}
The only variable here is that sometimes I can have 5 Choices or 4 Choices and sometimes I have an image in the exam (However, I can handle those specs once I know what to use).
Well I have to create a style that's similar to this one:
"Without Key Info, attending labs, etc."
Now, I looked into PyPDF2 and FPDF, and best what I could reach is this style:
Now, for FPDF2, it is pretty straight forward, in just a few lines of codes, I could create that by initializing and class and adding page and adding the question to it. However, the styling there is very limited and I tried to make use of "WriteHTML" and it still can't reach my desired styling at all.
I read that PDFKit or other alternatives are good, do you think I should first create a full HTML document with 1000+ questions then take that into PDFKit and convert it into PDF? or is there a way to treat each question as an object with default styling and append it to a PDF file object?
Thanks in advance :)
I don't know about the most Pythonic way, but I would do it like so:
Figure out what language you want to define the final output in. Since it has a lot of complex formatting, I'd say you want HTML (probably with CSS) or Latex.
Write a Jinja template in this target language, with variables in the appropriate places.
Plug the values from your JSON into Jinja to render the template and construct the HTML/Latex of every question.
Use pandoc to convert the HTML to PDF.
While this is quite a few technologies, they are all well suited to their task and easier to work with. The problem here is that you want to build PDFs with very specific layout. However PDFs are very complex and not all libraries implement it well - but pandoc does.

XML as complement to HTML

I'm having trouble wrapping my head around using XML as complement to HTML. I know what they are used for but I don't quite understand how to use them together.
I know that you can use JavaScript to convert an XML file to HTML, but I don't get how that's going to do the trick. How would I be able to style this HTML-file?
I have a template form, which I want to be accessible on a server and for which I want to enable edits. Once edited I want to save the edits on a separate file, so that the template is still available.(Just so you guys have a little bit of background regarding what I need this for).
After a lot of research I came to the conclusion that I would need to use XML, as I will have to store and transport data.
Could anyone explain in more detail how exactly XML can be used as a complement to HTML?
If you need more details or information please let me know. I did do a lot of research and I read the other posts regarding how to convert XML to HTML with JavaScript, but that doesn't answer my question about how EXACTLY they complement each other.
I guess my problem here is that I have yet to manage to wrap my head around the concept.
XML is related to HTML, as it uses the same magic characters for its markup and the same logic where to put the data.
The characters <> are used to separate the markups from the content.
The character & together with an entity code like < is used to encode characters, which would lead to troubles otherwise
elements can contain attributes like <someElement someAttribute="attr value">
elements can contain text or sub elements
The big difference is, that XML is absolutely free how you name your elements and attributes, while HTML relys on dedicated names (like <body>), whereas XML is absolutely strict in structure while HTML allows a lot (like unclosed tags).
As a thing in the middle there is XHTML, which is as strict as XML but sticks to the rules of HTML.
It is almost impossible to read HTML as XML, but you can easily create XML which is taken by any browser as a valid web page.
Your issue cries for XSLT. This is a method to transform a given XML into a new format. This allows for example, to export your data as XML and create a nice web page from it. Different XSLT will present the same data in different ways.
There are several online tools to test this feature. you might have a look here.
Your statement After a lot of research I came to the conclusion that I would need to use XML, as I will have to store and transport data is not all clear... How you send data (to a web application), and the way you send the (manipulated) data back, is not bound to XML. This is very often done with JSON, using Java Script to read, edit and send it back.
XML -> XSLT - HTML is often seen to create (rather static) reports for a web viewer

Extracting the same data from various HTML documents

Let's say I have several HTML pages from unrelated websites, but that contain the same overall information. I want to extract that information in a flexible manner, i.e. I want to only have to write a small number of data extractors for all of the pages (ideally, one). Say the fields are (to use a blog example) author, date, title, text. The classes of the HTML tags that denote these could be totally different for each page, but still display on the page in roughly the same way. For example, take this post from CNN and this post from Gawker. Both contain the same information - the information that I want - somewhere on the page when it is actually displayed. Is there a nice way to extract that data? Writing separate extractors is an option, but not a good one; there are about a thousand styles of documents in the dataset I want to use.
The only way you can do that is by finding a common element in all of those websites (e.g. they share the same DOM structure, or have the same ID, or are preceded by the same content in a previous tag like an <h1>).
Otherwise, you need to write different rules or regular expressions for each case.
Unless, of course, you write an algorithm so intelligent that is capable of recognizing the content intention/meaning even with different HTML - which is not simple nor quick to write in any way.

Django templatetag for rendering a subset of html

I have some html (in this case created via TinyMCE) that I would like to add to a page. However, for security reason, I don't want to just print everything the user has entered.
Does anyone know of a templatetag (a filter, preferably) that will allow only a safe subset of html to be rendered?
I realize that markdown and others do this. However, they also add additional markup syntax which could be confusing for my users, since they are using a rich text editor that doesn't know about markdown.
There's removetags, but it's a blacklisting approach which fails to remove tags when they don't look exactly like the well-formed tags Django expects, and of course since it doesn't attempt to remove attributes it is totally vulnerable to the 1,000 other ways of script-injection that don't involve the <script> tag. It's a trap, offering the illusion of safety whilst actually providing no real security at all.
HTML-sanitisation approaches based on regex hacking are almost inevitably a total fail. Using a real HTML parser to get an object model for the submitted content, then filtering and re-serialising in a known-good format, is generally the most reliable approach.
If your rich text editor outputs XHTML it's easy, just use minidom or etree to parse the document then walk over it removing all but known-good elements and attributes and finally convert back to safe XML. If, on the other hand, it spits out HTML, or allows the user to input raw HTML, you may need to use something like BeautifulSoup on it. See this question for some discussion.
Filtering HTML is a large and complicated topic, which is why many people prefer the text-with-restrictive-markup languages.
Use HTML Purifier, html5lib, or another library that is built to do HTML sanitization.
You can use removetags to specify list of tags to be remove:
{{ data|removetags:"script" }}

Programmatically detecting "most important content" on a page

What work, if any, has been done to automatically determine the most important data within an html document? As an example, think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
Note: Ideally, the method would work with well-formed markup, and terrible markup. Whether somebody uses paragraph tags to make paragraphs, or a series of breaks.
Readability does a decent job of exactly this.
It's open source and posted on Google Code.
UPDATE: I see (via HN) that someone has used Readability to mangle RSS feeds into a more useful format, automagically.
think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
I would probably try something like this:
open URL
read in all links to same website from that page
follow all links and build a DOM tree for each URL (HTML file)
this should help you come up with redundant contents (included templates and such)
compare DOM trees for all documents on same site (tree walking)
strip all redundant nodes (i.e. repeated, navigational markup, ads and such things)
try to identify similar nodes and strip if possible
find largest unique text blocks that are not to be found in other DOMs on that website (i.e. unique content)
add as candidate for further processing
This approach of doing it seems pretty promising because it would be fairly simple to do, but still have good potential to be adaptive, even to complex Web 2.0 pages that make excessive use of templates, because it would identify similiar HTML nodes in between all pages on the same website.
This could probably be further improved by simpling using a scoring system to keep track of DOM nodes that were previously identified to contain unique contents, so that these nodes are prioritized for other pages.
Sometimes there's a CSS Media section defined as 'Print.' It's intended use is for 'Click here to print this page' links. Usually people use it to strip a lot of the fluff and leave only the meat of the information.
http://www.w3.org/TR/CSS2/media.html
I would try to read this style, and then scrape whatever is left visible.
You can use support vector machines to do text classification. One idea is to break pages into different sections (say consider each structural element like a div is a document) and gather some properties of it and convert it to a vector. (As other people suggested this could be number of words, number of links, number of images more the better.)
First start with a large set of documents (100-1000) that you already choose which part is the main part. Then use this set to train your SVM.
And for each new document you just need to convert it to vector and pass it to SVM.
This vector model actually quite useful in text classification, and you do not need to use an SVM necessarily. You can use a simpler Bayesian model as well.
And if you are interested, you can find more details in Introduction to Information Retrieval. (Freely available online)
I think the most straightforward way would be to look for the largest block of text without markup. Then, once it's found, figure out the bounds of it and extract it. You'd probably want to exclude certain tags from "not markup" like links and images, depending on what you're targeting. If this will have an interface, maybe include a checkbox list of tags to exclude from the search.
You might also look for the lowest level in the DOM tree and figure out which of those elements is the largest, but that wouldn't work well on poorly written pages, as the dom tree is often broken on such pages. If you end up using this, I'd come up with some way to see if the browser has entered quirks mode before trying it.
You might also try using several of these checks, then coming up with a metric for deciding which is best. For example, still try to use my second option above, but give it's result a lower "rating" if the browser would enter quirks mode normally. Going with this would obviously impact performance.
I think a very effective algorithm for this might be, "Which DIV has the most text in it that contains few links?"
Seldom do ads have more than two or three sentences of text. Look at the right side of this page, for example.
The content area is almost always the area with the greatest width on the page.
I would probably start with Title and anything else in a Head tag, then filter down through heading tags in order (ie h1, h2, h3, etc.)... beyond that, I guess I would go in order, from top to bottom. Depending on how it's styled, it may be a safe bet to assume a page title would have an ID or a unique class.
I would look for sentences with punctuation. Menus, headers, footers etc. usually contains seperate words, but not sentences ending containing commas and ending in period or equivalent punctuation.
You could look for the first and last element containing sentences with punctuation, and take everything in between. Headers are a special case since they usually dont have punctuation either, but you can typically recognize them as Hn elements immediately before sentences.
While this is obviously not the answer, I would assume that the important content is located near the center of the styled page and usually consists of several blocks interrupted by headlines and such. The structure itself may be a give-away in the markup, too.
A diff between articles / posts / threads would be a good filter to find out what content distinguishes a particular page (obviously this would have to be augmented to filter out random crap like ads, "quote of the day"s or banners). The structure of the content may be very similar for multiple pages, so don't rely on structural differences too much.
Instapaper does a good job with this. You might want to check Marco Arment's blog for hints about how he did it.
Today most of the news/blogs websites are using a blogging platform.
So i would create a set of rules by which i would search for content.
By example two of the most popular blogging platforms are wordpress and Google Blogspot.
Wordpress posts are marked by:
<div class="entry">
...
</div>
Blogspot posts are marked by:
<div class="post-body">
...
</div>
If the search by css classes fails you could turn to the other solutions, identifying the biggest chunk of text and so on.
As Readability is not available anymore:
If you're only interested in the outcome, you use Readability's successor Mercury, a web service.
If you're interested in some code how this can be done and prefer JavaScript, then there is Mozilla's Readability.js, which is used for Firefox's Reader View.
If you prefer Java, you can take a look at Crux, which does also pretty good job.
Or if Kotlin is more your language, then you can take a look at Readability4J, a port of above's Readability.js.