Repair Invalid HTML and XML - html

I have a document which has invalid HTML and XML. I need to parse it in such a way that whenever any invalid HTML or XML it encounters it should fix it rather than treating it as a normal string. Till now I have tried this technique
Nokogiri::HTML(document)
Nokogiri::XML(document)
Both are not working in together.
I also refer this link but it didn't help me much. Also, I thought for a hack like to replace invalid HTML and XML with regex but my data is too much big so can't apply that hack
I am calling Nokogiri::HTML(document) so it handling html very well. But the problem is it is skipping the Xml tag which I don't want. I need those tags to print on the browser
Some tags like
</SESSION_CONFIG VERSION=“bgh:3”
<METHODS BASEURL=http://abc.hgd.com /servlet/IAMSERVER/>
<ADD_USER URL=“addUser/>
Though I Know some of the tags are not legal but still I need to print on browser

you need to configure norecover so Nokogiri does NOT try to fix the issues:
badly_formed = <<-EOXML
</SESSION_CONFIG VERSION=“bgh:3”
<METHODS BASEURL=http://abc.hgd.com /servlet/IAMSERVER/>
<ADD_USER URL=“addUser/>
EOXML
bad_doc = Nokogiri::HTML(badly_formed) { |config| config.norecover }
puts bad_doc
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>VERSION=“bgh:3”
<methods baseurl="http://abc.hgd.com"></methods>
<add_user url="“addUser/">
</add_user></p></body></html>

Related

Can I safely replace "<ul>" tags within HTML using regexes?

I am trying to solve this issue, where users paste invalid HTML that we have to deal with, of the form <ol><ul><li>item</li></ul></ol>. We are currently parsing using lxml. In legal HTML, <ol> cannot have a (direct) child of a <ul> (it must be in an <li>) so lxml closes the ol tag too soon to try to "repair" the HTML, producing <div><ol/><ul><li>item</li></ul>.
The user-pasted text also might be invalid XML (e.g., bare <br> tag), so we can't just parse it as XML.
Thus, we can neither parse it as HTML nor XML, because it might be invalid.
To make this certain (common) case of invalid HTML into valid HTML, can we just replace all <ul> tags with <ol> tags using regexes?
If I use lxml to parse <ol><ol><li>item</li></ol></ol>, the output looks fine (does not close a tag too soon).
However, I don't want to break actual user-typed text, and I'm wondering if there are edge cases I haven't thought of (like "<ul>" within a <pre> tag or some other crazy thing that isn't actually a tag, though I've tested that particular case).
Yes, it would change unnumbered lists to numbered lists. I'm okay with that.
Yes, I have read this fun regex answer.
In general, there is no guarantee of a 'non-edge case' transform with HTML and regular expressions. HTML, more so than XML, has rules that make a direct text replacement of things that look like tags problematic.
The following text validates as HTML using w3c.org validation checker without any warnings.
<!DOCTYPE html>
<html lang="en">
<head>
<title><!--<ul>--></title>
<style lang="css">s {content: "<ul>";}</style>
<script>"<ul>"</script>
</head>
<body data-ul="<ul>"></body>
</html>
That aside, using some regular expression heuristics might solve the issue at hand - at least insofar as a reasonable scope. A streaming HTML token parser that does not attempt to apply any validation or DOM/tree building might also be useful for the initial replacement stage.

why xpath is returning text outside html tags?

I am working with a document which have some text outside <html> tag. When I read data inside body it also returns the text which is not even in html tag.
page_text = Nokogiri::HTML(open(file_path)).xpath("//body").text
p page_text
Output:
"WARC/1.0\nWARC-Type: response\nWARC-Date: 2012-02-11T04:48:01Z\nWARC-TREC-ID: clueweb12-0000tw-13-04988\nWARC-IP-Address: 184.85.26.15\nWARC-Payload-Digest: sha1:PNCB5NNAA766RLLISZ6ODV3FJZBCATKR\nWARC-Target-URI: http://www.allchocolate.com/health/basics/\nWARC-Record-ID: \nContent-Type: application/http; msgtype=response\nContent-Length: 14577\n\n\n\n\n sample document\n\n\n hello world\n\n"
Document:
WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-11T04:48:01Z
WARC-TREC-ID: clueweb12-0000tw-13-04988
WARC-IP-Address: 184.85.26.15
WARC-Payload-Digest: sha1:PNCB5NNAA766RLLISZ6ODV3FJZBCATKR
WARC-Target-URI: http://www.allchocolate.com/health/basics/
WARC-Record-ID: <urn:uuid:ff32c863-5066-4f51-802a-f31d4af074d5>
Content-Type: application/http; msgtype=response
Content-Length: 14577
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>sample document</title>
</head>
<body>
hello world
</body>
</html>
Nokogiri is trying to parse the file contents as a HTML document, but it isn’t a valid document. It is a text document that just happens to contain in it a HTML document. Of course Nokogiri doesn’t know this, and it isn’t able to pick out that part that is HTML by itself, so it tries to parse the whole thing. Since it is not valid HTML, this produces errors.
As it parses, Nokogiri attempts to fix these errors as best it can, but that doesn’t work in this case, and results in the strange looking output you see here.
In particular, when Nokogiri sees the text before the HTML, it assumes that it should be part of the HTML document body. So it creates and injects html and body elements into the document, before adding the text as a child of this body.
Later it sees the actual <body> tag, but since it knows it already has a body element, and that there can only be one such element, it ignores it.
You need to make sure that you only provide valid HTML (or as close as you can to valid — the error correction can fix small things). You will probably need to pre-process your files in some way to remove the extra text at the beginning.
Clearly leading text is a problem, but not trailing text. XML is a highly structured language, and applying an XML parser to HTML means at the very least that you have to have valid HTML. If you don't have valid HTML, then you get whatever Nokogiri spits out.
It looks to me like Nokogiri wraps the whole thing in a default root node, then returns all the text nodes therein, essentially ignoring the //body xpath. Interestingly, if you wrap your text in a div and search for the xpath //div, no problems, so that might suggest a solution.
It seems like Nokogiri considers //body to be equal to the root node. Ah! Maybe Nokogiri uses <body> for the root node. Nope: the xpath /body//body doesn't work.
Response to comment:
You could use a regex to search for the <body> tag then insert a div tag. But searching html with a simple regex will be a fragile solution, and it won't work in all cases.
By the way, you can see how Nokogiri handles text outside of tags by parsing a document that only has the text: hello world, then printing out all the nodes that Nokogiri finds:
require 'nokogiri'
nodes = Nokogiri::HTML(open('html.html')).xpath('//*')
nodes.each do |node|
puts node.name
end
--output:--
html
body
p
So Nokogiri wraps the text in three tags.
Or, better yet, you can parse your document and print it out as html:
require 'nokogiri'
doc = Nokogiri::HTML(open('./html.html'))
puts doc.to_html
--output:--
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html><body><p>WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-11T04:48:01Z
WARC-TREC-ID: clueweb12-0000tw-13-04988
WARC-IP-Address: 184.85.26.15
WARC-Payload-Digest: sha1:PNCB5NNAA766RLLISZ6ODV3FJZBCATKR
WARC-Target-URI: http://www.allchocolate.com/health/basics/
WARC-Record-ID: <uuid:ff32c863-5066-4f51-802a-f31d4af074d5>
Content-Type: application/http; msgtype=response
Content-Length: 14577
<title>sample document</title>
hello world
</uuid:ff32c863-5066-4f51-802a-f31d4af074d5></p></body></html>
That means you can get hello world like this:
require 'nokogiri'
doc = Nokogiri::HTML(open('./html.html'))
title = doc.at_xpath('//title')
puts title.next.text.strip
--output:--
hello world
Another approach is to get rid of the non-html content before parsing with Nokogiri:
require 'nokogiri'
infile = File.open('html.html')
non_html = infile.gets(sep="\n\n")
html = infile.gets(nil) #Slurp the rest of the file
doc = Nokogiri::HTML(html)
puts doc.at_xpath('//body').text.strip
--output:--
hello world
That assumes there's always a blank line separating the non-html content from the html content.
First of all #7stud answer is spot on that you can break you file on \n\n but
in my documents collection it's not always \n\n before actual html code.
So using the same idea i have came with another workaround that to remove all the text before html start tag using regex and then pass it to Nokogiri to parse.
file = File.read(file_path).to_s
file = file.sub(/.*?(?=<html)/im,"")
page = Nokogiri::HTML(file)
Now it is working fine.
It's simple to preprocess the content before passing it to Nokogiri:
require 'nokogiri'
text = '
WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-11T04:48:01Z
WARC-TREC-ID: clueweb12-0000tw-13-04988
WARC-IP-Address: 184.85.26.15
WARC-Payload-Digest: sha1:PNCB5NNAA766RLLISZ6ODV3FJZBCATKR
WARC-Target-URI: http://www.allchocolate.com/health/basics/
WARC-Record-ID: <urn:uuid:ff32c863-5066-4f51-802a-f31d4af074d5>
Content-Type: application/http; msgtype=response
Content-Length: 14577
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>sample document</title>
</head>
<body>
hello world
</body>
</html>
'
doc = Nokogiri::HTML(text[/<!DOCTYPE.+/m])
doc.to_html # => "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n <title>sample document</title>\n</head>\n<body>\n hello world\n</body>\n</html>\n"
The trick is:
text[/<!DOCTYPE.+/m]
which tells Ruby to look through the text and return all the text from <!DOCTYPE to the end of the string, which is valid HTML.

html to jade error when contains <pre>

I have some static html documents, and I want to convert them into Jade. I tried html2jade in npm, everything is OK except this: the <pre> elements in html convert empty, can someone help me?
The html code looks like this:
<pre><code><p>Hello</p><span>Hello Again</span></code></pre>
The result is:
pre.
You can write that a couple different ways in Jade. Here are two different methods. The first takes advantage of Jade's automatic escaping while the second uses HTML entities instead.
Automatic escaping:
pre
code= '<p>Hello</p><span>Hello Again</span>'
HTML entities:
pre
code <p>Hello</p><span>Hello Again</span>

How to set doctype in CQ5.6?

I've been working with CQ5.6 for about a month now and our test site is almost done in terms of components.
However when we try to validate the pages we run into problems because AEM puts <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> above all our pages.
Now, I can't find any place where the Doctype is explicitly declared in our code. Nor is the HTML tag in our code, so i presume AEM wraps these around everything.
First I tried deleting the import of the doctype in our page component and replacing it with <!DOCTYPE html> but then we ended up with 2 doctype declarations. First the XHTML one, then a wrapped <html> tag and then my HTML5 one.
I've read in the docs that you can set the doctype using the cq:doctype property, but no clue where I should add that property to.
I also tried putting this above the head tags in our page components, but to no avail:
<% Doctype doc= Doctype.valueOf("HTML_5");
doc.toRequest(request);
%>
<%= Doctype.fromRequest(request).getDeclaration() %>
Could anyone explain to me how or where I could set the doctype to HTML5 for our project?
CQ doesn't wrap the page with Doctype by default. It might have been the case where your page component would have had foundation/components/page as its parent (sling:resourceSuperType property).
Due to the component hierarchy and inheritance, the Doctype included in the foundation page.jsp is getting included for your page, and hence it appears as if it is wrapping up your HTML.
The page.jsp includes doctype as shown below
<%= Doctype.fromRequest(request).getDeclaration() %>
You can avoid this by overriding the content of foundation page.jsp within your page component itself.
In this path foundation/components/page/_NAME_ you are able to override your head.jsp file which contains the DOCTYPE definition and the HEAD statements as well.
Next you can see an example of the original one:
head.jsp example
If you didn't define your own custom template, that's the reason. You may need to create a folder (/foundation/components/page/_NAME_/) with the next structure:
head.jsp
body.jsp
dialog.xml

XSLT string with HTML entities - How can I get it to render as HTML?

I'm completely new to using XSL, so if there's any information that I'm neglecting to include, just let me know.
I have a string in my XSLT file that I can display like this:
<xsl:value-of select="#Description/>
and it shows up, rendered in a browser like:
<div>I can't do anything about the html entities existing in the text.</div> <div>This includes quotes, like "Hello World" and sometimes whitespaces. </div>
What can I do to get this string rendered as html, so that <div></div> results in newlines, " gives me ", and gives me a space?
I could elaborate on things I've already tried that haven't worked, but I don't know if that's relevant.
I think you want to set the following attribute as so:
<xsl:value-of select="#Description" disable-output-escaping="yes"/>
Why do you need to have entities output? To the browser is the same as   -- in both cases it will display a non-breaking space.
There is a feature in XSLT 2.0 called character-maps, that provide this functionality, if really needed. It is an XSLT best practice to try not to use DOE, unless absolutely necessary.
Also, DOE is not a mandatory feature of XSLT and some XSLT processors may not implement it. This means that an XSLT application that uses DOE is generally not portable across different XSLT processors.
The reason divs in HTML get an endline is completely different and related to the CSS boxmodel. Most browsers apply the style:
div {display:block;}
In lieu of the standard display:inline;. However, they only do that to divs in the XHTML namespace. You need to output divs to the XHTML namespace to faciliate that. Bind the XHTML namespace to the prefix xhtml at the top of your document like so:
<xsl:stylesheet xmnls:xhtml="http://www.w3.org/1999/xhtml" ... >
And then output the divs as <xhtml:div> ... </xhtml:div> most browsers would recognise the div to be in the XHTML namespace (http://www.w3.org/1999/xhtml) and apply the block style.