R XML package weird bug while parsing xml and html files - html

I am using R's XML package to extract all possible data over a wide variety of html and xml files. These files are basically documentation or build properties or readme file.
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE chapter PUBLIC '-//OASIS//DTD DocBook XML V4.1.2//EN'
'http://www.oasis-open.org/docbook/xml/4.0 docbookx.dtd'>
<chapter lang="en">
<chapterinfo>
<author>
<firstname>Jirka</firstname>
<surname>Kosek</surname>
</author>
<copyright>
<year>2001</year>
<holder>Ji&rcaron;í Kosek</holder>
</copyright>
<releaseinfo>$Id: htmlhelp.xml,v 1.1 2002/05/15 17:22:31 isberg Exp $</releaseinfo>
</chapterinfo>
<title>Using XSL stylesheets to generate HTML Help</title>
<?dbhtml filename="htmlhelp.html"?>
<para>HTML Help (HH) is help-format used in newer versions of MS
Windows and applications written for this platform. This format allows
to pack several HTML files together with images, table of contents and
index into single file. Windows contains browser for this file-format
and full-text search is also supported on HH files. If you want know
more about HH and its capabilities look at <ulink
url="http://msdn.microsoft.com/library/tools/htmlhelp/chm/HH1Start.htm">HTML
Help pages</ulink>.</para>
<section>
<title>How to generate first HTML Help file from DocBook sources</title>
<para>Working with HH stylesheets is same as with other XSL DocBook
stylesheets. Simply run your favorite XSLT processor on your document
with stylesheet suited for HH:</para>
</section>
</chapter>
My goal is to just use xmlValue after parsing the tree using htmlTreeParse or xmlTreeParse using something like this (for xml files ..)
Text = xmlValue(xmlRoot(xmlTreeParse(XMLFileName)))
However, there is one error when I do this for both xml and html files. If there are child nodes at level 2 or more, the text fields get pasted without any space in between them.
For example, in the above example
xmlValue(chapterInfo) is
JirkaKosek2001JiKosek$Id: htmlhelp.xml,v 1.1 2002/05/15 17:22:31 isberg Exp
The xmlValues of each child node (recursive) is pasted together without adding space between them. How can I get xmlValue to add a whitespace while extracting this data
Thanks a lot for your help in advance,
Shivani

According to the documentation, xmlValue only works
on single text nodes, or on "XML nodes that contain a single text node".
Spaces in non-text nodes are apparently not kept.
However, even in the case of a single text node,
your code would strip the white spaces.
library(XML)
doc <- xmlTreeParse("<a> </a>")
xmlValue(xmlRoot(doc))
# [1] ""
You can add the ignoreBlanks=FALSE and useInternalNodes=TRUE
arguments to xmlTreeParse, to keep all the whitespace.
doc <- xmlTreeParse(
"<a> </a>",
ignoreBlanks = FALSE,
useInternalNodes = TRUE
)
xmlValue(xmlRoot(doc))
# [1] " "
# Spaces inside text nodes are preserved
doc <- xmlTreeParse(
"<a>foo <b>bar</b></a>",
ignoreBlanks = FALSE,
useInternalNodes = TRUE
)
xmlValue(xmlRoot(doc))
# [1] "foo bar"
# Spaces between text nodes (inside non-text nodes) are not preserved
doc <- xmlTreeParse(
"<a><b>foo</b> <b>bar</b></a>",
ignoreBlanks = FALSE,
useInternalNodes = TRUE
)
xmlValue(xmlRoot(doc))
# [1] "foobar"

Related

Ruby: Including raw HTML in a Nokogiri HTML builder

I'm writing code to convert a fixed XML schema to HTML. I'm trying to use Nokogiri, and it works for most tags, e.g.:
# doc is the Nokogiri html builder, text_inline is a TextInlineContent node
def consume_inline_content?(doc, text_inline)
text = text_inline.text
case text_inline.name
when 'text'
doc.text text
when 'emphasized'
doc.em {
doc.text text
}
# ... and so on ...
end
end
The problem is, this schema also includes a rawHTML text node. Here is some of my input:
<rawHTML><![CDATA[<h2>]]></rawHTML>
Stuff
<rawHTML><![CDATA[</h2>]]></rawHTML>
which should ideally be rendered as <h2>Stuff</h2>. But when I try the "obvious" thing:
...
when 'rawHTML'
doc << text
...
Nokogiri produces <h2></h2>Stuff. It seems to be "fixing" the unbalanced open tag before I have a chance to insert its contents or closing tag.
I recognize that I'm asking about a feature that could produce malformed html, and maybe the builder doesn't want to allow that. Is there a right way to handle this situation?

Issues with parsing HTML with ragel

In my project I need to extract links from HTML document.
For this purpose I've prepared ragel HTML grammar, primarily based on this work:
https://github.com/brianpane/jitify-core/blob/master/src/core/jitify_html_lexer.rl
(mentioned here: http://ragel-users.complang.narkive.com/qhjr33zj/ragel-grammars-for-html-css-and-javascript )
Almost all works well (thanks for the great tool!), except one issue I can't overcome to date:
If I specify this text as an input:
bbbb <a href="first_link.aspx"> cccc<a href="/second_link.aspx">
my parser can correctly extract first link, but not the second one.
The difference between them is that there is a space between 'bbbb' and '<a', but no spaces between 'cccc' and '<a'.
In general, if any text, except spaces, exists before '<a' tag it makes parses consider it as content, and parser do not recognize tag opening.
Please find in this repo: https://github.com/amdei/ragel_html_sample intentionally simplified sample with grammar, aiming to work as C program ( ngx_url_html_portion.rl ).
There is also input file input-nbsp.html , which expected to contain input for the application.
In order to play with it, make .c-file from grammar:
ragel ngx_url_html_portion.rl
then compile resulting .c-file and run programm.
Input file should be in the same directory.
Will be sincerely grateful for any clue.
The issue with the defined FSM is that it includes into 'content' all characters until the space. You should exclude HTML tag opening '<' from the rule. Here is the diff for illustration:
$ git diff
diff --git a/ngx_url_html_portion.rl b/ngx_url_html_portion.rl
index ccef0ca..1f8dcf0 100644
--- a/ngx_url_html_portion.rl
+++ b/ngx_url_html_portion.rl
## -145,7 +145,7 ## void copy2hrefbuf(par_t* par, u_char* p){
);
content = (
- any - (space )
+ any - (space ) - '<'
)+;
html_space = (

How can I strip HTML tags from a string in the model before I get to the view

Trying to determine how to strip the HTML tags from a string in Ruby. I need this to be done in the model before I get to the view. So using:
ActionView::Helpers::SanitizeHelperstrip_tags()
won't work. I was looking into using Nokogiri, but can't figure out how to do it.
If I have a string:
description = google
I need it to be converted to plain text without including HTML tags so it would just come out as "google".
Right now I have the following which will take care of HTML entities:
def simple_description
simple_description = Nokogiri::HTML.parse(self.description)
simple_description.text
end
You can call the sanitizer directly like this:
Rails::Html::FullSanitizer.new.sanitize('<b>bold</b>')
# => "bold"
There are also other sanitizer classes that may be useful: FullSanitizer, LinkSanitizer, Sanitizer, WhiteListSanitizer.
Nokogiri is a great choice if you don't own the HTML generator and you want to reduce your maintenance load:
require 'nokogiri'
description = 'google'
Nokogiri::HTML::DocumentFragment.parse(description).at('a').text
# => "google"
The good thing about a parser vs. using patterns, is the parser continues work with changes to the tags or format of the document, whereas patterns get tripped up by those things.
While using a parser is a little slower, it more than makes up for that by the ease of use and reduced maintenance.
The code above breaks down to:
Nokogiri::HTML(description).to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>google</body></html>\n"
Rather than let Nokogiri add the normal HTML headers, I told it to parse only that one node into a document fragment:
Nokogiri::HTML::DocumentFragment.parse(description).to_html
# => "google"
at finds the first occurrence of that node:
Nokogiri::HTML::DocumentFragment.parse(description).at('a').to_html
# => "google"
text finds the text in the node.
Maybe you could use regular expression in ruby like following
des = 'google'
p des[/<.*>(.*)\<\/.*>/,1]
The result will be "google"
Regular expression is powerful.
You could customize to fit your needs.

With R and XPath, how do you remove format elements such as \n and \t from the results?

Using XML I can scrape the URL I need, but when I use xpathSApply on it, R returns unwanted \n and \t indicators (new lines and tabs). Here is an example:
doc <- htmlTreeParse("http://www.milesstockbridge.com/offices/", useInternal = TRUE) # scrape and parse an HTML site
xpathSApply(doc, "//div[#class='info']//h3", xmlValue)
[1] "\n\t\t\t\t\t\tBaltimore\t\t\t\t\t" "\n\t\t\t\t\t\tCambridge\t\t\t\t\t" "\n\t\t\t\t\t\tEaston\t\t\t\t\t" "\n\t\t\t\t\t\tFrederick\t\t\t\t\t"
[5] "\n\t\t\t\t\t\tRockville\t\t\t\t\t" "\n\t\t\t\t\t\tTowson\t\t\t\t\t" "\n\t\t\t\t\t\tTysons Corner\t\t\t\t\t" "\n\t\t\t\t\t\tWashington\t\t\t\t\t"
As explained in this question, regex functions can easily remove the unwanted format elements
how to delete the \n\t\t\t in the result from website data collection? but I would rather xpath do the work first, if possible (I have hundreds of these to parse).
Also, there are functions such as translate, apparently, as in this question:
Using the Translate function to remove newline characters in xml, but how do I ignore certain tags? as well as strip() that I saw in a Python question. I do not know which are available when using R and xpath.
It may be that a text() function helps, but I do not know how to include it in my xpathSApply expression. Likewise with normalize-space().
You just want the trim = TRUE argument in your xmlValue() call.
> xpathSApply(doc, "//div[#class='info']//h3", xmlValue, trim = TRUE)
#[1] "Baltimore" "Cambridge" "Easton"
#[4] "Frederick" "Rockville" "Towson"
#[7] "Tysons Corner" "Washington"

Saving NSXMLDocument escaping HTML for certain NSXMLElements

For my application I have to save an XML document containing a few elements with HTML-text.
Example as the result should be:
<gpx>
<wpt>
<elementInHTML>
<p>Sample text.</p>
</elementInHTML>
etc...
But when I add this html element to my NSXMLDocument the '<' (to <) is correctly escaped automatically, but the '>' not (to >).
In code:
NSXMLElement *newWPT = [NSXMLElement elementWithName:#"wpt"];
NSXMLElement *htmlElement = [NSXMLElement elementWithName:#"elementInHTML"];
htmlElement.stringValue = #"<Sample text>";
[newWPT addChild:htmlElement];
But this results in an XML document like this:
<gpx>
<wpt>
<elementInHTML>
<p>Sample text.</p>
</elementInHTML>
etc...
And this result is not valid for the device that has to process this xml file.
Anybody an idea how to enclose a correctly escaped html-string into a NSXMLDocument?
&The string is correctly scaped for XML, greater than is a valid character where it is: http://www.w3.org/TR/REC-xml/#syntax
It seems it's a device implementation specific problem.
Your easy option is to include your html markup in a CDATA.
...and hope the device client XML parser implementation understand it properly.
(If your html markup include also CDATA sections you'll have to find/replace ">" with ">", as stated in the link before.)
P.D.: NSXMLNode CDATA in any search engine will lead you to something closer to "copy-paste"
EDIT:
Knowing now more about the content of the string in the original question (see question comments) and depending on the nature of your string answers to this other question may also help: Objective-C and Swift URL encoding