Skipping DTD validation in DOM4J - dom4j

I have an xml file that has a schema as follows
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
I have put JATS-journalpublishing1.dtd on the project root but still complaining with dom4j.documentException. My code looks like this..
SAXReader dom = new SAXReader();</br>
dom.setValidation(false);
Document document = dom.read("some.xml");
How do I tell the project to ignore the validation

Related

How to parse html from javafx webview and transfer this data to Jsoup Document?

I am trying to parse sidebar TOC(Table of Components) of some documentation site.
Jsoup
I have tried Jsoup. I can not get TOC elements because the HTML content in this tag is not part of initial HTML but is set by JavaScript after the page is loaded.
You can see my previous question here:JSoup cannot parse child elements after depth 2
The suggested solution is to examine what connections are made manually from the Browser Dev Tools menu find the last version of the website. Parsing sidebar TOC of some documentation site is just one component of my java program so I cannot do this manually.
JavaFX Webview(not Android Webview)
I have tried JavaFX Webview because I need a browser that executes javascript code and fills Toc tag components.
WebView browser = new WebView();
WebEngine webEngine = browser.getEngine();
webEngine.load("https://learn.microsoft.com/en-us/ef/ef6/");
But I don't know how can I retrieve HTML code of the loaded website and transfer this data to Jsoup Document?
ANy advice appreciated.
WebView browser = new WebView();
WebEngine webEngine = browser.getEngine();
String url = "https://learn.microsoft.com/en-us/ef/ef6/";
webEngine.load(url);
//get w3c document from webEngine
org.w3c.dom.Document w3cDocument = webEngine.getDocument();
// use jsoup helper methods to convert it to string
String html = new org.jsoup.helper.W3CDom().asString(webEngine.get);
// create jsoup document by parsing html
Document doc = Jsoup.parse(url, html);
I can't promise this is the best way as I've not used Jsoup before and I'm not an expert on the XML API.
The org.jsoup.Jsoup class has a method for parsing HTML in String form: Jsoup.parse(String). This means we need to get the HTML from the WebView as a String. The WebEngine class has a document property that holds a org.w3c.dom.Document. This Document is the HTML content of the currently showing web page. We just need to convert this Document into a String, which we can do with a Transformer.
import java.io.StringWriter;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.jsoup.Jsoup;
public class Utils {
private static Transformer transformer;
// not thread safe
public static org.jsoup.nodes.Document convert(org.w3c.dom.Document doc)
throws TransformerException {
if (transformer == null) {
transformer = TransformerFactory.newDefaultInstance().newTransformer();
}
StringWriter writer = new StringWriter();
transformer.transform(new DOMSource(doc), new StreamResult(writer));
return Jsoup.parse(writer.toString());
}
}
You would call this every time the document property changes. I did some "tests" by browsing Google and printing the org.jsoup.nodes.Document to the console and everything seems to be working.
There is a caveat, though; as far as I understand it the document property does not change when there are changes within the same web page (the Document itself may be updated, however). I'm not a web person, so pardon me if I don't make sense here, but I believe that this includes things like a frame changing its content. There may be a way around this by interfacing with the JavaScript using WebEngine.executeStript(String), but I don't know how.

XSS with dynamic HTML input

My team is fixing vulnerability threats from an old jsp application. The problem is it allows (permissioned) users to create a simple home page by putting their html into a textarea and having it render on the page. The problem is xss issues. I have been doing some research and found withing the jsp pages I can use:
fn:escapeXML() from the jstl library to escape any html/xml that is inputted. This is fine for simple form inputs, but for the home page creator, I want to be able to keep simple html but get rid of any harmful scripts or xss vulnerabilities.
My teammate and I are fairly new to fixing xss issues and have been relying on resources we find..
I have come across these resources and am not sure if this will work the way I like after reading through them.
-Which html sanitization library to use?
-https://www.owasp.org/index.php/XSS_%28Cross_Site_Scripting%29_Prevention_Cheat_Sheet
If I use owasp, will this sanitize the html to basic rendering and prevent any scripting from being implemented?
Here is what I currently have in my jsp:
<td class='caption'>
<c:set var="x"><%=system.getName()%></c:set>
Options for ${fn:escapeXml(x)}
</td>
This works and will currently stop any html/xml/script from running but I still would like basic html (titles, paragraphs, fonts, colors, etc) for a simple informational page with html.
According to OWASP
If your application handles markup -- untrusted input that is supposed to contain HTML -- it can be very difficult to validate. Encoding is also difficult, since it would break all the tags that are supposed to be in the input. Therefore, you need a library that can parse and clean HTML formatted text.
There is different HTML sanitizing libraries. The owasp-java-html-sanitizer library is probably a good choice.
You can use prepackaged policies:
PolicyFactory policy = Sanitizers.FORMATTING.and(Sanitizers.LINKS);
String safeHTML = policy.sanitize(untrustedHTML);
configure your own policy:
PolicyFactory policy = new HtmlPolicyBuilder()
.allowElements("a")
.allowUrlProtocols("https")
.allowAttributes("href").onElements("a")
.requireRelNofollowOnLinks()
.build();
String safeHTML = policy.sanitize(untrustedHTML);
or write custom policies:
PolicyFactory policy = new HtmlPolicyBuilder()
.allowElements("p")
.allowElements(
new ElementPolicy() {
public String apply(String elementName, List<String> attrs) {
attrs.add("class");
attrs.add("header-" + elementName);
return "div";
}
}, "h1", "h2", "h3", "h4", "h5", "h6"))
.build();
String safeHTML = policy.sanitize(untrustedHTML);
Read the documentation for full details.

how to convert string into html and loop through it in windows phone 8

I am using the following code
Deployment.Current.Dispatcher.BeginInvoke(() =>
{
string site = "http://www.nokia.com
webBrowserControl.Navigate(new Uri(site, UriKind.Absolute));
webBrowserControl.LoadCompleted += webBrowserControl_LoadCompleted;
});
private void webBrowserControl_LoadCompleted(object sender, NavigationEventArgs e)
{
string s = webBrowserControl.SaveToString();
}
How do I loop through this result string to find out elements like s and all
<div class="result-wrapper">
Tried to convert this string to XMLDocument but getting the error.
Please help me... thanks
You should not use XML document parser to pase html, because html schema is different than Html. you can use Agility Pack to parse html below is link on how you can use agility Pak
HTML Agility Pack - Windows Phone 8
Hope this helps.
It will throw you an exception when it is not a perfect XML document. It should have proper opening and closing tag. Check your html document with some online XML Validator and then proceed with that.
If you are going to parse only few tags, then identify the substring from your html document using "string.IndexOf()" and use that substring to load your XML Document.
Else, you have to do it manually or by using HTML Agility pack. But Html Agility pack needs some libraries from Silverlight 4.0 which is not recommended by microcoft.
So, doing manually is my choice.

LOCAL HTML file to generate a text file

I am trying to generate a TEXT/XML file from a LOCAL HTML file. I know there are a lot of answers to generating a file locally, usually suggesting using ActiveX object or HTML 5.
I'm guessing there is a way to make it work on all browsers (in the end HTML extension is opened by a browser even if it is a LOCAL file) and easily since this is a LOCAL file put in by user himself.
My HTML file will be on client's local machine not accessed via HTTP.
It is basically just a form written in HTML that upon "SAVE" command should be generating an XML file in the local disk (anywhere user decides) and saving form's content in.
Any good way?
One way that I can think of is, the html form elements can be set into class variables and then using the jaxb context you can create an XML file out of it.
Useful Link: http://www.vogella.com/tutorials/JAXB/article.html
What you can do is use base64 data-urls (no support for IE9-) to download the file:
First you need to create a temporary iframe element for your file to download in:
var ifrm = document.createElement('iframe');
ifrm.style.display = 'none';
document.body.appendChild(ifrm);
Then you need to define what you want the contents of the file to download to be, and convert it to a base64 data-url:
var html = '<!DOCTYPE html><html><head><title>Foo</title></head><body>Hello World</body></html>';
htmlurl = btoa(html);
and set it as source for the iframe
ifrm.src = 'data:text/x-html;base64,'+htmlurl;

Parsing RDFa in html/xhtml?

Using RDF::RDFa::Parser module in perl to parse rdf data out of website.
On website with with !DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> it works, but on sites using xhtml !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> no output...
test website -> http://www.filmstarts.de/kritiken/186918.html
use RDF::RDFa::Parser;
my $url = 'http://www.filmstarts.de/kritiken/186918.html';
my $options = RDF::RDFa::Parser::Config->tagsoup;
my $rdfa = RDF::RDFa::Parser->new_from_url($url, $options);
print $rdfa->opengraph('image');
print $rdfa->opengraph('description');
(I'm the author of RDF::RDFa::Parser.)
It looks like the HTML parser used by the RDFa parser is failing on that page. (I'm also the maintainer of the HTML parser in question, so I can't shift the blame onto anyone else!) Thus, by the time the RDFa parsing starts, all it sees is an empty DOM tree.
The page is quite hideously invalid XHTML yet still I would have expected the HTML parser to do a reasonable job. I've filed a bug report for you.
In the mean time, a workaround might be to build the XML::LibXML DOM tree outside of RDF::RDFa::Parser (perhaps using libxml's built-in HTML parser?). You could pass that tree directly to the RDFa parser:
use RDF::RDFa::Parser;
use LWP::Simple qw(get);
my $url = 'http://www.filmstarts.de/kritiken/186918.html';
my $xhtml = get($url);
my $dom = somehow_build_a_dom_tree($xhtml); # hand-waving!!
my $options = RDF::RDFa::Parser::Config->tagsoup;
my $rdfa = RDF::RDFa::Parser->new($dom, $url, $options);
print $rdfa->opengraph('image');
print $rdfa->opengraph('description');
I hope that helps!
Update: here's a possible implementation of somehow_build_a_dom_tree...
sub somehow_build_a_dom_tree {
my $p = XML::LibXML->new;
$p->recover_silently(1);
$p->load_html( string => #_ );
}