Parsing RDFa in html/xhtml? - html

Using RDF::RDFa::Parser module in perl to parse rdf data out of website.
On website with with !DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> it works, but on sites using xhtml !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> no output...
test website -> http://www.filmstarts.de/kritiken/186918.html
use RDF::RDFa::Parser;
my $url = 'http://www.filmstarts.de/kritiken/186918.html';
my $options = RDF::RDFa::Parser::Config->tagsoup;
my $rdfa = RDF::RDFa::Parser->new_from_url($url, $options);
print $rdfa->opengraph('image');
print $rdfa->opengraph('description');

(I'm the author of RDF::RDFa::Parser.)
It looks like the HTML parser used by the RDFa parser is failing on that page. (I'm also the maintainer of the HTML parser in question, so I can't shift the blame onto anyone else!) Thus, by the time the RDFa parsing starts, all it sees is an empty DOM tree.
The page is quite hideously invalid XHTML yet still I would have expected the HTML parser to do a reasonable job. I've filed a bug report for you.
In the mean time, a workaround might be to build the XML::LibXML DOM tree outside of RDF::RDFa::Parser (perhaps using libxml's built-in HTML parser?). You could pass that tree directly to the RDFa parser:
use RDF::RDFa::Parser;
use LWP::Simple qw(get);
my $url = 'http://www.filmstarts.de/kritiken/186918.html';
my $xhtml = get($url);
my $dom = somehow_build_a_dom_tree($xhtml); # hand-waving!!
my $options = RDF::RDFa::Parser::Config->tagsoup;
my $rdfa = RDF::RDFa::Parser->new($dom, $url, $options);
print $rdfa->opengraph('image');
print $rdfa->opengraph('description');
I hope that helps!
Update: here's a possible implementation of somehow_build_a_dom_tree...
sub somehow_build_a_dom_tree {
my $p = XML::LibXML->new;
$p->recover_silently(1);
$p->load_html( string => #_ );
}

Related

how to convert string into html and loop through it in windows phone 8

I am using the following code
Deployment.Current.Dispatcher.BeginInvoke(() =>
{
string site = "http://www.nokia.com
webBrowserControl.Navigate(new Uri(site, UriKind.Absolute));
webBrowserControl.LoadCompleted += webBrowserControl_LoadCompleted;
});
private void webBrowserControl_LoadCompleted(object sender, NavigationEventArgs e)
{
string s = webBrowserControl.SaveToString();
}
How do I loop through this result string to find out elements like s and all
<div class="result-wrapper">
Tried to convert this string to XMLDocument but getting the error.
Please help me... thanks
You should not use XML document parser to pase html, because html schema is different than Html. you can use Agility Pack to parse html below is link on how you can use agility Pak
HTML Agility Pack - Windows Phone 8
Hope this helps.
It will throw you an exception when it is not a perfect XML document. It should have proper opening and closing tag. Check your html document with some online XML Validator and then proceed with that.
If you are going to parse only few tags, then identify the substring from your html document using "string.IndexOf()" and use that substring to load your XML Document.
Else, you have to do it manually or by using HTML Agility pack. But Html Agility pack needs some libraries from Silverlight 4.0 which is not recommended by microcoft.
So, doing manually is my choice.

Skipping DTD validation in DOM4J

I have an xml file that has a schema as follows
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
I have put JATS-journalpublishing1.dtd on the project root but still complaining with dom4j.documentException. My code looks like this..
SAXReader dom = new SAXReader();</br>
dom.setValidation(false);
Document document = dom.read("some.xml");
How do I tell the project to ignore the validation

Display xml data into html page

i want to get data from xml file and display it into html.
Which is the best and easiest method to display xml data in html page ?
You should use XSLT for this job. XSLT is a language that is designed to transform documents from xml to xml. This is very usefull, because xhtml is an xml language. That means that you can convert xml to xhtml using XSLT.
XSLT can be used both serverside and clientside, but beware of the clientside solution. Some browers does not support it, and some only supports older versions wihch might lead to different results.
You can check out this tutorial: http://www.w3schools.com/xsl/
<?php
$xml = $your_xml_string;
try { //try to make it formated. DOMDocument class must be available.
$doc = new DOMDocument();
$doc->loadXML($xml);
$doc->formatOutput = true;
$xml = $doc->saveXML();
} catch (Exception $exc) { }
// parese as html
echo htmlspecialchars($xml);
?>
I would use JQuery. You can easily parse XML files and display the content where you wish.
Have a look here > jQuery.parseXML
and here > Example

How to get the HTML source of a webpage in Ruby [duplicate]

This question already has answers here:
Equivalent of cURL for Ruby?
(13 answers)
Closed 7 years ago.
In browsers such as Firefox or Safari, with a website open, I can right click the page, and select something like: "View Page Source" or "View Source." This shows the HTML source for the page.
In Ruby, is there a function (maybe a library) that allows me to store this HTML source as a variable? Something like this:
source = view_source(http://stackoverflow.com)
where source would be this text:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>Stack Overflow</title>
etc
Use Net::HTTP:
require 'net/http'
source = Net::HTTP.get('stackoverflow.com', '/index.html')
require "open-uri"
source = open(url){ |f| f.read }
UPD: Ruby >=1.9 allows syntax
require "open-uri"
source = open(url, &:read)
UPD: Ruby >=3.0 demands syntax
require "open-uri"
source = URI(url).open(&:read)
require 'open-uri'
source = open(url).read
short, simple, sweet.
Yes, like this:
require 'open-uri'
open('http://stackoverflow.com') do |file|
#use the source Eric
#e.g. file.each_line { |line| puts line }
end
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://google.com/')
puts page.body
you can then do a lot of other cool stuff with mechanize as well.
You could use the builtin Net::HTTP:
>> require 'net/http'
>> Net::HTTP.get 'stackoverflow.com', '/'
Or one of the several libraries suggested in "Equivalent of cURL for Ruby?".
Another thing you might be interested in is Nokogiri. It is an HTML, XML, etc. parser that is very easy to use. Their front page has some example code that should get you started and see if it's right for what you need.
If you have cURL installed, you could simply:
url = 'http://stackoverflow.com'
html = `curl #{url}`
If you want to use pure Ruby, look at the Net::HTTP library:
require 'net/http'
stack = Net::HTTP.new 'stackoverflow.com'
# ...later...
page = '/questions/4217223/how-to-get-the-html-source-of-a-webpage-in-ruby'
html = stack.get(page).body

make my file readable as either Perl or HTML

In the spirit of the "Perl Preamble" where a script works properly whether executed by a shell script interpreter or the Perl interpreter...
I have a Perl script which contains an embedded HTML document (as a "heredoc"), i.e.:
#!/usr/bin/perl
... some Perl code ...
my $html = <<'END' ;
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
... more HTML ...
</HTML>
END
... Perl code that processes $html ...
I would like to be able to work on the HTML that's inside the Perl script and check it out using a web browser, and only run the script when the HTML is the way I want. To accomplish this, I need the file to be openable both as an HTML file and as a Perl script.
I have tried various tricks with Perl comments and HTML comments but can't get it quite perfect. The file as a whole doesn't have to be "strictly legal" HTML (although the embedded document should be)... just displayable in a browser with no (or minimal) Perl garbage visible.
EDIT: Solved! See my own answer
Read it and weep Mr. #Axeman... I now present to you the empty set:
</dev/fd/0 eval 'exec perl -x -S $0 ${1+"$#"}' #> <!--
#!perl
... some Perl code ...
my $html = << '<!-- END' ; # -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
... more HTML ...
</HTML>
<!-- END
... Perl code that processes $html ...
# -->
This sounds like a path to pain. Consider storing the HTML in a separate file and reading it in within the script.
Maybe this is a job for Markup::Perl:
# don't write this...
print "Content-type: text/html;\n\n";
print "<html>\n<body>\n";
print "<p>\nYour \"lucky number\" is\n";
print "<i>", int rand 10, "</i>\n</p>\n";
print "</body>\n</html>\n";
# write this instead...
use Markup::Perl;
<html><body><p>
Your "lucky number" is
<i><perl> print int rand 10 </perl></i>
</p></body></html>
You could also drop the use Markup::Perl line and run your script like
perl -MMarkup::Perl my_page_with_embedded_perl.html
Then the page should render pretty well.
Sounds to me like you want a templating solution, such as Template::Toolkit or HTML::Template. Embedding HTML in your code or embedding code in your HTML is a recipe for pain.
Have you considered putting Perl inside of HTML?
Like ASP4 does?
It's a lot easier that way - trust me ;-)