Converting HTML to RDF - html

I'm looking for a general purpose API/web service/tool/etc... that allows convert a given HTML page to an RDF graph as specific as possible (most probably using a back bone ontology and/or mapper).

Have you proved GRDDL?
GRDDL is a technique for obtaining RDF
data from XML documents and in
particular XHTML pages.

I used XQuery to extract the data out of the given set of web pages. I had to write custom queries for the web pages. I think this is the most straight forward approach to take for a specific set of HTML files. However, it is obviously not good for the general case. For a different set of web pages other custom queries are need to be written.

I used JSoup to scrape data from HTML. It uses jQuery style of querying HTML DOM, wich I was already famirial with, so it was realy simple tool to use for me. I also fund it quite robust but I needed it just to scrape 3 datasources so I dont have rich experience with this tool yet. jsoup

Related

Web comic aggregator RSS feed general questions?

I am trying to create a web comic aggregation website using HTML 5, CSS 3, and JavaScript. I would like users to be able to view comics of different dates from different websites all in one place. After some research, it seems like I'm probably going to need to use an RSS feed to accomplish this. However, I don't fully understand the capabilities and usage of an RSS feed.
First, would it be possible to pull images from comic websites in an automated and orderly fashion using an RSS feed? Or would I need to use something else? Or would it not be possible at all? If it is possible with an RSS feed, I'm also confused somewhat about the general implementation. Would the code for the feed be in HTML, or JavaScript, or both? Would I need to use existing libraries and or APIs? Is their existing code with a similar enough function that I could use it as a starting point?
Thanks
You are in the right direction - RSS is a standard format used to update users/readers of newly published contents.
I'm sure you've searched it already, but its Wikipedia page is quite informative. Basically, it is a standardisation and extension of xml allowing for a uniform way to distribute and read material in an automated fashion.
In the same way there are other formats, such as Atom.
So, for your purpose the main thing to understand is that you want to READ RSS feeds, rather than writing/making one (although you could make one as well - combining the comics you've found). for example, at the bottom of xkcd you can see there are two links - one for an RSS feed and another for an Atom feed. You need to find websites like that, which publish RSS/Atom feeds of comic strips and write your site to read their feed and update itself with the new content. You can maybe automate even the way your site links to feeds by using (if you find one) or creating a feed for comic feeds (so your site would lookup this feed which would contain links to other feeds which would all be appropriate for you).
You could also put up a backend on a server that would fetch the feeds and update a database/databases from which the front-end would fetch the content from using one linking point, but let's stick with the technologies you've mentioned - for a client-based-website for now.
To read and parse the feeds you can look at the answer here, recommending using jFeed, a plugin for jQuery (jQuery is a very popular library for javaScript, if you don't know it)
I'm pretty sure that answers your questions, but let's address them again, dividing it down and going one by one:
would it be possible to pull images from comic websites in an automated and orderly fashion using an RSS feed?
Yes! As you can see in the feed of xkcd I've linked above, it is both possible and widely used to pull/distribute images using RSS (and Atom) feeds.
would I need to use something else?
You can use Atom, which is just a different standard, but fairly the same idea (also an extension of xml, still you can use jFeed)
would it not be possible at all?
It is possible. Do not worry. Stay calm and code away.
If it is possible with an RSS feed, I'm also confused somewhat about the general implementation. Would the code for the feed be in HTML, or JavaScript, or both?
Do not confuse the feed's code with yours. Your code should READ the feed. Not be it. The feed itself, as explained above is written in a standard form of xml called RSS (or Atom if you go with that). But that is what your code reads. For you code see next question/answer.
Would I need to use existing libraries and or APIs? Is their existing code with a similar enough function that I could use it as a starting point?
As mentioned above - you can use jQuesry and the plugin jFeed for it.
Hope that helps and is not confusing.

What else can HTML do besides determine page layout?

My friend and I were recently discussing HTML and web layout (he's just getting started with it) and we came upon an issue: is it possible to do anything with HTML besides determine page layout?
For example, addition
int x = 5 + 4;
is perfectly valid and easy to use in most languages (looking at you, Erlang). However, is it possible to somehow contort <html> to allow for similar functionality? In other words, can <html> be forced to be a more basic version of a scripting/interpreted language without any external help (javascript, etc.)? Why or why not?
Personally, until this conversation, I had never even considered the idea, but now it's got me intrigued and I need a definite answer. I figured it can't be possible because HTML is like XML, which is for data storage, not data manipulation.
HTML, as its name suggests is a mark-up language for hypertext. In other words, it describes the elements of data that need to go on a web page.
If you need to do any calculations or other processing, you'll first need to decide WHERE you want it to happen. For example, if you want to do calculations on the browser itself, you should look at languages like Javascript or Java. In some cases, software like Flash are also suitable through their scripting commands.
If you want the calculations to take place on a server using the data from the browser, you're looking at a server-side scripting language like PHP, ASP or JSP.
Take for example PHP... It's a powerful language with database capabilities. But you CANNOT expect to create a simple text box for user input using PHP as its role is on the server. So you shouldn't look at it like a restriction of PHP.
Likewise html has a role and that is to present data on the browser. Calculations should be using a scripting language like Javascript and layouts are best done using CSS.

Store data from HTML to XML file

Hello im trying to learn about XML , XML is media for store data but HTML media for display data, how can I store data from HTML to XML ?
Because i'd like to build some quiz maker that build up into HTML and store in XML, any tutorial/references for this?
thanks
XML is just a fancy way to store data for your application. It's a standard which means that you can easily export data from one application into another. If you are interested in this, take a look at this page: http://www.w3schools.com/xml/xml_parser.asp
You will need to use html and javascript to build a quiz. If you want you can make your quiz load questions and answers from XML.
HTML is a specialized language written in XML to describe how a webpage renders. HTML is valid XML however they very different things.
The question is very open ended, so it's hard to answer. One way is to post data from your html based website to your server and store it as xml.
However, it all depends on how you intend to use it.
I assume you mean "How can I load data stored in XML to html website". The simplest answer I can think of right now would be using jquery/javascript.
http://think2loud.com/224-reading-xml-with-jquery/
https://stackoverflow.com/questions/10811511/jquery-how-to-get-xml-data
https://stackoverflow.com/questions/16113188/convert-xml-to-html-using-jquery-javascript

How do I create some HTML help pages, with the same content at the top and bottom, without php or ASP etc?

I want to create some html help pages, separate html pages.
However, I want to have the same content on the top and bottom of the pages.
In the past I've used PHP or ASP, with header and footer files.
I've then had to do view source and save these pages to get what I want.
I just wondered if there an easiest way to do this ?
EDIT:
The pages are for use with software using a web object not a normal browser. So there won't be a web server
If your web server supports it, you could do server side includes
You could use frames, but it's not necessarily advisable (for one, it breaks navigation).
You could use XML files with an XSLT stylesheet to turn them into HTML documents that share similar elements.
You could use PHP or another server-side language to generate the pages, and then use a recursive download tool (such as wget) to turn them into HTML.
EDIT: you're basically asking whether the "standard-ish" subset of HTML supported by your component of choice provides a way of including data from a common file, just so you won't have to include the data in every HTML document.
The answer hovers somewhere between "no way" and "maybe your component has a few tricks to do that".
The sane thing to do here would be to have a tool generate the HTML documents from a common template. Could be XML + XSLT, PHP/ASP/whatever, or a fully-fledged CMS (this actually helps let non-technical users write the document contents).
It's awful, but you could include a JS file that uses a bunch of document.write("...") to include common elements. Not SEO friendly.

Is XSLT worth investing time in and are there any actual alternatives?

I realize there have been a few other questions on this topic, and the general concensus is to use your language of choice to manipulate the XML. However, this solution does not quite fit my circumstances.
Firstly, the scope of the project:
We want to develop platform independent e-learning, currently, its a bunch of HTML pages but as they grow and develop they become hard to maintain.
We already have about 30 modules, with 10-30 HTML pages each, and this is growing all the time.
The idea:
Have an XML file(s) + Schema pre eLearning Module, then produce some XSLT files that process the XML into the eLearning modiles. XML to HTML via XSLT.
Why:
We would like the flexibilty to be able to easily reformat the content
I realize CSS is a viable alternative here, especially to visually alter the look'n'feel but we may need a little more power than this and go as far as restructuring the pages.
If we decide to alter the pages layout or functionality in anyway, im guessing altering the "shared" XSLT files would be easier than updating the HTML files.
Depending on some "parameters" we could output drastically different page layouts/structures, above and beyond what CSS can do.
Can XSLT take QueryString parameters? Not sure..
Now, all this has to be platform independent, and to be able to run "offline" i.e. without a server powering the HTML so server side technologies are out of the question (C#, PHP)
Negatives I've read so far for XSLT:
Overhead? Not exactly sure why...is it the compute power need to convert to HTML?
Difficult to learn
Better alternatives
Now, what I would like to know exactly is:
Are there actually any viable alternatives for this "offline"?
Am I going about it in the correct manner
Do you guys have any advice or alternatives.
EDIT:
With or without XSL, CSS and JQuery will be a very prominent part of the solution we
develop.
General tidy up (sloppy engrish!)
Using an XSLT scheme for this is legitimate. XSLT's are powerful if you develop the expertise.
Overhead: Yes, for large documents, a transform can take some seconds. Do a transformation on a large document called many times a minute can be a bad strategy. That won't be a big problem for you since you won't be doing these transforms on demand, just when you want to revise.
Difficult to learn. You can be productive with XSLT pretty soon, but beware: just when it seems XSLT's are getting easy, you'll be surprised by it getting tricky all of a sudden! What you think would be difficult can be easy, and vice versa. You'll might have to import or create some templates just to do some simple date formatting, for example. It's all doable though. Don't be afraid to learn how to do "templates".
Better alternatives. Yes, there are better alternatives, but they are platform specific. For example, I'm in .NET land, and I've dropped XSLTs in favor of manipulating our new XElements and such, and VB.NET embedded XML is very powerful and easy. But XSLT is still great when you want to avoid becoming dependent on a particular platform.
You're still going to use CSS as part of your strategy, right? Changing an XSLT to output styling consistently is better than doing it in 30 modules by hand, but a well-planned CSS stylesheet can still help simplify things (increase maintainability and flexibility).
In summary: To organized the layout/revision of static html pages, platform independent, for flexible distribution: yes, you have a good stategy, from what I can see. And expertise you develop in XSLT will be useful in the future, too. And after mastering XSLT, you'll really understand XML, which will be helpful forever.
XSLT is an ideal tool to use for generating HTML from XML documents in the circumstances you've described. The common complaint about XSLT's processing overhead - that it requires the entire source XML document to be loaded into memory - is really not relevant if you're using XSLT to generate static HTML pages, unless maybe you're generating hundreds of thousands of them.
(And in fact that complaint is really only relevant in cases where the source XML document is large. If you've built an architecture around dynamically generating HTML from large XML documents, choosing XSLT as your technology may be a mistake, but it is not the big mistake.)
You should of course also use CSS.
Separate your data from your presentation.
Offload presentation rendering to the browsers, use CSS and CSS "enhancers" like SASS, Less, etc.
Generate strict XHTML - can format with CSS, can parse with XML parsers, etc
Use JQuery like for interactivity
XSLT is quite heavyweight and won't scale well, whereas XHTML+XSS+JQuery is very well understood and lots of tools exist.
If you already know C# or VB.NET consider using LINQ to XML, the code will be longer, but it may be less pain to write and maintain for a none XSLT expert.
It all come down to how many XML transforms you will needed, just 1 or 2 then I would not spend the time learning XSLT.