Rails, HTML to JSON? - html

Given a static HTML page, is there an automated way to generate json?
For a large website that contains a lot of static HTML I am wanting to generate json for RSS feeds and search functionality and am looking for a way to convert HTML to json.
I could obviously write json templates for every page and every language but that would be a unmaintainable. That would double an 800page website to 1600 pages and that is not an option.
One approach I thought of could be to write a bot that would loop through the routes to index the pages and save data to a database which would give me all the choices I could wish for, for searching such as solr, elastic search, thinking sphinx etc...
I could use capybarra to aid me in this by visiting each path and extracting text to save to a database in a rake task as a background job but not sure how that would work in a production environment and it seems that such a common requirement might have already been achieved but for the life of me I can't find one.
I would be far happier (I think) if I could find a way to convert HTML text content to JSON
Any ideas? Has this already been done? are there any gems that might help? or is there built in functionality that I have not thought of, maybe a way to get html into a hash that could then be converted into json? whatever the approach it needs to be automated. I'm just stuck for the best approach.

Basically html looks a lot like xml, but with strong tag meanings, so you could use xml to json conversion, if it all ends up getting tree of html tags embedded in each other.
And so your question becomes this question Except you might get problems with single tags, without closing one. So you might get all of these and put a closing bracket after each one before trying to get it as hash from xml. Oh, early answer. Btw in general for parsing text data you should look at regular expressions.

I chose to go with a nokogiri solution in the end and wrote a parser to meet my needs

Related

Is there anything wrong with YAML format to be joined to the web standards

Well, I think YAML is really fantastic...
It's beautiful, easy to read, clever syntax...compared to any other data serialization format.
As a superset of JSON we could say it's more elaborated, hence its language evolution.
But I see some different opinions out there, such:
YAML is dead,
don't use yaml and so on...
I simply can't understand on what this is based because it seems so nice :)
If we take few well succeeded examples over the web such as Ruby on Rails, we know they use yaml for simple configuration, but one thing that gets me curious is why yaml is not being part of most used formats over web like XML and JSON.
If you take twitter for example...why not offer the data in YAML format from the API as well?
Is there something wrong by doing it?
We can see the evolution on no-sql databases like couchdb, mongo, all json based, even one great project called jsondb which looks very lightweight and it definitely can do the job.
But when writing data structures in json I really can't understand why YAML is not being used instead.
So one of my concerns would be if is there something wrong with YAML?
People can say it's complex, but well, if you pretend to use the same features you would get in json it's definitely not. You will get a more beautiful file for sure tho and with no hassle. It would be indeed more complex if you decide to use more features, but that's how things are, at least you have the possibility to use it if you want to.
The possibility to choose if you want or not to use double-quotes for string is fantastic makes everything cleaner and easier to read....well you see what's my point :)
So my question would be, why YAML is not vastly used in place of JSON?
Why it doesn't seem that it will be used for data structure transfers within the online community?
All I can see is people using it for simple configuration files and nothing else...
Please bear with me since I might be completely wrong and very big projects might be happening and my ignorance on the subject didn't allow me to be a part of it :)
If is there any big project based on yaml out there I would be very happy to know about it
Thanks in advance
It's not that there's something wrong with YAML — it's just that it doesn't offer any compelling benefits in many cases. YAML is basically a superset of JSON. For most purposes, JSON is quite sufficient — people wouldn't be using advanced YAML features even if they had a full YAML parser — and its close ties to JavaScript make it fit in well with the technologies that Web developers are using anyway.
TLDR: People are already using as much YAML as they need. In most cases, that's JSON.
YAML uses more data than non-prettified JSON. It's great for files that humans might want to edit themselves but when all you're doing is passing data around, you're wasting bandwidth if you're using YAML.
If you need an explanation: each space in UTF-16 is two bytes. YAML uses spaces for indentation, and newline characters for nesting.
Take this example:
foo:
bar:
- foo
- bar
This requires 44 characters (including newline characters). The equivalent JSON would be only 29 characters:
{"foo":{"bar":["foo","bar"]}}
Then just imagine what happens if you URL-encode the YAML. It becomes 95 characters:
foo%3A%0A%20%20%20%20bar%3A%0A%20%20%20%20%20%20%20%20-%20foo%0A%20%20%20%20%20%20%20%20-%20bar
Meanwhile the JSON just becomes 64 characters:
%7B%22foo%22%3A%7B%22bar%22%3A%5B%22foo%22%2C%22bar%22%5D%7D%7D
The size increase to YAML from JSON is more than double when it's URL-encoded, in the above example. And I'm sure you can just imagine that the longer your YAML file, the more and more this difference will increase.
Oh, and one other reason not to use YAML: stackoverflow.com does not support YAML syntax highlighting... ! (Of course, I would argue that YAML is so beautiful that it doesn't need syntax highlighting. That's kind of the point of YAML, I think.)
In Ruby many people argue that configuration should be Ruby, rather than YAML. This saves the parsing stage, means you don't have to learn the new syntax, and don't end up with ERB tags everywhere when you are dynamically generating YAML content (Rails fixtures).
Personally I have to agree, and can't see what YAML would offer to network transfers that would make it a worthwhile consideration over JSON.
YAML has an amount of problems, there is a good article
YAML: probably not so great after all on that.
Short summary (in addition to problems already listed in other answers):
Unreadable except for simple and short things
Insecure by default
Has portability issues
Very complex, with amount of surprising behaviors
I considered using YAML few times and never did. The reason always had to do with white spaces for indentation. While I personally love this, even to me it sounded like asking for trouble, because
For sure someone will make a mistake, not expecting that changing white spaces will break the file. Sometimes someone who has no idea about the language / format has to go to the file to change one number or string.
You can't guarantee that everybody everywhere will have it's comparison / merging / SC software configured properly to catch white space or empty lines differences.

storing code snippets in a database

I want to make a code snippet database web application. Would the best way to store it in the database be to html encode everything to prevent XSS when displaying the snippets on the web page?
Thanks for the help!
The database has nothing to do with this; you simply need to escape the snippets when they are rendered as HTML.
At minimum, you need to encode all & as & and all < characters as <.
However, your server-side language already has a built-in HTML encoding function; you should use it instead of re-inventing the wheel. For more details, please tell us what language your server-side code is in.
Based on your previous questions, I assume you're using PHP.
If so, you're looking for the htmlspecialchars or htmlentities functions.
You would either have to escape it when you store it, or escape it when you display it. It'd probably be better to do it on display so that if you need to edit it later on, you don't have to decode it then re-encode it.
Also, you'll want to make sure you escape it properly when you store it in the database, otherwise you'd be leaving yourself open to SQL injection. Parameterized statements would be the best method, you shouldn't have to change the raw data at all.
The best thing to do is to not store it in the database. I have seen people store stored procedures in databases as a row. Just because you can doesn't mean you should.
It doesn't matter how you store it, what matters is how you render it in the HTML representation. I'd guess you'll need to do some sort of sanitation before rendering the bytes. Another option might be to convert every character to an HTML entity; this might suffice to prevent any code or tags from actually being interpreted.
As an example, view the source of a Stack Overflow page with some example code, and see how they're representing the code in the HTML.

Simple HTML interface to XSD?

I'm writing an app that, at its heart, uses a hierarchical tree of nodes
in XML, it looks like this:
<node>
<name>Node1</name>
<Attribute1>Something</Attribute1>
<Attribute2>SomethingElse</Attribute2>
<child>Node2</child>
<child>Node4</child>
<child>Node7</child>
</node>
And so on (all child elements must refer to an existing node, though the node inquestion doesnt have to precede the first reference to it)
For a simple structure like this is there a simple tool to generate a html page that will allow a user to enter Nodes and dynamically update a server-side xml file?
Im basically writing a tool that will use such a file, but the people who's job it is to create the file arent especially techno-literate, so creating the XML by hand is a no-no.
I could hand-crank one fairly quickly, but if I can get a tool to do it, even better (especially as the format may change in future)....
Xopus is a browser based XML editor that you could use for this. It is designed for the non techno-literate people out there.
Disclaimer: I work at Xopus.
I am pretty sure there is nothing that will do that for you automagically and you'll need to write that bit yourself.
Your options are to create a web based interface to do it, using HTML POST and writing the output to a file or database (then reloading it on submission) or something more advance with Javascript (e.g. that could do it dynamically with AJAX).
You can't do it in HTML alone - either way you'd need something to handle outputting the existing data and accepting HTTP POST requests, but you don't mention what language or platform you are using to write this. Being clear on that will help people suggest appropriate solutions.
You might want to rethink the XML structure ... Elements called "attribute{anything}" are ill advised (as are elements named in the convention foo1, foo2, (etc)). The whole <child>Node2</child> thing doesn't seem like a good way to go either. I suggest posting an actual example of the XML in question.
From what you've said, it sounds like there is no specific need for it to be in XML at all. Not that XML is bad (it isn't) but if putting it in an SQL database is a valid option and you have one of those anyway (e.g. your using a LAMP stack) then that's something to consider.
Would an XML editor like http://www.oxygenxml.com/ suffice? I don't know of any html web ones unless you write one yourself and use AJAX to send the data. At least an XML editor can generate a form that you can use to create and edit XML documents. Microsoft do infopath as well - which is actually designed more for questionaires but might do what you need, if the non-tech people would prefer something more office like.

How can I extract addresses and phone number from HTML?

Is there a library that specializes in parsing such data?
You could use something like Google Maps. Geocode the address and, if successful, Google's API will return an XML representation of the address with all of the elements separated (and corrected or completed).
EDIT:
I'm being voted down and not sure why. Parsing addresses can be a little difficult. Here's an example of using Google to do this:
http://blog.nerdburn.com/entries/code/how-to-parse-google-maps-returned-address-data-a-simple-jquery-plugin
I'm not saying this is the only way or necessarily the best way. Just a way to parse addresses on a web site.
There are 2 parts to this: extract the complete address from the page, and parse that address into something you can use (store the various parts in a DB for example).
For the first part you will need a heuristic, most likely country-dependant: for US addresses [A-Z][A-Z],?\s*\d\d\d\d\d should give you the end of an address, provided the 2 letters turn out to be a state. Finding the beginning of the string is left as an exercise.
The second part can be done either through a call to Google maps, or as usual in Perl, using a CPAN module: Lingua::EN::AddressParse (test it on your data to see if it works well enough for you).
In any case this is a difficult task, and you will most likely never get it 100% right, so plan for manually checking the addresses before using them.
You don't need regular expressions (yet) or a general parser like pyparsing (at all). Look at something like Beautiful Soup, which will parse even bad HTML into something like a tree of tags. From there, you can look at the source of the page, and find out what tags to drill down through to get to the data. Then, from Beautiful Soup's tree, you can search for these nodes using XPath (in recent versions), and directly loop over the tags you're interested in, getting to the actual data easily. From there, you can parse the data using a quick regex or something. This will be more flexible and more future proof, and also possibly less head-exploding, than just trying to do it in pure regular expressions.

How can I extract HTML content efficiently with Perl?

I am writing a crawler in Perl, which has to extract contents of web pages that reside on the same server. I am currently using the HTML::Extract module to do the job, but I found the module a bit slow, so I looked into its source code and found out it does not use any connection cache for LWP::UserAgent.
My last resort is to grab HTML::Extract's source code and modify it to use a cache, but I really want to avoid that if I can. Does anyone know any other module that can perform the same job better? I basically just need to grab all the text in the <body> element with the HTML tags removed.
I use pQuery for my web scraping. But I've also heard good things about Web::Scraper.
Both of these along with other modules have appeared in answers on SO for similar questions to yours:
how can i screen scrape with perl
how can i extract xml of a website and save in a file using perls lwp
how do i extract an html title with perl
can you provide an example of parsing html with your favorite parser
how do I extract content from html file using perl
HTML::Extract's features look very basic and uninteresting. If the modules that draegfun mentioned don't interest you, you could do everything that HTML::Extract does using LWP::UserAgent and HTML::TreeBuilder yourself, without requiring very much code at all, and then you would be free to work in caching on your own terms.
I've been using Web::Scraper for my scraping needs. It's very nice indeed for extracting data, and because you can call ->scrape($html, $originating_uri) then it's very easy to cache the result you need as well.
Do you need to do this in real-time? How does the inefficiency affect you? Are you doing the task serially so that you have to extract one page before you move onto the next one? Why do you want to avoid a cache?
Can your crawler download the pages and pass them off to something else? Perhaps your crawler can even run in parallel, or in some distributed manner.