Objective
Scrap HTML table from warframe wikia.
Background
I am trying to get the information of a table in warframe, the Mods List table. To achieve this objective I read the HTML-parser on Node.js topic and concluded that using YQL was my best option.
Code
By using Google Chrome Dev Tools, and two chrome extensions called CSS and XPath checker and XPath Helper, I was able to pin point the exact location of the table I am looking for with the following XPath query:
//*[#id="mw-content-text"]/div[33]/div/div[1]/table/tbody
Now, Chrome says this is the correct path, and the plugins I am using suggest it as well.
Problem
The problem is that when I use YQL, the result in Json is something utterly and completely different from the talbe I am expecting. In fact, it returns a different table together with misc data.
I am baffled to why this is happening. The wikia is a simple HTML page with little to no dynamic information whatsoever, so I really can't understand why I am getting erroneous results.
What could the problem be?
Unfortunately, YQL does not work properly with pages that are loaded over time, as is the case wit the wikia.
So, even then the XPath is correct, when Yahoo makes the first (and only) request, it receives an incomplete HTML, and never completes it.
To fix the issue, I decided instead to locally parse the HTML in my nodejs server using the npm-request and npm-cheerio packages.
The first package downloads the full page HTML, and the second parses it for the information I am looking for.
An effective solution that instead of relying on a third party tool, transfers all the work to my server.
Hope this helps someone, in the future !
Related
I'm trying to automate a part of my daily work on a website I'm using (not developping). I have to crawl several lists on the website, dig an item in the list, to check some values, and do it again and again ...
Im used to developping in Powershell and python, not with a web browser. I have limited rights on the machine I'm working on. The only solution I can easily deploy is a tampermonkey/greasemonkey on a firefox portable. I'd like to use this to catch the json answers of the website (all of them), parse the answers for some values, and automate some kind of popup "Hey, This item in the list is between 90 and 100 !"
No code to show yet, that's what i'm looking for : basic solution for json interception while i'm crawling.
I have good knowledge of json parsing using python, the hard part for me is catching the answers ...
Thank you very much for any help you can give. And apologies for my average english ...
Found out that tampermonkey can intercept jsons on page load, not afterwards. Closing question.
Given a static HTML page, is there an automated way to generate json?
For a large website that contains a lot of static HTML I am wanting to generate json for RSS feeds and search functionality and am looking for a way to convert HTML to json.
I could obviously write json templates for every page and every language but that would be a unmaintainable. That would double an 800page website to 1600 pages and that is not an option.
One approach I thought of could be to write a bot that would loop through the routes to index the pages and save data to a database which would give me all the choices I could wish for, for searching such as solr, elastic search, thinking sphinx etc...
I could use capybarra to aid me in this by visiting each path and extracting text to save to a database in a rake task as a background job but not sure how that would work in a production environment and it seems that such a common requirement might have already been achieved but for the life of me I can't find one.
I would be far happier (I think) if I could find a way to convert HTML text content to JSON
Any ideas? Has this already been done? are there any gems that might help? or is there built in functionality that I have not thought of, maybe a way to get html into a hash that could then be converted into json? whatever the approach it needs to be automated. I'm just stuck for the best approach.
Basically html looks a lot like xml, but with strong tag meanings, so you could use xml to json conversion, if it all ends up getting tree of html tags embedded in each other.
And so your question becomes this question Except you might get problems with single tags, without closing one. So you might get all of these and put a closing bracket after each one before trying to get it as hash from xml. Oh, early answer. Btw in general for parsing text data you should look at regular expressions.
I chose to go with a nokogiri solution in the end and wrote a parser to meet my needs
I'm processing a variety of RSS feeds, which contain summaries, as well as the target page URL content, and trying to use a uniform transformation method.
XSLT was the first thing that occurred to me to try, as it would accomplish what I want, in a standard way, without a lot of fuss aside from adding new XSLT stylesheets to accommodate uniquely formatted sites and feed content.
Problem: XSLT libraries are considered "private" in iOS, and even linking statically against your own copy will get you rejected by the Apple Store analysis tools.
I've looked into the possibility if injecting the stylesheet and data into a UIWebView that wasn't displayed, but this seems like a really roundabout and hackish way to get at the system's underlying XSLT processor in an "approved" fashion.
What alternative techniques/libraries exist which would let me do this in a standard fashion, ie: without rolling my own.
I'm not sure I fully understand your requirements, but one possbility would be to use libxml (which is allowed in iOS) to parse the XML and if necessary manipulate the DOM. If you really need to do XML transformations this is going to be more effort than XSLT, but if you just need to extract data from the XML, that can be done fairly easily with xpath queries.
That said, I have read several people claiming they got XSLT working on iOS and had their apps approved in the app store. In particular, I've seen this stackoverflow answer claimed as a working solution by multiple people. And if that fails, another answer suggested building the libxslt library yourself with renamed symbols to bypass the app store checks. I would only suggest that as a last resort though.
You'll probably want to look into Hpple for something powerful but light weight / native. See the tutorial on getting started here: http://www.raywenderlich.com/14172/how-to-parse-html-on-ios. Good luck!
I'm going to also recommend TFHpple but I'm also going to elaborate on the solution. I've explored an app that navigates a 3rd party (well, I'm the 3rd party, they're the source but that's semantics) website/data source but there are some pitfalls. The biggest pitfall is obvious: if the data source DOM changes you need to change your app and re-release. A creative way around this would be to publish/expose a global copy of the DOM on a public server that way the end user doesn't have to update their app any time the data source changes (as long as the change isn't radical).
For instance, if your expected DOM search in TFHpple is #"//figure[#class='figure']/a" and then a week from now your data source's resource you're looking for is altered to #"//figure1[#class='figure1']/a" you just opened yourself to an App Store release... UNLESS... you publish the expected DOM searches on a web server you control in a data dictionary that your app can consume and serve out to the various DOM search elements within your app. The only problem I foresee here is that if the data source adds or removes a data element you want to consume you either have to release a build or handle the removal ahead of time (respectively).
Lastly if the data source DOM isn't well formed or consistent you may be beating your head against a wall more times than not.
I have a client asking for a page similar to this - http://www.candlelighthomes.com/homedesigns.php - where you can use the search form to filter out what type of results you're seeing in the home plans. I've never done anything like that before, and honestly am not sure where to start. Can anyone tell me what type of coding is used to do it - is it just javascript, or is it PHP? I've tried googling it but am not really sure how to phrase what I'm looking for. I also looked at the page source but couldn't tell entirely what was making it work.
Use PHP and SQL to only fetch data matching the criteria from the Database.
In the end your query should look something like this:
$sql = "SELECT * FROM properties WHERE price BETWEEN $minPrice AND $maxPrice";
I believe you'll want to look into Ajax, PHP, and Javascript to solve this problem. You can use PHP to manipulate and format the data you want to display (PHP is a server-side scripting language). Ajax is used to communicate and send data between javascript and php scripts, so you can use that to grab your data. Finally, write a javascript to put your data where it needs to be in your webpage.
I know that sounds complicated, but to accomplish this you'll need to have at least a basic understanding of how JavaScript, Ajax (Search for Ajax with JQuery) and PHP. I learned by using examples from W3 Schools. Here's some helpful links to get you started.
http://www.w3schools.com/js/default.asp
http://www.w3schools.com/jquery/jquery_ajax_intro.asp
http://www.w3schools.com/php/default.asp
Is there a library that specializes in parsing such data?
You could use something like Google Maps. Geocode the address and, if successful, Google's API will return an XML representation of the address with all of the elements separated (and corrected or completed).
EDIT:
I'm being voted down and not sure why. Parsing addresses can be a little difficult. Here's an example of using Google to do this:
http://blog.nerdburn.com/entries/code/how-to-parse-google-maps-returned-address-data-a-simple-jquery-plugin
I'm not saying this is the only way or necessarily the best way. Just a way to parse addresses on a web site.
There are 2 parts to this: extract the complete address from the page, and parse that address into something you can use (store the various parts in a DB for example).
For the first part you will need a heuristic, most likely country-dependant: for US addresses [A-Z][A-Z],?\s*\d\d\d\d\d should give you the end of an address, provided the 2 letters turn out to be a state. Finding the beginning of the string is left as an exercise.
The second part can be done either through a call to Google maps, or as usual in Perl, using a CPAN module: Lingua::EN::AddressParse (test it on your data to see if it works well enough for you).
In any case this is a difficult task, and you will most likely never get it 100% right, so plan for manually checking the addresses before using them.
You don't need regular expressions (yet) or a general parser like pyparsing (at all). Look at something like Beautiful Soup, which will parse even bad HTML into something like a tree of tags. From there, you can look at the source of the page, and find out what tags to drill down through to get to the data. Then, from Beautiful Soup's tree, you can search for these nodes using XPath (in recent versions), and directly loop over the tags you're interested in, getting to the actual data easily. From there, you can parse the data using a quick regex or something. This will be more flexible and more future proof, and also possibly less head-exploding, than just trying to do it in pure regular expressions.