Scrape Data within HTML Tags Perl - html

I'm writing a web scraper, and am a Perl novice. I'm using HTML::TreeBuilder to get the data I need, but I've run into a case I'm not sure how to handle. Here's some sample HTML:
<div class="anything" val="20" name="matchup">someUniqueData</div>
I want to extract the val from this HTML tag. I've been using findvalues() to do most of my work, but I don't know if this can pull data from inside tags. I've glossed over the documentation unsuccessfully. Is there a simple solution for this type of scrape?

You need (using HTML::TreeBuilder::XPath):
my ($val) = $tree->findvalues('//div[#class="anything"]/#val');

Related

Extracting JSON data from html source for use with jsonlite in R

I have a background in data and have just been getting into scraping so forgive me if my web standards and languages is not up to scratch.
I am trying to scrape some data from a javascript component of a website I use. Viewing the page source I can actually see the data I need already there within javascript function calls in JSON format. For example it looks a little like this.
<script type="text/javascript">
$(document).ready(function () {
gameState = 4;
atView.init("/Data/FieldView/20152220150142207",{"a":[{"co":true,"col:"Red"}],"b":false,...)
meLine.init([{"c":100,"b":true,...)
</script>
Now, I only need the JSON data in meLine.init. If I physically copy/paste only the JSON data into a file I can then convert that with jsonlite in R and have exactly what I need.
However I don't want to have to copy/paste multiple pages so I need a way of extracting only this data and leaving everything else behind. I originally thought to save the html source code to R, convert to text and try and regex match "meLine.init(", but I'm not really getting anywhere with that. Could anyone offer some help?
Normally I'd use XML and xpath to parse an html page but in this case (since you know the exact structure you're looking for) you might be able to do it directly with a bit of regular expressions (this is generally not a good idea as emphasized here). Not sure if this gets you exactly to your goal but
sub("[ ]+meLine.init\\((.+)\\)" , "\\1",
grep("meLine.init", readLines("file://test.html"), value=TRUE),
perl=TRUE)
will return the line you're looking for and then you can work your magic with jsonlite. The idea is to read the page line by line. grep the (hopefully) single line that contains the string meLine.init and then extract the JSON string from that. Replace file://test.html with the URL you want to use

HTML parsing in Clojure

I'm looking for a good way to parse HTML in Clojure.
Exactly what I'm trying to do is get content of a web page with crawler and then get content of some HTML tags or their attributes.
So I have URL to the page, and I get html as String, but how do get data I need?
Use https://github.com/cgrand/enlive
It allows you to select and retrieve with CSS-alike selectors.
Or https://github.com/nathell/clj-tagsoup
I am not experienced with tag-soup but I can tell that enlive works well for most scraping.

How convert Html into Prolog

How convert Html into Prolog?
I need to extract from an html page its tag and i describe it into Prolog.
Example, if my file contains this html code
<title>Prove<title>
<select id="data_nastere_zi" name="data_nastere_zi">
i should get
title(Prove),
select(id(data_nastere_zi)).
I tried to see various library but i couldn't.
Thanks.
You can parse well formed HTML using SWI-Prolog library(sgml), in particular load_html/2.
My experience, scraping 'real world' websites, isn't really pleasant, because of insufficient error handling.
Anyway, when you will have loaded the page structure, you will have available library(xpath) to inspect such complex data.
edit getting a table inside a div:
xpath(Page, //div, Div),
xpath(Div, //table, Table)...
SWI-Prolog has a package for SGML/XML parsing based on the SWI-Prolog interface to SP by Anjo Anjewierden: "SWI-Prolog SGML/XML parser".

Grep and Extract Data in Perl

I have HTML content stored in a variable. How do I extract data that is found between a set of common tags in the page? For example, I am interested in the data (represented by DATA kept between a set of tags which one line after the other:
...
<td class="jumlah">*DATA_1*</td>
<td class="ud">*DATA_2*</td>
...
And then I would like to store a mapping DATA_2 => DATA_1 in a hash
Since it is HTML I think this could work for you?
https://metacpan.org/pod/XML::XPath
XPath is the way.
Since it's HTML, you probably want the XPath module made for working with HTML, HTML::TreeBuilder::XPath.
First you'll need to parse your string using the HTML::TreeBuilder methods. Assuming your webpage's content is in a variable named $content, do it like this:
my $tree = HTML::TreeBuilder->new;
$tree->parse_file($file_name);
Now you can use XPath expressions to get iterators over the nodes you care about. This first expression gets all td nodes that are in a tr in a table in the body in the html element:
my $tdNodes = $tree->findnodes('/html/body/table/tr/td');
Finally you can just iterate over all the nodes in a loop to find what you want:
foreach my $node ($tdNodes->get_nodelist) {
my $data = $node->findvalue('.'); // the content of the node
print "$data\n";
}
See the HTML::TreeBuilder documentation for more on its methods and the NodeSet documentation for how to use the NodeSet result object. w3schools has a passable XPath tutorial here.
With all this, you should be able to do pretty robust HTML parsing to grab out any element you want. You can even specify classes, ids, and more in your XPath queries to be really specific about which nodes you want. In my opinion, parsing HTML using this modified XPath library is a lot faster and more maintainable than dealing with a bunch of one-off regexes.
Use HTML parsing modules as described in answers to this Q - HTML::TreeBuilder or HTML::Parser.
Purely theoretically you could try doing this using Regular Expressions to do this but as noted in the linked question's answers and countless other times on SO, parsing HTML with RegEx is a Bad Idea with capital letters - too easy to get wrong, too hard to get well, and impossible to get 100% right since HTML is not a regular language.
You might try this module: HTML::TreeBuilder::XPath. The doc says:
This module adds typical XPath methods to HTML::TreeBuilder, to make it easy to query a document.

parse html in adobe air

I am trying to load and parse html in adobe air. The main purpose being to extract title, meta tags and links. I have been trying the HTMLLoader but I get all sort of errors, mainly javascript uncaught exceptions.
I also tried to load the html content directly (using URLLoader) and push the text into HTMLLoader (using loadString(...)) but got the same error. Last resort was to try and load the text into xml and then use E4X queries or xpath, no luck there cause the html is not well formed.
My questions are:
Is there simple and reliable (air/action script) DOM component there (I do not need to display the page and headless mode will do)?
Is there any library to convert (crappy) html into well formed xml so I can use xpath/E4X
Any other suggestions on how to do this?
thx
ActionScript is supposed to be a superset of JavaScript, and thankfully, there's...
Pure JavaScript/ActionScript HTML Parser
created by Javascript guru and jQuery creator John Resig :-)
One approach is to run the HTML through HTMLtoXML() then use E4X as you please :)
Afaik:
No :-(
No :-(
I think the easiest way to grab title and meta tags is writing some regular expressions. You can load the page's HTML code into a string and then read out whatever you need like this:
var str:String = ""; // put HTML code in here
var pattern:RegExp = /<title>(.+)<\/title>/i;
trace(pattern.exec(str));