XPath collect multiple attributes (html events) - html

I have a section of html code, and am trying to parse it via Perl's XML::LibXML module. I am trying to collect all the events within the html (onclick, onchange, onsubmit, etc), and thought XPath would be useful for identifying them. I know I can do
'//#onclick|//#onchange|//#onsubmit|...'
but was wondering if there's a way to avoid listing them, to ensure that no events are missed. The only idea I had was
'//#on*'
but that doesn't work.

Try doing this:
'//#*[starts-with(name(), "on")]'
The
start-with()
and
name()
are some Xpath functions, check http://www.w3.org/TR/xpath-functions/ & http://www.w3schools.com/xpath/xpath_functions.asp

Related

Cannot collect all nodes of Google search result with goquery: some nodes are missing

I am trying to collect results of a google search page in GoLang using the goquery library. In order to achieve this, I am collecting all nodes of a goquery selection with goquery. The problem is that the selection returned by Find("*") does not seem to contain all the nodes of the HTML document. Question: does the method collect ALL nodes with the whole tree structure or not ? If not, is there a method to collect them all ?
I tried using the goquery Find("*") method applied to the whole document selection. So nodes with certain attributes are not returned, although they are in the HTML document. For instance, nodes with are not recognized
alltags := doc.Find("*") //doc is the HTML doc with the Google search
The selection does not contain the div tags with class="srg". The same applies to other class values such as "bkWMgd", "rc" for example.
This has happened to me before. I was trying to web scrape with python beautiful soup package and the same thing was happening.
Later it turned out that the html markup returned when trying to fetch it was actually the markup the server returned after finding a bot. I solved this by setting the User-Agent to Mozilla/5.0.
Hope this helps in your quest to solve this.
You can start by updating the code for the fetch request you have performed.

Openrefine cannot fetch html code inside accordion

I know that openrefine is not a perfect tool for web scraping but looking for some helps from the first step.
I cannot collect the full html codes from openrefine when I add column by fetching url (https://profiles.health.ny.gov/hospital/view/103094). They do not incorporate any codes under accordion such as services, bed types, and etc.
Any idea to get the full codes by fetching in openrefine?
I am trying to collect information under administrative, whose Xpath is "//div[4]/div/ul/li" ("div#AdministrativeBox.in.collapse")
This website loads its content dynamically using Javascript. The information that interests you is not stored in the source code of the page, so Open Refine cannot extract it.
However, there is a workaround. If you transform your URLs with the GREL formula value.replace('view', 'tab_overview'), you will get scrapable pages like this one.
Note that OpenRefine does not use Xpath, but JSOUP selectors. To get the elements of the "Administrative" block, you can use this GREL formula.
forEach(value.parseHtml().select('#AdministrativeBox li'), e, e.htmlText()).join(',')
Result:

Curly brackets in HTML

I stumbled upon this code:
<a href="#" class="text1"{text2}>...</a>
What does the {text2} do? Later on, this HTML is replaced with:
<a href="#" class="text1" {text2} style>...</a>
Is there a way I can retrieve the text2 value with jQuery?
In some cases that code is input in, so scripts can actually easily identify a the line. Or in some cases can be an indicator for a database to retrieve and store data once it has been pulled.
Or it could be invalid markup, doubtful if the person knows what they are doing.
But without any other information or variables it is hard to say. But the most common is access for scripts within Php, Javascript, and even C#. Cause they can parse the HTML document and manipulate it. If those braces are used, and it is incorrectly it will cause a parse error.
Hopefully that sort of clarifies it.
Update:
Yes, jQuery can find it. It is a form of Javascript. You could implement something such as:
$(function() {
var foundString = $('*:contains("{text1}")');
});
There is a vast amount of data that addresses this for more detail.
It does nothing in HTML. It's actually invalid markup. Looks like maybe you have a template system that finds and replaces that before it gets rendered to the browser.
I know that in jinja2, a python templating system, brackets contain commands to the template engine, either as:
Hello, {{varName}}
or:
<ol>
{%for l in varList%}
<li>{{l}}</li>
{%endfor%}
</ol>
That's in jinja, but jinja has similar syntax to django templates, and many other template engines probably copy django's syntax also.
its used in angular js and are called expressions {{expression}}
AngularJS is a JavaScript framework. It can be added to an HTML page with a tag.
AngularJS extends HTML attributes with Directives, and binds data to HTML with Expressions.

Mapping plain text back into HTML document

Situation: I have a group of strings that represent Named Entities that were extracted from something that used to be an HTML doc. I also have both the original HTML doc, the stripped-of-all-markup plain text that was fed to the NER engine, and the offset/length of the strings in the stripped file.
I need to annotate the original HTML doc with highlighted instances of the NEs. To do that I need to do the following:
Find the start / end points of the NE strings in the HTML doc. Something that resulted in a DOM Range Object would probably be ideal.
Given that Range object, apply a styling (probably using something like <span class="ne-person" data-ne="123">...</span>) to the range. This is tricky because there is no guarantee that the range won't include multiple DOM elements (<a>, <strong>, etc.) and the span needs to start/stop correctly within each containing element so I don't end up with totally bogus HTML.
Any solutions (full or partial) are welcome. The back-end is mostly Python/Django, and the front-end is using jQuery. We would rather do this on the back-end, but I'm open to anything.
(I was a bit iffy on how to tag this question, so feel free to re-tag it.)
Use a range utility method plus an annotation library such as one of the following:
artisan.js
annotator.js
vie.js
The free software Rangy JavaScript library is your friend. Regarding your two tasks:
Find the start / end points of the […] strings in the HTML doc. You can use Range#findText() from the TextRange extension. It indeed results in a DOM Level 2 Range compatible object [source].
Given that Range object, apply a styling […] to the range. This can be handled with the Rangy Highlighter module. If necessary, it will use multiple DOM elements for the highlighting to keep up a DOM tree structure.
Discussion: Rangy is a cross-browser implementation of the DOM Level 2 range utility methods proposed by #Paul Sweatte. Using an annotation library would be a further extension on range library functionality; for example, Rangy will be the basis of Annotator 2.0 [source]. It's just not required in your case, since you only want to render highlights, not allow users to add them.

Getting started styling JSON search results from DocumentCloud

I'm looking to build a system that styles the search results from DocumentCloud (and allows me to link to a given document).
I know I can query DocumentCloud and return JSON results using a search string like this:
https://www.documentcloud.org/api/search.json?q=obama
I don't know how to:
Grab the output of the search and put it on my own page
Style the data once I have it on my page
I'd just like to know how to get started with this, I'm experienced with HTML and CSS but I've never worked with JSON before.
There's more info here but I just don't know where to get started: https://www.documentcloud.org/help/api
It sounds like you're not so familiar with JavaScript, correct? JSON stands for JavaScript Ojbect Notation, so to work with it, you'll have to dive in a bit. I strongly recommend looking into using a JavaScript framework/library, namely jQuery to handle the heavy lifting. (There are other worthy libraries, but jQuery is by far the most popular, and is very friendly, using CSS-like selectors to manipulate the document object model).
check this jQuery tutorial: How jQuery Works
Here's a primer on using jQuery's jsonp to fetch remote rsults and using them in a page: http://www.ibm.com/developerworks/library/wa-aj-jsonp1/
You might end up with code in a javascript file, or a script tag (following a link to the jQuery library) that looks like this:
$(document).ready(function () {
$.getJSON('https://www.documentcloud.org/api/search.json?q=obama&callback=?', null, function (results) {
// this would append whatever the json returns for 'total'
// inside an element on your page with an id of 'resultsCount':
$('#restulsCount').append(data.total);
});
});
As a result, extra text & markup can be added to elements you already have on your page in whatever form/position you need it, and regular CSS rules from any style block or CSS file linked on your page will apply to them.
Good luck.