Openrefine cannot fetch html code inside accordion - html

I know that openrefine is not a perfect tool for web scraping but looking for some helps from the first step.
I cannot collect the full html codes from openrefine when I add column by fetching url (https://profiles.health.ny.gov/hospital/view/103094). They do not incorporate any codes under accordion such as services, bed types, and etc.
Any idea to get the full codes by fetching in openrefine?
I am trying to collect information under administrative, whose Xpath is "//div[4]/div/ul/li" ("div#AdministrativeBox.in.collapse")

This website loads its content dynamically using Javascript. The information that interests you is not stored in the source code of the page, so Open Refine cannot extract it.
However, there is a workaround. If you transform your URLs with the GREL formula value.replace('view', 'tab_overview'), you will get scrapable pages like this one.
Note that OpenRefine does not use Xpath, but JSOUP selectors. To get the elements of the "Administrative" block, you can use this GREL formula.
forEach(value.parseHtml().select('#AdministrativeBox li'), e, e.htmlText()).join(',')
Result:

Related

Cannot collect all nodes of Google search result with goquery: some nodes are missing

I am trying to collect results of a google search page in GoLang using the goquery library. In order to achieve this, I am collecting all nodes of a goquery selection with goquery. The problem is that the selection returned by Find("*") does not seem to contain all the nodes of the HTML document. Question: does the method collect ALL nodes with the whole tree structure or not ? If not, is there a method to collect them all ?
I tried using the goquery Find("*") method applied to the whole document selection. So nodes with certain attributes are not returned, although they are in the HTML document. For instance, nodes with are not recognized
alltags := doc.Find("*") //doc is the HTML doc with the Google search
The selection does not contain the div tags with class="srg". The same applies to other class values such as "bkWMgd", "rc" for example.
This has happened to me before. I was trying to web scrape with python beautiful soup package and the same thing was happening.
Later it turned out that the html markup returned when trying to fetch it was actually the markup the server returned after finding a bot. I solved this by setting the User-Agent to Mozilla/5.0.
Hope this helps in your quest to solve this.
You can start by updating the code for the fetch request you have performed.

Send and receive data to and from a website using the TWebbrowser component in Delphi

I'm creating a VCL Application with Delpi 10.3 and want to support some web functionality by having the user enter the ISBN of a book into a TEdit component and from there passing/sending this value to a search field on this website: https://isbnsearch.org after which the website looks up the ISBN and displays the Author of the book. I want to somehow access the information (i.e Author) presented by the search result and again use it in my application.
This is my GUI, for a better idea of what I want to accomplish:
What code can I use for this? Any other feasible suggestions or approaches are acceptable.
When performing a search on that website, it simply loads a page with a specific URL query string...
https://isbnsearch.org/search?s=suess
The above example is when I search for "suess", so you can easily concatenate a search URL.
You can use any HTTP component, such as TIdHTTP, to load this search page, then use an HTML parser to scrape the page and read what you need. Much, much easier than trying to read through the TWebBrowser.
In the end, you won't actually display the HTML (I mean you can if you want to), but the idea is to read the data and display it in your own format.
On that specific page, start by locating the ul element with id searchresults. Then, each li element contains individual results. Unfortunately, this website uses pagination, and only shows 10 results per page. To do this, call this page again with another parameter &p=2 for the 2nd page, &p=3 for the 3rd page, and so on.
On the other hand, that is the worst way to acquire such information. What you should be doing is using a proper API which gives you machine-friendly data. The service you are referencing doesn't appear to have an option, but here's an example of one which does:
https://openlibrary.org/dev/docs/api/books - this also appears to provide you MUCH more information than the one you're using.

Automate Web Applications -parsing HTML Data

I just want to automate a web application, where that application parses the HTML page and pulls all the HTML Tags inner text based on some condition like if we have a tag called Span Example has given whose class="spanclass_1"
This is span tag...
which has particular class id. so that app parses and pulls that span into it.
And here the main pain area is, I should not use the developer code to automate that same parsing the HTML.
I want to automate that parsing done correctly, simply by using the parsed data which is shown in UI.
Any help, would be great.
Appreciating your time reading this.
(Note span tag is not shown)
Thanks buddies.
not enough details.
is this html page just a file in local filesystem on it is internet webpage?
do u have access to pages? can u modify it ? if answer yes, that just add javascript to page which will extract data and post to server.
if answer not, than it depends on language u use to programm.
Find good framework to parse html. load page parse it and extract data. Several situation can be there.
Worse scenario - page generated on client side using js.
Best scenario - page is in xhtml mode( u are lucky. any xml parser will help to build dom and extract data)
So so - page is simple html format (try several html parser to find most suitable for u)

Pulling out some text from a giant HTML file using Nokogiri/xpath

I am scraping a website and am trying to pull out certain elements from the HTML. In the sites I am scraping, there are script tags with a bunch of info in them however, there is one part inside these tags that I am interested in. The line basically looks like:
'image':'http://ut5.example.com/t/231/3_b_643435.jpg',
With some stuff above and below it. Now, this is different for each page source except for obviously the domain and some of the subfolders that store the images.
How would I go about looking through the source for this specific line, and cutting out just the URL? I would need to use regular expressions I feel as the URLs are dynamic.
The "gsub" method does something similar to what I want to search for, with its ability to use /regex/. But, I am not wanting to replace anything, I just want to find that URL in the source code using a /regex/ and copy it.
According to you comments, this is what you're looking for I guess
var regex = /http.+/;
Example http://jsfiddle.net/Km9ZB/

yql link list to html

I'm using yql to return a list of links from specific webpages. So far so good, it returns all the links that I pretend, however I don't know how to translate that info into my webpage.
Here's what I'm trying to do:
YQL returns a list of links in the
results
I want those links to appear in my
webpage, inside a table, inside divs,
etc... like if i wrote them there.
I have been trying to find a way to do this but I don't know much of js and json so I'm here trying to achieve some answer from those of you that might know a way.
There are a couple of ways to do this, depending on which approach you want to take.
First, and simplest, is server-side generation. This is what would result, for example, if a user hits a Submit button on a search form to send you his query, your script receives the query and generates the page, then sends that page to the user. In this case, your question is largely trivial. In pseudocode:
ASSIGN the list of results to list L
FOR EACH ITEM r IN L
PRINT a string containing an HTML template, substituting the value
r where appropriate
It's trivial enough that I suspect you want to do this via DOM manipulation. This is what requires JavaScript: you get the query, send the request without refreshing the page, and want to add the results to the DOM. If you're receiving the list of results, you're already most of the way there. Using jQuery, you would do the same thing as in the pseudocode above, except that where it has the PRINT statement you would have:
$(".SearchResults").append("<li>" + r + "</li>");
I highly recommend reading through the jQuery tutorial. It's not as hard as you think.