how iMacros can extract the data without having attribute on element? - extract

i want to get extract data: name, email, phone, date, city
here the HTML sample code:
<tbody>
<tr class="grid-row">
<td>Jimmy Shark</td>
<td>jshark#gmail.com</td>
<td>082166883333</td>
<td>07/13/15, 07:23 AM</td>
<td></td>
</tr>
<tr class="odd grid-row">
<td>Denny Large</td>
<td>large.denny#gmail.com</td>
<td>08575510121</td>
<td>07/09/16, 11:55 PM</td>
<td></td>
</tr>
<more and repeated>
</tbody>

Start with the following macro and correct it to meet your goal:
SET !ERRORIGNORE YES
SET !TIMEOUT_STEP 0
SET !EXTRACT_TEST_POPUP NO
TAG XPATH="//tr[#class='grid-row'][{{!LOOP}}]/td[1]" EXTRACT=TXT
TAG XPATH="//tr[#class='grid-row'][{{!LOOP}}]/td[2]" EXTRACT=TXT
TAG XPATH="//tr[#class='grid-row'][{{!LOOP}}]/td[3]" EXTRACT=TXT
TAG XPATH="//tr[#class='grid-row'][{{!LOOP}}]/td[4]" EXTRACT=TXT
TAG XPATH="//tr[#class='grid-row'][{{!LOOP}}]/td[5]" EXTRACT=TXT
PROMPT {{!EXTRACT}}
SET !EXTRACT NULL
TAG XPATH="//tr[#class='odd grid-row'][{{!LOOP}}]/td[1]" EXTRACT=TXT
TAG XPATH="//tr[#class='odd grid-row'][{{!LOOP}}]/td[2]" EXTRACT=TXT
TAG XPATH="//tr[#class='odd grid-row'][{{!LOOP}}]/td[3]" EXTRACT=TXT
TAG XPATH="//tr[#class='odd grid-row'][{{!LOOP}}]/td[4]" EXTRACT=TXT
TAG XPATH="//tr[#class='odd grid-row'][{{!LOOP}}]/td[5]" EXTRACT=TXT
PROMPT {{!EXTRACT}}

Related

VBA scraping with same class name but different innertext

Scraping value on a website but turned out the value that I need shared the same class name as the others.
HTML code
<tr class="table_bdrow1_style">
<td></td>
<td style="text-align:center" class="table_bdtext_style">1.</td>
<td style="text-align:center" class="table_bdtext_style">
<div id="a">
"0.8948"
</div>
</td>
<td style="text-align:center" class="table_bdtext_style">December 19, 2016</td>
</tr>
I need the value of second line (0.8948) and third line - the date value (December 19, 2016) but the code I am using only shows me the first value (1).
extract1 = IE.Document.getElementsByClassName("table_bdtext_style")(1).innerText
Cells(4, "A").Value = extract1
Not sure how can I extract the second and third but not the first value. Anyone can help? Thanks a lot!
Just assign the respective index in your extract call:
' for second tag
IE.Document.getElementsByClassName("table_bdtext_style")(2).innerText
' for third tag
IE.Document.getElementsByClassName("table_bdtext_style")(3).innerText

Clicking link after search results

<tbody>
<tr>
<td>6</td>
<td>LICENSED CLINICAL SOCIAL WORKER</td>
JOE L BLACK<td></td>
<td>ISLAND WI</td>
<td>08/03/1993</td>
<td>02/28/2017</td>
</tr>
PATH ON WEBSITE::
body > div.main > div:nth-child(3) > div.large-12.columns > table > tbody > tr > a
I need to click on the link that's in <a href="/LicenseSearch/IndividualLicense/SearchResultsSummary?chid=666783"> but I can't seem to get the loop to click on it. Can someone help me?
Tried these codes, none have worked so far:
TAG SELECTOR="HTML>body>div.main>div:nth-child(3)>div.large-12.columns>table>tbody>tr>td:nth-child(3)>a:nth-child(1)"
EVENTS TYPE=DBLCLICK SELECTOR="(/html/body/div[3]/div[3]/div[2]/table/tbody/tr/td[3]/a[contains(#href)" BUTTON=0
TAG XPATH="/html/body/div[3]/div[3]/div[2]/table/tbody/tr/td[3]/a"
TAG XPATH="(/html/body/div[3]/div[3]/div[2]/table/tbody/tr/td[3]/a[contains(#href))"

How to parse a date using Nokogiri in Ruby

I am trying to parse this page and pull the date that begins after
>p>From Date:
I get the error
Invalid predicate: //b[text() = '<p>From Date: ' (Nokogiri::XML::XPath::SyntaxError)
The xpath from "inspect element" is
/html/body/div#timelineItems/table/tbody/tr/td/table.resultsTypes/tbody/tr/td/p
This is an example of the code:
#/usr/bin/ruby
require 'Nokogiri'
noko = Nokogiri::HTML('china.html')
noko.xpath("//b[text() = '<p>From Date: ").each do |b|
puts b.next_sibling.content.strip
end
This is file://china.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>File </title>
</head>
<body>
<div id ="timelineItems">
<H2 id="telegram1"> Title </H2>
<p><table cellspacing="0">
<tr>
<td width="2%"> </td>
<td width="75%">
<table cellspacing="0" cellpadding="0" class="resultsTypes">
<tr>
<td width="5%" class="hide"> </td>
<td width="70%">
<p>Template: <span class="bidi">ארכיון בן גוריון - מסמך</span></p>
<p>Title: <span class="bidi">Meeting in China</span></p>
<p>recipient: David Ben Gurion</p>
<p>sender: Prime Minister of Union of Burma, Rangoon</p>
<p> Sub collection: <span class="bidi">התכתבות > תת-חטיבה מכתב</span></p>
<p>From Date: 02/14/1936</p>
<p>Link to file: <span class="bidi">תיק התכתבות 1956 ינואר</span></p>
</td>
</tr>
<tr>
<td colspan="2">
</td>
</tr>
</table></td>
<td class="actions"> </td>
</tr>
</table>
</p>
</div>
</body></html>
Amadan's answer
original.rb
#/usr/bin/ruby
require 'Nokogiri'
noko = Nokogiri::HTML('china.html')
date = noko.at_xpath("//p[starts-with(text(),'From Date: ')]").text()
puts date
formatted = date[/From Date: (.*)/, 1]
puts formatted
gives an error original.rb:5:in '<main>': undefined method 'text' for nil:NilClass (NoMethodError)
You can't use
noko = Nokogiri::HTML('china.html')
Nokogiri::HTML is a shortcut to Nokogiri::HTML::Document.parse. The documentation says:
.parse(string_or_io, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML) {|options| ... } ⇒ Object`
... string_or_io may be a String, or any object that responds to read and close such as an IO, or StringIO. ...
While 'china.html' is a String, it's not HTML. It appears you're thinking that a filename will suffice, however Nokogiri doesn't open anything, it only understands strings containing markup, either HTML or XML, or an IO-type object that responds to the read method. Compare these:
require 'nokogiri'
doc = Nokogiri::HTML('china.html')
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>china.html</p></body></html>\n"
versus:
doc = Nokogiri::HTML('<html><body><p>foo</p></body></html>')
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>foo</p></body></html>\n"
and:
doc = Nokogiri::HTML(open('http://www.example.org'))
doc.to_html[0..99]
# => "<!DOCTYPE html>\n<html>\n<head>\n <title>Example Domain</title>\n\n <meta charset=\"utf-8\">\n <met"
The last works because OpenURI adds the ability to read URLs to open, which responds to read:
open('http://www.example.org').respond_to?(:read) # => true
Moving on to the question:
require 'nokogiri'
require 'open-uri'
html = <<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>File </title>
</head>
<body>
<div id ="timelineItems">
<H2 id="telegram1"> Title </H2>
<p><table cellspacing="0">
<tr>
<td width="2%"> </td>
<td width="75%">
<table cellspacing="0" cellpadding="0" class="resultsTypes">
<tr>
<td width="5%" class="hide"> </td>
<td width="70%">
<p>Template: <span class="bidi">ארכיון בן גוריון - מסמך</span></p>
<p>Title: <span class="bidi">Meeting in China</span></p>
<p>recipient: David Ben Gurion</p>
<p>sender: Prime Minister of Union of Burma, Rangoon</p>
<p> Sub collection: <span class="bidi">התכתבות > תת-חטיבה מכתב</span></p>
<p>From Date: 02/14/1936</p>
<p>Link to file: <span class="bidi">תיק התכתבות 1956 ינואר</span></p>
</td>
</tr>
<tr>
<td colspan="2">
</td>
</tr>
</table></td>
<td class="actions"> </td>
</tr>
</table>
</p>
</div>
</body></html>
EOT
doc = Nokogiri::HTML(html)
Once the document is parsed, it's easy to find a particular <p> tag using the
<table cellspacing="0" cellpadding="0" class="resultsTypes">
as a placemarker:
from_date = doc.at('table.resultsTypes p[6]').text
# => "From Date: 02/14/1936"
It looks like its going to be tougher pulling the title = "Meeting in China" and link = "bing.com"; since they are on the same line.
I'm using CSS selectors to define the path to the desired text. CSS is more easily read than XPath, though XPath is more powerful and descriptive. Nokogiri allows us to use either, and lets us use search or at with either. at is equivalent to search('some selector').first. There are also CSS and XPath specific versions of search and at, described in Nokogiri::XML::Node.
title_link = doc.at('table.resultsTypes p[2] a')['href'] # => "http://www.bing.com"
title = doc.at('table.resultsTypes p[2] span').text # => "Meeting in China"
You're trying to use the XPath:
/html/body/div#timelineItems/table/tbody/tr/td/table.resultsTypes/tbody/tr/td/p
however, it's not valid for the HTML you're working with.
Notice tbody in the selector. Look at the HTML, immediately after either of the <table> tags, neither occurrence has a <tbody> tag, so the XPath is wrong. I suspect that was generated by your browser, which is doing a fix-up of the HTML to add <tbody> according to the specification, however Nokogiri doesn't do a fix-up to add <tbody> and the HTML doesn't match, causing the search to fail. So, don't rely on the selector defined by the browser, nor should you trust the browser's idea of the actual HTML source.
Instead of using an explicit selector, it's better, easier, and smarter, to look for specific way-points in the markup, and use those to navigate to the node(s) you want. Here's an example of doing everything above, only using a placeholder, and a mix of XPath and CSS:
doc.at('//p[starts-with(., "Title:")]').text # => "Title: Meeting in China"
title_node = doc.at('//p[starts-with(., "Title:")]')
title_url = title_node.at('a')['href'] # => "http://www.bing.com"
title = title_node.at('span').text # => "Meeting in China"
So, it's fine to mix and match CSS and XPath.
from_date = noko.at_xpath('//p[starts-with(text(), "From Date:")]').text()
date = from_date[/From Date: (.*)/, 1]
# => "02/14/1936"
EDIT:
Explanation: Get the first node (#at_xpath) anywhere in the document (//) such that ([...]) text content (text()) starts with (starts-with(string, stringStart)) "From Date" ("From Date:"), and take its text content (#text()), storing it (=) into the variable from_date (from_date). Then, extract the first group (#[regexp, 1]) from that text (from_date) by using the regular expression (/.../) that matches the literal characters "From Date: ", followed by any number (*) of any characters (.), that will be captured ((...)) in the first capture group to be extracted by #[regexp, 1].
Also,
Amadan's answer [...] gives an error
I did not notice that your Nokogiri construction is broken, as explained by the Tin Man. The line noko = Nokogiri::HTML('china.html') (which was not a part of my answer) will give you a single node document that only has the text "china.html" in it, and no <p> nodes at all.

how to bypass html escape signs and extract text only from html file in perl using web::scraper

I am trying to extract the text only from the html page and want to ignore or bypass the html escape signs "<" and ">". I am copying the part the html page that i used for extraction of text:
<table class="reference">
<tr>
<th align="left" width="25%">Tag</th>
<th align="left" width="75%">Description</th>
</tr>
<tr>
<td><!--...--></td>
<td>Defines a comment</td>
</tr>
<tr>
<td><!DOCTYPE> </td>
<td>Defines the document type</td>
</tr>
<tr>
<td><a></td>
<td>Defines a hyperlink</td>
</tr>
<tr>
<td><abbr></td>
<td>Defines an abbreviation</td>
</tr>
<tr>
...
My perl code is:
my $urlToScrape = "http://www.w3schools.com/tags/";
# prepare data
my $teamsdata = scraper {
process "table.reference > tr > td > a ", 'tags[]' => 'TEXT';
process "table.reference > tr > td > a ", 'urls[]' => '#href';
};
# scrape the data
my $res = $teamsdata->scrape(URI->new($urlToScrape));
print "<HTML_tags>\n";
for my $i ( 0 .. $#{$res->{urls}}) {
print FILE " <tag_Name> $res->{tags}[$i] </tag_Name>\n ";
}
print "</HTML_tags>\n";
The output I get is the following:
<HTML_tags>
<tag_Name> <!--...--> </tag_Name>
<tag_Name> <!DOCTYPE> </tag_Name>
<tag_Name> <a> </tag_Name>
<tag_Name> <abbr> </tag_Name>
</HTML_tags>
whereas I want output as:
<HTML_tags>
<tag_Name> !--...-- </tag_Name>
<tag_Name> !DOCTYPE </tag_Name>
<tag_Name> a </tag_Name>
<tag_Name> abbr </tag_Name>
</HTML_tags>
Can anyone tell what do I have to change inorder to get the above output?
Many Thanks.
Brute Force:
$res->{tags}[$i] =~ s/[\<\>]//gs; ## Added line
print FILE " <tag_Name> $res->{tags}[$i] </tag_Name>\n ";

How to embed links (anchor tag) into HTML context from UIBINDER in gwt

I have a HTML widget in my ui.xml which I am using in Uibinder to populate data as given below:
ui.xml ->
<g:HTML ui:field="operationsDetailTableTemplate" visible="false">
<table class="{style.LAYOUT_STYLE}" width="100%" border="1">
<tr>
<td><img src="images/indent-blue.gif"/></td>
<td>
<table class="{style.DEFAULT_STYLE}">
<thead>
<tr>
<th>OperationUuid</th>
....
</tr>
</thead>
<tbody>
<tr>
<td>%s</td>
...
</tr>
</tbody>
</table>
</td>
</tr>
....
</g:html>
Uibinder.java--->
String htmlText = operationsDetailTableTemplate.getHTML()
.replaceFirst("%s", toSafeString(operation.getOperationUuid()))
....
HTML html = new HTML(htmlText);
operationsDetail.add(html);
The above is done in a for loop for each of the operation retrieved from the database.
My question is how I can embed a hyperlink or an anchor tag on one of the cell (eg. operation id ) for each of the operation set retrieved. I also wish to have a listener attached to it.
P.S. - It does not allow me to have a anchor tag in HTML in ui.xml.
You'd better use the tools in the way they've been designed to be used: use ui:field="foo" on the <td> and #UiField Element foo + foo.setInnerHTML(toSafeString(...)) instead of extracting the HTML, modifying it and reinjecting it elsewhere. You could also use a <g:Anchor> and attach an #UiHandler to handle ClickEvents.
Your way of using UiBinder makes me think of SafeHtmlTemplates, or the new UiRenderer aka UiBinder for Cells: https://developers.google.com/web-toolkit/doc/latest/DevGuideUiBinder#Rendering_HTML_for_Cells