How to populate an array with text from html webscraping in ruby - html

I have used the nokogiri ruby gem to webscrape an html file for only the text under the tableData class. The html code is setup like so:
<div class="table-wrap">
<table class="table">
<tbody>
<tr>
<td class="tableData"> Jane Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
<tr>
<td class="tableData">John Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
</tbody>
</table>
</div>
and the code I used to webscrape looks like this:
vt = page.css("td[class='tableData']").text
puts vt
Which gives this output:
Jane Doe 01/01/201701/09/2017 VacationJohn Doe 01/01/201701/09/2017 Vacation
I want to populate an array within an array with only the 4 text values pertaining to each person. Which should look like this:
[[Jane Doe, 01/01/2017, 01/09/2017, Vacation], [John Doe, 01/01/2017, 01/09/2017, Vacation]]
I am new to coding and I'm not sure how to create a for loop to iterate over either the html code itself or the vt variable to produce an array of arrays. I know there are some push statements involved following the for loop but its the actual structure of the for loop that I am having trouble putting together. If you could provide some explanation in your answer for how the for loop works in this situation it would be much appreciated.

This is the basic structure you need. map is needed :
html=%q(<div class="table-wrap">
<table class="table">
<tbody>
<tr>
<td class="tableData"> Jane Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
<tr>
<td class="tableData">John Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
</tbody>
</table>
</div>)
require 'nokogiri'
doc = Nokogiri::XML(html)
array = doc.xpath('//tr').map do |tr|
tr.xpath('td').map{ |td| td.text }
end
p array
# [[" Jane Doe", " 01/01/2017", "01/09/2017 ", "Vacation"], ["John Doe", " 01/01/2017", "01/09/2017 ", "Vacation"]]

Try parsing the snippet as XML, finding all "tr" elements via XPath, and collecting their "td//text()" children:
require 'nokogiri'
doc = Nokogiri::XML(get_html_snippet)
data = doc.xpath('//tr').map do |tr|
tr.xpath('td').map { |td| td.text.strip }
end
data # => [["Jane Doe", "01/01/2017", "01/09/2017", "Vacation"], ["John Doe", "01/01/2017", "01/09/2017", "Vacation"]]

Related

How can I traverse HTML DOM with Swift

I have an http POST response which I receive in HTML. Now I want to display the results in my view Controller. How can I parse the DOM of the response to get the elements I want?
This is the response in raw html:
<tr>
<td style="text-align:center;">1</td>
<td>9.99</td>
<td style="text-align:center;" class="show_on_masters hide"></td>
<td style="text-align:center;">1.4</td>
<td>DE GRASSE, ANDRE</td>
<td style="text-align:center;">ON</td>
<td>
<div data-tooltip="Speed Academy Athletics Club">SAAC</div>
</td>
<td>94</td>
<td>2</td>
<!--<td class="rankings_hide_992">UF Tom Jones Invitational (Olympic Development)</td>-->
<!--<td class="rankings_hide_768">Gainesville , FL</td>-->
<td>
<div data-tooltip="UF Tom Jones Invitational (Olympic Development)" style="cursor:default;">Gainesville , FL</div>
</td>
<td>17/04/2021</td>
</tr>
<tr>
<td style="text-align:center;">2</td>
<td>10.08</td>
<td style="text-align:center;" class="show_on_masters hide"></td>
<td style="text-align:center;">1.9</td>
<td>BROWN, AARON</td>
<td style="text-align:center;">ON</td>
<td>
<div data-tooltip="Phoenix Athletics Assoc. of Ontario">PHNX</div>
</td>
<td>92</td>
<td>7</td>
<!--<td class="rankings_hide_992">World Athletics - Miramar</td>-->
<!--<td class="rankings_hide_768">Miramar, FL</td>-->
<td>
<div data-tooltip="World Athletics - Miramar" style="cursor:default;">Miramar, FL</div>
</td>
<td>10/04/2021</td>
</tr>
<tr>
<td style="text-align:center;">3</td>
<td>10.14</td>
<td style="text-align:center;" class="show_on_masters hide"></td>
<td style="text-align:center;">0.7</td>
<td>WARNER, DAMIAN</td>
<td style="text-align:center;">ON</td>
<td>
<div data-tooltip="London Western T.F.C.">LWTF</div>
</td>
<td>89</td>
<td>1dec5</td>
<!--<td class="rankings_hide_992">Hypo-Meeting</td>-->
<!--<td class="rankings_hide_768">Götzis, AUT</td>-->
<td>
<div data-tooltip="Hypo-Meeting" style="cursor:default;">Götzis, AUT</div>
</td>
<td>29/05/2021</td>
</tr>
I'm currently trying to use HTMLKit based on a couple tutorials, but I can't truly traverse the DOM with this library. Any ideas?
HTMLKit Tutorial
HTMLKit Video Tutorial
You can try SwiftSoup library that allows HTML parsing.
Usage
do {
let html: String = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
let doc: Document = try SwiftSoup.parse(html)
let link: Element = try doc.select("a").first()!
let text: String = try doc.body()!.text(); // "An example link"
let linkHref: String = try link.attr("href"); // "http://example.com/"
let linkText: String = try link.text(); // "example""
let linkOuterH: String = try link.outerHtml(); // "<b>example</b>"
let linkInnerH: String = try link.html(); // "<b>example</b>"
} catch Exception.Error(let type, let message) {
print(message)
} catch {
print("error")
}

how to add class in table in python using Dominate library

I have created Table using Dominate Library but Now I want to change my table class. can someone help me to do that ?
doc1 = dominate.document(title='Dominate your HTML')
with doc1:
with div():
attr(cls='body')
h1('Survey Report : Survey Report')
oc = dominate.document(title="whatever")
with doc1:
tags.style(".calendar_table{width:880px;}")
tags.style("body{font-family:Helvetica}")
tags.style("h1{font-size:x-large}")
tags.style("h2{font-size:large}")
tags.style("table{border-collapse:collapse}")
tags.style("th{font-size:small;border:1px solid gray;padding:4px;background-color:#DDD}")
tags.style("td{font-size:small;text-align:center;border:1px solid gray;padding:4px}")
with tags.table():
with tags.thead():
tags.th("Nominee", style = "color:#ffffff;background-color:#6A75F2")
tags.th("counts", style = "color:#ffffff;background-color:#6A75F2")
with tags.tbody():
for i in range(0,len(nom)):
with tags.tr(): #Row 1
tags.td(nom[i], style = "font-size:small;text-align:center;padding:4px")
if int(count_nom[i]) > 1:
tags.td(count_nom[i], style = "font-size:small;text-align:center;padding:4px;background-color:#F4D8D2")
else:
tags.td(count_nom[i], style = "font-size:small;text-align:center;padding:4px")
with tags.tr(): #Row 1
tags.td(b("Grand Total"), style = "font-size:small;text-align:center;padding:4px")
tags.td(b(sum(count_nom)), style = "font-size:small;text-align:center;padding:4px")
with open('/root/survey/'+'survey'+'.html', 'w') as f:
f.write(doc1.render())
with this I am able to create Table in HTML
<div class="body">
<h1>Survey Report</h1>
</div>
<style>.calendar_table{width:880px;}</style>
<style>body{font-family:Helvetica}</style>
<style>h1{font-size:x-large}</style>
<style>h2{font-size:large}</style>
<style>table{border-collapse:collapse}</style>
<style>th{font-size:small;border:1px solid gray;padding:4px;background-color:#DDD}</style>
<style>td{font-size:small;text-align:center;border:1px solid gray;padding:4px}</style>
<table>
<thead>
<th style="color:#ffffff;background-color:#6A75F2">Nominee</th>
<th style="color:#ffffff;background-color:#6A75F2">counts</th>
</thead>
<tbody>
<tr>
<td style="font-size:small;text-align:center;padding:4px">Deepesh Ahuja</td>
<td style="font-size:small;text-align:center;padding:4px">1</td>
</tr>
<tr>
<td style="font-size:small;text-align:center;padding:4px">Sabyasachi Mallick</td>
<td style="font-size:small;text-align:center;padding:4px">1</td>
</tr>
<tr>
<td style="font-size:small;text-align:center;padding:4px">Raju Singh</td>
<td style="font-size:small;text-align:center;padding:4px">1</td>
</tr>
<tr>
<td style="font-size:small;text-align:center;padding:4px">Abarna Ravi</td>
<td style="font-size:small;text-align:center;padding:4px;background-color:#F4D8D2">2</td>
</tr>
<tr>
<td style="font-size:small;text-align:center;padding:4px">
<b>Grand Total</b>
</td>
<td style="font-size:small;text-align:center;padding:4px">
<b>5</b>
</td>
</tr>
</tbody>
</table><br><br><br>
Now How I will set table class in python code like
<table class='calender_tabe'>
Can someone help me to set class of table and other tag using python dominate library?
Using the example syntax from github's documentation
from dominate.tags import *
testTable = table(border = 1)
print testTable
which will return:
<table border="1"></table>
with the print statement. However since you can't use the word "class" to refer to the html attribute (class being a python-reserved word) you have to go about it indirectly:
testTable.set_attribute('class','my_class_name')
Adding the above to the original instance of testTable will result in:
<table border="1" class="my_class_name"></table>

Xpath grep elements

I`m using Scrapy Python to try to grep data from the site.
How I can grep this structure with Xpath?
<div class="foo">
<h3>Need this text_1</h3>
<table class="thesamename">
<tbody>
<tr>
<td class="tmp_year">
45767
</td>
<td class="tmp_outcome">
<b>Win_1</b><br>
<span class="tmp_category">TEST_1</span>
</td>
</tr>
<tr>
<td class="tmp_year">
1232004
</td>
<td class="tmp_outcome">
<b>Win_2</b><br>
<span class="tmp_category">TEST_2</span>
</td>
</tr>
<tr>
<td class="tmp_year">
122004
</td>
<td class="tmp_outcome">
<b>Win_3</b><br>
<span class="tmp_category">TEST_3</span>
</td>
</tr>
</tbody>
<h3>Need this text_2</h3>
<table class="thesamename">
<tbody>
<td class="tmp_year">
234
</td>
<td class="tmp_outcome">
<b>Win_E</b><br>
<span class="tmp_category">TEST_E</span>
</td>
</tr>
<tr>
<td class="tmp_year">
3476
</td>
<td class="tmp_outcome">
<b>Win_C</b><br>
<span class="tmp_category">TEST_C</span>
</td>
</tr>
</tbody>
<h3>Need this text_3</h3>
<table class="thesamename">
<tbody>
<tr>
<td class="tmp_year">
85567
</td>
<td class="tmp_outcome">
<b>Win_T</b><br>
<span class="tmp_category">TEST_T</span>
</td>
</tr>
<tr>
<td class="tmp_year">
435656
</td>
<td class="tmp_outcome">
<b>Win_A</b><br>
<span class="tmp_category">TEST_A</span>
</td>
</tr>
<tr>
<td class="tmp_year">
980
</td>
<td class="tmp_outcome">
<b>Win_Z</b><br>
<span class="tmp_category">TEST_Z</span>
</td>
</tr>
</tbody>
I would like to have output with this structure:
"Section": {
Need this text_1 :
[45767 : Win_1 : TEST_1]
[1232004 : Win_2 : TEST_2]
[122004: Win_3 : TEST_3]
,
Need this text_2:
[234 : Win_E : TEST_E]
[3476 : Win_C : TEST_C]
,
Need this text_3:
[85567 : Win_T : TEST_T]
[435656 : Win_A : TEST_A]
[980: Win_Z : TEST_Z]
}
How can I create the proper xpath select to take this structure?
I can take separately all "h3" , all "a" then all tags with class but how I can match?
GREP YOU SAY?! LOL Well, You would be entirely wron to name it so but for the sake ofkeeping the jargon cleanfor understanding your just parsing/extracting.... So new to scrapy? or web dev sideof things? No matter... Theres no way I couldexpect to teach you in one answer here how to xpth/regex like a pro... only wayis for you to keep at but I throw in my input.
First of all, xpath is amazingly usefull wen it comes to websites that are necessarily build to stadard, which doesnt make them bad per say but in the html snipet you gave... its structured all right soo.. Id recommend css extract .. THESE ARE THE VALUES...
year = response.css('td.tmp_year a::text').extract()
outcome = response.css('td.tmp_outcome b::text').extract()
category= response.css('span.tmp_category::text').extract()
PRO-TIP: For what ever case you deem it neccesary, you can save a web page asan HTML file and use scrapy shell by referencing the direct file path to it... So I save you html snippet to a file on my desktop then ran...
scrapy shell file:///home/scriptso/Desktop/letsGREPlol.html
ANYWAYS... as far as xpath... since you asked lol... cake. lets compare the xpath with the cssand tell me you can see... it? lol
response.css('td.tmp_outcome b::text').extract()
so is a td tag....and the class name is tmp_outcome, thn the next node is a bold tag... of which where the text is thusly declaring it as text with the ::text
response.xpath('//td[#class="tmp_outcome"]/b/text()').extract()
So xpath is basically saying we star with a patter inthe entire site of the td tag... and class= tmp_outcome, then the bold, then in xpath to declare type /text() is for text.... /#href is for.. yeah you guessedit

HTML parsing with XPath: flattened hierarchical data

My target HTML is a flattened table of elements with 2 levels of data defined by class attribute:
<tr>
<td class="type">Type 1</td>
</tr>
<tr>
<td class="name">name1</td>
<td class="year">1970</td>
<td class="rank">1</td>
</tr>
<tr>
<td class="name">name2</td>
<td class="year">1982</td>
<td class="rank">3</td>
</tr>
Goal is parse out list of name, year, rank elements, which I accomplish with these xpath expressions:
//td[#class = 'name']/text()
//td[#class = 'year']/text()
//td[#class = 'rank']/text()
Each element is under immediately preceding
<tr>
<td class="type">Type 1</td>
</tr>
I would like to have "Type 1" assigned to each element parsed above. It could be separate list of the same length. Of course, my target HTML contains many such elements within the same 2-level hierarchy: type - element (name, year, rank).
The following rather clumsy xpath concatenates the closest, previous #type td to the name td matched above.
concat(//td[#class = 'name']/preceding::td[#class='type'][1]/text(), '-',
//td[#class = 'name']/text())
This probably makes more sense when shown in the following xsl
<xsl:for-each select="//td[#class='name']">
<Name>
<xsl:value-of select="concat(preceding::td[#class='type'][1]/text(),
'-', ./text())" />
</Name>
</xsl:for-each>
Applied to the following xml
<xml>
<tr>
<td class="type">Type 1</td>
</tr>
<tr>
<td class="name">name1</td>
<td class="year">1970</td>
<td class="rank">1</td>
</tr>
<tr>
<td class="name">name2</td>
<td class="year">1982</td>
<td class="rank">3</td>
</tr>
<tr>
<td class="type">Type 2</td>
</tr>
<tr>
<td class="name">name3</td>
<td class="year">1971</td>
<td class="rank">2</td>
</tr>
<tr>
<td class="name">name4</td>
<td class="year">1983</td>
<td class="rank">4</td>
</tr>
</xml>
With the result
<Name>Type 1-name1</Name>
<Name>Type 1-name2</Name>
<Name>Type 2-name3</Name>
<Name>Type 2-name4</Name>
Solution 1
First, find the td elements of interest. For example, the name tds with the following pseudo-code:
name_tds = doc.evalXPath("//td[#class = 'name']")
Then you can find the corresponding type td using a name td as context node like this:
type_td = name_td.evalXPath("../preceding-sibling::tr[td[#class = 'type']][1]/td")
Solution 2
Simply iterate all the tds and remember the last type you found. Pseudo-code:
foreach (td in doc.evalXPath("//td") {
class = td.getAttribute("class");
if (class == "type") {
type = td.textContent();
}
else if (class == "name") {
name = td.textContent();
println("type: " + type + ", name: " + name);
}
// Same for year and rank.
}

Rails 3 string contains HTML code need to loop through the code in string

I am working on the rails 3 application where i need to pass the html code in to the string variable and pass it to the web services as parameter.
I have the following code with the loop inside but since it is declare in to the string it is not working with the <%%> and #{} tag
#emaildata = "<H3>FLOOR VIEW ACTION REQUEST</H3>
<table border='0' cellspacing='4'>
<tr>
<td>Submitted On:</td>
<td align='left'><strong>#{Date.today}</strong></td>
</tr>
<tr>
<td> Originator: </td>
<td align='left'><strong>#{session[:user_name]}</strong></td>
</tr>
</table>
<table border=0 width=100%>
<tr bgcolor='##006699'>
<td align='center'><font color='##FFFFFF'><strong>ACTION CODE</strong></font></td>
<td align='center'><font color='##FFFFFF'><strong>PART<BR />NUMBER</strong></font></td>
<td align='center'><font color='##FFFFFF'><strong>LOCATION</strong></font></td>
<td align='center'><font color='##FFFFFF'><strong>BIN QTY</strong></font></td>
<td align='center'><font color='##FFFFFF'><strong>PACK QTY</strong></font></td>
<td align='center'><font color='##FFFFFF'><strong>UM</strong></font></td>
<td align='center'><font color='##FFFFFF'><strong>SCAN CODE</strong></font></td>
<td align='center'><font color='##FFFFFF'><strong>REASON / COMMENTS</strong></font></td>
</tr>
<% (1..PartNoListInEmail.length).each_index do |i|%>
<tr bgcolor='##E0E5E5'>
<td align='center'>#{#ActionCodeListInEmail[i]}</td>
<td align='center'>#{#PartNoListInEmail[i]}</td>
<td align='center'>#{#SendToListInEmail[i]}</td>
<td align='center'>#{#OrderQtyListInEmail[i]}</td>
<td align='center'>#{#PackQtyListInEmail[i]}</td>
<td align='center'>#{#UMListInEmail[i]}</td>
<td align='center'>#{#ScancodeListInEmail[i]}</td>
<td align='center'>#{#reasonForActionIn[i]}</td>
</tr>
<%end%>
</table>"
Please help me .
Save your html as partial as a html.erb
#emaildata = "<%= escape_javascript(render :partial=>'some_partial_name', :locals => {:PartNoListInEmail => #PartNoListInEmail}).html_safe %>"
For combining strings with HTML, you want to use a template system like Erb or Haml. If you don't intend to immediately render the template back to a browser, you can still use Erb to do this by calling Erb directly, having it parse the HTML string and variables and return the result as a string.
Once you go down this road, be extra careful of user provided content and escape anything untrustworthy. When you render erb templates normally in rails, rails does a fair amount of work for you to avoid these sorts of problems, but once you do something like what your example showed, or if you use Erb directly to parse it, you no longer benefit from Rails' safety checks, and therefore will need to put in your own checks.