Unable to select only first occurrence of multiple attributes with same name? - html

Here is my html code:
<table id="laptop_detail" class="table">
<tbody>
<tr>
<td style="padding-left:18px" class="ha">Camera Pixels</td>
<td class="val">8 megapixel camera</td>
</tr>
<tr>
<td style="padding-left:36px" class="ha">Camera Pixels</td>
<td class="val">8 megapixel camera</td>
</tr>
</tbody>
and my xpath:
$x('//*[#id="laptop_detail"]//tr/td[contains(., "Camera Pixels")]/following-sibling::td[1]/text()')
My problem is I am unable to find any working way of selecting only one occurrence of attribute.

Enclose the part locating the "Camera Pixels" td element into parenthesis:
(//*[#id="laptop_detail"]//tr/td[contains(., "Camera Pixels")])[1]/following-sibling::td
Demo:
$ xmllint index.html --xpath '(//*[#id="laptop_detail"]//tr/td[contains(., "Camera Pixels")])[1]/following-sibling::td'
<td class="val">8 megapixel camera</td>

Related

Extract weather values from app.weathercloud.net

Hi all I would like to extract 25.8 value from this html block using xpath
the html code is from a weather website, https://app.weathercloud.net/
"<div id=""gauge-rainrate""><h3>Intensidad de lluvia</h3><canvas id=""rainrate"" width=""200"" height=""200""></canvas><div class=""summary"">
<table>
<tbody><tr>
<th> mm/h</th>
<th class=""max""><i class=""icon-chevron-up icon-white""></i> Máx </th>
</tr>
<tr>
<td class=""grey"">Diaria</td>
<td><a id=""gauge-rainrate-max-day"" rel=""tooltip"" title="""" data-original-title=""22/04/2022 00:00"">0.0</a></td>
</tr>
<tr>
<td class=""grey"">Mensual</td>
<td><a id=""gauge-rainrate-max-month"" rel=""tooltip"" title=""21/04/2022 02:15"">25.8</a></td>
</tr>
<tr>
<td class=""grey"">Anual</td>
<td><a id=""gauge-rainrate-max-year"" rel=""tooltip"" title=""21/04/2022 02:15"">25.8</a></td>
</tr>
</tbody></table>
</div></div>"
I use this expression to extract in a google spreadsheet cell
=IMPORTXML("https://app.weathercloud.net/d5044837546#current";"//a[#id='gauge-rainrate-max-month']")
apparently the code is correct but my output is always
-
I don't understand why...

Count <tr> HTML tag grouped by their #class attribute name in XPath

I need a XPath expression that count all the <tr> rows that have a starting class attribute string: room_loop_counter grouped by their attribute name itself.
I have the following sample HTML code to extract data from:
<tbody id="container" >
<tr class="room_loop_counter1 maintr">
<td class="legibility " rowspan="6"></td>
<td colspan="4" style="padding:0;"></td>
</tr>
<tr class="room_loop_counter1">
<td ></td>
<td class=""></td>
</tr>
<tr class="room_loop_counter1"></tr>
<tr class="room_loop_counter2 maintr divider"></tr>
<tr class="room_loop_counter2"></tr>
<tr class="room_loop_counter3 maintr divider"></tr>
<tr class="room_loop_counter3"></tr>
<tr class="room_loop_counter3"></tr>
<tr class="room_loop_counter3"></tr>
<tr class="room_loop_counter3"></tr>
</tbody>
Given the above HTML I would want to get as result : 2,1,4. The count is the number of elements minus one, since I want to discard from the count the first <tr>(the one with the maintr) that is the header...
Between <tr> elements there could be other <tr> elements so their are not strictly one after the other, so we can't rely on following or preceding sibling logic.
I've tried with the following XPath expression :
count(//table[#id="maxotel_rooms"]/tbody/tr[#class=distinct-values(//table[#id="maxotel_rooms"]/tbody/tr[starts-with(#class, "room_loop_counter") and not(contains(#class, "maintr"))]/#class)]/#class])
but it doesn't work on chrome(evaluating it with $x('') on the console window) since it doesn't recognize the distinct-values function.
Could you suggest a possible solution? What is the best approach ?
Check this XPath for unique tr with class starts with some data and not followed by some other class name.
//tbody/tr[starts-with(#class, "room_loop_counter") and not(contains(#class, "maintr"))]/following::tr[not(./#class=following::tr/#class) and not(contains(#class, "maintr"))]
Javascript:
var path = "//body/div";
var uniquePathCount = window.document.evaluate('count(' + path + ')', window.document, null, 0, null);
console.log( uniquePathCount );
console.log( uniquePathCount.numberValue );
Ouput:
<tr class="room_loop_counter1"/>
<tr class="room_loop_counter2"/>
<tr class="room_loop_counter3"/>

Xpath grep elements

I`m using Scrapy Python to try to grep data from the site.
How I can grep this structure with Xpath?
<div class="foo">
<h3>Need this text_1</h3>
<table class="thesamename">
<tbody>
<tr>
<td class="tmp_year">
45767
</td>
<td class="tmp_outcome">
<b>Win_1</b><br>
<span class="tmp_category">TEST_1</span>
</td>
</tr>
<tr>
<td class="tmp_year">
1232004
</td>
<td class="tmp_outcome">
<b>Win_2</b><br>
<span class="tmp_category">TEST_2</span>
</td>
</tr>
<tr>
<td class="tmp_year">
122004
</td>
<td class="tmp_outcome">
<b>Win_3</b><br>
<span class="tmp_category">TEST_3</span>
</td>
</tr>
</tbody>
<h3>Need this text_2</h3>
<table class="thesamename">
<tbody>
<td class="tmp_year">
234
</td>
<td class="tmp_outcome">
<b>Win_E</b><br>
<span class="tmp_category">TEST_E</span>
</td>
</tr>
<tr>
<td class="tmp_year">
3476
</td>
<td class="tmp_outcome">
<b>Win_C</b><br>
<span class="tmp_category">TEST_C</span>
</td>
</tr>
</tbody>
<h3>Need this text_3</h3>
<table class="thesamename">
<tbody>
<tr>
<td class="tmp_year">
85567
</td>
<td class="tmp_outcome">
<b>Win_T</b><br>
<span class="tmp_category">TEST_T</span>
</td>
</tr>
<tr>
<td class="tmp_year">
435656
</td>
<td class="tmp_outcome">
<b>Win_A</b><br>
<span class="tmp_category">TEST_A</span>
</td>
</tr>
<tr>
<td class="tmp_year">
980
</td>
<td class="tmp_outcome">
<b>Win_Z</b><br>
<span class="tmp_category">TEST_Z</span>
</td>
</tr>
</tbody>
I would like to have output with this structure:
"Section": {
Need this text_1 :
[45767 : Win_1 : TEST_1]
[1232004 : Win_2 : TEST_2]
[122004: Win_3 : TEST_3]
,
Need this text_2:
[234 : Win_E : TEST_E]
[3476 : Win_C : TEST_C]
,
Need this text_3:
[85567 : Win_T : TEST_T]
[435656 : Win_A : TEST_A]
[980: Win_Z : TEST_Z]
}
How can I create the proper xpath select to take this structure?
I can take separately all "h3" , all "a" then all tags with class but how I can match?
GREP YOU SAY?! LOL Well, You would be entirely wron to name it so but for the sake ofkeeping the jargon cleanfor understanding your just parsing/extracting.... So new to scrapy? or web dev sideof things? No matter... Theres no way I couldexpect to teach you in one answer here how to xpth/regex like a pro... only wayis for you to keep at but I throw in my input.
First of all, xpath is amazingly usefull wen it comes to websites that are necessarily build to stadard, which doesnt make them bad per say but in the html snipet you gave... its structured all right soo.. Id recommend css extract .. THESE ARE THE VALUES...
year = response.css('td.tmp_year a::text').extract()
outcome = response.css('td.tmp_outcome b::text').extract()
category= response.css('span.tmp_category::text').extract()
PRO-TIP: For what ever case you deem it neccesary, you can save a web page asan HTML file and use scrapy shell by referencing the direct file path to it... So I save you html snippet to a file on my desktop then ran...
scrapy shell file:///home/scriptso/Desktop/letsGREPlol.html
ANYWAYS... as far as xpath... since you asked lol... cake. lets compare the xpath with the cssand tell me you can see... it? lol
response.css('td.tmp_outcome b::text').extract()
so is a td tag....and the class name is tmp_outcome, thn the next node is a bold tag... of which where the text is thusly declaring it as text with the ::text
response.xpath('//td[#class="tmp_outcome"]/b/text()').extract()
So xpath is basically saying we star with a patter inthe entire site of the td tag... and class= tmp_outcome, then the bold, then in xpath to declare type /text() is for text.... /#href is for.. yeah you guessedit

Repeating an HTML snippet that contains elements with id attribute defined

I need to repeat an html snippet several times on a page but the problem is that the contained html elements have id defined that should be unique on page. So what is the proper way I could repeat this snippet without removing the id attribute from the elements ?
Could I use iframe as container for this snippet & let the duplicate ids exist on page?
You can use JavaScript to clone the snippet and at the same time change the IDs.
Example if your snippet is:
<div id="snippet">
<table id="list">
<tr id="0">
<td id="firstName0">John</td>
<td id="lastName0">Smith</td>
</tr>
<tr id="1">
<td id="firstName2">John</td>
<td id="lastName2">Doe</td>
</tr>
</table>
</div>
Using
$("#snippet").clone(false).find("*[id]").andSelf().each(function() { $(this).attr("id", $(this).attr("id") + "_1"); });
Will Produce
<div id="snippet_1">
<table id="list_1">
<tr id="0_1">
<td id="firstName0_1">John</td>
<td id="lastName0_1">Smith</td>
</tr>
<tr id="1_1">
<td id="firstName2_1">John</td>
<td id="lastName2_1">Doe</td>
</tr>
</table>
</div>
Which will solve the duplicate ID problem.

How do I awk a unix text file to a predefined html code?

I don't know HTML (HORRIBLY EMBARRASSED but didn't ever have the need to). I am pretty perspicacious when it comes to UNIX however I am horribly confused with this assignment I have. I know what I need to do but am having the hardest time ever getting started.
I have the following files in my hwk12 directory:
roster.html
roster.txt
sample.html
sample.txt
The following is the content of the roster.html file:
<html>
<body>
<table border=2>
<tr><th>Name</th><th>Username</th><th>Email</th></tr>
<tr>
<td>Nikhil Banerjee</td>
<td>nbanerje</td>
<td>zetapsi796#hotmail.com</td>
</tr>
<tr>
<td>Jeff Nazarian</td>
<td>jnazaria</td>
<td>jeff.nazarian#asu.edu</td>
</tr>
<tr>
<td>Anna Melzer</td>
<td>amelzer</td>
<td>anna.melzer#asu.edu</td>
</tr>
<tr>
<td>Jose Garcia</td>
<td>jgarcia</td>
<td>garcia-j#msn.com</td>
</tr>
<tr>
<td>Jillian Testa</td>
<td>jtesta</td>
<td>jillian.testa#asu.edu</td>
</tr>
<tr>
<td>Clayton Lengelzigich</td>
<td>clengelz</td>
<td><a href="mailto:clayton.lengel-zigich#asu.edu">clayton.lengel-
zigich#asu.edu</a></td>
</tr>
<tr>
<td>Ashley Bennett</td>
<td>abennett</td>
<td>ashley.bennett#asu.edu</td>
</tr>
<tr>
<td>Ann Frost</td>
<td>afrost</td>
<td>ann.frost#asu.edu</td>
</tr>
<tr>
<td>Timothy Whipple</td>
<td>twhipple</td>
<td>tweed#asu.edu</td>
</tr>
<tr>
<td>Wei Shen</td>
<td>wshen</td>
<td>shenwei58#hotmail.com</td>
</tr>
<tr>
<td>Cari Mahon</td>
<td>cmahon</td>
<td>cari.mahon#asu.edu</td>
</tr>
<tr>
<td>Alberto Salas</td>
<td>asalas</td>
<td>alberto2504#msn.com</td>
</tr>
<tr>
<td>Dorothy Haskett</td>
<td>dhaskett</td>
<td>dorothy.haskett#asu.edu</td>
</tr>
<tr>
<td>Criss Bradbury</td>
<td>cbradbur</td>
<td>crissbradbury#hotmaiil.com</td>
</tr>
<tr>
<td>Steve Ellermann</td>
<td>sellerma</td>
<td>cis494#ellermann.com</td>
</tr>
<tr>
<td>Zewdie Bekele</td>
<td>zbekele</td>
<td>zewdiea#aol.com</td>
</tr>
<tr>
<td>Frederic Diziere</td>
<td>fdiziere</td>
<td>fsd#asu.edu</td>
</tr>
<tr>
<td>Matt Bowes</td>
<td>mbowes</td>
<td>matt.bowes#asu.edu</td>
</tr>
<tr>
<td>Jasen Meece</td>
<td>jmeece</td>
<td>jasen.meece#sun.com</td>
</tr>
<tr>
<td>Aaron Carpenter</td>
<td>acarpent</td>
<td>aaron.carpenter#asu.edu</td>
</tr>
<tr>
<td>Binqin Xi</td>
<td>bxi</td>
<td>binqin.xi#asu.edu</td>
</tr>
<tr>
<td>Yinting Chan</td>
<td>ychan</td>
<td>yin.chen#asu.edu</td>
</tr>
<tr>
<td>Michael Evans</td>
<td>mevans</td>
<td>michael.evans#asu.edu</td>
</tr>
<tr>
<td>Herman Beringer</td>
<td>hberinge</td>
<td>jber#cox.net</td>
</tr>
<tr>
<td>Andrew Jolley</td>
<td>ajolley</td>
<td>andrew#andrewjolley.com</td>
</tr>
<tr>
<td>Michael Raby</td>
<td>mraby</td>
<td>mike1071#yahoo.com</td>
</tr>
<tr>
<td>Hajar Alaoui</td>
<td>halaoui</td>
<td>hajar6#hotmail.com</td>
</tr>
<tr>
<td>Anne Lemar</td>
<td>alemar</td>
<td>anne.lemar#asu.edu</td>
</tr>
<tr>
<td>Russell Crotts</td>
<td>rcrotts</td>
<td>Russell.Crotts#asu.edu</td>
</tr>
<tr>
<td>Dan Mazzola</td>
<td>dmazzola</td>
<td>dan.mazzola#sun.com</td>
</tr>
<tr>
<td>Bill Boyton</td>
<td>bboyton</td>
<td>boytonb#earthlink.net</td>
</tr>
</table>
</body>
</html>
The following is the content of the roster.txt file:
Whipple Timothy tweed#asu.edu Shen Wei shenwei58#hotmail.com
Mahon Cari cari.mahon#asu.edu Salas Alberto alberto2504#msn.com
Haskett Dorothy dorothy.haskett#asu.edu Bradbury Criss
crissbradbury#hotmaiil.com Ellermann Steve
cis494#ellermann.com Bekele Zewdie zewdiea#aol.com Diziere Frederic
fsd#asu.edu Bowes Matt matt.bowes#asu.edu Meece Jasen
jasen.meece#sun.com Carpenter Aaron aaron.carpenter#asu.edu
Xi Binqin binqin.xi#asu.edu Chan Yinting yin.chen#asu.edu
Evans Michael michael.evans#asu.edu Beringer Herman
jber#cox.net Jolley Andrew andrew#andrewjolley.com Raby Michael
mike1071#yahoo.com Alaoui Hajar hajar6#hotmail.com Lemar Anne
anne.lemar#asu.edu Crotts Russell Russell.Crotts#asu.edu Mazzola Dan
dan.mazzola#sun.com Boyton Bill boytonb#earthlink.net
The following is the content of the sample.html file:
<html>
<body>
<table border=2>
<tr><th>Name</th><th>Username</th><th>Email</th></tr>
<tr>
<td>Michael Raby</td>
<td>mraby</td>
<td>mike1071#yahoo.com</td>
</tr>
<tr>
<td>Hajar Alaoui</td>
<td>halaoui</td>
<td>hajar6#hotmail.com</td>
</tr>
<tr>
<td>Anne Lemar</td>
<td>alemar</td>
<td>anne.lemar#asu.edu</td>
</tr>
<tr>
<td>Russell Crotts</td>
<td>rcrotts</td>
<td>Russell.Crotts#asu.edu</td>
</tr>
<tr>
<td>Dan Mazzola</td>
<td>dmazzola</td>
<td>dan.mazzola#sun.com</td>
</tr>
<tr>
<td>Bill Boyton</td>
<td>bboyton</td>
<td>boytonb#earthlink.net</td>
</tr>
</table>
</body>
</html>
The following is the content of the sample.txt file:
Raby Michael mike1071#yahoo.com
Alaoui Hajar hajar6#hotmail.com
Lemar Anne anne.lemar#asu.edu
Crotts Russell Russell.Crotts#asu.edu
Mazzola Dan dan.mazzola#sun.com
Boyton Bill boytonb#earthlink.net
I'm not asking for someone to do this for me because I LOVE UNIX and I want to learn it myself. Everytime I look at this HTML code I am confusing the #$$#& out of myself. I need help getting started.
The homework prompt is the following:
You are to write a nawk(1) script called ~/hwk12/mk_html.awk that converts a text file (sample.txt and roster.txt) to an html page that a web browser can read. I have given you the output in the file sample.html which is reproduced below (notice how each level of indentation is two spaces deep):
Again, I don't want someone to do this for me. Im just confused as to how data in the text file will append to the HTML table without the actual HTML code. Can someone please help me get started?
Looks like you'll need to define the necessary HTML tags within your script. The meat of the html file will be these lines:
<tr>
<td>$first $last</td>
<td>$username</td>
<td>$email</td>
</tr>
These tags define a table row. You can parse the variables from the text files with awk and use them to fill in the html. The other html markup can be copy-pasted as static text into the output html file.
Edit: You can do this to grab the first and last name and print to the html file.
last = $1
first = $2
print " <tr>"
print " <td>" first " " last "</td>"
print " </tr>"
You just need to expand that to get the email and username.