Watir/Ruby selecting next value - html

I am working with a table that has links in the first column:
html = Nokogiri::HTML(browser.html)
html.css('tr td a').each do |links|
browser.link(:text=>"#{a}").click
puts "#{a}"
end
How do i display the NEXT value for the link?
If the link name is abcd but the next one is efgh, how do i get it to write the efgh?

You should be able to achieve this using the index in the array you are working with.
thing = ['a', 'b', 'c', 'd']
(0..thing.length - 1).each do |index|
puts thing[index + 1]
end

I don't understand the use case here (not at all), but this contrived example might point you in the direction that you're looking to go.
Use the links method to create an array of link objects. Then, you can print the text for the element at the second position but click the element at the first position.
require 'watir-webdriver'
b = Watir::Browser.new
b.goto('http://www.iana.org/domains/reserved')
nav_links = b.div(:class => "navigation").links
puts nav_links[1].text #=> NUMBERS
nav_links[0].click
puts b.url #=> http://www.iana.org/domains
The Enumerable::each_with_index method might also be useful since it cycles through each element of an array and additionally returns the respective element position. For example:
b.div(:class => "navigation").links.each_with_index { |el, i| puts el.text, i }
#=> DOMAINS
#=> 0
#=> NUMBERS
#=> 1
#=> PROTOCOLS
#=> 2
...

Related

How to search for nodes at the uppermost levels only in Ruby-Nokogiri?

HTML (particularly MathML) can be heavily nested. With Ruby-Nokogiri, I want to search for a node at the uppermost levels, which are arbitrary, within a parent node. Here is an example HTML/MathML.
<math><semantics>… (arbitrary depth)
<mrow> (call it (1))
<mrow> (1-1)
<mrow> (2)
<mrow> (2-1)
<mrow> (2-2)
For a Nokogiri::HTML object for it, page, page.css("math mrow") returns an Array of all the nodes of <mrow>, having an Array size of 5 in this case, with the last node being "<mrow> (2-2)".
My goal is to identify the last <mrow> node at the upper-most level, i.e., "<mrow> (2)" in the example above (so that I can add another node after it).
In other words, I want to get the "last node of a certain kind at the shallowest depth among all the nodes of the kind". The depth of the uppermost level for the type of node is unknown and so I cannot limit the depth for the search.
If you want the uppermost mrow node in terms of depth, you could select among all :first-of-type the one with the least number of ancestors:
first_mrow = page.css('mrow:first-of-type').min_by.with_index { |node, index| [node.ancestors.size, index] }
Adding with_index ensures that for nodes with identical number of ancestors, the first one will be picked.
To get the first mrow node from the start of the document (regardless of depth), you could simply use:
first_mrow = page.at_css('mrow')
With the first mrow node you can then select its parent node:
parent = first_mrow.parent
and finally retrieve the last element from the parent's (immediate) mrow nodes:
last_mrow = parent.css('> mrow').last
The latter can also be expressed via the :last-of-type CSS pseudo-class:
last_mrow = parent.at_css('> mrow:last-of-type')
Sounds like a breadth-first search problem at 1st glance: add a node to the queue, if it's not of the desired type remove it and add its children to the queue in reversed order, repeat until you find the desired node (it will be the shallowest one because of BFS properties and the last one because we add children reversed).
Quick and dirty example:
require "nokogiri"
def find_last_shallowest(root)
queue = [root]
while queue.any?
element = queue.shift
break element if matching?(element)
queue += element.children.reverse
end
end
def matching?(element)
# Put your matching logic here
element.name == "m"
end
doc = <<~XML
<foo>
<bar>
<m>
<x></x>
</m>
</bar>
<baz>
</baz>
<m>
<y></y>
</m>
<m>
<z></z>
</m>
</foo>
XML
xml = Nokogiri::XML(doc)
find_last_shallowest(xml.root) # => #(Element:0xf744 { name = "m", children = [ #(Text "\n "), #(Element:0xf758 { name = "z" }), #(Text "\n ")] })
It finds m thing which has z as a child - which is the last and shallowest one...

scraping - find the last 5 score of each match - in html

i would like your help to get the last 5 score, i can't get it please help me.
from selenium import webdriver
import pandas as pd
from pandas import ExcelWriter
from openpyxl.workbook import Workbook
import time as t
import xlsxwriter
pd.set_option('display.max_rows', 5, 'display.max_columns', None, 'display.width', None)
browser = webdriver.Firefox()
browser.get('https://www.mismarcadores.com/futbol/espana/laliga/resultados/')
print("Current Page Title is : %s" %browser.title)
aux_ids = browser.find_elements_by_css_selector('.event__match.event__match--static.event__match--oneLine')
ids=[]
i = 0
for aux in aux_ids:
if i < 1:
ids.append( aux.get_attribute('id') )
i+=1
data=[]
for idt in ids:
id_clean = idt.split('_')[-1]
browser.execute_script("window.open('');")
browser.switch_to.window(browser.window_handles[1])
browser.get(f'https://www.mismarcadores.com/partido/{id_clean}/#h2h;overall')
t.sleep(5)
p_ids = browser.find_elements_by_css_selector('h2h-wrapper')
#here the code of the last 5 score of each match
I believe you can use your Firefox browser but have not tested with it. I use chrome so if you want to use chromedriver check the version of your browser and download the right one, also add it to your system path. The only thing with this approach is that it open a browser window until the page is loaded (because we are waiting for the javascript to generate the matches data). If you need anything else let me know. Good luck!
https://chromedriver.chromium.org/downloads
Known issues: Sometimes it will throw index out of range when retrieve matches data. This is something I am looking to it because it look like sometimes the xpath on each link change a little .
from selenium import webdriver
from lxml import html
from lxml.html import HtmlElement
def test():
# Here we specified the urls to for testing purpose
urls = ['https://www.mismarcadores.com/partido/noIPZ3Lj/#h2h;overall'
]
# a loop to go over all the urls
for url in urls:
# We will print the string and format it with the url we are currently checking, Also we will print the
# result of the function get_last_5(url) where url is the current url in the for loop.
print("Scores after this match {u}".format(u=url), get_last_5(url))
def get_last_5(url):
print("processing {u}, please wait...".format(u=url))
# here we get a instance of the webdriver
browser = webdriver.Chrome()
# now we pass the url we want to get
browser.get(url)
# in this variable, we will "store" the html data as a string. We get it from here because we need to wait for
# the page to load and execute their javascript code in order to generate the matches data.
innerHTML = browser.execute_script("return document.body.innerHTML")
# Now we will assign this to a variable of type HtmlElement
tree: HtmlElement = html.fromstring(innerHTML)
# the following variables: first_team,second_team,match_date and rows are obtained via xpath method(). To get the
# xpath go to chrome browser,open it and load one of the url to check the DOM. Now if you wish to check the xpath
# of each of this variables (elements in case of html), right click on the element->click inspect->the inspect
# panel will appear->the clicked element wil appear selected on the inspect panel->right click on it->Copy->Copy
# Xpath. first_team,second_team and match_date are obtained from the "title" section. Rows are obtained from the
# table of last matches in the tbody content
# When using xpath it will return a list of HtmElement because it will try to find all the elements that match our
# xpath, so that is why we use [0] (to get the first element of the list). This will give use access to a
# HtmlElement object so now we can access its text attribute.
first_team = tree.xpath('//*[#id="flashscore"]/div[1]/div[1]/div[2]/div/div/a')[0].text
print((type(first_team)))
second_team = tree.xpath('//*[#id="flashscore"]/div[1]/div[3]/div[2]/div/div/a')[0].text
# [0:8] is used to slice the string because in the title it contains also the time of the match ie.(10.08.2020
# 13:00) . To use it for comparing each row we need only (10.08.20), so we get from position 0, 8 characters ([0:8])
match_date = tree.xpath('//*[#id="utime"]')[0].text[0:8]
# when getting the first element with [0], we get a HtmlElement object( which is the "table" that have all matches
# data). so we want to get all the children of it, which are all the "rows(elements)" inside it. getchildren()
# will also return a list of object of type HtmlElement. In this case we are also slicing the list with [:-1]
# because the last element inside the "table" is the button "Mostar mas partidos", so we want to take that out.
rows = tree.xpath('//*[#id="tab-h2h-overall"]/div[1]/table/tbody')[0].getchildren()[:-1]
# we quit the browser since we do not need this anymore, we could do it after assigning innerHtml, but no harm
# doing it here unless you wish to close it before doing all this assignment of variables.
browser.quit()
# this match_position variable will be the position of the match we currently have in the title.
match_position = None
# Now we will iterate over the rows and find the match. range(len(rows)) is just to get the count of rows to know
# until when to stop iterating.
for i in range(len(rows)):
# now we use the is_match function with the following parameter: first_team,second team, match_date and the
# current row which is row[i]. if the function return true we found the match position and we assign (i+1) to
# the match_position variable. i+1 because we iterate from 0.
if is_match(first_team, second_team, match_date, rows[i]):
match_position = i + 1
# now we stop the for no need to go further when we find it.
break
# Since we only want the following 5 matches score, we need to check if we have 5 rows beneath our match. If
# adding 5 from the match position is less than the number of rows then we can do it, if not we will only get the
# rows beneath it(maybe 0,1,2,3 or 4 rows)
if (match_position + 5) < len(rows):
# Again we are slicing the list, in this case 2 times [match_position:] (take out all the rows before the
# match position), then from the new list obtained from that we do [:5] which is start from the 0 position
# and stop on 5 [start:stop]. we use rows=rows beacause when slicing you get a new list so you can not do
# rows[match_position:][:5] you need to assign it to a variable. I am using same variable but you can assign
# it to a new one if you wish.
rows = rows[match_position:][:5]
else:
# since we do not have enough rows, just get the rows beneath our position.
rows = rows[match_position:len(rows)]
# Now to get the list of scores we are using a list comprehension in here but I will explain it as a for loop.
# Before that, you need to know that each row(<tr> element in html) has 6 td elements inside it, the number 5 is
# the score of the match. then inside each "score element" we have a span element and then a strong element,
# something like
# <tr>
# <td></td>
# <td></td>
# <td></td>
# <td></td>
# <td><span><strong>1:2</strong></span></td>.
# <td></td>
# </tr>
# Now, That been said, since each row is a HtmlElement object , we can go in a for loop as following:
scores = []
for row in rows:
data = row.getchildren()[4].getchildren()[0].text_content()
# not the best way but we will get al the text content on the element, in this case the span element,
# if the string has more than 5 characters i.e. "1 : 2" then we will take as if it is i.e. "1 : 2(0 : 1)". So
# in this case we want to slice it from the 2nd character from right to left and get 5 characters from that
# position.
# using a ternary expression here, if the length of the string is equal to 5 then this is our score,
# if not then we have to slice it and get the last part, from -6 which is the white space before then 2 (in
# our example) to -1 (which is the 1 before the last ')' ).
score = data if len(data) == 5 else data[-6:-1]
scores.append(score)
print("finished processing {u}.".format(u=url))
# now we return the scores
return scores
def is_match(t1, t2, match_date, row):
# from each row we want to compare, t1,t2,match_date (this are obtained from the title) with the rows team1,
# team2 and date. Each row has 6 element inside it. Please read all the code on get_last_5 before reading this
# explanation. so the for this row, date is in position 0, team1 in 2, team2 in 3.
# <td><span>10.03.20</span></td>
date = row.getchildren()[0].getchildren()[0].text
# <td><span>TeamName</span></td> (when the team lost) or
# <td><span><strong>TeamName</strong></span></td> (when the team won)
team1element = row.getchildren()[2].getchildren()[0] # this is the span element
# using a ternary expression (condition_if_true if condition else condition_if_false)
# https://book.pythontips.com/en/latest/ternary_operators.html
# if span element have childrens , (getchildren()>0) then the team name is team1element.getchildren()[0].text
# which is the text of the strong element, if not the jsut get the text from the span element.
mt1 = team1element.getchildren()[0].text if len(team1element.getchildren()) > 0 else team1element.text
# repeat the same as team 1
team2element = row.getchildren()[3].getchildren()[0]
mt2 = team2element.getchildren()[0].text if len(team2element.getchildren()) > 0 else team2element.text
# basically we can compare only the date, but jsut to be sure we compare the names also. So, if the dates and the
# names are the same this is our match row.
if match_date == date and t1 == mt1 and t2 == mt2:
# we found it so return true
return True
# if not the same then return false
return False

How to get value of a hidden element? (Watir)

just wondering, how can I get the value of a hidden element using watir? This is the element:
<input type="hidden" value="randomstringhere" id="elementid" name="elementname" />
And this is my code atm:
require "rubygems"
require "watir-webdriver"
$browser = Watir::Browser.new :ff
$browser.goto("http://www.site.com")
$grabelement = $browser.hiddens(:id, "elementid")
$blah = $grabelement.attribute_value("value")
puts $blah
This gets stuck at the last line, where it returns
code.rb:6:in `<main>': undefined method `attribute_value' for #<Watir::HiddenCollection:0x8818adc> (NoMethodError)
Sorry for the basic question, I've had a search and couldn't find anything.
Thanks in advance!
Problem
Your code is quite close. The problem is the line:
$grabelement = $browser.hiddens(:id, "elementid")
This line says to get a collection (ie all) of hidden elements that have id "elementid". As the error message says, the collection does not have the attribute_value method. Only elements (ie the objects in the collection) have the method.
Solution (assuming single hidden with matching id)
Assuming that there is only one, you should just get the first match using the hidden instead of hiddens (ie drop the s):
$grabelement = $browser.hidden(:id, "elementid")
$blah = $grabelement.value
puts $blah
#=> "randomstringhere"
Note that for the value attribute, you can just do .value instead of .attribute_value('value').
Solution (if there are multiple hiddens with matching id)
If there actually are multiple, then you can iterate over the collection or just get the first, etc:
#Iterate over each hidden that matches
browser.hiddens(:id, "elementid").each{ |hidden| puts hidden.value }
#Get just the first hidden in the collection
browser.hiddens(:id, "elementid").first.value

Ruby: Change class based on array value

I'm looking to create an HTML structure with classes based on the values of arrays from Ruby.
I have 6 classes that will be applied to different elements on an 8x8 grid.
Each row will be a div with 8 span elements inside. In ruby, each nested array will be the div row and then each element will be a span assigned a class based on the value of the array element.
a = [[1,4,3,2,2,3,1,4]
[4,5,6,6,3,2,3,5]]
So two rows will be created with 8 elements inside with the appropriate classes.
Is it possible to convert data structures to HTML like this in Ruby?
Maybe this is what you want:
a = [[1,4,3,2,2,3,1,4],
[4,5,6,6,3,2,3,5]]
html = ''
a.each do |row|
html << "<div>%s</div>" % row.map { |c| %{<span class="#{c}"></span>} }.join
end
# puts html
update
In other words:
html = a.map do |row|
"<div>%s</div>" % row.map { |c| %{<span class="#{c}"></span>} }.join
end.join
umm.. yea. something among the lines of...
a.each do |subArray|
puts "<div>"
subArray.each do |element|
puts '<span class="#{element}">Some text</span>'
end
puts "</div>
end
If this doesn't fit your needs please post a more specific question.

Download HTML Text with Ruby

I am trying to create a histogram of the letters (a,b,c,etc..) on a specified web page. I plan to make the histogram itself using a hash. However, I am having a bit of a problem actually getting the HTML.
My current code:
#!/usr/local/bin/ruby
require 'net/http'
require 'open-uri'
# This will be the hash used to store the
# histogram.
histogram = Hash.new(0)
def open(url)
Net::HTTP.get(URI.parse(url))
end
page_content = open('_insert_webpage_here')
page_content.each do |i|
puts i
end
This does a good job of getting the HTML. However, it gets it all. For www.stackoverflow.com it gives me:
<body><h1>Object Moved</h1>This document may be found here</body>
Pretending that it was the right page, I don't want the html tags. I'm just trying to get Object Moved and This document may be found here.
Is there any reasonably easy way to do this?
When you require 'open-uri', you don't need to redefine open with Net::HTTP.
require 'open-uri'
page_content = open('http://www.stackoverflow.com').read
histogram = {}
page_content.each_char do |c|
histogram[c] ||= 0
histogram[c] += 1
end
Note: this does not strip out <tags> within the HTML document, so <html><body>x!</body></html> will have { '<' => 4, 'h' => 2, 't' => 2, ... } instead of { 'x' => 1, '!' => 1 }. To remove the tags, you can use something like Nokogiri (which you said was not available), or some sort of regular expression (such as the one in Dru's answer).
See the section "Following Redirection" on the Net::HTTP Documentation here
Stripping html tags without Nokogiri
puts page_content.gsub(/<\/?[^>]*>/, "")
http://codesnippets.joyent.com/posts/show/615