Why this XPathExpression 'div[#class="jobsearch-SerpJobCard unifiedRow row result clickcard"]' is not working? - html

I am developing a crawler with indeed.com. When I implement the xpath finding the path didn't work. Here is my expression tested in the chrome developer console. The result only returns an empty list.
$x('div[#class="jobsearch-SerpJobCard unifiedRow row result clickcard"]')
And here is the original HTML code.
I want to crawl the things inside the clickcard
I am confused about how to fix this problem

You have to use the following XPath in Scrapy to output the desired list of elements (remove the clickard part) :
response.xpath('//div[#class="jobsearch-SerpJobCard unifiedRow row result"]').getall()
Always check what Scrapy returns to you (HTML code in response) before applying your XPath expression.
Output for the following link : 18 elements.
In the JS console of your browser you have to input :
$x("//div[#class='jobsearch-SerpJobCard unifiedRow row result clickcard']")

Related

ROBOTFRAMEWORK - Looping through all images on a page - pulling the link

I am working on a test that checks that all images on a page are visible. I'm running into an issue where its only pulling the link from the first img on the page and logs it the length of the loop. Im currently getting a count of all the images, and in that count I loop through and pull the img source. There are no special classes, or ids. The only thing I have to go off of is . I'm guessing I will somehow need to parse the entire HTML since robotframework only looks at what is viewable on the screen?
My end goal is to pull all img sources on a page and confirm each one returns a 200 status code.
Here is what I have now:
#{all_image_sources} Create List
${all_images} Get Element Count //body//img
FOR ${image} IN RANGE ${all_images}
${img_src} Get Element Attribute tag:img src
log ${img_src}
Append To List ${all_image_sources} ${img_src}
END
Log List ${all_image_sources}'''
You might consider using Get WebElements, this will give you each image locator in a list. You can then loop through the list to get each src attribute.
example:
#{all_image_sources} Create List
${all_images} Get WebElements //body//img
FOR ${image} IN #{all_images}
${img_src} Get Element Attribute ${image} src
Append To List ${all_image_sources} ${img_src}
END
Log List ${all_image_sources}
Get WebElements

Extract values from HTML when parent div contains a specific word (multi-nested divs)

I copy the HTML of a "multi-select" list from a page which looks like that:
and then paste the HTML version (after beautifying it online) in a notepad++ page.
I know want to use Regex in order to extract the lines that are enabled in that list. In other words, I want to see what options I had selected from that dropdown. There are many lines and it is impossible to scroll and find them all. So, the best way in my mind is to use that HTML and search for the divs that contain "enabled". Then, the inner divs should have the values that I am looking for.
The HTML is shown below:
<div class="ui-multiselect-option-row" data-value="1221221111">
<div class="ui-multiselect-checkbox-wrapper">
<div class="ui-multiselect-checkbox"></div>
</div>
<div class="ui-multiselect-option-row-text">(BASE) OneOneOne (4222512512)</div>
</div>
<div class="ui-multiselect-option-row ui-multiselect-option-row-selected" data-value="343333434334">
<div class="ui-multiselect-checkbox-wrapper">
<div class="ui-multiselect-checkbox"></div>
<div class="ui-multiselect-checkbox-selected">✔</div>
</div>
<div class="ui-multiselect-option-row-text">(BASE) TwoTwoTwo (5684641230)</div>
</div>
The outcome should return the following value only (based on the above):
(BASE) TwoTwoTwo (5684641230)
So far, I have tried using the following regex in notepad++:
<div class="ui-multiselect-option-row ui-multiselect-option-row-selected"(.*?)(?=<div class="ui-multiselect-option-row")
but it is impossible to mark all the lines at the same time and remove the unmarked ones. Notepad++ only marks the first line of the entire selection. So, I am thinking whether there is a better way - a more complex regex that can parse the value directly. So, in lines:
a) I either want to make the above work with another regex line in notepad++ (I am open to visualstudio if that makes it faster)
b) Or an easier way using the console in Chrome to parse the selected values. I would still like to see the regex solution but for Chrome console I have an
Update 1:
I used this line $('div.ui-multiselect-option-row-selected > div:nth-child(2)')
and all I need know, as I am not that familiar with the Chrome console export, is to get the innerHTML from the following lines:
Update 2:
for (var b in $('div.ui-multiselect-option-row-selected > div:nth-child(2)')){
console.log($('div.ui-multiselect-option-row-selected > div:nth-child(2)')[b].innerHTML);
which works and I now only have to export the outcome
}
Open up Chrome's Console tab and execute this:
$x('//div[contains(#class, "ui-multiselect-option-row-selected")]/div[#class="ui-multiselect-option-row-text"]/text()')
Here is how it should look using your limited HTML sample but duplicated.
If you have multiple multi-selects and no unique identifier then count which one you need to target (notice the [1]):
$x('//div[contains(#class, "ui-multiselect-option-row-selected")][1]/div[#class="ui-multiselect-option-row-text"]/text()')
All you have to do is use css selectors followed by a .map to get all the elements' innerHTML in a list
[...$('div.ui-multiselect-option-row-selected > div:nth-child(2)')].map(n => n.innerHTML)
The css selector is div.ui-multiselect-option-row-selected > div:nth-child(2) - which, as I've already mentioned in my comment, selects the 2nd immediate child of all divs with the ui-multiselect-option-row-selected class.
Then we just use some javascript to turn the result into a list and do a map to extract all the innerHTML. As you asked.
If the list is sufficiently big, you might consider storing the result of [...$('div.ui-multiselect-option-row-selected > div:nth-child(2)')].map(n => n.innerHTML) in a variable using
const e = [...$('div.ui-multiselect-option-row-selected > div:nth-child(2)')].map(n => n.innerHTML);
and then doing
copy(e);
This will copy the list into your clipboard, wherever you use ctrl + v now - you'll end up pasting the list.

Xpath not getting content

I've tried looking through a bunch of answers already related to this, but I'm very unfamiliar with xpath and I'm a bit stuck.
I'm trying to just grab some information from a website, but I keep getting "imported content is empty" when i try to use importxml in excel.
Here's an example of the page I'm trying to read from (it's a college football simulator for running games. This call is Alabama vs Oklahoma using the 2019 teams):
http://www.ncaagamesim.com/FB_GameSimulator.asp?HomeTeam=Alabama&HomeYear=2019&AwayTeam=Oklahoma&AwayYear=2019&hs=1&hSchedule=0
I'm trying to grab the two teams' scores from the above link.
The first team's score's xpath is supposedly /html/body/div[3]/div/div/div[2]/div/div[1]/center/div[3]/div[1]/table/tbody/tr[1]/td[2]
but I keep getting an empty response.
I'm trying to use importxml in google sheets to get the data.
This returns quite a bit, but it doesn't appear to have the info I need. =importxml("http://www.ncaagamesim.com/FB_GameSimulator.asp?HomeTeam=Alabama&HomeYear=2019&AwayTeam=Oklahoma&AwayYear=2019&hs=1&hSchedule=0", "//div[contains(#class,gs_score)]")
If I quote the gs_score, it doesn't return anything.
Would appreciate any help with this. Thanks!
Edit: The xpath fails with /html/body/div[3]. If I change this to div[2], it returns some of the page data, but not the part I'm looking for.
According to an article I found -
Unfortunately, ImportXML doesn’t load JavaScript, so you won’t be able
to use this function if the content of the document is generated by
JavaScript (jQuery, etc.)
Not sure if this is relevant...
Edit 2:
I noticed the values I need are in an html table, so I tried using this
=IMPORTHTML("http://www.ncaagamesim.com/FB_GameSimulator.asp?HomeTeam=Alabama&HomeYear=2019&AwayTeam=Oklahoma&AwayYear=2019&hs=1&hSchedule=0", "table",1)
I'm still getting no content, no matter what table number I put in that formula.
If I copy the selector in the inspector, we get:
body > div.container > div > div > div.container > div > div.col-lg-9 > center > div:nth-child(3) > div.col-sm-6.col-xs-12.gs_score.gs_borderright.rightalign > table > tbody > tr:nth-child(1) > td:nth-child(2)
Which seems to be the same as the xpath.
Part of the answer: 'gs_score' needs to be in quotes - it's a string literal, not an element name. As an element name, it selects nothing, and everything contains nothing, so the predicate is always true.

Cannot create more then two c3js graphs on a page

We are using the following code (generated by php but finally this is on client side)
c3.generate({'bindto':'#b65d3422__salestaffcommunication_xepan_base_view_chart_chart','data':{'keys':{'x':'name','value':['Email','Call','Meeting']},'groups':[['Email','Call','Meeting']],'json':[],'type':'bar'},'axis':{'x':{'type':'category'},'rotated':true},'onrendered':function(ev,ui){$(".widget-grid").masonry({'itemSelector':'.widget'})}});
c3.generate({'bindto':'#f67e14d8__t_masscommunication_xepan_base_view_chart_chart','data':{'keys':{'x':'name','value':['Newsletter','TeleMarketing']},'groups':[['Newsletter','TeleMarketing']],'json':[],'type':'bar'},'axis':{'x':{'type':'category'},'rotated':true},'onrendered':function(ev,ui){$(".widget-grid").masonry({'itemSelector':'.widget'})}});
c3.generate({'bindto':'#517df254__ableworkforce_xepan_base_view_chart_chart','data':{'columns':[['present',11.111111111111]],'type':'gauge'},'color':{'pattern':['#FF0000','#F97600','#F6C600','#60B044'],'threshold':{'values':[30,60,90,100]}},'onrendered':function(ev,ui){$(".widget-grid").masonry({'itemSelector':'.widget'})}});
And last graph is not drawn. showing
SyntaxError (DOM Exception 12): The string did not match the expected pattern.
However, I can run ANY two and it works fine. that means all code is perfect but once second one is drawn ( no matter in which order). Third one doesn't draws.
Is it any known bug, or any workaround known.
Using v0.4.11 of c3 from c3js.org
Here is my jsfiddle
https://jsfiddle.net/2yy2mjaf/1/
Thank you.
IDs cannot start with a number, which is the case of your third ID.
The simple solution is just adding a letter to it:
'bindto':'#a517df254_ //just put an "a" before the number here
Here is your fiddle: https://jsfiddle.net/7dkLdg32/

How to find the index of HTML child tag in Selenium WebDriver?

I am trying to find a way to return the index of a HTML child tag based on its xpath.
For instance, on the right rail of a page, I have three elements:
//*[#id="ctl00_ctl50_g_3B684B74_3A19_4750_AA2A_FB3D56462880"]/div[1]/h4
//*[#id="ctl00_ctl50_g_3B684B74_3A19_4750_AA2A_FB3D56462880"]/div[2]/h4
//*[#id="ctl00_ctl50_g_3B684B74_3A19_4750_AA2A_FB3D56462880"]/div[3]/h4
Assume that I've found the first element, and I want to return the number inside the tag div, which is 1. How can I do it?
I referred to this previous post (How to count HTML child tag in Selenium WebDriver using Java) but still cannot figure it out.
You can get the number using regex:
var regExp = /div\[([^)]+)\]/;
var matches = regExp.exec("//[#id=\"ctl00_ctl50_g_3B684B74_3A19_4750_AA2A_FB3D56462880\"]/div[2]/h4");
console.log(matches[1]); \\ returns 2
You can select preceeding sibling in xpath to get all the reports before your current one like this:
//h4[contains(text(),'hello1')]/preceding-sibling::h4
Now you only have to count how many you found plus the current and you have your index.
Another option would be to select all the reports at once and loop over them checking for their content. They always come in the same order they are in the dom.
for java it could look like this:
List<WebElement> reports = driver.findElements(By.xpath("//*[#id='ctl00_ctl50_g_3B684B74_3A19_4750_AA2A_FB3D56462880']/div/h4")
for(WebElement element : reports){
if(element.getText().contains("report1"){
return reports.indexOf(element) + 1;
}
}
Otherwise you will have to parse the xpath by yourself to extract the value (see LG3527118's answer for this).