Scrapy selector doesn't "see" an element that is present on the webpage - html

I want to parse the following webpage:
https://mafiaworldtour.com/tournaments/2653
And I need to find the following element:
//html/body/div[1]/div/section[2]/div/div/div/div[1]/div[1]/div/div[2]/div/div[1]/div[2]/span/text()
When I search it on the webpage via inspect, it is clearly present, but
city = response.xpath('//html/body/div[1]/div/section[2]/div/div/div/div[1]/div[1]/div/div[2]/div/div[1]/div[2]/span/text()').extract_first() returns None.
What is the reason for this?
I expect to get the city Хайфа, Израиль of the tournament via xpath.

Using my own project retrieveCssOrXpathSelectorFromTextOrNode to fetch the full xpath query:
x('Хайфа, Израиль');
//body/div[#class="site-wrapper"]/div[#class="main"][#role="main"]/section[#class="page-content"]/div[#class="container"]/div[#class="tabs"]/div[#class="tab-content"]/div[#class="tab-pane fade in active "][#id="general"]/div[#class="row"]/div[#class="col-md-12"]/div[#class="table-responsive"]/div[#class="responsive-info-table"]/div[#class="row with-top-border"]/div[#class="col-md-6"]/span[#class="small_content"]
It's always better to have these specific XPath query than the one with relative path like auto-generated by chrome dev tools:
//html/body/div[1]/div/section[2]/div/div......
But you can remove the useless part, should be like:
(From chrome dev tools, or firefox console):
$x('//span[#class="small_content"]')[0].innerText
or in your case:
response.xpath('//span[#class="small_content"]/text()').extract_first()
Output
" Хайфа, Израиль"

you can use both CSS selector orXPATH
CSS selector
response.css('.small_content::text').get()
XPATH
response.xpath('//span[#class="small_content"]/text()').get()

Related

Scraping HTML elements between ::before and ::after with scrapy and xpath

I am trying to scrape some links from a webpage in python with scrapy and xpath, but the elements I want to scrape are between ::before and ::after so xpath can't see them as they do not exist in the HTML but are dynamically created with javascript. Is there a way to scrape those elements?
::before
<div class="well-white">...</div>
<div class="well-white">...</div>
<div class="well-white">...</div>
::after
This is the actual page http://ec.europa.eu/research/participants/portal/desktop/en/opportunities/amif/calls/amif-2018-ag-inte.html#c,topics=callIdentifier/t/AMIF-2018-AG-INTE/1/1/1/default-group&callStatus/t/Forthcoming/1/1/0/default-group&callStatus/t/Open/1/1/0/default-group&callStatus/t/Closed/1/1/0/default-group&+identifier/desc
I can't replicate your exact document state.
However if you load the page you can see some template language loaded in the same format your example data is:
Also if you check XHR network inpector you can see some AJAX requests for json data is being made:
So you can download the whole data you are looking for in handy json format over here:
http://ec.europa.eu/research/participants/portal/data/call/amif/amif_topics.json
scrapy shell "http://ec.europa.eu/research/participants/portal/data/call/amif/amif_topics.json"
> import json
> data = json.loads(response.body_as_unicode())
> data['topicData']['Topics'][0]
{'topicId': 1259874, 'ccm2Id': 31081390, 'subCallId': 910867, ...
Very very easy!
you just use the "Absolute XPath" and "Relative XPath" (https://www.guru99.com/xpath-selenium.html) together.By this trick you can pass form ::before (and maybe ::after). For example in your case (I supposed that,:
//div[#id='"+FindField+"'] // following :: td[#class='KKKK'] is before your "div".
FindField='your "id" associated to the "div"'
driver.find_element_by_xpath ( "//div[#id='"+FindField+"'] // following :: td[#class='KKKK'] / div")
NOTE:only one "/" must be use.
Also you can use only "Absolute XPath" in all addressing (Note:must be use "//" at the first Address.

Convert to CSS Selector

Trying to convert the below given HTML tag of a Image Button which I want to click but not getting clicked while using Xpath.
HTML Script
<img src="../../../../imagepool/transparent%21tmlservicedesk?cid=1"
id="reg_img_304316340" aralttxt="1" artxt="Show Application List"
arimgcenter="1" alt="Show Application List" title="Show Application List"
class="btnimg" style="top:0px; left:0px; width:23px; height:140px;">
Xpath Generated for the same:
//div[#class='btnimgdiv']/img[#id='reg_img_304316340']/#src
Read some of the articles that for image buttons CSS selector is much better than xpath and wanted to know how to convert the html to CSS selector.
Image BUtton which i want to click but not getting clicked while using Xpath
This is because you are using id attribute value of the element which looks like dynamically generated.
Read some of the articles that for image buttons CSS selector is much better than xpath
Yes, you are right, using cssSeector is much faster than xpath to locate an element.
wanted to know how to convert the html to CSS selector.
You need to use that attribute value which is unique and unchangeable to locate element, you can use below cssSelector :-
img.btnimg[title='Show Application List']
Reference Link :-
http://www.w3schools.com/cssref/css_selectors.asp
https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors

Partial HTML Selection Using Jsoup

So I was wondering if there is a way to find the element that belongs to a specific String that you know exists on a HTML page as part of an attribute. The example is I know that "Apr-16-2015" is somewhere in an attribute on the HTML page. If I go look for it, it's part of the attribute title:
<a title="Apr-16-2015 5:04 AM"
However, I do not have the information about the exact time, i.e. the "5:04 AM". I was wondering if there is a way to partially search an attribute in order for it to return the full element.
This is my code:
org.jsoup.nodes.Element links = lastPage.select("[title=\"Apr-16-2015\"]").first();
Again, it doesn't work because I did not enter the full attribute title, as given above. My question: "Is there any way to make this selector work by not entering the full information, as I will be unable to have the latter part of the attribute to my disposition?"
You can use it in the following way:
lastPage.select("[title^=\"Apr-16-2015\"]").first();
As described on JSoup Documentation:
[attr^=value], [attr$=value], [attr*=value]: elements with attributes
that start with, end with, or contain the value, e.g. [href*=/path/]
References:
http://jsoup.org/cookbook/extracting-data/selector-syntax

How do I find a reliable XPath for this html element (type is text, class is known, no id present)?

The element is similar to:
<input type="text" class="information">
There is no id for the element.
There is only one text type element inside the information class. I want to be able to enter text into this html element by using casperjs which works on top of phantomjs.
The XPath obtained from chrome developer tools is similar to:
//*[#id="abcid"]/div/div[1]/input
abcdid is the id of the div element which comprises of the text box and a few other elements. But I need a more reliable XPath. I'm not very experienced with finding XPaths so forgive me if the answer is too obvious.
If you want to use XPath selectors for nearly all CasperJS functions, you need to provide it as an object. If the selector is provided as a string it will be automatically assumed that it is a CSS selector.
You can build the XPath selector object yourself:
{
type: 'xpath',
path: '//input[#class="information"]'
}
or just use a XPath utility by first requiring it at the beginning of your script and then using it:
var x = require('casper').selectXPath;
// later ...
var text = casper.fetchText(x('//input[#class="information"]'));
Regarding your selector:
If there is only one input with the information class then you can use the XPath
//input[#class="information"]
or the CSS selector
input.information[type='text']
If the input has other classes too, the CSS selector will work as is, but the XPath selector must be changed to
//input[contains(#class,"information")]

Simple Xpath puzzle

I'm trying to automate the Google Translate web interface with Selenium (but it's not necessary to understand Selenium to understand this question, just know that it finds elements and clicks them). I'm stuck on selecting the language to translate from.
I can't get to the point where the drop-down menu opens, as seen in the screenshot below.
Now, I want to select 'Japanese'.
This xpath expression works: $b.find_element(:xpath,"//*[#id=':13']/div").click But I would rather have one where I can just input the name of the language.
This xpath expression also works: $b.find_element(:xpath,"//*[contains(text(),'Japanese')]").click But only as long as there is no other 'Japanese' text on the page.
So I'm trying to narrow down the scope of my xpath, but when I try to specify the path to take to find the 'Japanese' text, the expression no longer works, I can't find the element: $b.find_element(:xpath,"//*div[#id='gt-sl-gms']/*[contains(text(),'Japanese')]").click
It also no longer works for the original xpath either: $b.find_element(:xpath,"//*div[#id='gt-sl-gms']/*[#id=':13']/div").click
Which is weird, because to bring down the drop-down menu, I use this xpath $b.find_element(:xpath,"//*[#id='gt-sl-gms']/*[contains(text(),'From:')]").click.
So it's not that I have two wildcards in my expression and it's not that my expression is too specific. There's something else that I'm missing and I'm sure it's really simple.
Any suggestions are appreciated.
Edit Other things I have tried unsuccessfully:
$b.find_element(:xpath,"//*/div[#id='gt-sl-gms']/*[#id=':13']/div").click
$b.find_element(:xpath,"//*[#id='gt-sl-gms']/*[#id=':13']/div").click
$b.find_element(:xpath,"//*[#id='gt-sl-gms']//*[#id=':13']/div").click
If the div with "#id=':13'" is an descendant of the div with "#id='gt-sl-gms" your xpaht "//*[#id='gt-sl-gms']//*[#id=':13']/div" would work.
The above xpaht expect that the html looks somehow like:
<div id="gt-sl-gms">
<div>
<div id=":13">
<div></div>
</div>
</div>
</div>
If <div id="gt-sl-gms"> in not an ancestor (as I expect) you have to look for an "real" ancestor, or you may use following (for nodes later in the document) or following-sibling (for nodes later in the document at the same level as the previous.
*div is incorrect, it should be just div. Also, depending on he structure of the HTML, you may need // instead of /.
Try selecting descendants (//) instead of (/*) which is really grandchildren or deeper.