Scraping HTML elements between ::before and ::after with scrapy and xpath - html

I am trying to scrape some links from a webpage in python with scrapy and xpath, but the elements I want to scrape are between ::before and ::after so xpath can't see them as they do not exist in the HTML but are dynamically created with javascript. Is there a way to scrape those elements?
::before
<div class="well-white">...</div>
<div class="well-white">...</div>
<div class="well-white">...</div>
::after
This is the actual page http://ec.europa.eu/research/participants/portal/desktop/en/opportunities/amif/calls/amif-2018-ag-inte.html#c,topics=callIdentifier/t/AMIF-2018-AG-INTE/1/1/1/default-group&callStatus/t/Forthcoming/1/1/0/default-group&callStatus/t/Open/1/1/0/default-group&callStatus/t/Closed/1/1/0/default-group&+identifier/desc

I can't replicate your exact document state.
However if you load the page you can see some template language loaded in the same format your example data is:
Also if you check XHR network inpector you can see some AJAX requests for json data is being made:
So you can download the whole data you are looking for in handy json format over here:
http://ec.europa.eu/research/participants/portal/data/call/amif/amif_topics.json
scrapy shell "http://ec.europa.eu/research/participants/portal/data/call/amif/amif_topics.json"
> import json
> data = json.loads(response.body_as_unicode())
> data['topicData']['Topics'][0]
{'topicId': 1259874, 'ccm2Id': 31081390, 'subCallId': 910867, ...

Very very easy!
you just use the "Absolute XPath" and "Relative XPath" (https://www.guru99.com/xpath-selenium.html) together.By this trick you can pass form ::before (and maybe ::after). For example in your case (I supposed that,:
//div[#id='"+FindField+"'] // following :: td[#class='KKKK'] is before your "div".
FindField='your "id" associated to the "div"'
driver.find_element_by_xpath ( "//div[#id='"+FindField+"'] // following :: td[#class='KKKK'] / div")
NOTE:only one "/" must be use.
Also you can use only "Absolute XPath" in all addressing (Note:must be use "//" at the first Address.

Related

Convert to CSS Selector

Trying to convert the below given HTML tag of a Image Button which I want to click but not getting clicked while using Xpath.
HTML Script
<img src="../../../../imagepool/transparent%21tmlservicedesk?cid=1"
id="reg_img_304316340" aralttxt="1" artxt="Show Application List"
arimgcenter="1" alt="Show Application List" title="Show Application List"
class="btnimg" style="top:0px; left:0px; width:23px; height:140px;">
Xpath Generated for the same:
//div[#class='btnimgdiv']/img[#id='reg_img_304316340']/#src
Read some of the articles that for image buttons CSS selector is much better than xpath and wanted to know how to convert the html to CSS selector.
Image BUtton which i want to click but not getting clicked while using Xpath
This is because you are using id attribute value of the element which looks like dynamically generated.
Read some of the articles that for image buttons CSS selector is much better than xpath
Yes, you are right, using cssSeector is much faster than xpath to locate an element.
wanted to know how to convert the html to CSS selector.
You need to use that attribute value which is unique and unchangeable to locate element, you can use below cssSelector :-
img.btnimg[title='Show Application List']
Reference Link :-
http://www.w3schools.com/cssref/css_selectors.asp
https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors

Scraping one html element into another

I'm trying to collect url from a list on a table in R. But table is a html element into web page, so xpath doesn't work adequately. I obtain the following result:
> doc<-read_html(url("http://www.bibliotecanacional.gov.co/rnbp/directorio-de-bibliotecas-publicas"))
> v<-toString(xml_find_all(doc, xpath='//*[#id="ContentPlaceHolder1_Ejemplo2_GridviewConCSSFriendly1_GridViewJedis_LinkButton1_0"]'))
> v
[1] ""
In the image, you can see how I extract xpath by inspection of url element.
Extraction of xpath
I will be grateful with your help. Thanks.
That page contains an iframe..so you need to switch to the iframe first before you could get the element from that iframe.
It has an iframe with title: Libros digitales y aplicaciones producidas BNC
Not sure how to do that using what you're using, but you might be able to look that up easily in here.

Jsoup is not Selecting Script Tag

I am trying to select an script tag on page with text contains
Document doc=jsoup.parse(somehtml);
Elements ele=doc.select("script:contains(accountIndex)");
Code for script tag on the page is
<script>(function() {var vm = ko.mapping.fromJS({
"accountIndex": 1,
"accountNumber": "*******",
"hideMoreDetailsText": "Hide More Details",
"viewAccountNumberText": "Show Account Number",
"hideAccountNumberText": "Hide Account Number",
});window.AccountDetails = vm;})();</script>
I am able to select this script tag if i pass css locator of script tag like
Elements ele=doc.select("body > script:nth-child(44)");
There are many script tag on the page so the second approach is not generic.It may change in future.
Can somebody please tell what is the issue with the first approach.Because i am able to select other tags on the page with contains of jsoup
The selector :contains(text) looks for an element that has that text value. A script doesn't have text, it has data (otherwise the JS would be visible in the browser). You can use the :containsData(data) selector instead.
E.g.:
Elements els = doc.select("script:containsData(accountIndex)");
Here's an example. The Selector documentation has all the handled query types (which is not just strict CSS).
jsoup only supports CSS selectors, and those only allow you to select based on CSS classes and properties of the DOM elements, not their text contents (CSS selector based on element text?). You could try using another framework for parsing and querying the HTML, for example XOM and TagSoup like described here: https://stackoverflow.com/a/11817487/7433999
Or you could add CSS classes to youc script tags like this:
<script class="class1">
// script1
</script>
<script class="class2">
// script2
</script>
Then you can select the script tags again via CSS using jsoup:
Elements elements = document.select("script.class1");

How do I get Mithril.js v0.2.5 to render raw HTML extracted from json? [duplicate]

Suppose I have a string <span class="msg">Text goes here</span>.I need to use this string as a HTML element in my webpage. Any ideas on how to do it?
Mithril provides the m.trust method for this. At the place in your view where you want the HTML output, write m.trust( '<span class="msg">Text goes here</span>' ) and you should be sorted.
Mithril it's powerfull thanks to the virtual dom, in the view you if you want to create a html element you use:
m("htmlattribute.classeCss" , "value");
So in your case:
m("span.msg" , "Text goes here");
Try creating a container you wish to hold your span in.
1. Use jQuery to select it.
2. On that selection, call the jQuery .html() method, and pass in your HTML string.
($('.container').html(//string-goes-here), for example)
You should be able to assign the inner HTML of the container with the string, resulting in the HTML element you want.
Docs here.

How to parse div> main>div

Image of the source code I want to parse
How to parse div class="flight-selector-listing"?
How to open "main[ui-view]" and going next
But I only have
Element masthead = doc.select("div.FR>main[ui-view]").first();
and Output:
<main ui-view="mainView"></main>
How to parse div class="flight-selector-listing"?
Use this CSS query:
div.flight-selector-listing
How to open "main[ui-view]" and going next
Jsoup is an HTML parser. He won't be able to "open" anything. If you want to open "main[ui-view]" then use tools like HTMLUnit, Selenium or ui4j.
(...) and Output:
I bet div.FR>main[ui-view] is populated by some Javascript code running on the page. If it is the case, Jsoup can't help here.