href attribute empty using xpath (python3) - html

Using chrome and xpath in python3, I try to extract the value of an "href" attribute on this web page. "href" attributes contains the link to the movie's trailer ("bande-annonce" in french) I am interested in.
First thing, using xpath, it appears that the "a" tag is a "span" tag. In fact, using this code:
response_main=urllib.request.urlopen("http://www.allocine.fr/film/fichefilm_gen_cfilm=231874.html")
htmlparser = etree.HTMLParser()
tree_main = etree.parse(response_main, htmlparser)
tree_main.xpath('//*[#id=\"content-start\"]/article/section[3]/div[2]/div/div/div/div[1]/*')
I get this result:
[<Element span at 0x111f70c08>]
So the "div" tag contains no "a" tag but just a "span" tag. I've read that html visualization in browsers doesn't always reflects the "real" html sent by the server. Thus I tried to use this command to extract the href:
response_main=urllib.request.urlopen("http://www.allocine.fr/film/fichefilm_gen_cfilm=231874.html")
htmlparser = etree.HTMLParser()
tree_main = etree.parse(response_main, htmlparser)
tree_main.xpath('//*[#id=\"content-start\"]/article/section[3]/div[2]/div/div/div/div[1]/span/#href')
Unfortunately, this returns nothing... And when I check the attributes within the "span" tag with this command:
tree_main.xpath('//*[#id=\"content-start\"]/article/section[3]/div[2]/div/div/div/div[1]/span/#*')
I got the value of the "class" attribute, but nothing about "href"... :
['ACrL3ZACrpZGVvL3BsYXllcl9nZW5fY21lZGlhPTE5NTYwMDcyJmNmaWxtPTIzMTg3NC5odG1s meta-title-link']
I'd like some help to understand what's happening here. Why the "a" tag is a "span" tag? And the most important question to me, how can I extract the value of the "href" attribute?
Thanks a lot for your help!

Required link generated dynamically with JavaScript. With urllib.request you can get only initial HTML page source while you need HTML after all JavaScript been executed.
You might use selenium + chromedriver to get dynamically generated content:
from selenium import webdriver as web
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait
driver = web.Chrome("/path/to/chromedriver")
driver.get("http://www.allocine.fr/film/fichefilm_gen_cfilm=231874.html")
link = wait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[#class='meta-title']/a[#class='xXx meta-title-link']")))
print(link.get_attribute('href'))

Related

How to grab text inside a script tag with scrapy?

I need to grab as text the contents of a script tag with a very specific attribute with the scrapy library. Essentialy the BeautifulSoup equivalent of this:
js_content = soup.find("script",type="application/ld+json").get_text()
I tried this, but the result is not quite what I need.
response.css('script').attrib['type']
CSS:
response.css('script[type="application/ld+json"]::text').get()
xpath:
response.xpath('//script[#type="application/ld+json"]/text()').get()
Basically we're finding a script tag that has an attribute type with a value of application/ld+json and grabbing the text.

How to get the text from a webpage element using Selenium

I am trying to get a series of text from a web element, but unable to.
The HTML code is as follows:
<span class="versionInfo">
<span class="menu-highight">SoftFEPVis (GUI): </span> == $0
"1.6.4"
</span>
Where SoftFEPVis (GUI): and 1.6.4 are the texts which I would like to be able extract.
I am able to locate the element, and print out its class (menu-highlight), but un-able to extract SoftFEPVis (GUI): and 1.6.4.
I tried :
Version_Number = Browser.find_element(By.XPATH,'//[#id="versionDropDown"]/div/span[3]/span').getText()
and got an error:
'WebElement' object has no attribute getText.
Please help.
Instead of using .getText() you could use:
.get_attribute('innerText')
or
.get_attribute('innerHtml')
or
.text
If it helps, here is a more in-depth discussion of the topic:
Given a (python) selenium WebElement can I get the innerText?
getText() is a Selenium Java client method, where as from your code trials and the error message presumably you are using Selenium Python client.
Solution
To print the text SoftFEPVis (GUI): and 1.6.4 you can use the text attribute and you can use either of the following locator strategies:
Using css_selector and get_attribute("innerHTML"):
print(Browser.find_element(By.CSS_SELECTOR, "span.versionInfo").text)
Using xpath and text attribute:
print(Browser.find_element(By.XPATH, "//span[#class='versionInfo']").text)
Note : You have to add the following imports :
from selenium.webdriver.common.by import By

Printing text from a specific html tag, with only the tags class name. PYTHON3

I need a code "snippet" (or what you call them) that prints out all of the words inside a specific html class, not tag but class.
<h1 class="example">Hello people!</h1>
Lets say for some reason the HTML of a website only looked like that, I would need a code that could print out whats inside the H1 TAG but only with the class. I have tried researching this but havent gotten anything that would have helped (although I am bad at researching).
Thank you.
BeautifulSoup can do this for you
from bs4 import BeautifulSoup
import requests
html_doc = '<h1 class="example">Hello people!</h1>'
# or, if you need to get the content from an http endpoint
# html_doc = requests.get(url_to_source).text
soup = BeautifulSoup(html_doc, 'html.parser')
for heading in soup.find_all(attrs={"class": "example"}):
print(heading.string)

How to extract all tags from within a html tag using beautifulsoup

I am writing a generic html parser and want to be able to extract all the tags from a given tag. Because its a generic parser, outer tags may contain one or more inner tags, and they could be just any html tag, hence I cant use methods like find. I have also tried using .contents but it returns the result in form of a list, but I just want the tags as they are, so that they can be parsed further as bs4 tags.
E.g.: Given the following html:
<tr><th>a</th><th>b</th></tr>
I need to extract the following, while ensuring that its still of type bs4 tag
<th>a</th><th>b</th>
Why not using find_all() method with no arguments?
from bs4 import BeautifulSoup as soup
html = """<div><tr><th>a</th><th>b</th></tr></div>"""
page = soup(html,"html.parser")
div = page.find('div')
print('Get all tag occurences')
print(div.find_all())
print('Get only the inside tag, without duplicate')
print(div.find_all()[0])
OUTPUT:
Get all tag occurences
[<tr><th>a</th><th>b</th></tr>, <th>a</th>, <th>b</th>]
Get only the inside tag, without duplicate
<tr><th>a</th><th>b</th></tr>

Scraping HTML elements between ::before and ::after with scrapy and xpath

I am trying to scrape some links from a webpage in python with scrapy and xpath, but the elements I want to scrape are between ::before and ::after so xpath can't see them as they do not exist in the HTML but are dynamically created with javascript. Is there a way to scrape those elements?
::before
<div class="well-white">...</div>
<div class="well-white">...</div>
<div class="well-white">...</div>
::after
This is the actual page http://ec.europa.eu/research/participants/portal/desktop/en/opportunities/amif/calls/amif-2018-ag-inte.html#c,topics=callIdentifier/t/AMIF-2018-AG-INTE/1/1/1/default-group&callStatus/t/Forthcoming/1/1/0/default-group&callStatus/t/Open/1/1/0/default-group&callStatus/t/Closed/1/1/0/default-group&+identifier/desc
I can't replicate your exact document state.
However if you load the page you can see some template language loaded in the same format your example data is:
Also if you check XHR network inpector you can see some AJAX requests for json data is being made:
So you can download the whole data you are looking for in handy json format over here:
http://ec.europa.eu/research/participants/portal/data/call/amif/amif_topics.json
scrapy shell "http://ec.europa.eu/research/participants/portal/data/call/amif/amif_topics.json"
> import json
> data = json.loads(response.body_as_unicode())
> data['topicData']['Topics'][0]
{'topicId': 1259874, 'ccm2Id': 31081390, 'subCallId': 910867, ...
Very very easy!
you just use the "Absolute XPath" and "Relative XPath" (https://www.guru99.com/xpath-selenium.html) together.By this trick you can pass form ::before (and maybe ::after). For example in your case (I supposed that,:
//div[#id='"+FindField+"'] // following :: td[#class='KKKK'] is before your "div".
FindField='your "id" associated to the "div"'
driver.find_element_by_xpath ( "//div[#id='"+FindField+"'] // following :: td[#class='KKKK'] / div")
NOTE:only one "/" must be use.
Also you can use only "Absolute XPath" in all addressing (Note:must be use "//" at the first Address.