So basically I am trying to work on webscraping. I need to scrap the work life balance rating from indeed website. But the challenge that I am facing is that I do not know how to extract the text from the aria-label, so I can get the output 4.0 out of 5 stars.
<div role="img" aria-label="4.0 out of 5 stars."><div class="css-eub7j6 eu4oa1w0"><div data-testid="filledStar" style="width:42.68px" class="css-i84nrz eu4oa1w0"></div></div></div>
You need to identify the element and use the get attribute aria-label to get the value.
If you are using python. code will be
print(diver.find_element(By.XPATH, "//div[#role='img']").get_attribute("aria-label"))
Update:
print(diver.find_element(By.XPATH, "//div[#role='img' and #aria-label]").get_attribute("aria-label"))
Or
print(diver.find_element(By.XPATH, "//div[#role='img' and #aria-label][.//div[#data-testid='filledStar']]").get_attribute("aria-label"))
In case you can locate that element attribute value can be retrieven with selenium with get_attribute() method.
Let's say you are using By.CSS_SELECTOR and the locator is css_selector.
Python syntax is:
aria_label_value = driver.driver.find_element(By.CSS_SELECTOR, css_selector).get_attribute("aria-label")
Same can be done with other programming languages similarly with slight syntax changes
To retrive the value of the aria-label attribute i.e. "4.0 out of 5 stars." you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:
Using CSS_SELECTOR and role="img":
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[role='img'][aria-label]"))).get_attribute("aria-label"))
Using XPATH and data-testid="filledStar":
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[#data-testid='filledStar']//ancestor::div[#role='img' and #aria-label]"))).get_attribute("aria-label"))
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
You can find a relevant discussion in Python Selenium - get href value
Related
I'm trying to scrape some data from LinkedIn but I noticed that the elements id change each time I load the page with Selenium. So I tried using class name to find all the elements but the class names have newline inside of them, preventing me from scraping the website.
example of class with newlines here
Website link example
I tried doing the below:
job_test = "ember-view jobs-search-results__list-item occludable-update p0 relative scaffold-layout__list-item\n \n \n "
job_list = driver.find_elements(By.CLASS_NAME, job_test)
I even tried this:
job_test = '''ember-view jobs-search-results__list-item occludable-update p0 relative scaffold-layout__list-item
'''
job_list = driver.find_elements(By.CLASS_NAME, job_test)
But it does not show me any elements when I print job_list. What do I do here?
By.CLASS_NAME accepts only one classname, so you can't pass multiple. See: Invalid selector: Compound class names not permitted error using Selenium
Solution
To create the job list you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following locator strategies:
Using CLASS_NAME:
driver.get('https://www.linkedin.com/jobs/search/?currentJobId=3425809260&keywords=python')
job_list = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "jobs-search-results__list-item")))
Using CSS_SELECTOR:
driver.get('https://www.linkedin.com/jobs/search/?currentJobId=3425809260&keywords=python')
job_list = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "li.jobs-search-results__list-item")))
Using XPATH:
driver.get('https://www.linkedin.com/jobs/search/?currentJobId=3425809260&keywords=python')
job_list = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//li[contains(#class, 'jobs-search-results__list-item')]")))
I am new to web scraping and a bit confused with my current situation. Is there a way to extract the link for all the sector from this website(where I circled in red) From the html inspector, it seems like it is under the "performance-section" class and it is also under the "heading" class. My idea was to start from the "performance-section" then reach the "a" tag href in the end to get the link.
I tried to use the following code but it is giving me "None" as a result. I stopped here because if I am already getting None before getting the "a" tag, then I think there is no point of keep going.
import requests
import urllib.request
from bs4 import BeautifulSoup
url = "https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml"
response = requests.get(url)
results_page = BeautifulSoup(response.content,'lxml')
heading =results_page.find('performance-section',{'class':"heading"})
Thanks in advance!
You are on the right track with your mind game.
Problem
You should take another look at the documentation, because currently you don't even try to select tags, but try a mix of classes - It is also possible, but to learn you should start step by step.
Solution to get the <a> and its href
This will select all <a> in <div> with class heading
that parents are <div> with class performance-section
soup.select('div.performance-section div.heading a')
.
import requests
from bs4 import BeautifulSoup
url = "https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml"
response = requests.get(url)
soup = BeautifulSoup(response.content,'lxml')
[link['href'] for link in soup.select('div.performance-section div.heading a')]
I am new in web scraping, so I need your help.
I have this html code (look at the picture) and I want to get this specific value --> "275,47".
I wrote down this code, but something is going wrong... Please help me! :)
import requests
from bs4 import BeautifulSoup as bs
url="https://www.skroutz.gr/s/11504255/Apple-iPhone-SE-32GB.html"
page=requests.get(url)
soup=bs(page.text,"html.parser")
D={"href":"/products/show/30871132","rel":"nofollow","class":"js-product-
link","data-func":"trigger_shop_uservoice","data-uservoice-
pid":"30871132","data-append-element":".shop-details","data-uservoice-
shopid":"1913","data-type":"final_price"}
value=soup.find_all("a",attrs=D)
print(value.string)
So, you are close! Your error is being thrown because the variable "value" does not have an attribute called string. Value is currently a list of items. You want to iterate over all of the anchors and find the one you are looking for.
My suggestion:
import requests
from bs4 import BeautifulSoup as bs
url="https://www.skroutz.gr/s/11504255/Apple-iPhone-SE-32GB.html"
page=requests.get(url)
soup=bs(page.text,"html.parser")
value=soup.find_all("a")
for item in value:
if '30871132' in item.get('href'):
print item.text
item will be the current anchor tag we are iterating over in the loop.
We can get it's href attribute (or any attribute) by using the .get method.
We then check if '308711332' is in the href, and if so, print out it's text.
In my Java code I want to programmatically create a <fieldset> tag that I can use in my JSF form.
The setup of my form looks like this:
Application app = FacesContext.getCurrentInstance().getApplication();
HtmlForm form = (HtmlForm) app.createComponent(HtmlForm.COMPONENT_TYPE);
form.setStyleClass("pure-form pure-form-stacked");
As you can see I use HtmlForm.COMPONENT_TYPE as an identifier for the JSF UI component but I haven't found an identifier for a fieldset so I tried:
UIComponent fieldset = app.createComponent("fieldset");
form.getChildren().add(fieldset);
Unfortunately this is not working so I have to come up with another solution. Do you have any ideas?
Is there a general approach how HTML tags (which are unknown in the JSF context) can be created?
You can try the following:
Theres a component called <f:verbatim> which you would use in xhtml like this:
<f:verbatim escape="false">
<fieldset id="blah"></fieldset>
</f:verbatim>
To achieve that programmaticlly you can add this component like this:
String fieldsetHTMLText ="<fieldset id=\"blah\"></fieldset>";
UIOutput verbatim = new UIOutput();
verbatim.setRendererType("javax.faces.Text");
verbatim.getAttributes().put("escape", false);
verbatim.setValue(fieldsetHTMLText);
I found three solutions to my problem. The first one is to use PrimeFaces, the second one is to use MyFaces Tomahawk and the third one is to use a JSF Verbatim UI component with string input. I will shortly list up code samples and the differences between the solutions.
1 PrimeFaces
With an include of the PrimeFaces components suite (and it's Apache Commons FileUpload dependency) one can use the Fieldset class to programatically create a fieldset on-the-fly. The bad thing on that is, that the PrimeFaces Fieldset component is depends on a PrimeFaces JavaScript file so instead of the plain fieldset, one will get a fieldset and a JavaScript include which is way too much.
import org.primefaces.component.fieldset.Fieldset;
...
form.getChildren().add(new Fieldset());
2 MyFaces Tomahawk
The UI component set Tomahawk also comes with a Fieldset component that can be used to create an HTML fieldset programatically. If the Fieldset of Tomahawk will be used, then one will get a plain and nice-looking fieldset tag. The bad thing here is that Tomahawk is an extension to MyFaces and MyFaces itself is a whole JavaServer Faces implementation which should not be used alongside standard JSF.
import org.apache.myfaces.custom.fieldset.Fieldset
...
form.getChildren().add(new Fieldset());
3 JSF Verbatim UI Component
The standardized and hacky way is to use a JSF Verbatim UI component. Within a verbatim component you are allowed to put any HTML needed. With this little trick we can create a verbatim tag:
UIOutput fieldset = new UIOutput();
fieldset.setRendererType("javax.faces.Text");
fieldset.getAttributes().put("escape", false);
fieldset.setValue("<fieldset></fieldset>");
The code shown above renders a fieldset HTML element but because it is a string and the tag inside the string is closed you cannot programatically append anything to that tag, so this won't work:
form.getChildren().add(fieldset);
To generate an HTML tag that can be used for nesting of elements, each opening and closing tag must be put in an own Varbatim component which makes this solution very text heavy:
UIOutput fieldsetStart = new UIOutput();
fieldsetStart.setRendererType("javax.faces.Text");
fieldsetStart.getAttributes().put("escape", false);
fieldsetStart.setValue("<fieldset>");
UIOutput fieldsetClose = new UIOutput();
fieldsetClose.setRendererType("javax.faces.Text");
fieldsetClose.getAttributes().put("escape", false);
fieldsetClose.setValue("</fieldset>");
HtmlInputText inputText = (HtmlInputText) app.createComponent(HtmlInputText.COMPONENT_TYPE);
form.getChildren().add(fieldsetStart);
form.getChildren().add(inputText);
form.getChildren().add(fieldsetClose);
Conclusion:
None of the solutions shown is really elegant. PrimeFaces und MyFaces have large dependencies and the standard JEE way requires practally much writing effort. I had hoped to find a nice solution to produce unknown / custom HTML elements, such as: document.createElement("fieldset");.
If anyone knows a way to do that, please post the solution.
I have a simple extension for the Sphinx documentation utility (my version in use isSphinx-1.1.3-py2.6). Very much like this excellent example by Doug Hellmann. How can I add a rel='bar' attribute to the final HTML for the <a ...> tag?
The reference nodes are created in this fashion:
node = nodes.reference(rawtext, utils.unescape(text),
internal=False,
refuri=ref,
classes=['foocss'],
rel='bar',
**options)
However, the rel='bar' attribute gets stripped out from the final HTML markup. Hunting through the source got me to sphinx/writers/html.py and the HTMLTranslator class. Here is part of the visit_reference method:
# overwritten
def visit_reference(self, node):
atts = {'class': 'reference'}
<snip>
if 'reftitle' in node:
atts['title'] = node['reftitle']
self.body.append(self.starttag(node, 'a', '', **atts))
Additional attributes are not handled. Maybe they could be replaced in others parts. I couldn't find anything useful in that respect.
So, I could:
create a custom node which re-implements all the functionality of the reference node. A fair bit of work for a small addition.
Overwrite the visit_reference method in sphinx/writers/html.py. Quicker, but bad in terms of future Sphinx updates.
Add the rel attribute with jQuery to the link tag after the fact. Well, not pretty either.
I managed to do this with download_reference. Using app.add_node I override the visit_... method:
import posixpath
from sphinx.writers.html import HTMLTranslator
from sphinx.addnodes import download_reference
def visit_download_reference(self, node):
if node.hasattr('filename'):
self.body.append(
'<a class="reference download internal" href="%s" %s>' %
(posixpath.join(self.builder.dlpath, node['filename']), 'rel="%s"' % node['rel'] if node.get('rel', None) else ''))
self.context.append('</a>')
else:
self.context.append('')
def setup(app):
app.add_node(download_reference, html=(visit_download_reference, HTMLTranslator.depart_download_reference))
full extension is here