XPath for href links based on anchor text substring - html

I have this HTML and I need to make an XPath to find all the "A1" text and get the href of all those elements of the page. It has multiple A1s in the page but I need all the hrefs.
I can't crack it.
<a href="./leitor.do?numero=20090&keyword=ministro&anchor=5975889&origem=busca" class="edition" title="Folha de S.Paulo">
<figure>
<img src="https://acervo.folha.uol.com.br/files/flip/11/89/58/97/5975889/140/5975889.jpg" width="180" height="312.4">
</figure>
<h3>31.dez.2014</h3>
<p>
país. Poder Novo <b>ministro</b> diz que Congresso irá ?expurgar? culpados futuro articulador polí
</p>
<small>
Folha de S.Paulo, Ano 94 - N° 20.090<br>
A1 - 1 ocorrência
</small>
</a>

This XPath,
//a[contains(.,"A1")]/#href
will return all href attributes on a elements with string values that contain an "A1" substring.

You don't have to use XPath for that. You can use driver.find_elements_by_partial_link_text("A1"), and on each of the returned element, call element.get_attribute("href")
You can combine it to one line as follows:
all_hrefs=[el.get_attribute("href") for el in driver.find_elements_by_partial_link_text("A1")]

Related

How to find a specific tagged interval within a find_all interval using beautifulsoup

I am trying to first find all the spans, after that look for those that have "Número de documento" in their get_text (), to continue with extracting from those span those that have the "nombcampo="it_ndoc_tom"" attribute.
This would be the structure of the spans that have "Número de documento" and with them their respective attributes that have different values, of which I only want to extract those that have "it_ndoc_tom"
[<span class='\"rSpnTitulo\"'>Número de documento<div campo="" class='\"glyphicon' editable.\"="" es="" glyphicon-edit\"="" style='\"float:right;\"' title='\"Éste'></div></span>, <span cetxt\"="" class='\"rSpnValor' idc0='\"1152723\"' nombcampo="it_ndoc_tom" paso="" vidc0='\"890102999\"'>890102999</span>]
[<span class='\"rSpnTitulo\"'>Número de documento<div campo="" class='\"glyphicon' editable.\"="" es="" glyphicon-edit\"="" style='\"float:right;\"' title='\"Éste'></div></span>, <span cetxt\"="" class='\"rSpnValor' idc0='\"1152725\"' nombcampo="ia_ndoc_ase" paso="" vidc0='\"52865608\"'>52865608</span>]
[<span class='\"rSpnTitulo\"'>Número de documento<div campo="" class='\"glyphicon' editable.\"="" es="" glyphicon-edit\"="" style='\"float:right;\"' title='\"Éste'></div></span>, <span cetxt\"="" class='\"rSpnValor' idc0='\"1152726\"' nombcampo="ib_ndoc_ben" paso="" vidc0='\"52865608\"'>52865608</span>]
[<span class='\"rSpnTitulo\"'>Número de documento<div campo="" class='\"glyphicon' editable.\"="" es="" glyphicon-edit\"="" style='\"float:right;\"' title='\"Éste'></div></span>, <span cetxt\"="" class='\"rSpnValor' idc0='\"1120863\"' nombcampo="ib_ndoc_ben" paso="" vidc0='\"860002082\"'>860002082</span>]
Currently I can already access those spans that have "Número de documento", but the problem is trying to search within those it found, those that have the attribute (nombcampo = it_ndoc_tom).
I share a bit of the code:
dataArray = soup.find_all('div', {'class': '\\\"rDivDatosAseg'})
for data in dataArray:
data_container = data
ndoc_tom = data_container.find_all('span')
if ndoc_tom[0].get_text() == "Número de documento":
for span in ndoc_tom:
filt_ndoc_tom = span.find_all('span', {'nombcampo': 'it_ndoc_tom'})
print(filt_ndoc_tom)

web data scraping : split html content

I'm scraping a website and I was able to reduce a variable called "gender" to this :
[<span style="text-decoration: none;">
Lass Christian, du Danemark, à Yverdon-les-Bains, avec 200 parts de CHF 100
</span>, <span style="text-decoration: none;">associé gérant </span>]
And now I'd like to have only "associé" in the variable but I can't find a way to split this html code.
The reason is that I want to know if it's "associé" (male) or "associée" (female).
does anyone have any ideas ?
Cheers
----- edit ----
here my code which gets me the html output
url = "http://www.rc2.vd.ch/registres/hrcintapp-pub/companyReport.action?rcentId=5947621600000055031025&lang=FR&showHeader=false"
r = requests.get(url)
soup = BeautifulSoup(r.content,"lxml")
table = soup.select_one("#adm").find_next("table") #select_one finds only the first tag that matches a selector:
table2 = soup.select_one("#adm").find_all_next("table")
output = table.select("td span[style^=text-decoration:]", limit=2) #.text.split(",", 1)[0].strip()
print(output)
Whatever the parent of the two elements is you can call span:nth-of-type(2) to get the second span, then just check the text:
html = """<span style="text-decoration: none;">
Lass Christian, du Danemark, à Yverdon-les-Bains, avec 200 parts de CHF 100
</span>
<span style="text-decoration: none;">associé gérant </span>"""
soup = BeautifulSoup(html)
text = soup.select_one("span:nth-of-type(2)").text
Or if it not always the second span you can search for the span by the partial text associé:
import re
text = soup.find("span", text=re.compile(ur"associé")).text
For your edit, all you need is to extract the text last element and use .split(None, 1)[1] to get the gender:
text = table.select("td span[style^=text-decoration:]", limit=2)[-1].text
gender = text.split(None, 1)[1] # > gérant

Parsing <sup> tag inside <td>

I'm trying to xpath parse an HTML document containing the following line:
<td class="ficha ficha_izq">Emisiones de CO<sub>2</sub> (gr/km)</td>
I'm using scrapy, and the result is:
[<Selector xpath='//td[contains(#class,"ficha_izq")]/node()' data=u'Emisiones de CO'>, <Selector xpath='//td[contains(#class,"ficha_izq")]/node()' data=u'<sub>2</sub>'>, <Selector xpath='//td[contains(#class,"ficha_izq")]/node()' data=u' (gr/km)'>]
so, three items instead of just one. I don't mind about the tag, so how would I get a single item containing:
Emisiones de CO2 (gr/km)
This is not a single case, I've several items containing the tag, so I need some programatic solution.
Any clue?
thanks!!
NOTE: Using text() instead of node() does not help:
[<Selector xpath='//td[contains(#class,"ficha_izq")]/text()' data=u'Emisiones de CO'>, <Selector xpath='//td[contains(#class,"ficha_izq")]/text()' data=u' (gr/km)'>]
This xpath should work //td[contains(text(),'Emisiones de CO')]/node()
Use w3lib.html.remove_tags. You can use it with an ItemLoader.
In [1]: html = '<td class="ficha ficha_izq">Emisiones de CO<sub>2</sub> (gr/km)</td>'
In [2]: sel = Selector(text=html)
In [3]: map(remove_tags, sel.xpath('//td').extract())
Out[3]: [u'Emisiones de CO2 (gr/km)']
Alternatives using XPath or CSS selectors:
In [4]: u''.join(sel.xpath('//td[contains(#class,"ficha_izq")]//text()').extract())
Out[4]: u'Emisiones de CO2 (gr/km)'
In [5]: u''.join(sel.css('td.ficha_izq ::text').extract())
Out[5]: u'Emisiones de CO2 (gr/km)'
Notice the space between td.ficha_izq and ::text, and that ::text CSS pseudo element is a Scrapy extension to CSS selectors.

Jsoup css selector

I have this html code:
<div class="last-minute">
<span>Modulo:</span>4-3-3<p>Mandorlini durante questa sosta confida di recuperare
Juanito Gomez e Cirigliano, attualmente fermi ai box. Non preoccupa Hallfredsson
sostituito a Genova per un taglio al capo. </p><div class="squalificati">
<span>Squalificati :</span>-</div><div class="indisponibili"><span>Indisponibili :
</span>
<div><strong><a title="Cirigliano" href="../../../../calciatore/VERONA
HELLAS/Cirigliano">Cirigliano</a></strong>: Lesione distrattiva al flessore destro</div>
<div><strong><a title="Juanito " href="../../../../calciatore/VERONA HELLAS/Juanito
">Juanito </a></strong>: Lesione distrattiva al bicipite femorale destro</div> </div>
<div class="dubbio"><span>In dubbio :</span>-</div><div class="diffidati">
<span>Ballottaggi :</span>Jankovic 60% - Martinho 40%</div><div style='float:
left;margin-bottom: 8px;font-style: italic;color: #929292;line-height: 14px;width:
168px;'>Aggiornamento:12/11/2013 12:09:36</div>
I would like to get that "4-3-3" just after this code :<span>Modulo:</span> (2nd line).
How can i get it using the css selector in jsoup? Thank you.
You should use the ownText() method of the Element class (see docs), which selects only the text owned directly by the element and ignores its child tags.
For example:
String html = "<div class='last-minute'><span>Modulo:</span>4-3-3<p>Mandorlini....";
Document doc = Jsoup.parse(html);
System.out.println(doc.select("div.last-minute").first().ownText());
Will output:
4-3-3

XPath select all text content for a <div> except for a specific tag <h5>

I searched and tried several solutions for this problem but none of them worked:
I have this HTML
<div class="detalhes_colunadados">
<div class="detalhescolunadados_blocos">
<h5>Descrição completa</h5>
Sala de estar/jantar,2 vagas de garagem cobertas.<br>
</div>
<div class="detalhescolunadados_blocos">
<h5>Valores</h5>
Venda: R$ 600.000,00<br>
Condomínio: R$ 660,00<br>
</div>
</div>
And wanna to extract by XPath only the text content in the first div class="detalhescolunadados_blocos" that are not h5 tags.
I tried:
//div[#class='detalhescolunadados_blocos']/[1]/*[not(self::h5)]
Try the following XPath expression:
//div[#class='detalhescolunadados_blocos'][1]//text()[not(ancestor::h5)]
This will return:
$ xmllint --html --shell so.html
/ > xpath //div[#class='detalhescolunadados_blocos'][1]//text()[not(ancestor::h5)]
Object is a Node Set :
Set contains 2 nodes:
1 TEXT
content=
2 TEXT
content= Sala de estar/jantar,2 vagas de gar...
It seems to me that this works:
//div[#class="detalhescolunadados_blocos"]/text()
Try doing this :
//div[#class="detalhes_colunadados"]/div/text()