Jsoup css selector - html

I have this html code:
<div class="last-minute">
<span>Modulo:</span>4-3-3<p>Mandorlini durante questa sosta confida di recuperare
Juanito Gomez e Cirigliano, attualmente fermi ai box. Non preoccupa Hallfredsson
sostituito a Genova per un taglio al capo. </p><div class="squalificati">
<span>Squalificati :</span>-</div><div class="indisponibili"><span>Indisponibili :
</span>
<div><strong><a title="Cirigliano" href="../../../../calciatore/VERONA
HELLAS/Cirigliano">Cirigliano</a></strong>: Lesione distrattiva al flessore destro</div>
<div><strong><a title="Juanito " href="../../../../calciatore/VERONA HELLAS/Juanito
">Juanito </a></strong>: Lesione distrattiva al bicipite femorale destro</div> </div>
<div class="dubbio"><span>In dubbio :</span>-</div><div class="diffidati">
<span>Ballottaggi :</span>Jankovic 60% - Martinho 40%</div><div style='float:
left;margin-bottom: 8px;font-style: italic;color: #929292;line-height: 14px;width:
168px;'>Aggiornamento:12/11/2013 12:09:36</div>
I would like to get that "4-3-3" just after this code :<span>Modulo:</span> (2nd line).
How can i get it using the css selector in jsoup? Thank you.

You should use the ownText() method of the Element class (see docs), which selects only the text owned directly by the element and ignores its child tags.
For example:
String html = "<div class='last-minute'><span>Modulo:</span>4-3-3<p>Mandorlini....";
Document doc = Jsoup.parse(html);
System.out.println(doc.select("div.last-minute").first().ownText());
Will output:
4-3-3

Related

How to find a specific tagged interval within a find_all interval using beautifulsoup

I am trying to first find all the spans, after that look for those that have "Número de documento" in their get_text (), to continue with extracting from those span those that have the "nombcampo="it_ndoc_tom"" attribute.
This would be the structure of the spans that have "Número de documento" and with them their respective attributes that have different values, of which I only want to extract those that have "it_ndoc_tom"
[<span class='\"rSpnTitulo\"'>Número de documento<div campo="" class='\"glyphicon' editable.\"="" es="" glyphicon-edit\"="" style='\"float:right;\"' title='\"Éste'></div></span>, <span cetxt\"="" class='\"rSpnValor' idc0='\"1152723\"' nombcampo="it_ndoc_tom" paso="" vidc0='\"890102999\"'>890102999</span>]
[<span class='\"rSpnTitulo\"'>Número de documento<div campo="" class='\"glyphicon' editable.\"="" es="" glyphicon-edit\"="" style='\"float:right;\"' title='\"Éste'></div></span>, <span cetxt\"="" class='\"rSpnValor' idc0='\"1152725\"' nombcampo="ia_ndoc_ase" paso="" vidc0='\"52865608\"'>52865608</span>]
[<span class='\"rSpnTitulo\"'>Número de documento<div campo="" class='\"glyphicon' editable.\"="" es="" glyphicon-edit\"="" style='\"float:right;\"' title='\"Éste'></div></span>, <span cetxt\"="" class='\"rSpnValor' idc0='\"1152726\"' nombcampo="ib_ndoc_ben" paso="" vidc0='\"52865608\"'>52865608</span>]
[<span class='\"rSpnTitulo\"'>Número de documento<div campo="" class='\"glyphicon' editable.\"="" es="" glyphicon-edit\"="" style='\"float:right;\"' title='\"Éste'></div></span>, <span cetxt\"="" class='\"rSpnValor' idc0='\"1120863\"' nombcampo="ib_ndoc_ben" paso="" vidc0='\"860002082\"'>860002082</span>]
Currently I can already access those spans that have "Número de documento", but the problem is trying to search within those it found, those that have the attribute (nombcampo = it_ndoc_tom).
I share a bit of the code:
dataArray = soup.find_all('div', {'class': '\\\"rDivDatosAseg'})
for data in dataArray:
data_container = data
ndoc_tom = data_container.find_all('span')
if ndoc_tom[0].get_text() == "Número de documento":
for span in ndoc_tom:
filt_ndoc_tom = span.find_all('span', {'nombcampo': 'it_ndoc_tom'})
print(filt_ndoc_tom)

XPath for href links based on anchor text substring

I have this HTML and I need to make an XPath to find all the "A1" text and get the href of all those elements of the page. It has multiple A1s in the page but I need all the hrefs.
I can't crack it.
<a href="./leitor.do?numero=20090&keyword=ministro&anchor=5975889&origem=busca" class="edition" title="Folha de S.Paulo">
<figure>
<img src="https://acervo.folha.uol.com.br/files/flip/11/89/58/97/5975889/140/5975889.jpg" width="180" height="312.4">
</figure>
<h3>31.dez.2014</h3>
<p>
país. Poder Novo <b>ministro</b> diz que Congresso irá ?expurgar? culpados futuro articulador polí
</p>
<small>
Folha de S.Paulo, Ano 94 - N° 20.090<br>
A1 - 1 ocorrência
</small>
</a>
This XPath,
//a[contains(.,"A1")]/#href
will return all href attributes on a elements with string values that contain an "A1" substring.
You don't have to use XPath for that. You can use driver.find_elements_by_partial_link_text("A1"), and on each of the returned element, call element.get_attribute("href")
You can combine it to one line as follows:
all_hrefs=[el.get_attribute("href") for el in driver.find_elements_by_partial_link_text("A1")]

I want to parse json to html, with angularjs, i am converting an xml to json with npm xml2json

I am using angularjs http to get an xml from the server, i am using node xml2json plugin to convert it to a json object. I get the document returned properly but it doesn't look right because i has html tags in the xml document. If anyone knows how to convert the returned json properly that would be great,
var x2js = new X2JS();
var dom = x2js.xml_str2json(chr);
<my_data>
<p class="justify">
El
<strong>
<em>
Content Tour 2015
</em>
</strong>
<strong>
Distributors
</strong>
</p>
</my_data>
I am still trying to figure this computer programming stuff out, i have only been coding a few months, so please excuse my bad code and kind of dumb questions, but if anyone knows how to parse the json or xml to html, i will be very grateful.
// Add your javascript here
$(function(){
$("body").append('<notas> <copete>Editorial</copete> <titulo>grande</titulo> <seccion>Opinion</seccion> <cuerpo> <p class="rtejustify"> El <strong> <em> Foro Infochannel Tour 2015 </em> </strong> concluyó en días pasados en la hermosa ciudad de Oaxaca, Oaxaca. El escenario para concluir el recorrido no pudo ser mejor. </p> </cuerpo> </notas>')
});
check plunkr --
plnkr.co/edit/nlEzvOeqwyGx7FwlFIEI?p=preview

Parsing <sup> tag inside <td>

I'm trying to xpath parse an HTML document containing the following line:
<td class="ficha ficha_izq">Emisiones de CO<sub>2</sub> (gr/km)</td>
I'm using scrapy, and the result is:
[<Selector xpath='//td[contains(#class,"ficha_izq")]/node()' data=u'Emisiones de CO'>, <Selector xpath='//td[contains(#class,"ficha_izq")]/node()' data=u'<sub>2</sub>'>, <Selector xpath='//td[contains(#class,"ficha_izq")]/node()' data=u' (gr/km)'>]
so, three items instead of just one. I don't mind about the tag, so how would I get a single item containing:
Emisiones de CO2 (gr/km)
This is not a single case, I've several items containing the tag, so I need some programatic solution.
Any clue?
thanks!!
NOTE: Using text() instead of node() does not help:
[<Selector xpath='//td[contains(#class,"ficha_izq")]/text()' data=u'Emisiones de CO'>, <Selector xpath='//td[contains(#class,"ficha_izq")]/text()' data=u' (gr/km)'>]
This xpath should work //td[contains(text(),'Emisiones de CO')]/node()
Use w3lib.html.remove_tags. You can use it with an ItemLoader.
In [1]: html = '<td class="ficha ficha_izq">Emisiones de CO<sub>2</sub> (gr/km)</td>'
In [2]: sel = Selector(text=html)
In [3]: map(remove_tags, sel.xpath('//td').extract())
Out[3]: [u'Emisiones de CO2 (gr/km)']
Alternatives using XPath or CSS selectors:
In [4]: u''.join(sel.xpath('//td[contains(#class,"ficha_izq")]//text()').extract())
Out[4]: u'Emisiones de CO2 (gr/km)'
In [5]: u''.join(sel.css('td.ficha_izq ::text').extract())
Out[5]: u'Emisiones de CO2 (gr/km)'
Notice the space between td.ficha_izq and ::text, and that ::text CSS pseudo element is a Scrapy extension to CSS selectors.

XPath select all text content for a <div> except for a specific tag <h5>

I searched and tried several solutions for this problem but none of them worked:
I have this HTML
<div class="detalhes_colunadados">
<div class="detalhescolunadados_blocos">
<h5>Descrição completa</h5>
Sala de estar/jantar,2 vagas de garagem cobertas.<br>
</div>
<div class="detalhescolunadados_blocos">
<h5>Valores</h5>
Venda: R$ 600.000,00<br>
Condomínio: R$ 660,00<br>
</div>
</div>
And wanna to extract by XPath only the text content in the first div class="detalhescolunadados_blocos" that are not h5 tags.
I tried:
//div[#class='detalhescolunadados_blocos']/[1]/*[not(self::h5)]
Try the following XPath expression:
//div[#class='detalhescolunadados_blocos'][1]//text()[not(ancestor::h5)]
This will return:
$ xmllint --html --shell so.html
/ > xpath //div[#class='detalhescolunadados_blocos'][1]//text()[not(ancestor::h5)]
Object is a Node Set :
Set contains 2 nodes:
1 TEXT
content=
2 TEXT
content= Sala de estar/jantar,2 vagas de gar...
It seems to me that this works:
//div[#class="detalhescolunadados_blocos"]/text()
Try doing this :
//div[#class="detalhes_colunadados"]/div/text()