I'm trying to xpath parse an HTML document containing the following line:
<td class="ficha ficha_izq">Emisiones de CO<sub>2</sub> (gr/km)</td>
I'm using scrapy, and the result is:
[<Selector xpath='//td[contains(#class,"ficha_izq")]/node()' data=u'Emisiones de CO'>, <Selector xpath='//td[contains(#class,"ficha_izq")]/node()' data=u'<sub>2</sub>'>, <Selector xpath='//td[contains(#class,"ficha_izq")]/node()' data=u' (gr/km)'>]
so, three items instead of just one. I don't mind about the tag, so how would I get a single item containing:
Emisiones de CO2 (gr/km)
This is not a single case, I've several items containing the tag, so I need some programatic solution.
Any clue?
thanks!!
NOTE: Using text() instead of node() does not help:
[<Selector xpath='//td[contains(#class,"ficha_izq")]/text()' data=u'Emisiones de CO'>, <Selector xpath='//td[contains(#class,"ficha_izq")]/text()' data=u' (gr/km)'>]
This xpath should work //td[contains(text(),'Emisiones de CO')]/node()
Use w3lib.html.remove_tags. You can use it with an ItemLoader.
In [1]: html = '<td class="ficha ficha_izq">Emisiones de CO<sub>2</sub> (gr/km)</td>'
In [2]: sel = Selector(text=html)
In [3]: map(remove_tags, sel.xpath('//td').extract())
Out[3]: [u'Emisiones de CO2 (gr/km)']
Alternatives using XPath or CSS selectors:
In [4]: u''.join(sel.xpath('//td[contains(#class,"ficha_izq")]//text()').extract())
Out[4]: u'Emisiones de CO2 (gr/km)'
In [5]: u''.join(sel.css('td.ficha_izq ::text').extract())
Out[5]: u'Emisiones de CO2 (gr/km)'
Notice the space between td.ficha_izq and ::text, and that ::text CSS pseudo element is a Scrapy extension to CSS selectors.
Related
I have html code like:
<form class="variations_form cart" action="https://example.com/name-of-product" method="post" enctype='multipart/form-data' data-product_id="386" data-product_variations="[{"attributes":{"attribute_pa_czas-realizacji":"24h"},"availability_html":"<p class=\"stock out-of-stock\">Brak w magazynie<\/p>\n","backorders_allowed":false,"dimensions":{"length":"","width":""}]">
I would like to extract "Brak w magazynie".
I have tried xpath:
//*[text() = 'Brak w magazynie']
but it doesn't work. Any idea how to do it? :)
You can use the following XPath expressions to locate this element:
//form[#class='variations_form cart']
Or
//form[#action='https://example.com/name-of-product']
Or
//form[#action='https://example.com/name-of-product' and #class='variations_form cart']
And then extract the found element text
UPD
If you want to select such elements containing Brak w magazynie in their data-product_variations attribute you can use XPath like this:
//form[#class='variations_form cart' and(contains(#data-product_variations,'Brak w magazynie')) ]
Or
//form[#action='https://example.com/name-of-product' and contains(#data-product_variations,'Brak w magazynie')]
I have this HTML and I need to make an XPath to find all the "A1" text and get the href of all those elements of the page. It has multiple A1s in the page but I need all the hrefs.
I can't crack it.
<a href="./leitor.do?numero=20090&keyword=ministro&anchor=5975889&origem=busca" class="edition" title="Folha de S.Paulo">
<figure>
<img src="https://acervo.folha.uol.com.br/files/flip/11/89/58/97/5975889/140/5975889.jpg" width="180" height="312.4">
</figure>
<h3>31.dez.2014</h3>
<p>
país. Poder Novo <b>ministro</b> diz que Congresso irá ?expurgar? culpados futuro articulador polí
</p>
<small>
Folha de S.Paulo, Ano 94 - N° 20.090<br>
A1 - 1 ocorrência
</small>
</a>
This XPath,
//a[contains(.,"A1")]/#href
will return all href attributes on a elements with string values that contain an "A1" substring.
You don't have to use XPath for that. You can use driver.find_elements_by_partial_link_text("A1"), and on each of the returned element, call element.get_attribute("href")
You can combine it to one line as follows:
all_hrefs=[el.get_attribute("href") for el in driver.find_elements_by_partial_link_text("A1")]
I am using angularjs http to get an xml from the server, i am using node xml2json plugin to convert it to a json object. I get the document returned properly but it doesn't look right because i has html tags in the xml document. If anyone knows how to convert the returned json properly that would be great,
var x2js = new X2JS();
var dom = x2js.xml_str2json(chr);
<my_data>
<p class="justify">
El
<strong>
<em>
Content Tour 2015
</em>
</strong>
<strong>
Distributors
</strong>
</p>
</my_data>
I am still trying to figure this computer programming stuff out, i have only been coding a few months, so please excuse my bad code and kind of dumb questions, but if anyone knows how to parse the json or xml to html, i will be very grateful.
// Add your javascript here
$(function(){
$("body").append('<notas> <copete>Editorial</copete> <titulo>grande</titulo> <seccion>Opinion</seccion> <cuerpo> <p class="rtejustify"> El <strong> <em> Foro Infochannel Tour 2015 </em> </strong> concluyó en días pasados en la hermosa ciudad de Oaxaca, Oaxaca. El escenario para concluir el recorrido no pudo ser mejor. </p> </cuerpo> </notas>')
});
check plunkr --
plnkr.co/edit/nlEzvOeqwyGx7FwlFIEI?p=preview
I have this html code:
<div class="last-minute">
<span>Modulo:</span>4-3-3<p>Mandorlini durante questa sosta confida di recuperare
Juanito Gomez e Cirigliano, attualmente fermi ai box. Non preoccupa Hallfredsson
sostituito a Genova per un taglio al capo. </p><div class="squalificati">
<span>Squalificati :</span>-</div><div class="indisponibili"><span>Indisponibili :
</span>
<div><strong><a title="Cirigliano" href="../../../../calciatore/VERONA
HELLAS/Cirigliano">Cirigliano</a></strong>: Lesione distrattiva al flessore destro</div>
<div><strong><a title="Juanito " href="../../../../calciatore/VERONA HELLAS/Juanito
">Juanito </a></strong>: Lesione distrattiva al bicipite femorale destro</div> </div>
<div class="dubbio"><span>In dubbio :</span>-</div><div class="diffidati">
<span>Ballottaggi :</span>Jankovic 60% - Martinho 40%</div><div style='float:
left;margin-bottom: 8px;font-style: italic;color: #929292;line-height: 14px;width:
168px;'>Aggiornamento:12/11/2013 12:09:36</div>
I would like to get that "4-3-3" just after this code :<span>Modulo:</span> (2nd line).
How can i get it using the css selector in jsoup? Thank you.
You should use the ownText() method of the Element class (see docs), which selects only the text owned directly by the element and ignores its child tags.
For example:
String html = "<div class='last-minute'><span>Modulo:</span>4-3-3<p>Mandorlini....";
Document doc = Jsoup.parse(html);
System.out.println(doc.select("div.last-minute").first().ownText());
Will output:
4-3-3
I searched and tried several solutions for this problem but none of them worked:
I have this HTML
<div class="detalhes_colunadados">
<div class="detalhescolunadados_blocos">
<h5>Descrição completa</h5>
Sala de estar/jantar,2 vagas de garagem cobertas.<br>
</div>
<div class="detalhescolunadados_blocos">
<h5>Valores</h5>
Venda: R$ 600.000,00<br>
Condomínio: R$ 660,00<br>
</div>
</div>
And wanna to extract by XPath only the text content in the first div class="detalhescolunadados_blocos" that are not h5 tags.
I tried:
//div[#class='detalhescolunadados_blocos']/[1]/*[not(self::h5)]
Try the following XPath expression:
//div[#class='detalhescolunadados_blocos'][1]//text()[not(ancestor::h5)]
This will return:
$ xmllint --html --shell so.html
/ > xpath //div[#class='detalhescolunadados_blocos'][1]//text()[not(ancestor::h5)]
Object is a Node Set :
Set contains 2 nodes:
1 TEXT
content=
2 TEXT
content= Sala de estar/jantar,2 vagas de gar...
It seems to me that this works:
//div[#class="detalhescolunadados_blocos"]/text()
Try doing this :
//div[#class="detalhes_colunadados"]/div/text()