Cannot get CSS item - html

I ran into difficulty getting a radio schedule from http://www.franceculture.fr/programmes#/2014-01-26
My code is:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
file=File.open('/Users/hubertus/Dropbox/Apps/Drafts/radio/franceculture-2014-01-29-17-10- 34.txt', 'r')
url=file.readline
target=file.readline
f = File.open('/Users/hubertus/Desktop/output.txt', 'w')
doc = Nokogiri::HTML(open(url))
doc.css(".actionnable").each do |item|
#puts item
heure = item.at_css(".plage").text
text = item.at_css("a").text
description = item.at_css(':nth-child(3)').text
link =item.at_css('a')[:href]
f.puts "#{heure} - #{text} -#{description} - #{link}"
end
f.close
namesarr=File.read('/Users/hubertus/Desktop/output.txt').split(/\n/)
puts namesarr.select{ |i| i < target }.max
file.close
I want to get the later CSS items, as in the following HTML, I would like to get the second link (href:/emission-un-autre-jour-est-possible-pixar-%C2%AB-25-ans- d%E2%80%99animation-exposition-metamorphoses-de-la-foret-)
and its title (Pixar « 25 ans d’animation " exposition m de la for guyanaise : s>Pixar « 25 ans d’animation "Exposition / Métamorphoses de la forêt guyanaise : série)
<li class="actionnable">
<span class="plage">06:00</span>
<h2>Un autre jour est possible </h2>
<p> Production : Tewfik Hakem. Réalisation : Thomas Dutter. </p>
<img src="/sites/all/themes/franceculture/images/down.png" width="20" height="20" alt="déplier/replier" class="action"><ul>
<li>
<span>06:00</span><a href="/emission-un-autre-jour-est-possible-pixar-%C2%AB-25-ans- d%E2%80%99animation-exposition-metamorphoses-de-la-foret-" title="Pixar « 25 ans d’animation " exposition m de la for guyanaise : s>Pixar « 25 ans d’animation "Exposition / Métamorphoses de la forêt guyanaise : série</a>
<p>
</p>
</li>
</ul>
<div class="clearer"></div>
</li>
Does anyone have an idea how to get this with CSS selectors?

Related

Split HTML in a certain form of list in Python

For a django project i need to generate from the Django Website a kind of complex Word document.
I started using Docx-Template that do the job great but i encountered a problem:
For certain "spot" in the word template i need to brake Django rich texte (HTML) in something usable for Docx Template
So i went into transforming my richtext in a list that can have two types of elements (to keep the order of the blocs) : ["some paragraphs","('list',['first elt of the list for bullet','second','ect...")]
For now i have two function : one that break the HTML and one that transform it
My "HTML Breaking function" is like that :
def decoupe_html (raw_html):
soup=BeautifulSoup(raw_html,"html.parser")
arbre=[]
#decoupe en grand bloc HTML
for elt in soup:
arbre.append(elt)
print(arbre)
#On parcours chaque elt pour le transformer en truc compréhensible par Word dans une liste
for elt in arbre:
#recup du tag de début du "chunk"
tag=elt.name
#traitement des paragraphe de texte
if tag == "p":
texte=elt.text
place=arbre.index(elt)
arbre[place]=texte
#traitement des listes
elif tag == "ul":
list_elt=[]
enfants = elt.findChildren()
#on récupère tous les elt de la liste
for chld in enfants:
list_elt.append(chld.text)
place=arbre.index(elt)
arbre[place]=("list",list_elt)
return(arbre)
But i have trouble in "breaking" more complex list with multi level like for example this html :
<p>pfoizepfkjze</p>
<ul>
<li>blabla
<ul>
<li>bla2
<ul>
<li>bla3
<ul>
<li>bla4</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
<li>rebla</li>
</ul>
what should i change in my code to keep my data structure and get for example :
arbre = ['pfoizepfkjze',('list',['blabla',('list',['bla2',('list',['bla3',('list',['bla4'])])]),'rebla'])]
Thanks all for your help :)

How to write a filter to nanoc that adds a ref to a toc after each header

I'm trying to create a filter that adds references to the TOC after each h2 heading.
Here is what I've done but it doesn't work, and each time I modify the code all the relevant pages are not recompiled.
{Project dir}/Rules:
compile '/**/*.md' do
filter :kramdown
layout '/default.*'
filter :add_ref_to_toc
filter :add_toc_french
write item.identifier.without_ext + '/index.html'
end
{Project dir}/lib/filters/add_ref_to_toc.rb:
require 'nokogiri'
class AddRefToTocFilter < Nanoc::Filter
identifier :add_ref_to_toc
def run(content, params={})
doc = Nokogiri::HTML.fragment(content)
doc.css('#contents h2') do |header|
header.add_next_sibling "<br/><a href='#toccontent'>Aller à la table des matières</a><br/><br/>"
end
doc.to_s
end
end
Page after processing:
<div id="contents">
<h2>title 1</h2>
<br/><a href='#toccontent'>Aller à la table des matières</a><br/><br/>
<h2>title 2</h2>
<br/><a href='#toccontent'>Aller à la table des matières</a><br/><br/>
<h2>title 3</h2>
<br/><a href='#toccontent'>Aller à la table des matières</a><br/><br/>
</div>
Here is the code that works.
require 'nokogiri'
class AddRefToTocFilter < Nanoc::Filter
identifier :add_ref_to_toc
def run(content, params={})
doc = Nokogiri::HTML(content)
headers = doc.css('#contents h2')
headers.each do |header|
header.add_next_sibling(Nokogiri::HTML.fragment("<br/><a href='#toccontent'>|Aller à la table des matières|</a><br/><br/>"))
end
doc.to_s
end
end

getting next item in beatiful Soup

With Beautiful soup I need to detect the meaning of one expression. From the many definitions, only one is needed. In the Beautiful soup the contents is like this:
<strong>occhio della testa</strong><br/>
loc.s.m <br/>
<span class="mu"title="">CO</span><span style="color:#000"> </span><br/>
follia<br/>
<strong>pagare un occhio della testa</strong><br/>
loc.v.<br/>
<span class="mu"title="">CO</span><span style="color:#000"> </span><br/>
strapagare<br/>
<strong>passare per la testa</strong><br/>
loc.v.<br/>
<span class="mu" title="">CO</span><span style="color:#000"> </span><br/>
passare per la mente<br/>
<strong>perdere la testa</strong><br/>
loc.v.<br/>
<span class="mu" title="">CO</span><span style="color:#000"> </span><br/>
entrare in uno stato di confusione mentale; impazzire, spec. fig.: ha perso la testa per quella donna, se ne è perdutamente innamorato<br/>
<strong>
What I need from the above text is :
pagare un occhio della testa:strapagare
I tried this
# list of expressions that I need their meaning
myitems = ['pagare un occhio della testa', '....' , '....']
for ex in myitems:
ws = ex.split()
li = ""
url = "https://mydictionary/" + ws[-1]+ ""
if urllib.request.urlopen(url):
htmlfile = urllib.request.urlopen(url)
soup = BeautifulSoup(htmlfile, 'lxml')
txt = soup.text
if ex in txt:
li = '%s = %r' % (es, soup.next_siblings)
print(li)
This code gives only the ex. Can someone help?
I don't know how regular the structure is but for the above you can use the following (bs4 4.7.1):
soup.select_one('strong:contains("pagare un occhio della testa") ~ span + span').next_sibling.next_sibling.strip()

web data scraping : split html content

I'm scraping a website and I was able to reduce a variable called "gender" to this :
[<span style="text-decoration: none;">
Lass Christian, du Danemark, à Yverdon-les-Bains, avec 200 parts de CHF 100
</span>, <span style="text-decoration: none;">associé gérant </span>]
And now I'd like to have only "associé" in the variable but I can't find a way to split this html code.
The reason is that I want to know if it's "associé" (male) or "associée" (female).
does anyone have any ideas ?
Cheers
----- edit ----
here my code which gets me the html output
url = "http://www.rc2.vd.ch/registres/hrcintapp-pub/companyReport.action?rcentId=5947621600000055031025&lang=FR&showHeader=false"
r = requests.get(url)
soup = BeautifulSoup(r.content,"lxml")
table = soup.select_one("#adm").find_next("table") #select_one finds only the first tag that matches a selector:
table2 = soup.select_one("#adm").find_all_next("table")
output = table.select("td span[style^=text-decoration:]", limit=2) #.text.split(",", 1)[0].strip()
print(output)
Whatever the parent of the two elements is you can call span:nth-of-type(2) to get the second span, then just check the text:
html = """<span style="text-decoration: none;">
Lass Christian, du Danemark, à Yverdon-les-Bains, avec 200 parts de CHF 100
</span>
<span style="text-decoration: none;">associé gérant </span>"""
soup = BeautifulSoup(html)
text = soup.select_one("span:nth-of-type(2)").text
Or if it not always the second span you can search for the span by the partial text associé:
import re
text = soup.find("span", text=re.compile(ur"associé")).text
For your edit, all you need is to extract the text last element and use .split(None, 1)[1] to get the gender:
text = table.select("td span[style^=text-decoration:]", limit=2)[-1].text
gender = text.split(None, 1)[1] # > gérant

Loop incrementation for url

I want to insert dynamic url from my title, it works but it displays 6*6 'cause of 2 differents loops... I don't know how to do differently...
I tried to use global variable but it's the same... (I have 6 Recipes)
Thanks !
Code XML :
<recette id="r1">
<titre>Cake au chocolat</titre>
<cat>dessert</cat>
<type/>
<nombre>6</nombre>
<listeingredients>
<ingredient q="150" u="g">chocolat pâtissier</ingredient>
<ingredient q="3" u="pièce">oeufs</ingredient>
<ingredient q="100" u="g">sucre en poudre</ingredient>
<ingredient q="60" u="g">farine</ingredient>
<ingredient q="1" u="cuillère à café">levure</ingredient>
<ingredient q="80" u="g">beurre</ingredient>
<ingredient q="50" u="g">poudre d'amandes</ingredient>
</listeingredients>
<cuisson>
<temps type="preparation">15</temps>
<temps type="cuisson">30</temps>
<temperature u="C">180</temperature>
</cuisson>
</recette>
<recette id="r2">
<titre>Brownies aux noix de pécan</titre>
<cat>dessert</cat>
<type/>
<nombre>6</nombre>
<listeingredients>
<ingredient q="200" u="g">chocolat à cuire</ingredient>
<cuisson>
<temps type="preparation">10</temps>
<temps type="cuisson">25</temps>
<temperature u="C">180</temperature>
</cuisson>
<instruction>
<etape>Faire fondre le chocolat avec le beurre, soit au bain-marie à feu doux, soit au micro-ondes sur programme 'décongélation'.</etape>
<etape>Quand c'est bien fondu, mélanger et ajouter le sucre, les oeufs un par un, la farine, puis les noix de pécan hachées grossièrement.</etape>
<etape>Bien mélanger et verser dans un moule carré de 20 cm (ou rectangulaire pas trop grand), chemisé de papier sulfurisé.</etape>
<etape>Mettre au four préchauffé à 180°C pendant 25 min.</etape>
<etape>Laisser refroidir et couper en carrés.</etape>
</instruction>
</recette>
</listerecettes>
Code xq :
{
for $titre in db:open("recettes")//recette//titre, $j in (6,2,1,5,4,3)
order by $titre
return
<a href="http://127.0.0.1:8984/rest/recettes?query=//recette[#id='r{$j}']">
<ul>
<li>
{
$titre
}
</li>
</ul>
</a>
}
</body>
</html>
After your comment I still don't really know what you want to achieve with $j as it is not useful at all and can simply be removed. So I guess you maybe want something else and not just a simple list ordered by thte title - If so, please clearify what you want to achieve. Otherwise, simply removing the $j and using the id attribute will work fine.
for $recette in db:open("recettes")//recette
let $titre := $recette/titre
order by $titre
return
<a href="http://127.0.0.1:8984/rest/recettes?query=//recette[#id='{$recette/#id}']">
<ul>
<li>
{
$titre
}
</li>
</ul>
</a>
Thank you but I had just found an other way ! :
for $titre in db:open("recettes")//recette//titre
order by $titre
return
<a href="http://127.0.0.1:8984/rest/recettes?query=//recette[#id='{$titre/..//#id}']">
<ul>
<li>
{
$titre
}
</li>
</ul>
</a>
I tested your code and it's correct but you've forget "$titre" after order by !