Does ocamldoc generate documentation for functions?

Does ocamldoc generate documentation for functions? - function

I'm using ocamldoc to generate the documentation of my program. My code is not particularly big yet, I just have one function, but when I open the HTML the function documentation doesn't appear in any of the files generated by ocamldoc.
I use ocamldoc -all-params arbol\ binario.ml to generate the HTML
I read the documentation for ocamldoc and I used the flag -all-params but it didn't work either. Also I created a simple non-recursive function but it's the same output.
(** #author Roldan Rivera Luis Ricardo
#author Foo*)
(**Este modulo contiene la implementacion de una arbol binario
de busqueda BST (acrónimo del inglés Binary Search Tree)
con sus funciones basicas.
{b funciones}
- {! Crear}
- {! Insertar}
- {! Buscar}
- {! Recorrer}*)
(** Tipo de dato llamado Tree, la notacion 'a (alfa) indica que es un
tipo de dato polimorfico, es decir que puede soportar
cualquier tipo de dato. *)
type 'a tree =
| Branch of 'a * 'a tree * 'a tree (** Un elemento * sub-arbol izquierdo * sub-arbol derecho *)
| Leaf (** El fin de una rama, significa que ya no hay mas sub-arboles, equivalente al Nil *)
(** Busca el dato deseado en el arbol
#param tree Arbol donde se va a realizar la busqueda
#param x El valor a buscar
#return None Si no se encuentra el dato en el arbol*)
let rec buscar tree x =
match tree with
| Leaf -> None
| Branch(k,left,right) ->
if k = x then Some x
else if x < k then buscar left x
else buscar right x

Did you forget to precise the html backend (also you should not put space in module name)?
Running ocamldoc with
ocamldoc -html -all-params filename.ml
should print the following documentation for the function:
<pre><span id="VALbuscar"><span class="keyword">val</span> buscar</span> : <code class="type">'a tree -> 'a -> 'a option</code></pre><div class="info ">
<div class="info-desc">
<p>Busca el dato deseado en el arbol</p>
</div>
<ul class="info-attributes">
<li><b>Returns</b> None Si no se encuentra el dato en el arbol</li>
</ul>
</div>
<div class="param_info"><table border="0" cellpadding="3" width="100%">
<tr>
<td align="left" valign="top" width="1%"><b>Parameters: </b></td>
<td>
<table class="paramstable">
<tr>
<td align="center" valign="top" width="15%" class="code">
tree</td>
<td align="center" valign="top">:</td>
<td><div class="paramer-type">
<code class="type">'a tree</code><div>
Arbol donde se va a realizar la busqueda
</tr>
<tr>
<td align="center" valign="top" width="15%" class="code">
x</td>
<td align="center" valign="top">:</td>
<td><div class="paramer-type">
<code class="type">'a</code><div>
El valor a buscar
</tr>
</table>
</td>
</tr>
</table></div>

Related

I have a text with html and css and need formatted text

I am re-making a website that takes tests. It was on PHP and now we are taking it to TypeScript.
I am using React and Material UI to display the feedback for the correct answer and explanation when the user sends their answer.
I get the feedback from an old database they were using (and that we will still use), and what they used to do was create a variable called html (string), and then concatenate each part of the html code raw.
Something like this: (in PHP .= and . is what you use to concatenate)
$html = '';
$html = '<div>';
$html .= '<p class="maintext"><span>' . $some_variable_they_want_in_span . '</span>';
$html .= $some_other_variable_that_is_empty_or_not_depending_on_if_clause . '</p>';
$html .= '</div>';
it is a pain... but the thing is. Now that I am showing the feedback, it comes in a terrible format that I can't simply show in a Typography element (Material UI, but I couldn't in a -p- tag either).
The text is, for example, this one:
" <p>Artículo séptimo.\n</p><p>1. En el ejercicio de sus funciones, los miembros de las Fuerzas y Cuerpos de Seguridad tendrán a todos los efectos legales el carácter de AGENTES DE LA AUTORIDAD.\n</p><p>2. Cuando se cometa delito de atentado, empleando en su ejecución armas de fuego, explosivos u otros medios de agresión de análoga peligrosidad, que puedan poner en peligro grave la integridad física de los miembros de las Fuerzas y Cuerpos de Seguridad, tendrán al efecto de su protección penal LA CONSIDERACIÓN DE AUTORIDAD.\n</p><p>3. La Guardia Civil sólo tendrá consideración de fuerza armada en el cumplimiento de las misiones de carácter militar que se le encomienden, de acuerdo CON EL ORDENAMIENTO JURÍDICO.\n</p><p>\n</p> "
How can I show this in a readable and nice text? I tried some npm packages and passing it through decodeURI but no change. Any ideas? Some text will even come with something like
"Artículo 2. Secretaría de Estado de Seguridad.<div>Corresponde a la persona titular de la Secretaría de Estado de Seguridad el ejercicio de las funciones a las que se refiere el artículo 62 de la Ley 40/2015, de 1 de octubre, y en particular, la dirección, coordinación y supervisión de los órganos directivos dependientes de la Secretaría de Estado, bajo la inmediata autoridad de la persona titular del Ministerio, para el ejercicio de las siguientes funciones:&nbsp;</div><ul><li><span style=\"letter-spacing: -0.015rem;\">El ejercicio del mando de las Fuerzas y Cuerpos de Seguridad DEL ESTADO, y la COORDINACIÓN Y SUPERVISIÓN DE LOS SERVICIOS Y MISIONES QUE LES CORRESPONDEN.</span></li></ul>"

Ok! So one of my coworkers found this solution:
import textParserTest from "~/utils/textparser"; // (which uses in its code the following: import parse from "html-react-parser";)
...
<div
contentEditable="true"
dangerouslySetInnerHTML={{
__html: textParserTest(props.question.answer, 0)
.toString()
.replaceAll("\\n", ""),
}}
></div>
So basically. html-react-parser package did the trick

ITextRenderer - TextBox out of generated Pdf mediabox

I'm trying to convert a html content document to pdf using ITextRenderer using this version of library:
implementation group: 'org.xhtmlrenderer', name: 'flying-saucer-pdf-itext5', version: '9.1.22'
This is what I'm doing to convert my html to pdf:
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
ITextRenderer renderer = new ITextRenderer();
renderer.setDocumentFromString(getXHTMLFromHTML(out.toString()));
renderer.layout();
renderer.createPDF(byteArrayOutputStream);
However I'm getting a problem with a specific part of my html, a table that's supposed to be a list. This is the result in adobe acrobat:
As you can see the text box is out of the mediabox of the pdf, there's no cropbox present, and the result is that text is not fully visible.
Here is the html that causes this issue:
<div id="4a9dc6ae-3d2e-448b-8e70-3422659f87bd" class="alinea">
<p>Ne sont pas déductibles les dépenses suivantes:</p>
<table id="9c582e44-1932-4bee-b93e-6fb75b725f4b" style="width: 1127px;" class="custom-list">
<tbody class="body number-style dot-separator">
<tr class="row">
<td class="cell">1.</td>
<td class="cell">les dépenses faites en vue de remplir des obligations imposées � la collectivité par ses statuts ou son pacte social;</td>
</tr>
<tr class="row">
<td class="cell">2.</td>
<td class="cell">l'impôt sur le revenu des collectivités, l'impôt sur la fortune et l'impôt commercial communal;</td>
</tr>
<tr class="row">
<td class="cell">3.</td>
<td class="cell">les rémunérations imposables en vertu du premier alinéa, numéro 2 de l'<a id="e48d16e9-3eeb-4d99-91c3-f97b8d206a5e" href="https://00f74ba44b4a6fc879f193e8a262c50fb07e0f7d81-apidata.googleusercontent.com/#/o/link/1/a0ea877e-3736-4ce9-b09f-fc1be79b787a" class="tech_content LINK_INTERN">article 91</a>;</td>
</tr>
<tr class="row">
<td class="cell">4.</td>
<td class="cell">les dépenses faites dans un but cultuel, charitable ou d'intérêt général sans préjudice de la disposition prévue au premier alinéa, numéro 3 de l'<a id="3e74194c-2ae3-4b41-9efc-c6a1c23102cb" href="https://00f74ba44b4a6fc879f193e8a262c50fb07e0f7d81-apidata.googleusercontent.com/#/o/link/1/a0eb1cc1-3e9c-4175-9a52-be2d0b6cc455" class="tech_content LINK_INTERN">article 109;</a></td>
</tr>
<tr>
<td>5.</td>
<td>les intérêts ou redevances dus lorsque les conditions suivantes sont simultanément remplies :
<table class="custom-list" id="02812a11-d2d0-92cb-9c78-e9fa30f8fa1d" style="width: 955px;">
<tbody class="body lower-alpha-style one-bracket-separator">
<tr class="row">
<td class="cell">a)</td>
<td class="cell">le bénéficiaire des intérêts ou redevances est un organisme � caractère collectif au sens de l’article 159. Si le bénéficiaire n’est pas le bénéficiaire effectif, il y a lieu de prendre en considération le bénéficiaire effectif ;</td>
</tr>
<tr class="row">
<td class="cell">b)</td>
<td class="cell">l’organisme � caractère collectif qui est le bénéficiaire des intérêts ou redevances est une entreprise liée au sens de l’article 56 ;</td>
</tr>
<tr class="row">
<td class="cell">c)</td>
<td class="cell">l’organisme � caractère collectif qui est le bénéficiaire des intérêts ou redevances est établi dans un pays ou territoire figurant � l’annexe I des conclusions du Conseil de l’Union européenne relatives � la liste révisée de l’Union européenne des pays et territoires non coopératifs � des fins fiscales (ci-après « annexe I »), dans les conditions spécifiées ci-après.</td>
</tr>
</tbody>
</table><p></p><p>Toutefois, la disposition du présent numéro n’est pas applicable si le contribuable apporte la preuve que l’opération � laquelle correspondent les intérêts ou redevances dus est utilisée pour des motifs commerciaux valables qui reflètent la réalité économique.</p><p></p><p>Le terme « intérêts » employé dans le présent numéro désigne les intérêts et arrérages dus qui se rapportent � des créances de toute nature, assorties ou non de garanties hypothécaires ou d’une clause de participation aux bénéfices du débiteur, et notamment les intérêts et arrérages d’obligations d’emprunts, y compris les primes et lots attachés � ces titres. Les pénalisations pour paiement tardif ne sont pas considérées comme des intérêts au sens du présent numéro.</p><p></p><p>Le terme « redevances » employé dans le présent numéro désigne les rémunérations de toute nature dues pour l’usage ou la concession de l’usage d’un droit d’auteur sur une œuvre littéraire, artistique ou scientifique, y compris les films cinématographiques, d’un brevet, d’une marque de fabrique ou de commerce, d’un dessin ou d’un modèle, d’un plan, d’une formule ou d’un procédé secrets et pourdes informations ayant trait � une expérience acquise dans le domaine industriel, commercial ou scientifique.</p><p></p><p>À partir du 1er mars 2021, la disposition du présent numéro s’applique concernant les pays et territoires qui figurent � l’annexe I, dans sa dernière version, telle que publiée au Journal officiel de l’Union européenne � cette date. À partir du 1er janvier de chaque année qui suit, elle s’applique concernant les pays et territoires qui figurent � l’annexe I, dans sa dernière version au 1er janvier de l’année subséquente en question, telle que publiée au Journal officiel de l’Union européenne � cette date.</p><p></p><p>Toutefois, lorsque des pays et territoires ne figurent plus � l’annexe I, dans sa dernière version au 1er janvier d’une année subséquente, telle que publiée au Journal officiel de l’Union européenne � cette date, la disposition du présent numéro cesse de s’appliquer concernant ces pays et territoires dès la date de publication au Journal officiel de l’Union européenne de l’annexe I dans sa dernière version mentionnée ci-avant. En cas de version antérieure de l’annexe I au cours de la même année opérant pour la première fois le retrait du pays ou territoire en question, la disposition du présent numéro cesse de s’appliquer déj� dès la date de publication au Journal officiel de l’Union européenne de l’annexe I, dans une telle version antérieure opérant le retrait du pays ou territoire en question.</p></td>
</tr>
</tbody>
</table>
</div>
Thank you very much for help, I have look everywhere but I don't see anybody having this issue.

So turns out that itext wraps the content of a table inside the margins of the pdf only if the table (or any element) does not provide a fixed width or height. So the solution for me in this case was to remove the width and height from my html table.

How to enter Carriage Return in XSL

I am programming a simple web app. I have the following xml :`
<pelis>
<nombre id="avengers">Valoracion de Avengers Engame
<Valoracion>Calificacion general: 7.5</Valoracion>
<Valorpro>Calificacion de Cartelera Tinajo: 8</Valorpro>
<ComentariosP>Positivo: Buena pelicula, con un argumento interesante y envolvente. Las escenas estan bien rodadas y las coreografias bien ejecutadas. La banda sonora es buena, asi como el vestuario y sobre todo el maquillaje y los efectos especiales.</ComentariosP>
<ComentariosN>Negativo: Demasiado larga para nuestro gusto. Algunos fallos de guion y de realizacion. El sonido en algunas escenas es demasiado elevado y poco claro. El argumento es predecible desde muy temprano.</ComentariosN>
</nombre>
<nombre id="mamma">Valoracion de Mamma Mia!
<Valoracion>Calificacion general: 8</Valoracion>
<Valorpro>Calificacion de Cartelera Tinajo: 8</Valorpro>
<ComentariosP>Positivo: Pelicula muy recomendable. Unifica una comedia elaborada con una trama interesante y adictiva. La banda sonora de ABBA es impecable, sumado a un escenario espectacular, asi como unas coreografias muy elaboradas.</ComentariosP>
<ComentariosN>Negativo: Guion predecible. En algunas partes se hace un poco aburrida y no consigue captar al espectador. </ComentariosN>
</nombre>
<nombre id="hobbit">Valoracion de El Hobbit
<Valoracion>Calificacion general: 8</Valoracion>
<Valorpro>Calificacion de Cartelera Tinajo: 10</Valorpro>
<ComentariosP>Positivo: Gran película. Al igual que sus predecesoras (El señor de los Anillos I,II y III), es una obra de arte. Música impresionante, universo envolvente, escenas increibles, trama interesante, vestuario peculiar y una infinidad de aspectos positivos.</ComentariosP>
<ComentariosN>Negativo: El unico punto negativo es que puede ser un poco larga</ComentariosN>
</nombre>
<nombre id="hotel">Valoracion de Hotel Transilvania
<Valoracion>Calificacion general: 8.5</Valoracion>
<Valorpro>Calificacion de Cartelera Tinajo: 8</Valorpro>
<ComentariosP>Positivo: Una comedia para toda la familia. Peculiar en gran medida, dado que enfoca a los monstruos como seres humanizados. Buena trama, actores conocidos. Muy entretenida.</ComentariosP>
<ComentariosN>Negativo: No tiene ningun punto negativo destacable</ComentariosN>
</nombre>
<nombre id="lobo">Valoracion de El Lobo de Wall Street
<Valoracion>Calificacion general: 8</Valoracion>
<Valorpro>Calificacion de Cartelera Tinajo: 9</Valorpro>
<ComentariosP>Positivo: Una pelicula muy curiosa. Cuenta con grandes actores, un guion elaborado, unas escenas muy bien rodadas. Se sigue con gran interes y tiene giros que la convierten en un peliculon</ComentariosP>
<ComentariosN>Negativo: Demasiado larga (3 horas de duracion)</ComentariosN>
</nombre>
</pelis>`
and I also have the following XSL to transform it to HTML:
What I want is that all the text that will be shown in the html have carriage return after every line. I have tried adding and also what you can see in the code, but it doesnt work.

Because XSLT output is HTML, you can use <br/> for that. Or CSS for the <p> tag.

A NL or CR character in an HTML document displays as an ordinary space in the browser. If you want text to start on a new line when rendered in the browser, you typically want to output a <br> element.

Delete spaces around specific tags in an html file in shell

I'm looking to write a shell script that minify less html files, but I'm having a problem.
I would like to delete the space on each side of a specific html tag, these tags being read from a file. With "perl", I can't do it, nothing happens, with sed in 2 commands I almost get what I want. In the example below, the space between some tags is removed, but not all, at the level of the "section" tags there is a problem, "h2" too, however the pattern matches ...
for tag in $tag_file ; do
# perl -e '$comHtml=<>; $comHtml=~s/ *(<${tag} *.* *>) */\1/g; print $comHtml' < tmp_html
sed -i -r -e "s: *(<${tag} *.* *>) *:\1:gI" ./tmp_html
sed -i -r -e "s: *(</${tag} *.* *>) *:\1:gI" ./tmp_html
done
here, $tag_file contains the specific tag got from a file, for example $tag_file = html \n head \n section \n ...
Entry html :
<!doctype html> <html lang="fr"> <head> <meta charset="UTF-8"> <title>La gazette de L-INFO</title> <link rel="stylesheet" type="text/css" href="./styles/gazette.css"> </head> <body> <nav> <ul> <li>Accueil</li> <li>Toute l'actu</li> <li>Recherche</li> <li>La rédac'</li> <li>jbigoude <ul> <li>Mon profil</li> <li>Nouvel article</li> <li>Se déconnecter</li> </ul> </li> </ul> </nav> <header> <img src="./images/titre.png" alt="La gazette de L-INFO" width="780" height="83"> <h1>Le site de désinformation n°1 des étudiants en Licence Info</h1> </header> <main> <section class="centre"> <h2>À la Une</h2> <img src="images/hacker.jpg" alt="Un mouchard dans un corrigé de Langages du Web"><br> Un mouchard dans un corrigé de Langages du Web <img src="images/hymne.jpg" alt="Votez pour l'hymne de la Licence"><br> Votez pour l'hymne de la Licence <img src="images/melenchon.jpg" alt="L'amphi Sciences Naturelles bientôt renommé amphi Mélenchon"><br> L'amphi Sciences Naturelles bientôt renommé amphi Mélenchon </section> <section class="centre"> <h2>L'info brûlante</h2> <img src="images/walkingdead.jpg" alt="Il avait annoncé 'Je vais vous défoncer' l'enseignant relaxé"><br> Il leur avait annoncé "Je vais vous défoncer" l'enseignant relaxé <img src="images/pingouins.jpg" alt="Des pinguoins dans l'amphi B"><br> Toute une famille de pingouins découverte dans l'amphi B <img src="images/macron.jpg" alt="Emmanuel Macron obtient sa Licence d'Info en EAD"><br> Emmanuel Macron obtient sa Licence Info en EAD </section> <section class="centre"> <h2>Les incontournables</h2> <img src="images/arnaque.jpg" alt="Arnaque au devoir corrigé de TLSP"><br> Une arnarque au corrigé de TL mise à jour <img src="images/calendrier.jpg" alt="Le calendier des Dieux de la Licence bientôt disponible"><br> Le calendier des Dieux de la Licence bientôt disponible <img src="images/sondage.jpg" alt="Allez-vous réussir votre année ?"><br> Résultat de notre sondage : allez-vous réussir votre année ? </section> <section> <h2>Horoscope de la semaine</h2> <p>Vous l'attendiez tous, voici l'horoscope du semestre pair de l'année 2019-2020. Sans surprise, il n'est pas terrible...</p> <table id="horoscope"> <tr> <td>Signe</td> <td>Date</td> <td>Votre horoscope</td> </tr> <tr> <td>♈ Bélier</td> <td>du 21 mars<br>au 19 avril</td> <td rowspan="4"> <p>Après des vacances bien méritées, l'année reprend sur les chapeaux de roues. Tous les signes sont concernés. </p> <p>Jupiter s'aligne avec Saturne, péremptoirement à Venus, et nous promet un semestre qui ne sera pas de tout repos. Février sera le mois le plus tranquille puisqu'il ne comporte que 29 jours.</p> <p>Les fins de mois seront douloureuses pour les natifs du 2e décan au moment où tomberont les tant-attendus résultats du module d'<em>Algorithmique et Structures de Données</em> du semestre 3.</p> </td> </tr> <tr> <td>♉ Taureau</td> <td>du 20 avril<br>au 20 mai</td> </tr> <tr> <td>...</td> <td>...</td> </tr> <tr> <td>♓ Poisson</td> <td>du 20 février<br>au 20 mars</td> </tr> </table> <p>Malgré cela, notre équipe d'astrologues de choc vous souhaite à tous un bon semestre, et bon courage pour le module de <em>Système et Programmation Système</em>.</p> </section> </main> <footer>© Licence Informatique - Janvier 2020 - Tous droits réservés</footer> </body> </html>
output html :
<!doctype html><html lang="fr"><head><meta charset="UTF-8"><title>La gazette de L-INFO</title><link rel="stylesheet" type="text/css" href="./styles/gazette.css"></head><body><nav><ul> <li>Accueil</li> <li>Toute l'actu</li> <li>Recherche</li> <li>La rédac'</li> <li>jbigoude <ul> <li>Mon profil</li> <li>Nouvel article</li> <li>Se déconnecter</li></ul> </li> </ul></nav><header> <img src="./images/titre.png" alt="La gazette de L-INFO" width="780" height="83"><h1>Le site de désinformation n°1 des étudiants en Licence Info</h1></header><main><section class="centre"><h2>À la Une</h2> <img src="images/hacker.jpg" alt="Un mouchard dans un corrigé de Langages du Web"><br> Un mouchard dans un corrigé de Langages du Web <img src="images/hymne.jpg" alt="Votez pour l'hymne de la Licence"><br> Votez pour l'hymne de la Licence <img src="images/melenchon.jpg" alt="L'amphi Sciences Naturelles bientôt renommé amphi Mélenchon"><br> L'amphi Sciences Naturelles bientôt renommé amphi Mélenchon </section> <section class="centre"> <h2>L'info brûlante</h2> <img src="images/walkingdead.jpg" alt="Il avait annoncé 'Je vais vous défoncer' l'enseignant relaxé"><br> Il leur avait annoncé "Je vais vous défoncer" l'enseignant relaxé <img src="images/pingouins.jpg" alt="Des pinguoins dans l'amphi B"><br> Toute une famille de pingouins découverte dans l'amphi B <img src="images/macron.jpg" alt="Emmanuel Macron obtient sa Licence d'Info en EAD"><br> Emmanuel Macron obtient sa Licence Info en EAD </section> <section class="centre"> <h2>Les incontournables</h2> <img src="images/arnaque.jpg" alt="Arnaque au devoir corrigé de TLSP"><br> Une arnarque au corrigé de TL mise à jour <img src="images/calendrier.jpg" alt="Le calendier des Dieux de la Licence bientôt disponible"><br> Le calendier des Dieux de la Licence bientôt disponible <img src="images/sondage.jpg" alt="Allez-vous réussir votre année ?"><br> Résultat de notre sondage : allez-vous réussir votre année ? </section> <section> <h2>Horoscope de la semaine</h2><p>Vous l'attendiez tous, voici l'horoscope du semestre pair de l'année 2019-2020. Sans surprise, il n'est pas terrible...</p> <table id="horoscope"> <tr> <td>Signe</td> <td>Date</td> <td>Votre horoscope</td> </tr> <tr> <td>♈ Bélier</td> <td>du 21 mars<br>au 19 avril</td> <td rowspan="4"> <p>Après des vacances bien méritées, l'année reprend sur les chapeaux de roues. Tous les signes sont concernés. </p> <p>Jupiter s'aligne avec Saturne, péremptoirement à Venus, et nous promet un semestre qui ne sera pas de tout repos. Février sera le mois le plus tranquille puisqu'il ne comporte que 29 jours.</p> <p>Les fins de mois seront douloureuses pour les natifs du 2e décan au moment où tomberont les tant-attendus résultats du module d'<em>Algorithmique et Structures de Données</em> du semestre 3.</p> </td> </tr> <tr> <td>♉ Taureau</td> <td>du 20 avril<br>au 20 mai</td> </tr> <tr> <td>...</td> <td>...</td> </tr> <tr> <td>♓ Poisson</td> <td>du 20 février<br>au 20 mars</td> </tr> </table> <p>Malgré cela, notre équipe d'astrologues de choc vous souhaite à tous un bon semestre, et bon courage pour le module de <em>Système et Programmation Système</em>.</p> </section></main><footer>© Licence Informatique - Janvier 2020 - Tous droits réservés</footer></body></html>

The main issue with your perl line is your quoting; you used single quotes, which means you don't have to escape the Perl variables, but it also means the shell variable ${tag} will be interpreted by Perl (where it's empty) not the shell. You can access shell variables more easily from Perl by either passing them as arguments or environment variables. You also didn't use the -i switch for in-place editing, so you just printed the changes to STDOUT.
With ojo installed, you can do this with a proper HTML parser and thus not be susceptible to edge cases:
env tag=$tag perl -0777 -pi -CS -Mojo -e '$_ = x($_);
$_->find($ENV{tag})->each(sub {
$_->content($_->content =~ s/\A *//r =~ s/ *\z//r);
my ($p, $n) = ($_->previous_node, $_->next_node);
$p->content($p->content =~ s/ *\z//r) if defined $p and ($p->type eq "text" or $p->type eq "raw");
$n->content($n->content =~ s/\A *//r) if defined $n and ($n->type eq "text" or $n->type eq "raw");
})' tmp_html
The -0777 switch ensures the file will be operated on in one step rather than by line, -pi wraps the code in a loop which will assign the input to $_ and then update that file in-place with the resulting value of $_, and -CS ensures it will be decoded from UTF-8 to parse and encoded back after.
The x function from ojo creates a Mojo::DOM object, which can then find every instance of the requested tag and operate on it (which includes its contents and closing tag).
The substitution operations: s/\A *//r and s/ *\z//r remove all space characters from the beginning or end of the string respectively, and return the modified string (/r prevents it from operating in place, so you can use this with Mojo::DOM's content method). To instead remove any whitespace characters (including newlines), use s/\A\s*//r and s/\s*\z//r.

After some communication with OP I hope that I properly understood the problem.
HTML tags is stored in separate file one per line (tag_file.txt), in separate file we have HTML webpage code (file.html).
The code should strip spaces in HTML webpage code (file.html) around tags [opening,closing] specified in tag file (tag_file.txt).
NOTE: processing done with perl script without shell's assistance (shortens processing time)
use strict;
use warnings;
use feature 'say';
my $tag_file = 'tag_file.txt';
my $html_file = 'file.html';
open my $fh_tag, '<', $tag_file # open tag file
or die "Couldn't open $tag_file: $!";
my #tags = <$fh_tag>; # read tags into array
chomp #tags; # remove eol from tag lines
close $fh_tag; # close tag file
open my $fh_html, '<', $html_file # open html file
or die "Couldn't open $html_file: $!";
my $html = do { local $/; <$fh_html> }; # read whole file into variable
close $fh_html; # close html file
# now make substitution for each read tag
for my $tag (#tags) { $html =~ s!\s*(</?$tag\s*.*?>)\s*!$1!g; }
say $html;
Content of tag_file.txt
html
head
body
section

XPath: How do I extract the text from almost equivalent html structure?

I'm trying to extract some news articles from a webpage, but the html structure changes for some of these articles.
In one case I have this:
and in another case I have this:
Is there a XPath I can use that could extract the full text on both cases?
EDIT
For the fist case I have tried this (using scrapy shell):
response.xpath('//*[#class="mioloNoticia"]/article/text()').extract()
which results in:
[u'\r\n ',
u'\r\n ',
u'\r\n ',
u'\r\n ',
u'\r\n ',
u'\r\n ',
u'Entendo as raz\xf5es dos defensores do acordo ortogr\xe1fico. Tamb\xe9m entendo as raz\xf5es pelas quais uma criatura acredita que \xe9 Napole\xe3o Bonaparte. ',
u'Mais dif\xedcil de compreender \xe9 o motivo que leva uma classe pol\xedtica inteira a seguir com respeito Napole\xe3o Bonaparte. O problema do acordo nem sequer \xe9 t\xe9cnico ou jur\xeddico. Isso \xe9 \xf3bvio: qualquer um sabe que aquilo \xe9 uma aberra\xe7\xe3o lingu\xedstica (a grafia como mera transcri\xe7\xe3o fon\xe9tica?) e uma ilegalidade completa (lembrar as acrobacias jur\xeddicas que se fizeram sobre o texto original). Sem falar da ambi\xe7\xe3o autorit\xe1ria de submeter 300 milh\xf5es de falantes a um capricho racionalista. ',
u'O problema do acordo \xe9 termos tido v\xe1rios governos que, reverentes e analfabetos, foram ratificando, modificando e legislando como se o acordo fosse mesmo para levar a s\xe9rio. Se Marcelo ajudar a acabar com esta farsa, a sua Presid\xeancia j\xe1 ter\xe1 valido a pena.']
and for the second case I have tried:
response.xpath('//*[#class="mioloNoticia"]/article/p/text()').extract()
which results in a better extraction:
[u'Os c\xe1lculos na ves\xedcula, os sintomas de um reumatismo que o atacava quando o Outono se aproximava ou a certeza de que o fim das coisas era inevit\xe1vel abriam-lhe a porta ao pessimismo em geral e \xe0 descren\xe7a no futuro \u2013 mas a vis\xe3o de um mundo encavalitado \xe0s costas do "progresso" era o aspecto mais penoso da exist\xeancia. A esta dist\xe2ncia, compreendo-o; ser "contra o progresso" \xe9 nos nossos dias um pecado capital, e resmungar contra "a criatividade" tornou-se uma apostasia definitiva e dram\xe1tica.',
u'O "ser humano" est\xe1 condenado a acreditar na criatividade sem limites, na originalidade, no progresso, na mudan\xe7a e, finalmente, na ideia de que as coisas novas s\xe3o sempre superiores \xe0s antigas. Isto pode fazer confus\xe3o a um velho do Alto Minho, educado pela vida (e pelos desaires) a apreciar as coisas que permanecem e a desconfiar das inven\xe7\xf5es em que n\xe3o v\xea grande utilidade.',
u'A minha sobrinha Maria Lu\xedsa \u2013 a eleitora esquerdista da fam\xedlia \u2013 j\xe1 foi uma sacerdotisa do Progresso (com mai\xfascula). Hoje, desconfia bastante da direc\xe7\xe3o que as coisas tomam, e o seu optimismo em rela\xe7\xe3o \xe0 esp\xe9cie humana \xe9 morigerado. Alimento a esperan\xe7a, dissimulada por muita cautela e certo tom de ironia, de v\xea-la feliz como Dona Ester, minha m\xe3e, gostava de ver felizes os seus filhos, espalhados sobre o areal da praia de Afife, respirando o iodo da tarde e abrigando-se do vento galego que descia pelo litoral. Os sucessos e insucessos dos \xfaltimos setenta anos ensinaram-me a desejar pouco, a aceitar a grandeza das coisas desconhecidas, a reler os livros que j\xe1 foram belos algum dia, a manter alguma f\xe9 numa ordem que comanda os planetas ou a solid\xe3o das dunas de Moledo. Ao mesmo tempo, esse ego\xedsmo n\xe3o faz mal aos outros. N\xe3o exige muito deles. N\xe3o lhes oferece demasiadas desilus\xf5es, nem utopias, nem promessas v\xe3s de um mundo perfeito. N\xe3o lhes alimenta a f\xe9 nas coisas imposs\xedveis que exigem que os outros mudem para que n\xf3s possamos satisfazer os desejos pessoais.',
u'Esse mundo perfeito existe, sim \u2013 mas terminou h\xe1 muito, antes do progresso, da democracia e dos d\xe9fices da economia. Tamb\xe9m \xe9 preciso lembrar que n\xe3o se pode voltar atr\xe1s nem \xe9 poss\xedvel recuperar o tempo perdido. O que est\xe1 perdido, est\xe1 perdido. O que passou, passou h\xe1 muito.']
But I'm after a single XPath that could word for both cases.

If the only different between different articles are plain text or text within tags the xpath is quite simple
//article/text() | //article/p/text()
This will extract both whether one of them exists or both

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Does ocamldoc generate documentation for functions? - function

Related

I have a text with html and css and need formatted text

ITextRenderer - TextBox out of generated Pdf mediabox

How to enter Carriage Return in XSL

Delete spaces around specific tags in an html file in shell

XPath: How do I extract the text from almost equivalent html structure?

Categories

Resources