perl XML::Feed error reading html content inside rss feed - html

i´m trying to parse and read the content of a rss feed and getting an error.
this is my rss feed file (for test porpose)
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:Test="http://www.Test.com">
<channel>
<title>Deportes - Test.com</title>
<link>http://www.Test.com</link>
<description>Últimas noticias de deportes</description>
<item>
<title><![CDATA[El 'Chacho' Coudet es el nuevo entrenador de Rosario Central]]></title>
<link>http://442.Test.com/2014-12-15-326653-coudet-fue-presentado-como-nuevo-dt-de-central/</link>
<description><![CDATA[El Chacho, Tengo mucha alegría y ganas de empezar a trabajar. No esperaba que sea acá”, reconoció.]]></description>
<category><![CDATA[Deportes]]></category>
<pubDate>15 12 2014 06:15:0 +0000</pubDate>
<enclosure url="http://www.Test.com/__export/1418678333348/sites/diarioTest/img/2014/12/15/deportes/1215_coudet_g_fb.jpg" type="image/jpeg"><![CDATA[El Chacho Coudet]]></enclosure>
<author><![CDATA[]]></author>
<content><![CDATA[<p>Eduardo Coudet fue presentado como nuevo entrenador deRosario Central&nbsp.</p>
]]></content>
</item>
</channel>
</rss>
this is my test.pl script file.
#!/usr/bin/perl
use strict;
use warnings;
use XML::Feed;
my $feed = XML::Feed->parse("test.xml");
for my $entry ($feed->entries) {
print $entry->content;
}
when i run this code i get this error.
Can't use string ("<p>Eduardo Coudet fue prese"...) as a HASH ref while "st
rict refs" in use at C:/Strawberry/perl/site/lib/XML/Feed/Entry/Format/RSS.pm li
ne 91.
i think is a bug inside XML::Feed
Reference: https://github.com/davorg/xml-feed/blob/master/lib/XML/Feed/Format/RSS.pm
Thanks

SOLVED
The developer of this library fixed the bug on version 0.53
https://github.com/davorg/xml-feed/issues/16
Thanks, i hope this help other with the same issue

Related

Get different attributes of a single TEI-Tags with XSLT

i have an xml-code (TEI) like this:
<pb n="19"/> <lb n="1"/><rs type="author" xml:id="MH"><rs type="patient" xml:id="BavoilMr">Mr. Bavoil</rs> - 56 ans - clincailler au quai au<supplied reason="omitted">x</supplied> fleur<supplied reason="omitted">s</supplied> - <lb n="2"/>100 toujours l'ouïe dure <lb n="3"/>26 mai<note>"mai" korrigiert aus "mars".</note>- l'oreille droite jette du pus depuis 6 ou 8 mois - ce mois<supplied reason="omitted">-</supplied> <lb n="4"/>ci encore plus
- surdité de cette oreille depuis 2 mois <lb n="5"/>il a eu un coup d'air en route - depuis 15 ans il a eu <lb n="6"/>l'oreille dure alternativement l'une et l'autre - <lb n="7"/>maintenant alternativement aussi <lb n="8"/>douleur <del rend="crossout">dans</del> sur l'os externe du coude il ne peut rien lever en
and want to translate it into an html file where the RS-Tag <rs type="author" xml:id="MH"> ... </rs> is an anchor like that <a id=MH"> ... </a>
My XSL-Code can translat one rs-tag with a special attribute:
<xsl:template match= "//tei:rs[#xml:id='MH']">
<a id="MH">
<xsl:apply-templates/>
</a>
</xsl:template>
but i can not iterate through all the rs-tags to get all the xml:id's as an attribute and write it as an id in the <a id="MH">
like:
<a id="Bavoil"> ... </a>
<a id="xml_id_of_person2> ... </a>
<a id="xml_id_of_person3"> ... </a>
Can someone help me?
You can use the following to match all the <rs> nodes having #id attribute.
<xsl:template match="rs[#id]">
Sample XML
<root>
<rs type="author" id="MH"></rs>
<rs type="patient" id="BavoilMr"></rs>
</root>
XSLT 1.0
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="rs[#id]">
<a id="{#id}"></a>
</xsl:template>
</xsl:stylesheet>
Output
<a id="MH"/>
<a id="BavoilMr"/>

Apache Camel wrong encoding after marshalling xml to json from http

I'm performing an http call to get an RSS feed from a newspaper xml feed from latin america and then transform the response body to JSON.
The problem with latin american papers are newspapers is common to find latin characters that need to be encoded, such á é í ó ú.
The problem is that the response is not encoded properly so I get description like this one:
Las lluvias llegar��an a la ciudad de C��rdoba jueves y viernes seg��n prev�� el Servicio Meteorol��gico Nacional (SMN)
I've tried setting encoding parameters for the http component and the xmljson marshal and neither of both work. I also tried forcing Content-Type headers for application/rss+xml; charset=utf-8 and application/json; charset=utf-8 but neither.
I'm using the following DataFormat:
<dataFormats>
<xmljson id="xmljson"/>
</dataFormats>
And my route is as follows:
<route id="rss">
<from uri="direct:rss"/>
<setHeader headerName="CamelHttpUri">
<simple>"http://srvc.lavoz.com.ar/rss.xml"</simple>
</setHeader>
<setHeader headerName="CamelHttpMethod">
<constant>GET</constant>
</setHeader>
<to uri="http://rss"/>
<marshal ref="xmljson"/>
</route>
An example response would be:
{
"channel": {
"title": "LaVoz",
"link": "http://srvc.lavoz.com.ar/rss.xml",
"description": [],
"language": "en",
"item": [
{
"title": "��Se vienen las lluvias a C��rdoba?",
"link": "http://srvc.lavoz.com.ar/ciudadanos/se-vienen-las-lluvias-cordoba",
"description": "Las lluvias llegar��an a la ciudad de C��rdoba jueves y viernes seg��n prev�� el Servicio Meteorol��gico Nacional (SMN) aunque se mantendr�� el promedio de las temperaturas.�� Este martes estuvo cielo algo nublado con una temperatura m��nima de 14�� registrada a las 6.10 y una m��xima de 29,5�� a las 15.30, seg��n indic�� el Observatorio Meteorol��gico C��rdoba.�� Pron��stico extendido Hay probabilidad de tormentas para jueves y viernes. Mir�� el pron��stico.�� Ciudadanos",
"pubDate": "Tue, 14 Feb 2017 21:19:21 +0000",
"dc:creator": {
"#xmlns:dc": "http://purl.org/dc/elements/1.1/",
"#text": "redaccionlavoz"
},
"guid": {
"#isPermaLink": "false",
"#text": "1099119 at http://srvc.lavoz.com.ar"
}
},...
Update:
- If the route returns the XML response (without marshalling it into JSON) the encoding works as expected.
- If instead of marshalling the route logs the body content with the XML response into a logger the problem also appears.
A friend was able to solve it by converting the body to String with convertBodyTo using UTF-8 before marshalling.
The end code looks like this:
<route id="rss">
<from uri="direct:rss"/>
<setHeader headerName="CamelHttpUri">
<simple>"http://srvc.lavoz.com.ar/rss.xml"</simple>
</setHeader>
<setHeader headerName="CamelHttpMethod">
<constant>GET</constant>
</setHeader>
<to uri="http://rss"/>
<convertBodyTo type="String" charset="UTF-8" />
<setProperty propertyName="CamelCharsetName">
<constant>utf-8</constant>
</setProperty>
<marshal ref="xmljson"/>
</route>

I want to parse json to html, with angularjs, i am converting an xml to json with npm xml2json

I am using angularjs http to get an xml from the server, i am using node xml2json plugin to convert it to a json object. I get the document returned properly but it doesn't look right because i has html tags in the xml document. If anyone knows how to convert the returned json properly that would be great,
var x2js = new X2JS();
var dom = x2js.xml_str2json(chr);
<my_data>
<p class="justify">
El
<strong>
<em>
Content Tour 2015
</em>
</strong>
<strong>
Distributors
</strong>
</p>
</my_data>
I am still trying to figure this computer programming stuff out, i have only been coding a few months, so please excuse my bad code and kind of dumb questions, but if anyone knows how to parse the json or xml to html, i will be very grateful.
// Add your javascript here
$(function(){
$("body").append('<notas> <copete>Editorial</copete> <titulo>grande</titulo> <seccion>Opinion</seccion> <cuerpo> <p class="rtejustify"> El <strong> <em> Foro Infochannel Tour 2015 </em> </strong> concluyó en días pasados en la hermosa ciudad de Oaxaca, Oaxaca. El escenario para concluir el recorrido no pudo ser mejor. </p> </cuerpo> </notas>')
});
check plunkr --
plnkr.co/edit/nlEzvOeqwyGx7FwlFIEI?p=preview

Adding Support for Multiple Language to a Web Page?

I know I could do this by simply copying the files over, changing the names (adding a language code like "about" versus "about_es", or "contact" versus "contact_es") and basically redirecting them to a different site altogether, but I was wondering how to go about doing this like the Elder Scrolls website does it (URL is the same). It seems like that method would be more elegant/professional.
Any ideas?
you can create resource JSON or XML file
ex1:
<languages>
<language id="EN">
<element name="page_head" >Hello world</element>
<element name="page_footer" >goodbye world</element>
</language>
<language id="FR">
<element name="page_head">Bonjour tous le monde</element>
<element name="page_footer">Heureux voir tous le monde</element>
</language>
</languages>
ex2:
a=[{
"language":"EN",
"elements":
{
"page_head":"Hello world",
"page_footer":"goodbye world"
}
},
{
"language":"EN",
"elements":
{
"page_head":"Bonjour tous le monde",
"page_footer":"Heureux voir tous le monde"
}];
or maybe you can just use a lazy way aka "databases"!

Parsing <sup> tag inside <td>

I'm trying to xpath parse an HTML document containing the following line:
<td class="ficha ficha_izq">Emisiones de CO<sub>2</sub> (gr/km)</td>
I'm using scrapy, and the result is:
[<Selector xpath='//td[contains(#class,"ficha_izq")]/node()' data=u'Emisiones de CO'>, <Selector xpath='//td[contains(#class,"ficha_izq")]/node()' data=u'<sub>2</sub>'>, <Selector xpath='//td[contains(#class,"ficha_izq")]/node()' data=u' (gr/km)'>]
so, three items instead of just one. I don't mind about the tag, so how would I get a single item containing:
Emisiones de CO2 (gr/km)
This is not a single case, I've several items containing the tag, so I need some programatic solution.
Any clue?
thanks!!
NOTE: Using text() instead of node() does not help:
[<Selector xpath='//td[contains(#class,"ficha_izq")]/text()' data=u'Emisiones de CO'>, <Selector xpath='//td[contains(#class,"ficha_izq")]/text()' data=u' (gr/km)'>]
This xpath should work //td[contains(text(),'Emisiones de CO')]/node()
Use w3lib.html.remove_tags. You can use it with an ItemLoader.
In [1]: html = '<td class="ficha ficha_izq">Emisiones de CO<sub>2</sub> (gr/km)</td>'
In [2]: sel = Selector(text=html)
In [3]: map(remove_tags, sel.xpath('//td').extract())
Out[3]: [u'Emisiones de CO2 (gr/km)']
Alternatives using XPath or CSS selectors:
In [4]: u''.join(sel.xpath('//td[contains(#class,"ficha_izq")]//text()').extract())
Out[4]: u'Emisiones de CO2 (gr/km)'
In [5]: u''.join(sel.css('td.ficha_izq ::text').extract())
Out[5]: u'Emisiones de CO2 (gr/km)'
Notice the space between td.ficha_izq and ::text, and that ::text CSS pseudo element is a Scrapy extension to CSS selectors.