Apache Camel wrong encoding after marshalling xml to json from http - json

I'm performing an http call to get an RSS feed from a newspaper xml feed from latin america and then transform the response body to JSON.
The problem with latin american papers are newspapers is common to find latin characters that need to be encoded, such á é í ó ú.
The problem is that the response is not encoded properly so I get description like this one:
Las lluvias llegar��an a la ciudad de C��rdoba jueves y viernes seg��n prev�� el Servicio Meteorol��gico Nacional (SMN)
I've tried setting encoding parameters for the http component and the xmljson marshal and neither of both work. I also tried forcing Content-Type headers for application/rss+xml; charset=utf-8 and application/json; charset=utf-8 but neither.
I'm using the following DataFormat:
<dataFormats>
<xmljson id="xmljson"/>
</dataFormats>
And my route is as follows:
<route id="rss">
<from uri="direct:rss"/>
<setHeader headerName="CamelHttpUri">
<simple>"http://srvc.lavoz.com.ar/rss.xml"</simple>
</setHeader>
<setHeader headerName="CamelHttpMethod">
<constant>GET</constant>
</setHeader>
<to uri="http://rss"/>
<marshal ref="xmljson"/>
</route>
An example response would be:
{
"channel": {
"title": "LaVoz",
"link": "http://srvc.lavoz.com.ar/rss.xml",
"description": [],
"language": "en",
"item": [
{
"title": "��Se vienen las lluvias a C��rdoba?",
"link": "http://srvc.lavoz.com.ar/ciudadanos/se-vienen-las-lluvias-cordoba",
"description": "Las lluvias llegar��an a la ciudad de C��rdoba jueves y viernes seg��n prev�� el Servicio Meteorol��gico Nacional (SMN) aunque se mantendr�� el promedio de las temperaturas.�� Este martes estuvo cielo algo nublado con una temperatura m��nima de 14�� registrada a las 6.10 y una m��xima de 29,5�� a las 15.30, seg��n indic�� el Observatorio Meteorol��gico C��rdoba.�� Pron��stico extendido Hay probabilidad de tormentas para jueves y viernes. Mir�� el pron��stico.�� Ciudadanos",
"pubDate": "Tue, 14 Feb 2017 21:19:21 +0000",
"dc:creator": {
"#xmlns:dc": "http://purl.org/dc/elements/1.1/",
"#text": "redaccionlavoz"
},
"guid": {
"#isPermaLink": "false",
"#text": "1099119 at http://srvc.lavoz.com.ar"
}
},...
Update:
- If the route returns the XML response (without marshalling it into JSON) the encoding works as expected.
- If instead of marshalling the route logs the body content with the XML response into a logger the problem also appears.

A friend was able to solve it by converting the body to String with convertBodyTo using UTF-8 before marshalling.
The end code looks like this:
<route id="rss">
<from uri="direct:rss"/>
<setHeader headerName="CamelHttpUri">
<simple>"http://srvc.lavoz.com.ar/rss.xml"</simple>
</setHeader>
<setHeader headerName="CamelHttpMethod">
<constant>GET</constant>
</setHeader>
<to uri="http://rss"/>
<convertBodyTo type="String" charset="UTF-8" />
<setProperty propertyName="CamelCharsetName">
<constant>utf-8</constant>
</setProperty>
<marshal ref="xmljson"/>
</route>

Related

Microsoft Translator Text not raising error when wrong params passed with custom engine

I've noticed 2 cases in which Microsoft Translator Text should raise errors when using custom engines, but instead has an unexpected behavior.
Normal Case
I send a request with the method POST.
URL = https://api.cognitive.microsofttranslator.com/translate
parameters :
- api-version=3.0
- textType=plain
- category=my_engine_id
- from=de
- to=en
Headers={"Ocp-Apim-Subscription-Key": my_key,'Content-type': 'application/json'}
Body (JSON format) :
[{"Text": "Klicken Sie in jedem Bildschirm auf das Textbeispiel, das am besten lesbar ist."},
{"Text": "Verschiedene Themen aus den Bereichen Wortschatz, Satzbau, Kohärenz, Textwiedergabe,
Kommasetzung und Orthographie werden anhand von Textbeispielen und Übungen vorgestellt."},
{"Text": "„Auch wenn zwei Staaten in Deutschland existieren, sind sie doch füreinander nicht Ausland; ihre Beziehungen zueinander können nur von besonderer Art sein.“"},
{"Text": "Mit dieser Formel bricht der neue Kanzler ein jahrzehntelanges Tabu."},
{"Text": "Bislang wird der östliche Teil Deutschlands von den bundesdeutschen Politikern als SBZ, Zone oder „sogenannte DDR“ bezeichnet."}]
The response is as expected:
[{"translations":
[{"text": "On each screen, click on the text sample that is most readable.",
"to": "en"}]},
{"translations":
[{"text": "Various topics from the fields of vocabulary, typesetting, coherence, text reproduction, comma setting and orthography are presented using text examples and exercises.",
"to": "en"}]},
etc.
Case 1
I change the parameter to=fr in the URL (instead of "en").
The response shows that the text has been translated into french, although the custom engine is only trained from DE>EN (therefore I think a generic engine was used instead, but there is no info in the HTTP response) !
[{"translations":[{"text":"Sur chaque écran, cliquez sur l’échantillon de texte le plus lisible.","to":"fr"}]},{"translations":[{"text":"Divers sujets des domaines du vocabulaire, du typage, de la cohérence, de la reproduction du texte, du décor de virgule et de l’orthographe sont présentés à l’aide d’exemples de textes et d’exercices.","to":"fr"}]},{"translations":[{"text":"\"Même si deux États existent en Allemagne, ils ne sont pas étrangers l’un à l’autre; Leurs relations les uns avec les autres ne peuvent être que d’un genre particulier.","to":"fr"}]},{"translations":[{"text":"Avec cette formule, le nouveau chancelier brise un tabou de dix ans.","to":"fr"}]},{"translations":[{"text":"Jusqu’à présent, la partie orientale de l’Allemagne est appelée par les politiciens fédéraux allemands SBZ, zone ou « soi-disant DDR ».","to":"fr"}]}]
Instead of this behaviour, I would have expected an error in the HTTP response :
400075 The language pair and category combination is not valid.
Case 2
I change the parameter from=da in the URL (instead of "de").
The response shows that the translation is simply a copy of the source text, although it indicates that it was translated into "en" !
[{"translations":[{"text":"Klicken Sie in jedem Bildschirm auf das Textbeispiel, das am besten lesbar ist.","to":"en"}]},{"translations":[{"text":"Verschiedene Themen aus den Bereichen Wortschatz, Satzbau, Kohärenz, Textwiedergabe, Kommasetzung und Orthographie werden anhand von Textbeispielen und Übungen vorgestellt.","to":"en"}]},{"translations":[{"text":"\"Auch wenn zwei Staaten in Deutschland existieren, sind sie doch füreinander nicht Ausland; ihre Beziehungen zueinander können nur von besonderer Art sein.\"","to":"en"}]},{"translations":[{"text":"Mit dieser Formula bricht der neue Kanzler ein jahrzehntelanges Tabu.","to":"en"}]},{"translations":[{"text":"Bislang wird der östliche Teil Deutschlands von den bundesdeutschen Politikern als SBZ, Zone oder \"sogenannte DDR\" bezeichnet.","to":"en"}]}]
Same as for Case 1, instead of this behaviour, I would have expected an error in the HTTP response :
400075 The language pair and category combination is not valid.
Is it normal that I don't get an error for these 2 cases ? Has anybody else encountered this behavior before ?
Actually I would like to either use these error codes, or else check before sending the translation request that the language pair corresponds to the custom engine, do you know of some way to do it ?
For case1, translation is done in two hops - de>en, en>fr. The custom engine is used for the first hop, and our general model is used for the second. If you want the translation to fail in this case instead, you can set the parameter allowFallback to false
allowFallback Optional parameter.
Specifies that the service is allowed to fallback to a general system when a custom system does not exist. Possible values are: true (default) or false.
docs
In the second case, the request contains German text but is labeled as Danish. There is nothing we can do here except return the input. If you want the api to detect the from language, omit it from the parameters and the api will run auto-detect on the text.
If the from parameter is not specified, automatic language detection is applied to determine the source language.

XSL to HTML - Insert Image

I am making a XSL (to be converted to HTML) file from XML and i want to insert an image. My problem is that the link of the image is in the XML. I want the image from "caixa id="102"". How can i do it?
XML:
<loja xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="trabalhoXSD.xsd">
<componentesDisponiveis>
<caixa id="101">
<preco>23.90</preco>
<imagem>https://www.pcdiga.com/bizizi/img_upload/produtos_1/18677_1_gx.jpg?d=1443548409</imagem>
<descricao>A Nox introduz a Kore: uma solução com amplas possibilidades num formato semi-tower. A sua versatilidade converte-a numa opção perfeita para aqueles que
necessitam de uma caixa para hardware de alto desempenho, num formato mais compacto.
O design em preto com linhas angulares fornecem-lhe um aspecto implacável, juntamente com o efeito de alumínio escovado do painel frontal.</descricao>
<HDD>5</HDD>
<SDD>1</SDD>
<leitorDiscosOpticos>0</leitorDiscosOpticos>
</caixa>
<caixa id="102">
<preco>124.89</preco>
<imagem>https://www.pcdiga.com/bizizi/img_upload/produtos_1/8502_1_gx.png?d=1348685644</imagem>
<descricao>Quando você precisar de sair e levar seu jogo, a caixa Vengeance C70 é a opção perfeita. Ela é esculpida em aço sólido e feito para sobreviver a viagens com
menos desgaste, e as alças para transporte ergonómico acrescentam confiança ao transporte.</descricao>
<HDD>8</HDD>
<SDD>1</SDD>
<leitorDiscosOpticos>0</leitorDiscosOpticos>
</caixa></componentesDisponiveis></loja>
Image sources, like other HTML attributes, need to be added using the <xsl:attribute> tag.
<img>
<xsl:attribute name="src">
<xsl:value-of select="componentesDisponiveis/caixa[#id = '102']/imagem"/>
</xsl:attribute>
</img>
As you can see to get the specific Id you just add it between square brackets as a condition.

perl XML::Feed error reading html content inside rss feed

i´m trying to parse and read the content of a rss feed and getting an error.
this is my rss feed file (for test porpose)
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:Test="http://www.Test.com">
<channel>
<title>Deportes - Test.com</title>
<link>http://www.Test.com</link>
<description>Últimas noticias de deportes</description>
<item>
<title><![CDATA[El 'Chacho' Coudet es el nuevo entrenador de Rosario Central]]></title>
<link>http://442.Test.com/2014-12-15-326653-coudet-fue-presentado-como-nuevo-dt-de-central/</link>
<description><![CDATA[El Chacho, Tengo mucha alegría y ganas de empezar a trabajar. No esperaba que sea acá”, reconoció.]]></description>
<category><![CDATA[Deportes]]></category>
<pubDate>15 12 2014 06:15:0 +0000</pubDate>
<enclosure url="http://www.Test.com/__export/1418678333348/sites/diarioTest/img/2014/12/15/deportes/1215_coudet_g_fb.jpg" type="image/jpeg"><![CDATA[El Chacho Coudet]]></enclosure>
<author><![CDATA[]]></author>
<content><![CDATA[<p>Eduardo Coudet fue presentado como nuevo entrenador deRosario Central&nbsp.</p>
]]></content>
</item>
</channel>
</rss>
this is my test.pl script file.
#!/usr/bin/perl
use strict;
use warnings;
use XML::Feed;
my $feed = XML::Feed->parse("test.xml");
for my $entry ($feed->entries) {
print $entry->content;
}
when i run this code i get this error.
Can't use string ("<p>Eduardo Coudet fue prese"...) as a HASH ref while "st
rict refs" in use at C:/Strawberry/perl/site/lib/XML/Feed/Entry/Format/RSS.pm li
ne 91.
i think is a bug inside XML::Feed
Reference: https://github.com/davorg/xml-feed/blob/master/lib/XML/Feed/Format/RSS.pm
Thanks
SOLVED
The developer of this library fixed the bug on version 0.53
https://github.com/davorg/xml-feed/issues/16
Thanks, i hope this help other with the same issue

Regex html script tag with specific URL

I have html content with a <script> tag in it. In those <script> tags I
have an url pointing to a video.
What I want is replace those html tags with my specific tag which use this pattern : [VIDEO]MY_URL_[/VIDEO]
I'm using hpple for parsing the html content.
I'm using this xPath query : //script
When the parser find a result for my query I'm using this function for extracting the video url :
NSDataDetector* detector = [NSDataDetector dataDetectorWithTypes:NSTextCheckingTypeLink error:nil];
NSArray* matches = [detector matchesInString:raw options:0 range:NSMakeRange(0, [raw length])];
NSString *finalUrl = [self urlMatchingRegexResults:matches withExtensionArray:[self videosExtensionsArray]];
if (finalUrl) {
NSString *replacement = [NSString stringWithFormat:#"[%#]%#[/%#]",tag,finalUrl,tag];
NSString *pattern = [NSString stringWithFormat:#"<script.*>.*%#.*</script>",finalUrl];
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:pattern options:NSRegularExpressionCaseInsensitive error:nil];
NSArray *matches = [regex matchesInString:self.store options:0 range:NSMakeRange(0, self.store.length)];
modifiedString = [regex stringByReplacingMatchesInString: modifiedString options:0 range:NSMakeRange(0, modifiedString.length) withTemplate:replacement];
}
where "raw" is the result of [TFHppleElement raw]
where [self videosExtensionsArray] is an array of videos extensions :
- (NSArray *)videosExtensionsArray {
static NSArray *videosExtensionsArray;
static dispatch_once_t onceToken;
dispatch_once(&onceToken, ^{
videosExtensionsArray = #[#"mp4",#"mov",#"avi",#"flv",#"mkv"];
});
return videosExtensionsArray;
}
The problem is that if i have multiple tag in my html content, my regex take the first opening tag and take the last closing tag.
How can i modify my regex to avoid this issue ?
NSString *pattern = [NSString stringWithFormat:#"<script.*>.*%#.*<\\/script>",finalUrl];
EDIT :
Content of the HTML :
<html><body><p style="text-align: center;"><a href="http://www.tuxboard.com/nba-jam-avec-gerald-green/gerald-green-nba-jam/" rel="attachment wp-att-171429">[IMG]http://www.tuxboard.com/photos/2014/03/Gerald-Green-NBA-Jam.jpg[/IMG]
</a>
</p>
<p><span id="more-171399"/><br/>
Si le jeu <strong>NBA Jam</strong> était édité cette année, le joueur des Phoenix Suns <strong>Gerald Green</strong> serait la star en couverture. L’arrière des Suns est à la fois un immense dunkeur avec une détente phénoménale, mais aussi une fine gâchette.</p>
<p style="text-align: center;"><a href="http://www.tuxboard.com/nba-jam-avec-gerald-green/video-nba-jam-gerald-green/" rel="attachment wp-att-171431">[IMG]http://www.tuxboard.com/photos/2014/03/Video-NBA-Jam-Gerald-Green.jpg[/IMG]
</a>
</p>
<p>L’équipe de Phoenix l’a intégré dans le jeu <strong>NBA Jam</strong>, suite à ses performances hors normes face au Thunder avec notamment 41 pts. </p>
<p>On vous laisse savourer cette vidéo, avec une jolie pépite à la fin (on n’en dit pas plus…)</p>
<div id="tuxplayer">Chargement du player …</div>
<p><script type="text/javascript"><![CDATA[jwplayer("tuxplayer").setup({ flashplayer: "http://medias.tuxboard.com/playerv2.swf", file: "http://medias2.tuxboard.com/NBA_Jam_Gerald_Green.mp4",image: "http://www.tuxboard.com/photos/2014/03/NBA-Jam-Gerald-Green-on-Fire-640x357.jpg", height: 370,width: '100%', 'plugins': 'sharing-3'});]]></script></p>
<p>
Les dernières actions du bonhomme qui devrait remporter le titre du joueur ayant le plus progressé !</p>
<p style="text-align: center;">[IMG]http://www.tuxboard.com/photos/2014/03/Gerald-Green-Poster-Mason-Plumlee.gif[/IMG]
</p>
<p style="text-align: center;">[IMG]http://www.tuxboard.com/photos/2013/11/Dunk-Gerald-Green.gif[/IMG]
</p>
<p style="text-align: center;">[IMG]http://www.tuxboard.com/photos/2014/01/gerald-green-windmill.gif[/IMG]
</p>
<p><iframe width="640" height="360" src="http://www.youtube.com/embed/xnzQ3FWc7Oo?feature=oembed" frameborder="0" allowfullscreen=""/></p>
<p><iframe width="640" height="360" src="http://www.youtube.com/embed/Yyr6mkAbCQw?feature=oembed" frameborder="0" allowfullscreen=""/></p>
<p>Et surement son plus beau dunk :</p>
<p style="text-align: center;">
</p><div id="Gerald">Chargement du player …</div>
<p><script type="text/javascript"><![CDATA[
jwplayer("Gerald").setup({ flashplayer: "http://medias.tuxboard.com/playerv2.swf", file: "http://medias2.tuxboard.com/Gerald_Green_Windmill_Alley-Oop.mp4",image: "http://www.tuxboard.com/photos/2012/03/Video-Gerald-GreenAlley-Oop.jpg", height: 390,width: 640, 'plugins': 'sharing-3'});]]></script></p>
</body></html>
Log of the pattern :
<script.*?>.*http://medias2.tuxboard.com/NBA_Jam_Gerald_Green.mp4.*?</script>
Matching usually finds the longest match, you need the shortest which is indicated by *? for shortest zero or more. See Regular Expressions - ICU User Guide referenced by Apple's `NSRegularExpression" documentation.

Adding Support for Multiple Language to a Web Page?

I know I could do this by simply copying the files over, changing the names (adding a language code like "about" versus "about_es", or "contact" versus "contact_es") and basically redirecting them to a different site altogether, but I was wondering how to go about doing this like the Elder Scrolls website does it (URL is the same). It seems like that method would be more elegant/professional.
Any ideas?
you can create resource JSON or XML file
ex1:
<languages>
<language id="EN">
<element name="page_head" >Hello world</element>
<element name="page_footer" >goodbye world</element>
</language>
<language id="FR">
<element name="page_head">Bonjour tous le monde</element>
<element name="page_footer">Heureux voir tous le monde</element>
</language>
</languages>
ex2:
a=[{
"language":"EN",
"elements":
{
"page_head":"Hello world",
"page_footer":"goodbye world"
}
},
{
"language":"EN",
"elements":
{
"page_head":"Bonjour tous le monde",
"page_footer":"Heureux voir tous le monde"
}];
or maybe you can just use a lazy way aka "databases"!