Tomcat + Wicket: UTF-8 chars not rendering properly

Tomcat + Wicket: UTF-8 chars not rendering properly - html

I have a Wicket app with some pages containing accented chars, entered as UTF-8, e.g. "résumé".
When I debug the app via the traditional Wicket Start.java class (which invokes an embedded Jetty server) all is good. However when I try deploying to a local Tomcat instance, it renders as "r√©sum√©".
My document looks like:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US"
xmlns:wicket="http://wicket.apache.org/dtds.data/wicket-xhtml1.4-strict.dtd">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
résumé
</body>
</html>
Here's what curl -I returns for the page when running on Jetty:
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Content-Language: en-US
Pragma: no-cache
Cache-Control: no-cache, max-age=0, must-revalidate
Content-Length: 13545
Server: Jetty(6.1.25)
And here's what Tomcat returns:
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Pragma: no-cache
Cache-Control: no-cache, max-age=0, must-revalidate
Content-Type: text/html;charset=UTF-8
Content-Language: en-US
Transfer-Encoding: chunked
Date: Sat, 23 Jul 2011 14:36:45 GMT

The problem is that Wicket doesn't detect the encoding of the markup files correctly. They are encoded as UTF-8, so non-ASCII chars are represented by two bytes. But Wicket doesn't know that and reads them as two separate characters. Those two characters are then encoded as UTF-8 again in the response. Since the "square root" characters is not ANSI itself you should actually see three bytes per é in the response.
Anyway, you need to fix this markup encoding interpretation. Checkout the Wicket source code for XMLReader#init().
It reads like Wicket tries three things the find out about the encoding of a markup file:
Evaluates the <?xml ... ?> declaration in beginning of the markup file. (Missing for you?)
Uses the default encoding specified by Application#getMarkupSettings().setDefaultMarkupEncoding(String)
Uses the OS default.
It looks like are missing 1 and 2 at the moment so Wicket falls back to 3 which doesn't work in your case. So try any of the other two.

I'm not sure why this is needed, but here's a workaround that solved this for me:
public class Application extends WebApplication
{
#Override
protected void init()
{
getRequestCycleSettings().setResponseRequestEncoding("UTF-8");
getMarkupSettings().setDefaultMarkupEncoding("UTF-8");
}
}
To give credit where it is due, I found this solution here.

Related

Why is Firefox showing me a cached version of a page

I have a page the includes an iframe.…
<!DOCTYPE html>
<html>
<body>
<!-- … -->
<iframe
src="/assets/js/pdfjs/web/viewer.html?file=2021-09-12_1200-file.pdf#zoom=page-width"
style="..."
></iframe>
<!-- … -->
</body>
</html>
That includes the following response headers…
HTTP/1.1 200 OK
Date: Tue, 26 Oct 2021 11:02:17 GMT
Server: Apache/2.4.38 (Debian)
X-Powered-By: PHP/7.3.27
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate
Pragma: no-cache
X-DEBUGKIT-ID: 77761443-2882-4882-b0e1-01eea68deded
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 2349
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=UTF-8
If I change the file path in the iframe src attribute (e.g /assets/js/pdfjs/web/viewer.html?file=2021-10-26_1200-file.pdf#zoom=page-width - note the new timestamp), and reload the page, the old file is still returned, rather than the new one, despite the Cache-Control: no-store, no-cache, must-revalidate header.
Debugging the requests recevied by the server, I can see…
The parent page is requested and returned with the headers as above (with new Date & X-DEBUGKIT-ID header values), and the correct, updated iframe src value.
The iframe page is being requested with the original filename, rather than the new one (I'm assuming from the cached page).
If I reload using Cmd+Shift+R (to ignore the browser cache), then the correct iframe document is loaded.
What am I missing in this setup that is causing the page to be cached? I thought that the Cache-Control header we have should be sufficient here.
If I add a random query string to the parent page this correctly loads new documents, but I feel this is a hack that should not be needed.
I've also tried adding a Etag header containing a random string that's different for each request, but this seems to have no effect on the browser caching.

HTML email (from SendGrid) not being rendered properly

What Content-Type does SendGrid set in the header by default?
I have an issue where html email going thru Sendgrid is not being formatted properly.
In the email header I see
MIME-Version: 1.0
Content-Type: text/plain
and then the following is rendered in any email client, standalone or web based.
This is a multi-part message in MIME format.
--------------e21a5bffb444e61b8e8a30240210d506
Content-Type: text/html; charset=UTF-8; format=flowed
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
etc etc
Shouldn't the Content-type in the header be multipart/mixed or similar to properly render the html and display images?
How is this changed?
Can it be changed somehow by the actual html being sent to SendGrid's server?
Any feedback appreciated!

The library you are using is hardcoding the content-type to be text/plain. In the smtp/mailer/SMTPMailer.as source on line 135:
writeUTFBytes ("Content-Type: text/html; charset=UTF-8; format=flowed\r\n");
This library doesn't look like it's extremely robust, it's lacking documentation, and it's 6 years old. You may want to try to find a different solution.

Typo3 website. When surfing around the site sometimes the browser shows the html code instead of showing the site

I have a Typo3 website (version 4.5.32). Sometimes (randomly) when I'm surfing around my website the browser shows the html code instead of showing the webpage.
For example it shows:
HTTP/1.1 200 OK
Date: Thu, 12 Feb 2015 11:36:29 GMT
Server: Apache
Set-Cookie: fe_typo_user=f4b8445b0719bd7490dcde98e7d8ff5b;
path=/; domain=.<my_domain>
Vary: Accept-Encoding,User-Agent
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=utf-8
bfee
<!DOCTYPE html>
<html lang="es-ES" xmlns="http://www.w3.org/1999/xhtml">
...
</html>
<!-- Cached page generated 12-02-15 12:35. Expires 13-02-15 12:35 -->
<!-- Parsetime: 0ms -->
0
when it should show the webpage.
Another example:
HTTP/1.1 200 OK
Date: Thu, 12 Feb 2015 11:41:19 GMT
Server: Apache
Set-Cookie: fe_typo_user=fd0199b1f48b719c097ef19418f18397; path=/; domain=.<my_domain>
Expires: 0
Last-Modified: Thu, 12 Feb 2015 11:41:19 GMT
Cache-Control: no-cache, must-revalidate
Pragma: no-cache
Set-Cookie: be_typo_user=71e6061cabf0d60a03739493561b67d9; path=/
Vary: Accept-Encoding,User-Agent
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=utf-8
9395
<!DOCTYPE html>
<html lang="es-ES" xmlns="http://www.w3.org/1999/xhtml">
...
</html>
<!-- Cached page generated 12-02-15 12:37. Expires 13-02-15 12:37 -->
<!-- Parsetime: 111ms -->
0
Thanks.

The extra character before the actual output (bfee, 9395) are the lengths of the following block of data. The header Transfer-Encoding: chunked also indicates that the output is cut into blocks (chunks).
All user agents (browsers, etcetera) must support chunked transfer encoding. Perhaps there is some proxy in between that ruins the experience? Anyway, it's the webserver that decides to use this transfer encoding and not TYPO3.
The only thing inside TYPO3 that could ruin things is when the content is retrieved using t3lib_div::getUrl(). This function only supports chunked data if you have cURL activated in the installation.

Trying to keep it unicode all the way

Arabic user data that was submitted from a website form occasionally ends up Mojibake in our database. A user would type something like:
الإعلان العالمى لحقوق الإنسان
in an input form and the post is received by a server and stored in a database. When we retrieve the message from the database, it reads:
ï»¿Ø§Ù„Ø¥Ø¹Ù„Ø§Ù† Ø§Ù„Ø¹Ø§Ù„Ù…Ù‰ Ù„ØÙ‚ÙˆÙ‚ Ø§Ù„Ø¥Ù†Ø³Ø§Ù†
The form is in an embedded iframe page with these tags:
<!DOCTYPE HTML>
<html>
<head>
<meta content="text/html; charset=UTF-8" http-equiv="content-type" />
<!-- other header elements -->
</head>
<body>
<form accept-charset="utf-8" action="https://www.salesforce.com/servlet/servlet.WebToLead?encoding=UTF-8" method="post">
<!-- other body elements -->
</body>
</html>
A post generate these request headers
Accept */*
Accept-Encoding gzip, deflate
Accept-Language en-US,en;q=0.5
Cache-Control no-cache
Connection keep-alive
Content-Length 543
Content-Type application/x-www-form-urlencoded; charset=UTF-8
Host www.salesforce.com
Origin [ -- redacted -- ]
Pragma no-cache
Referer [ -- redacted -- ]
User-Agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:28.0) Gecko/20100101 Firefox/28.0 FirePHP/0.7.4
x-insight activate
And receives these response headers
HTTP/1.1 200 OK
Date: Fri, 25 Apr 2014 09:15:49 GMT
Cache-Control: private
Content-Type: text/html;charset=UTF-8
Transfer-Encoding: chunked
I have no control over the server configuration of the machine serving the form or the server processing the form data.
Is there anything more I can do in the page markup that can prevent the problem? Are there known user agents which would ignore the accept-charset attribute?
Since the character scramble only happens occasionally, what is the best way to try and replicate / isolate the problem?
Thanks!

Getting latin1 instead of UTF-8 with CGI::Application

I am using CGI::Application with UTF-8 data.
In the HTML have I set encoding to UTF-8 like so
<!DOCTYPE html>
<html dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
but the output is treated as latin1, as special characters are displayed as 2 characters.
Page Info in Firefox says the page is encoded with ISO-8859-1 despite the HTML header.
I have only been able to find these two posts about the problem, but they are old and very complicated.
Anyone that have solved this problem?
Update: Here are the HTTP header from FireBug.
Response Headers
Date Tue, 26 Apr 2011 09:53:24 GMT
Server Apache/2.2.3 (CentOS)
Connection close
Transfer-Encoding chunked
Content-Type text/html; charset=ISO-8859-1
Request Headers
Host example.com
User-Agent Mozilla/5.0 (X11; Linux x86_64; rv:2.0) Gecko/20100101 Firefox/4.0
Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language en-gb,en;q=0.5
Accept-Encoding gzip, deflate
Accept-Charset ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive 115
Connection keep-alive
I noticed that if I force UTF-8 by FireFox->Web Developer->Character Encoding->Unicode (UTF-8), if looks correct.

Your HTTP headers:
Content-Type text/html; charset=ISO-8859-1
… claim the document is encoded as Latin 1. Real HTTP headers take priority over HTML <meta> data.
$webapp->header_add(-type => 'text/html; charset=UTF-8');
… should do the job if I'm reading the documentation correctly.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Tomcat + Wicket: UTF-8 chars not rendering properly - html

Related

Why is Firefox showing me a cached version of a page

HTML email (from SendGrid) not being rendered properly

Typo3 website. When surfing around the site sometimes the browser shows the html code instead of showing the site

Trying to keep it unicode all the way

Getting latin1 instead of UTF-8 with CGI::Application

Categories

Resources