Trying to keep it unicode all the way - html

Arabic user data that was submitted from a website form occasionally ends up Mojibake in our database. A user would type something like:
الإعلان العالمى لحقوق الإنسان
in an input form and the post is received by a server and stored in a database. When we retrieve the message from the database, it reads:
الإعلان العالمى لحقوق الإنسان
The form is in an embedded iframe page with these tags:
<!DOCTYPE HTML>
<html>
<head>
<meta content="text/html; charset=UTF-8" http-equiv="content-type" />
<!-- other header elements -->
</head>
<body>
<form accept-charset="utf-8" action="https://www.salesforce.com/servlet/servlet.WebToLead?encoding=UTF-8" method="post">
<!-- other body elements -->
</body>
</html>
A post generate these request headers
Accept */*
Accept-Encoding gzip, deflate
Accept-Language en-US,en;q=0.5
Cache-Control no-cache
Connection keep-alive
Content-Length 543
Content-Type application/x-www-form-urlencoded; charset=UTF-8
Host www.salesforce.com
Origin [ -- redacted -- ]
Pragma no-cache
Referer [ -- redacted -- ]
User-Agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:28.0) Gecko/20100101 Firefox/28.0 FirePHP/0.7.4
x-insight activate
And receives these response headers
HTTP/1.1 200 OK
Date: Fri, 25 Apr 2014 09:15:49 GMT
Cache-Control: private
Content-Type: text/html;charset=UTF-8
Transfer-Encoding: chunked
I have no control over the server configuration of the machine serving the form or the server processing the form data.
Is there anything more I can do in the page markup that can prevent the problem? Are there known user agents which would ignore the accept-charset attribute?
Since the character scramble only happens occasionally, what is the best way to try and replicate / isolate the problem?
Thanks!

Related

Why is Firefox showing me a cached version of a page

I have a page the includes an iframe.…
<!DOCTYPE html>
<html>
<body>
<!-- … -->
<iframe
src="/assets/js/pdfjs/web/viewer.html?file=2021-09-12_1200-file.pdf#zoom=page-width"
style="..."
></iframe>
<!-- … -->
</body>
</html>
That includes the following response headers…
HTTP/1.1 200 OK
Date: Tue, 26 Oct 2021 11:02:17 GMT
Server: Apache/2.4.38 (Debian)
X-Powered-By: PHP/7.3.27
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate
Pragma: no-cache
X-DEBUGKIT-ID: 77761443-2882-4882-b0e1-01eea68deded
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 2349
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=UTF-8
If I change the file path in the iframe src attribute (e.g /assets/js/pdfjs/web/viewer.html?file=2021-10-26_1200-file.pdf#zoom=page-width - note the new timestamp), and reload the page, the old file is still returned, rather than the new one, despite the Cache-Control: no-store, no-cache, must-revalidate header.
Debugging the requests recevied by the server, I can see…
The parent page is requested and returned with the headers as above (with new Date & X-DEBUGKIT-ID header values), and the correct, updated iframe src value.
The iframe page is being requested with the original filename, rather than the new one (I'm assuming from the cached page).
If I reload using Cmd+Shift+R (to ignore the browser cache), then the correct iframe document is loaded.
What am I missing in this setup that is causing the page to be cached? I thought that the Cache-Control header we have should be sufficient here.
If I add a random query string to the parent page this correctly loads new documents, but I feel this is a hack that should not be needed.
I've also tried adding a Etag header containing a random string that's different for each request, but this seems to have no effect on the browser caching.

Requesting css style as text/css get response as text/html?

Basically I have:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="stylesheet" type="text/css" href="main.css"/>
by doing this I get a console error saying: main.css was not loaded because its MIME type, “text/html”, is not “text/css”.
through sniffing my browser network tab, It appears that the request is made as text/css but the response comes as text/html.
Request Headers (0.361 KB):
User-Agent: "Mozilla/5.0 (X11; linux x86_64; rv:52.0) gecko/20100101"
Accept: "text/css,*/*;q=0.1"
Accept-language: "en-US, e;q=0.5"
Accept-Encoding: "gzip, deflate"
Response Header (0.123 KB):
Content-Type: "text/html;charset=ISO-8859-1"
Date:"FRI 20 Oct 2017 ..."
Transfer-Encoding: "chunked"
FYI: This stylesheet is request in mutiple pages, at the other pages it works but not here.
Much regards

Website is displayed as HTML-Code instead of rendered HTML - indeterministic

I have a problem with a website, where sometimes only the HTML text is displayed in the browser window, instead of the rendered HTML page. This happens sometimes in all browsers.
Example URL:
http://www.starkl.at/view/p-1258/Newsletter---Gartentipp/
The HTTP request headers from IE9 are (Cookies are not shown):
GET http://www.starkl.at/view/p-1258/Newsletter---Gartentipp/ HTTP/1.1
Accept: text/html, application/xhtml+xml, */*
Referer: http://www.starkl.at/view/p-1931/Service/
Accept-Language: de-AT
User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
DNT: 1
Host: www.starkl.at
Pragma: no-cache
The HTTP response headers are:
HTTP/1.1 200 OK
Date: Wed, 19 Sep 2012 07:43:49 GMT
Cache-Control: no-cache, no-store, must-revalidate, proxy-revalidate
Content-Type: text/html;charset=UTF-8
Content-Length: 21160
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Also the content length (in bytes) seems to match.
It's a Java 6/7 application running on a Tomcat 6/7, with an additional httpd 2.2.x in front.
Any idea what the problem could be????
Thanks in advance!
If the browser writes the code and not renders it is because it's being told to do so, probably your app is returning html encoded in a way that browser thinks it's plain text.
Open tools, options, email, email options, then uncheck "Read all standard mail in plain text." This is for Outlook 2003 so your version, if not 2003, might be slightly different.

Tomcat + Wicket: UTF-8 chars not rendering properly

I have a Wicket app with some pages containing accented chars, entered as UTF-8, e.g. "résumé".
When I debug the app via the traditional Wicket Start.java class (which invokes an embedded Jetty server) all is good. However when I try deploying to a local Tomcat instance, it renders as "résumé".
My document looks like:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US"
xmlns:wicket="http://wicket.apache.org/dtds.data/wicket-xhtml1.4-strict.dtd">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
résumé
</body>
</html>
Here's what curl -I returns for the page when running on Jetty:
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Content-Language: en-US
Pragma: no-cache
Cache-Control: no-cache, max-age=0, must-revalidate
Content-Length: 13545
Server: Jetty(6.1.25)
And here's what Tomcat returns:
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Pragma: no-cache
Cache-Control: no-cache, max-age=0, must-revalidate
Content-Type: text/html;charset=UTF-8
Content-Language: en-US
Transfer-Encoding: chunked
Date: Sat, 23 Jul 2011 14:36:45 GMT
The problem is that Wicket doesn't detect the encoding of the markup files correctly. They are encoded as UTF-8, so non-ASCII chars are represented by two bytes. But Wicket doesn't know that and reads them as two separate characters. Those two characters are then encoded as UTF-8 again in the response. Since the "square root" characters is not ANSI itself you should actually see three bytes per é in the response.
Anyway, you need to fix this markup encoding interpretation. Checkout the Wicket source code for XMLReader#init().
It reads like Wicket tries three things the find out about the encoding of a markup file:
Evaluates the <?xml ... ?> declaration in beginning of the markup file. (Missing for you?)
Uses the default encoding specified by Application#getMarkupSettings().setDefaultMarkupEncoding(String)
Uses the OS default.
It looks like are missing 1 and 2 at the moment so Wicket falls back to 3 which doesn't work in your case. So try any of the other two.
I'm not sure why this is needed, but here's a workaround that solved this for me:
public class Application extends WebApplication
{
#Override
protected void init()
{
getRequestCycleSettings().setResponseRequestEncoding("UTF-8");
getMarkupSettings().setDefaultMarkupEncoding("UTF-8");
}
}
To give credit where it is due, I found this solution here.

Getting latin1 instead of UTF-8 with CGI::Application

I am using CGI::Application with UTF-8 data.
In the HTML have I set encoding to UTF-8 like so
<!DOCTYPE html>
<html dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
but the output is treated as latin1, as special characters are displayed as 2 characters.
Page Info in Firefox says the page is encoded with ISO-8859-1 despite the HTML header.
I have only been able to find these two posts about the problem, but they are old and very complicated.
Anyone that have solved this problem?
Update: Here are the HTTP header from FireBug.
Response Headers
Date Tue, 26 Apr 2011 09:53:24 GMT
Server Apache/2.2.3 (CentOS)
Connection close
Transfer-Encoding chunked
Content-Type text/html; charset=ISO-8859-1
Request Headers
Host example.com
User-Agent Mozilla/5.0 (X11; Linux x86_64; rv:2.0) Gecko/20100101 Firefox/4.0
Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language en-gb,en;q=0.5
Accept-Encoding gzip, deflate
Accept-Charset ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive 115
Connection keep-alive
I noticed that if I force UTF-8 by FireFox->Web Developer->Character Encoding->Unicode (UTF-8), if looks correct.
Your HTTP headers:
Content-Type text/html; charset=ISO-8859-1
… claim the document is encoded as Latin 1. Real HTTP headers take priority over HTML <meta> data.
$webapp->header_add(-type => 'text/html; charset=UTF-8');
… should do the job if I'm reading the documentation correctly.