Embedding and Displaying chinese/japanese

Embedding and Displaying chinese/japanese - actionscript-3

I have been working on a subtitles engine for flash/flv video player. On my Mac everything is great, nice aliased glyphs, displaying all the characters, etc. Switch to windows, it all goes out the window. Some machines with Eastern Characters enabled display fine, but I can't guarantee all users will have this option selected.
I am using the TLFTextField, I am pulling in UTF-8 XML with Chinese/Japanese characters.
I have tried embedding the (required fonts/glyphs) but pushes the file size up massively.
I have also tried changing it to unicode, with no joy. Has anyone got any experience with displaying these characters while maintaining a low file size.

I'm not really offering a solution to your question, but if the user is wanting Chinese or Japanese subtitles, I'm pretty sure that they will have the correct encoding.

Related

Is It Safe To Use Unicode Literals in HTML?

I am making an application, and I want to add a "HOME" button.
After much struggling with various icon libraries, I stumbled upon this site,
http://graphemica.com/%F0%9F%8F%A0, with this
🏠
A unicode symbol, which is more akin to a letter than an image.
I pasted it into my HTML, and it just workedTM.
All this seems a little too easy, though. Are unicode symbols widely supported? Is there some kind of problem with them that leads people to use icon libraries instead?

It depends on what do you mean for "safe".
User should have the fonts, so you must include the relative font, and in various formats: there is not yet a format recognized by most used web-browsers.
Additionally, font with multiple colours are not fully understood by various systems, so you should care about what do you expect from users (click, select, copy, etc.).
Additionally, every fonts has own design, so between different fonts (so browsers and operating system) things can look differently. We do not have yet a "Helvetica 'Home'", a "Times New Roman 'Home'".
All this points, could be solved by using a web font, with monochrome glyphs (but it could be huge, if it includes all Unicode code points (+ usual combinations).
It seems that various recent browser crashes if there are many different glyphs, but usually it should not be a problem.
I also recommend aria stuffs so that you page could be used also by e.g. readers (and braille screen).
Note: on the plus side, the few people that use text browser can better see the HOME (not the case in case of an image), if somebody still care about this use case.

Some things you want to make sure you’re doing:
Save your HTML file as UTF-8. In fact, save all text files as UTF-8 unless there’s some reason you can’t.
Put the line <meta charset="utf-8" /> near the top of your HTML file.
Make sure your server isn’t misconfigured to tell all browsers that webpages are in the wrong encoding.
If, somehow, it is and you can’t fix it, fall back on &entities;.
Specify a font stack for your emoji in CSS with a set of fonts that cover nearly every system, perhaps including Apple Color Emoji, Noto Color Emoji, Segoe UI Emoji and Twemoji.
If a free font such as Noto or Symbola contains the emoji you use, you can package it as a WOFF to be sure it will always display the way you want. (As of 2018, Tor browser does not show most emoji correctly by default, but mainstream browsers do.)

I think using unicode is a good practice for development. Beacause The unicodes are essentially part of your operating system so you don’t need any special library or plugin and you treat them like regular text.
The only problem is - code can be defficult to read or understand. I think it is not easy to understand that (&#12796 8;🏠) printing home icon.
Even the 8 bit PNGs are faster then the font icons.
Image icons can be lightweight but still slow down your site with another HTTP request and time for the image to load. With images you don’t have flexibility over the color and scaling. SVG vector image alternatives are still not faster than plain-text (Unicode characters). Unicode doesn’t require additional HTTP requests and can be made to scale nicely.
If you are developing a website using only simple shapes, you can use unicode UTF-8 symbols as replacement for font icons.
I think :
Almost every developer use libraries for icons because of readablility of code, Easy to use and get more options.
Safe or Not
I can not say whether it is safe or not.
Because Unicode contains such a large number of characters and incorporates the varied writing systems of the world, incorrect usage can expose programs or systems to possible security attacks. This is especially important as more and more products are internationalized. This document describes some of the security considerations that programmers, system analysts, standards developers, and users should take into account, and provides specific recommendations to reduce the risk of problems.
Read about UNICODE SECURITY CONSIDERATIONS

Here are few precautions to be taken while doing that, I did some research and found this to be more helpful for your question. Also I dont know how you can do but credits go to Mr.GOY
Displaying unicode symbols in HTML

Is it possible to sniff out what font a browser is currently using from server side?

Basically I want to echo back to the user what font they are using. Is it possible to do? If there is a Wordpress plugin for it, that would be extra nice.
The page in question is going to show Chinese characters. My friend wants to have one column with a character as it looks in mainland China, and then a column with how it looks in Taiwan (there are slight, but distinct, differences - see the article at the link), both these columns will be pictures. And then a third column that displays the character using your font. It would be neat if there was a way to know which variant the user's font is displaying. But now that I write it out, it seems like a very hard problem.
Hacking Chinese on character differences

What are these strange characters in HTML source?

My friend runs a website and had an e-mail from Google Safesearch informing him he was hosting a phishing page. Turns out his cPanel was bruteforced (weak password) and they uploaded some of the pages onto his server. He told me about it and I wanted to take a look at how sophisticated are.
In many of the files, certain words/portions of text are strange. They display perfectly in a webbrowser, but are jumbled inside the HTML. I was wondering if anyone can tell me what this is?
Examples:
<title>WÐµlÑÐ¾mÐµ tÐ¾ ÐµÐ’Ð°y: Sign in</title>
<span class="txtbox_title">Ð Ð°sswÐ¾rd</span>
<a class="three" href="#">FÐ¾rgÐ¾t yÐ¾ur
It's also worth noting that there is normal text throughout the page that displays perfectly also.
I assume this is to stop the detection of certain words in the page, but I'm not sure. Any information would be great.
Edit: Originally was tagged as PHP. I realised that it probably shouldn't be so removed it. Be nice, kids.
Edit edit: For clarity, it's a phishing page targetting eBay users.
The examples I posted in the original post are (in order):
eBay: Sign In
Your Password
Forgot your [password]
As such I don't believe it to be any sort of malware, but a method of encrypting text to fight detection in browsers such as Chrome (which I assume detect 'hot' words in their algorithm).

They UTF-8 encoded Cyrillic letters and possibly other characters chosen for their visual similarity to common Latin letters. You are viewing the page in an editor that does not interpret data as UTF-8 but as in Latin 1 encoding.
For example, what you see as “Ð¾” is actually two bytes, 0xD0 0xBE. When interpreted as UTF-8 data (which is what browsers do here), they represent “о” U+043E CYRILLIC SMALL LETTER O. It is identical with the common Latin letter “o” in visual appearance (in any font that contains both letters), but coded as a separate character due to belonging to a different writing system. To any program, they are quite distinct characters, unless the program has been separately coded to handle “confusables”.
Such confusion is often intentionally created for various reasons. You are probably right in assuming that here the purpose was “to stop the detection of certain words in the page”. When e.g. “Forgot” is written using Cyrillic o’s (Fоrgоt), normal Find operations will find it when searching for “Forgot”.

My best guess is that there it is a custom type of keylogger. The WÐµlÑÐ¾mÐµ tÐ¾ ÐµÐ’Ð°ywould be parsed by the keylogger to output some data into a database that can be mined later for important information.
My second guess is that it is a means to scare or mess with the person whom owns the site.
My third guess is that the virus was coded by china or some other language and when the code was translated back into utf-8 it resulted in some of the unused characters to output the strange content.
EDIT
My fith guess is the the phishing website was programmatic getting the source code content of the ebay site and parsing it into it's own html file. And ebay has its own countermeasures against such a type of attack by scrambling the letter in the source code.
With this there must be some type of javascript that undoes the effects of the original source code.

Odd HTML/XML encoding issue

I'm having some real issues with a site we're building on our bespoke content management system. The system renders all views via XSLT, which may be the problem.
The problem we're experiencing appears to be the result of character encoding mismatches, but I'm struggling to work out which part of the process is breaking down.
The issue does not occur in Firefox or Chrome, and in IE is fine for the initial load of the page and when it is refreshed, however, when using the 'back' button or 'forward' button in IE, I find that any unicode characters are showing as a white question mark in a black diamond which implies that the wrong character set is being used. We've also seen odd results as a result of this with the page as indexed by google (it appears to index the DOCTYPE reference and the content of the head element rather than the content as would normally be the case).
All of the XSLT stylesheets are outputting UTF-16 and the XSLT files themselves are UTF-16 files (previously there was a mismatch). The site is serving the pages as UTF-16 and the HTML output has a meta tag setting the content type to use a charset of UTF-16.
I've checked the results using Fiddler to see what's coming from the server, however, Fiddler isn't logging a request/response when IE uses the back/forward buttons, so presumably it's got them cached somewhere.
Anyone got any ideas?

The site is serving the pages as UTF-16
Whoah! Don't do that.
There are several browser bugs to do with UTF-16 pages. I hadn't heard of this particular one before but it's common for UTF-16 to break form handling, for example. UTF-16 is very rarely used on the web, and as a consequence it turns up a lot of little-known bugs in browsers and other agents (like search engines and other tools written in one of the many scripting languages with poor Unicode support like PHP).
the HTML output has a meta tag setting the content type to use a charset of UTF-16
This has no effect. If the browser fails to detect UTF-16 then, because UTF-16 is not ASCII-compatible, it won't even be able to read the meta tag.
On the web, always use an ASCII-compatible encoding—usually UTF-8. UTF-8 is by far the best-supported encoding, and is almost always smaller in size than UTF-16. UTF-16 offers pretty much no advantage and I would avoid it in every case.

Possibly IE is corrupting the files when they are read from the cache. Could be related to this (unfotunately unanswered) question
Firefox & IE: Corrupted data when retrieved from cache
A few things you could check/try:
Make sure encoding is specified in both http Content-Type: header and <?xml encoding=...> declaration at the top of the XML
Are you specifing the endian of your UTF-16 or relying on byte order mark? If the latter try specifying. I think windows is usually fond of UTF-16LE.
Are you able to try another encoding? Namely UTF-8?
Are you able to disable caching from the server end (if its practical)? pragma: no-cache or whatever its modern day equivalent is? (sorry, been a while since I played with this stuff).
Sorry, no real answer here, but too much to write as a comment.

HTML: How do i debug why a language does not display correctly

i was recently asked why a tumblr theme of mine does not display Vietnamese correctly on this site. how do i debug whats the problem.
i wonder if its because of the use of
a custom font or cufon?
maybe its a character set issue? but
UTF-8 shld support most languages?

Debugging is difficult, especially if you don't read the language in question. There are some things you should check though:
1.) Fonts. This is the main cause of trouble. If you want to display a character you must have that character in the selected font. If you use standard fonts that may work on internationalised Windows but there are also "unicode" fonts (ie, Arial Unicode MS) you may want to specify explicitly.
2.) Encoding. Make sure the page is served in an appropriate character set. Check the HTTP and HTML headers "charset". UTF-8 is appropriate for most languages.
3.) Browser and OS Support. It's pretty much a given these days that browsers support non-latin character sets, however it's possible the client has a very old or unusual browser. Can't hurt to find out which browser/os combination they are using and what their "Regional Settings" are.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008