How do I paste Chinese text into an html snippet with no UTF-8 meta tag? - html

I have some pages I'm copying Chinese text into from a Word doc. I have 2 kind of HTML documents. Some are parent pages, with a full head tag, and meta tag, where I have spec'd:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
I also have some snippets that just start with <div> because they are include files.
When I paste Chinese into the docs with the meta tag, it pastes fine and I can see the Chinese.
When I paste Chinese into the HTML snippets that start with a <div>, I just get boxes.
How do I paste Chinese into a snippet so that I can see the characters?
In Dreamweaver, I can flip over to visual mode and paste them in, but when I flip back to code mode, It shows me that all the characters have been converted to URL encoded equivalents. On the UTF-8 page, I can paste the Chinese and read it in code mode as Chinese characters.
HOWEVER
The weird thing is I have several include files. One I opened up has no meta tag on it, and it already had Chinese in it from development I did a few weeks back. I can still paste new Chinese into it and it's fine.
So basically, I have regular old HTML files, with no meta tag about UTF-8, and some allow me to paste Chinese into them in code mode and it works fine, and others don't allow it. The structure of these various HTML snippets are nearly identical.
Could this be a DreamWeaver bug or is there some trick / setting?

You cannot do so reliably without some sort of metadata; in HTML, that is what the <meta> tag is for. (Otherwise the browser will have to guess, and uses the default that the user has chosen. Which, in most cases, is anything but what one expects.)

Dreamweaver also has to determine the encoding of the files it opens, and in the absence of metadata it will presumably use the same techniques as a web browser (such as detecting byte sequences / character distributions that are only likely to appear in a specific encoding) to come up with a best guess.
This is the likely reason why one snippet opened with the correct encoding and another did not. Once open, Dreamweaver "knows" that a file is encoded with UTF-8, so it can continue to use UTF-8 for that file (and may even add a BOM when saving, to ensure it opens correctly in future) which is probably why saving over the bad one with the good one fixed the issue.
A good editor will let you specify the encoding to use when opening (and saving) a file. You can do this in Dreamweaver via Modify > Page Properties (see: UTF-8 Without BOM?).

Related

Strange validation behaviour of the charset with two html of the same template

I have got two HTML documents based on the same template. I built both exactly the same and then changed the contents inside the divs. I'm using the DTD HTML 4.01 Transitional and the charset ISO-8859-15 (for Spanish language, you know accents and so on) in a meta tag inside the head.
And when it comes to validation, one parses and the other doesn't, and I can't figure out why.
It complains about some accents in one of the documents that are also present in the other document which gets no complaints.
I find it funny, but there must be a reason.
I think I found the problem, I just opened with the simple notepad the file that was giving me trouble and once opened there I could see that I had strange characters like “ or ‰ in my code. I just removed them and wrote the contents properly and, of course, it passed the parser. I could not see those characters with my file opened from notepad++, that's why the parser error I was getting was so strange to me.
I didn't set the encoding in my Notepad++ to ANSI and maybe that was the reason I couldn't see those odd characters.

How to make the website show signs like "č" and "ć"?

I'm making a website that is in Croatian, and I need to use signs like: "č", "ć", "ž", "đ" and "š". They are currently displayed as little boxes.
Info:
I use Notepad ++.
I set the encoding there to UTF-8.
I put the following line of HTML in: <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
However, it does not work. Even Notepad ++ can't display my characters using UTF-8, so that would suggest that I should probably use something else...
http://webdesign.maratz.com/lab/utf_table/
Use HTML entities, for example
č : č
ž : ž
This sounds more like a font issue than a character encoding issue. If it were a character encoding issue, the characters would most likely be displayed as 2+ ASCII characters. The boxes, however, typically mean the character encoding is correct, but that specific character is not available in the font being used (which is especially common with lesser-used fonts). This would explain why it's behaving incorrectly in both the website and Notepad++.
To fix the issue, simply use a different font in your editor and website.
Note: I recommend a widely used font for the best chance of it working. Specifying a generic name in the website (e.g. serif or sans-serif) will probably have even better results, as the OS/browser would decide on the best font to use.
In short, be consistent about your character encoding throughout.
Configure your editor to save in the encoding you want
If you use any server side programming, make sure it isn't transcoding your data
If you use a database, make sure it is configured to use the same encoding
Configure your server to emit a Content-Type header that specifies that encoding
Use the meta tag in your question
The W3C provides useful material on encodings that starts here.
A useful site for special characters and their ASCII-codes: CopyPaste Character
To 'type' them, use the alt codes.
However, to use them in your site, you'll better use the HTML codes like you can find on CPC
As a test, try this:
<span style="font-family:Arial Unicode MS">
č ć ž đ š
</span>
You should be able to see your characters correctly.
I've just copied and pasted a line from your question along with your meta tag, placed it into a plain text file in vi.
It works just fine - all characters are displayed fine: http://www.dusystems.com/tmp/1.html
If you can't do the same with your editor then the problem is with the editor and not character sets and encodings.
If you're on Windows you can use its built-in Notepad to edit UTF-8 files. Open Notepad, type all of your special characters, add the meta tag. When doing Save As select UTF-8 from the Encoding drop-down in the dialog. Save as something.html and open in IE. It will 100% work.

Odd HTML/XML encoding issue

I'm having some real issues with a site we're building on our bespoke content management system. The system renders all views via XSLT, which may be the problem.
The problem we're experiencing appears to be the result of character encoding mismatches, but I'm struggling to work out which part of the process is breaking down.
The issue does not occur in Firefox or Chrome, and in IE is fine for the initial load of the page and when it is refreshed, however, when using the 'back' button or 'forward' button in IE, I find that any unicode characters are showing as a white question mark in a black diamond which implies that the wrong character set is being used. We've also seen odd results as a result of this with the page as indexed by google (it appears to index the DOCTYPE reference and the content of the head element rather than the content as would normally be the case).
All of the XSLT stylesheets are outputting UTF-16 and the XSLT files themselves are UTF-16 files (previously there was a mismatch). The site is serving the pages as UTF-16 and the HTML output has a meta tag setting the content type to use a charset of UTF-16.
I've checked the results using Fiddler to see what's coming from the server, however, Fiddler isn't logging a request/response when IE uses the back/forward buttons, so presumably it's got them cached somewhere.
Anyone got any ideas?
The site is serving the pages as UTF-16
Whoah! Don't do that.
There are several browser bugs to do with UTF-16 pages. I hadn't heard of this particular one before but it's common for UTF-16 to break form handling, for example. UTF-16 is very rarely used on the web, and as a consequence it turns up a lot of little-known bugs in browsers and other agents (like search engines and other tools written in one of the many scripting languages with poor Unicode support like PHP).
the HTML output has a meta tag setting the content type to use a charset of UTF-16
This has no effect. If the browser fails to detect UTF-16 then, because UTF-16 is not ASCII-compatible, it won't even be able to read the meta tag.
On the web, always use an ASCII-compatible encoding—usually UTF-8. UTF-8 is by far the best-supported encoding, and is almost always smaller in size than UTF-16. UTF-16 offers pretty much no advantage and I would avoid it in every case.
Possibly IE is corrupting the files when they are read from the cache. Could be related to this (unfotunately unanswered) question
Firefox & IE: Corrupted data when retrieved from cache
A few things you could check/try:
Make sure encoding is specified in both http Content-Type: header and <?xml encoding=...> declaration at the top of the XML
Are you specifing the endian of your UTF-16 or relying on byte order mark? If the latter try specifying. I think windows is usually fond of UTF-16LE.
Are you able to try another encoding? Namely UTF-8?
Are you able to disable caching from the server end (if its practical)? pragma: no-cache or whatever its modern day equivalent is? (sorry, been a while since I played with this stuff).
Sorry, no real answer here, but too much to write as a comment.

HTML - Arabic Support

i have a website in which i have to put some lines in Arabic.... how to do it...
where to get the Arabic text characters... how to make the page support Arabic...
i have to put a line per page and there is a lotta lotta pages so can't go around making images and putting them...
This is the answer that was required but everybody answered only part one of many.
Step 1 - You cannot have the multilingual characters in unicode document.. convert the document to UTF-8 document
advanced editors don't make it simple for you... go low level...
use notepad to save the document as meName.html & change the encoding
type to UTF-8
Step 2 - Mention in your html page that you are going to use such characters by
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
Step 3 - When you put in some characters make sure your container tags have the following 2 properties set
dir='rtl'
lang='ar'
Step 4 - Get the characters from some specific tool\editor or online editor like i did with Arabic-Keyboard.org
example
<p dir="rtl" lang="ar" style="color:#e0e0e0;font-size:20px;">رَبٍّ زِدْنٍي عِلمًا</p>
NOTE: font type, font family, font face setting will have no effect on special characters
The W3C has a good introduction.
In short:
HTML is a text markup language. Text means any characters, not just ones in ASCII.
Save your text using a character encoding that includes the characters you want (UTF-8 is a good bet). This will probably require configuring your editor in a way that is specific to the particular editor you are using. (Obviously it also requires that you have a way to input the characters you want)
Make sure your server sends the correct character encoding in the headers (how you do this depends on the server software you us)
If the document you serve over HTTP specifies its encoding internally, then make sure that is correct too
If anything happens to the document between you saving it and it being served up (e.g. being put in a database, being munged by a server side script, etc) then make sure that the encoding isn't mucked about with on the way.
You can also represent any unicode character with ASCII
You not only have to put the meta tag, telling that it is UTF-8 but really make the document UTF-8. You can do that with good editors (like notepad++) by converting them to "unicode" or "UTF-8 without BOM". Than you can simply use arabic characters
As this page is UTF-8, here are some examples (I hope I don't write anything rude here): شغف
If you use a server side scripting language make sure that it does not output the page in a different encoding. In PHP e.g. you can set it like this:
header('Content-Type: text/html; charset=utf-8');
If you don't even know where to get Arabic characters, but you want to display them, then you're doing something wrong.
Save files containing Arabic characters with encoding UTF-8. A good editor allows you to set the character encoding.
In the HTML page, place the following after <head>:
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
If you're using XHTML:
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
That's it.
An alternative way (without messing with the encoding of a file), is using HTML escape sequences. This website does that jobs for you: http://www.htmlescape.net/
Won't you need the ensure the area where you display the Arabic is Right-to-Left orientated also?
e.g.
<p dir="rtl">
i edit the html page with notepad ++ ,set encoding to utf-8 and its work
As mentioned above, by default text editors will not use UTF-8 as the standard encoding for documents.
However most editors will allow you to change that in the settings. Even for each specific document.
Check you have <meta charset="utf-8"> inside head block.

HTML Encoding Charset Problem I think?

I've been asked to add a testimonial to this page...
http://www.orchardkitchens.com/Showroom/testimonials.html
As you will see there are funny characters showing up all over the place, and it has thrown the structure of the page out.
I've since reloaded the backup and the funny chars are still appearing. Any ideas what I need to do??
Please ask if you need more info from me about the problem in hand.
Many thanks,
ETFairfax.
Looks to me as though some of the text was encoded as UTF-8 yet loaded as if it were an ANSI charset then an HTML encode run over it. Resulting in these extra characters. You will need to find the source text re-build the HTML ensuring whatever is reading the source text understands that its in UTF-8 encoding.
Valid HTML might be a start; a HTML document shouldn't start with a meta tag directly. Also it seems that the charset problem is not with your web page but rather in the backend code. Look at the source, there are numerous things such as
“
appearing which are HTML character entities for things that UTF-8 encoding yields when interpreted as Latin 1. So you should probably fix your code instead of the HTML (well, that too).
Your HTML is syntactically invalid. The <!doctype> is missing, the <html> tag is missing, the <head> tag is missing, the meta information cannot be parsed reasonably by the webbrowser.
Fix your HTML first and then retry.
As to the character encoding story, just ensure that you're using one and same character encoding everywhere. In the datastore, in the source files, in the response headers, etcetera. You may find the introductory text of this article useful to learn a bit more about character encodings. If you actually know/use Java, then you may find the proposed solutions useful as well.