Characters not appearing in my website (using utf-8) - html

I am just learning how to code, so this may have an obvious solution, but I have searched for answers for a few days now. Several characters at the end of the alphabet (x,y,z, etc.) are not appearing in the actual website despite it appearing in my code. for example in HTML I will type "Last year was my senior year" and my website will say "Last ear was m senior ear"I have insured that I am using utf-8 (unicode) on both platforms and have re-written the code many times. Any help would be appreciated, and an image of the problem will be posted as well.
html code on left, website on the right

In your software, you could try to look for a "Display invisible characters" preference, maybe there's one somewhere around the missing letter that is the cause of your issue?
I sometimes have such issues when I accidentally do a fine space (alt + space on Mac) rather than a normal space. That sounds impossible but actually happens when I write code too fast, when I insert a space after or before characters such as [] or {} that require using the alt key...

Oh, this seems to be a Google Chome/font problem: https://support.google.com/chrome/thread/2892761/the-letter-y-is-missing-on-some-pages-i-can-look-up-the-pages-in-another-browser-and-they-display-fine

Related

When to use &nbsp

I have seen &nbsp in html and can't quite tell what it does other than create some whitespace. I am wondering what exactly it does and when it should be used?
(it should have a semi-colon on the end) is an entity for a non-breaking space.
Use it between two words that should not have a line break inserted between them by word wrapping.
There is a good explanation about when this is appropriate grammar on the English StackExchange.
It is sometimes abused to create horizontal space between content in web pages (since it will not collapse like multiple regular spaces). Padding and margins should usually be used instead of this hack.
One reason for is to insert multiple spaces in a document.
In HTML, multiple whitespace characters are collapsed into one space. This includes tabs and newlines.
IF you wanted to display the following:
three spaces.
You could insert 3 entities instead of using spaces like so:
three spaces.
Edit: It's worth mentioning that is more of a historical artifact than anything else. Just about every use for it that is mentioned in the answers to this question has a better alternative means to accomplish that goal. However, is still with us, and these are some of the things people have used it for.
See also: http://www.sightspecific.com/~mosh/www_faq/nbsp.html
I don't know if this answers your question or not and certainly this answer is not of the caliber already provided by others, but the beauty of a discussion thread or Q&A site is the diversity of experience that might be found in it. So, on that note, I'll share with you what I've used nbsp; for. (To be perfectly honest, 24 hours ago, nbsp; was something I had never even heard of.)
Here's how I used nbsp;. I was posting something using markdown language and I had a very simple two-item bulleted list. For the life of me I could not get the spacing before this list and after to look symmetrical. So, I did a web search and somehow ended up taking a look at this thread.
Before using nbsp; the paragraph that followed bullet point #2 collapsed the spacing between the bulleted point and the text, making it look as if the paragraph had something to do with bullet #2, specifically (which was not the case). I tried a lot of different things that I can't even remember now, but the one thing that ultimately worked was insertion of nbsp;.
Since then, I've been seeing all sorts of posts that indicate some controversy over its use, but for non-coders who need to wrangle out of an unsightly/misleading formatting issue, nbsp; is a very quick and useful fix.

Why shouldn't I use weird Characters in code/HTML documents?

I'm wondering if it's a bad idea to use weird characters in my code. I recently tried using them to create little dots to indicate which slide you're on and to change slides easily:
There are tons of these types of characters, and it seems like they could be used in place of icons/images in many cases, they are style-able and scale-able, and screen readers would be able to make sense of them.
But, I don't see anyone doing this, and I've got a feeling this is a bad idea, I just can't decide why. I guess it seems too easy to be true. Could someone tell me why this is or isn't okay? Here are some more examples of the characters i'm talking about:
↖ ↗ ↙ ↘ ㊣ ◎ ○ ● ⊕ ⊙ ○  △ ▲ ☆ ★ ◇ ◆ ■ □ ▽ ▼ § ¥ 〒 ¢ £ ※ ♀ ♂ &⁂ ℡ ↂ░ ▣ ▤ ▥ ▦ ▧ ✐✌✍✡✓✔✕✖ ♂ ♀ ♥ ♡ ☜ ☞ ☎ ☏ ⊙ ◎ ☺ ☻ ► ◄ ▧ ▨ ♨ ◐ ◑ ↔ ↕ ♥ ♡ ▪ ▫ ☼ ♦ ▀ ▄ █ ▌ ▐ ░ ▒ ▬ ♦ ◊
PS: I would also welcome general information about these characters, what they're called and stuff (ASCII, Unicode)?
There are three things to deal with:
1. As characters in a sentence/text:
The problem is that some fonts simply do not have them. However since CSS can control font use you probably will not run into this problem. As long as you use a web safe font, and know that that character is available in that font, you should probably be okay.
You can also use an embedded font, though be sure to fall back on a web safe font that contains the character you need as many browser will not support embedded fonts.
However sometimes certain devices will not have multiple fonts to choose from. If that font does not support your character you will run into problems. However depending on what your site does and the audience you are targeting this may not be a problem for you. Not to mention that devices like that are very old, and uncommon.
All in all it was probably not a good idea a handful of years ago, but now you are not likely to have problems as long as you cover all your bases.
It is important however to point out that you should never hard code those characters, instead use HTML entities. Just inserting those characters into your code can lead to unpredictable results. I recently copied some text from Word directly into my code, Word used smart quotes (quote marks that curve inwards properly). They showed up fine in Notepad++, but when I viewed the page I did not get quotes, I got some weird symbol.
I could have either replaced them with normal quotes " or with HTML entities to keep the style “ and ” (“ and ”).
Any Unicode character can be inserted this way (even those without special names).
Wikipedia has a good reference:
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
2. As UI elements:
While it may be safe to use them in many cases, it is still better to use HTML elements where possible. You could simply style some div elements to be round and filled/not filled for your example.
As far as design goes they are really limiting, finding one that fits with the style of your page can be a hassle, and may mean that you will definitely need to embed a font, which is still only supported by the latest browsers.
Plus many devices do not support heavy font manipulation, and will often display them poorly. It works in the flow of your text, but as a vital part of the UI there can be major problems. Any possible issue one of those characters can bring will be multiplied by the fact that it is part of your UI.
From an artistic stand point they simply limit your abilities too much.
3. What are you doing?
Finaly you need to consider this:
Text is for telling
Image is for showing
HTML is for organizing
CSS is for making things look good while you show them
JavaScript is for functionality
Those characters are text, they are for telling someone something. So ask the question: "What am I doing?" and then use what was designed for that task. If you are telling use them, if you are showing use Image, or CSS.
I've seen this done before (the stars) and I think it's an awesome idea! It's also becoming quite popular to use a font (with #font-face) full of icons, like this one: http://fortawesome.github.com/Font-Awesome/
I can't see any downside to using a font like "font awesome" (only the upsides you mention like scalabilty and the ability to change color with CSS). Perhaps there's a downside to using the special characters you mention but none that I know of.
The problem with using those characters is that not all of them are available in all fonts used by all users, which means your application may look strange, or in the worst case be unusable. That said, it is becoming more common to assume the characters available in certain common fonts (Apple/Microsoft's Arial, Bitstream Vera). You can't even assume that you can download a font, as some users may capture content for offline reading with a service like Instapaper or Read It Later.
There are a number of problems:
Portability: using anything other than the 7-bit ASCII characters in code can make your code less portable, as recipients may use the wrong encoding. You can do a lot to mitigate this (eg. use UTF16 or at least UTF-8 encoded files). Most languages allow you to specify strings in characters using some form of escape notation (eg. "\u1234" in C#), which will avoid the problem, but loses some of the advantages.
Font-dependency: user interface elements that depend on special characters being available in a font may be harder to internationalize, since those glyphs might not be in the font that you want/need to use for a particular audience.
No color, limited choice of art: while font glyphs might seem useful to a coder, they probably look pretty poor to a UI designer.
The question is very broad; it could be split to literally thousands of questions of the type “why shouldn’t I use character ... in HTML documents?” This seems to be what the question is about—not really about code. And it’s about characters, seen as “weird” or “uncommon” or “special” from some perspective, not about character encodings. (None of the characters mentioned are encoded in ASCII. Some are encoded in ISO-8895-1. All are encoded in Unicode.)
The characters are used in HTML documents. There is no general reason against not using them, but loads of specific reasons why some specific characters might not be the best approach in a specific situation.
For example, the “little dots” you mention in your example (probably not dots at all but circles or bullets), when used as control elements as you describe, would mean poor usability and poor accessibility. Making them significantly larger would improve the situation, but this more or less proves that such text characters are not suitable for controls.
Screen readers could make sense of special characters if they used a database of various properties of characters. Well, they don’t, and they often fail to read properly even the most common special characters. Just reading the Unicode name of a character can be cryptic or outright misleading. The proper reading would generally depend on meaning and context.
The main issue, however, is that people do not generally recognize characters in the meanings that you would assign to them. How many people know what the circled plus symbol “⊕” stands for? Maybe 1 out of 1,000, optimistically thinking. It might be all right to use in on a page about advanced mathematics or physics, especially if the notation is defined there. But used in general text, it would be just… a weird character, and people would read different meanings into it, or just get puzzled.
So using special characters just because they look cool isn’t a good idea. Even when there is time and place for a special character, there are technical issues with them. How many fonts do you expect to contain “⊕”? How many of those fonts do you expect Joe Q. Public to have in his computer? In this specific case, you would find the font coverage reasonably good, but you would still have to analyze it and write a longish list of font names in your CSS code to cover most platforms. In the pile of poo case (♨), it would be unrealistic to expect most people to see anything but a symbol for unrepresentable character. Regarding the methods of finding out such things, check out my Guide to using special characters in HTML.
I've run into problems using unusual characters: the tools editor, compiler, interpreter etc.) often complain and report errors. In the end, it wasn't worth the hassle. Darn western hegemony, or homogeneity, or, well, something!

Thai line breaking: how to break Thai text effectively

Situation
with Thai text on a client site is that we can't control where exactly particular words/sentences are going to break between the lines (how web browser will handle it). Often, content appearance is indicated as incorrect by local reviewers.
Workaround
to this is that copywriter needs to deliver Thai content with breaking ​ and non-breaking  zero-width-space chars included.
In practice, rather than:
ของเพื่อนๆ ที่ออนไลน์อยู่
we should use something as ugly as:
ของเพื่อนๆ​ที่​ออนไลน์อยู่
The above is just an example, I don't really know where exactly the breakpoints are allowed.
In fact, non-breaking zero spaces alone would do the trick either ... it's just more strict and correct to use breaking ones as well for better accuracy.
And while it definitely is doable like this, it also is a time consuming and not very effective solution for a large site content management. Simply said, the effort put into it doesn't match the effect needed.
Research
so far has lead to the workaround mentioned, looking for a better way how to handle this. Even W3C doesn't have a solution yet and is just discussing whether it should be part of CSS3 specification.
Thai language utilizes spaces very rarely, mostly to distinguish between sentences etc. Therefore, common appearance of a Thai sentence is one looong string.
Where to break such a string when more lines of text are put together is determined by particular words identification. For words identification local dictionaries are used which are most probably part of operating system or web browser, I'm not entirely sure about these.
Apparently, the more web browsers / operating systems you check on the more results you get! Moreover, there's not much you can do about this as it's system driven and there are no "where to break Thai" settings available.
Using <wbr/>, ​ or ­ to indicate where the breakpoints really are won't prevent web browser thinking (even though wrong) that some breaks are also possible in places, where you haven't defined them e.g. in the middle of a word which might be grammatically incorrect.
If such a word is placed at the end of a line (depends on screen resolution, copy length, CSS rules defined) and the browser applies his wrong line breaking rule on it then you would end up with a Thai line breaking issue, no matter that you have defined another breakpoints before, after or somewhere else in the word - browser will always use a breakpoint that he thinks is closest to EOL, not just the ones you have gently suggested by inserting one of the mentioned chars in your markup.
That's why you actually need to focus on where not to break your text (non-breaking zero-width-space), not where it's allowed. And that's what lead us back to the ugly and long markup example in the "Workaround" section above. That way a line break can strictly only occur where you have allowed it to be, but it's messy.
Any other solution
how to handle this more effectively would be appreciated ... and who knows, it might even help W3C in their implementation?
THANK YOU!
I know this thread was quite some time but I have something to say as a native Thai. I read lots of Thai web pages everyday and I feel the quality of Thai line breaking by the modern web browsers nowadays is perfectly acceptable.
As I know, Google Chrome browser uses ICU4C, Internet Explorer uses Uniscribe API, and Firefox uses libthai to break Thai sentences into words. For Thai people I know, how these web browsers handle line breaks in Thai is perfectly acceptable for them. (actually we used to have this problem with very early version of Firefox (1.x) but that is resolved now.)
Thai line breaking and word breaking, unlike western languages, is still considered an unsolved problem and is still actively tackled by many linguistics researchers. Currently there is no implementation that could perfectly break a sentence to Thai words. IBM ICU Boundary Analysis page contains some analysis on this problem.
Many times, it has something to do with the context. For example, the phrase "ตากลม" can be correctly broken to "ตา","กลม" or "ตาก","ลม". Each way says totally different thing but Thai readers can still perfectly understand the intended meaning, given the context.
Given that your local reviewers are already familiar with reading Thai websites, I think maybe they are too pushy on you to resolve this problem. This is common unsolvable problem for all Thai websites, web browsers, and even Microsoft Word.
It is best to wait (or contribute to IBM ICU) until Thai sentence breaking implementation gets better. Let the web browsers handle this. I don't think trying to workaround this problem worth your valuable time. As as I know, even Thai website publishers here just don't care to get this one right.
Should you need to publish a document with a perfect line/word breaking, you may consider other medium, such as PDF document in which you should have more control over the line breaks.
Hope this helps :)
The ICU and ICU4J libraries have a dictionary based word break iterator for Thai that you could use on the server side to inject breaking zero width spaces where appropriate.
Or, you could use this to build a utility that could run at build time or on delivery of translations, if you knew the spacing requirements that far in advance.
see ICU Boundary Analysis for more info. These libraries are available for C, C++, and Java.
There is a W3C working group working exactly on this (for Thai and other Southeast Asian languages). Their layout requirement draft is quite recent, from last month:
Thai Layout Requirements (Draft) (10 Jan 2023)
https://www.w3.org/International/sealreq/thai/
Thai Gap Analysis (19 Jan 2022) https://www.w3.org/TR/thai-gap/
I hope these info can feed into the fruitful discussion here.
You can also follow/join the Southeast Asia Language Enablement (sealreq) activity on GitHub: https://github.com/w3c/sealreq

Paste from Outlook/Word/Office to Embeded Browser

So, we have a great application, that is going well, but some of our users like to copy their text to word before pasting into our application. When they do that, the HTML is parsed out somewhat properly, but usually contains tags from outlook or word, that our XHTML engine just doesn't like, or understand.
For example, a user types in a note into Word, has some minor formatting in it, and they past into our HTML editor (it's just a basic webbrowser with designmode turned on), the subsequent source includes <_o3a_p> tags, among others.
Am i going to have to just write a stripper for every type of MSO html tag?
I have had good luck pasting WORD content to Libre Office, and then re-selecting and copying the text out of Libre Office into a web form.
It keeps the formatting, and links, and removes all the Microsoft formatting Code.
As a user that sometimes copies data from Word to a web form (I sometimes like to spellcheck first), I've found great success by first pasting into Notepad, then copying from there and pasting into the web form.
However, Word still sometimes has the last laugh. If you have "smart quotes" enabled, it turns
This is the "best" way.
into
This is the “best” way.
(Note the quotes around the word "best").
The easy way to fix this is to turn off Smart Quotes before I begin to type; I can also use Notepad to find all of the "smart quote" symbols (“ ” ‘ ’) and replace them with "normal quote" symbols (" " ' ').
The consensus seems to be that while some tools available are somewhat successful at auto parsing ms work tags, none are 100% perfect. Methods to parse those tags depend upon what framework you are using.
Regular expression would probably be a clean fix.
Some more information about this topic can be found
on this blog post that basically documents the same struggle you seem to be having.

What are the HTML entities for up and down triangles?

I've found the outlined versions, but I want the solid up and down triangles.
Does anyone know these entities?
All named HTML entities are specified in chapter 24 of the HTML standard. The only thing missing from the page are rendered entities, but you can easily create your own copy with the additional information by applying a simple regexp:
s/<!ENTITY (\S+)/<!ENTITY \1 &\1;/
Not all entities are named. For many, you need to specify the Unicode code page, either in decimal (▲ ▲, ▼ ▼) or hex (▲ ▲, ▼ ▼).
A little but late, but you can use &blacktriangledown; &blacktriangledown;, and &blacktriangle; &blacktriangle;, to make both the up and down filled in triangles. I was looking for it myself and the alt codes didn't help so I decided to share this. This same thing works for both left and right as well.
I don't know if I've ever seen what you're looking for. Maybe a better way of doing it would be to create the arrows in Photoshop on a transparent background (.gif or .png format), and then load up the images.
Check that, you can do it through alt characters.
http://www.tedmontgomery.com/tutorial/ALTchrc.html
▼ ▲
using the alt characaters on your computer keyboard is a big no no if you are working on a web page for many reasons. #1. encoding of the website, encoding of the database driving the website if any, the codepage of the computer view the website, the codepage your own pc's keyboard is set to.. all that are mostly factors you can not control. So some people will see wonky weird letter combos or sqiggle characters instead of what you intend. For webpages use the html codes for those characters when ever you can. or at least entity encode and make sure you have your code page defined in your html header of your site.. that way people will see what you intend them to.
now if you are doing this in word for a document that will be viewed in your own country you are probably safe. But for online things (site coding or data entry) you should avoid this like the plague.