I have been trying to validate my web page for the last two hours, I only have one error remaining before it is successfully validated but I keep on getting the character decoding problem, I cannot get round it.....
The whole document is fine except it says...
Sorry, I am unable to validate this document because on line 77 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.
The error was: utf8 "\x85" does not map to Unicode
The only thing on line 77 is some text inside some <p> tags, I have tried changing them to <a>, or <span> and taking the <p> away so it is just loose inside the div but the error only goes away when I delete the text inside the tags.
I am using the utf-8 encoding:
<meta http-equiv="Content-type" content="text/html;charset=utf-8" />
I am sorry if this is simple to resolve, my knowledge is extremely basic, I am only a first year computing student.
EDIT: the text inside the <p> tags are as follows:
<p>Our team thrives on the latest
political news as we do you. We work
around the clock to bring you the
latest, most important news as soon as
it happens. What do we ask in return…
nothing! This site is funded by us!
Your satisfaction is as much a pay
packet to us then a wad of untraceable
counterfeit notes.<br/><br/>
Sign up
to our newsletter to get regular
updates on news as soon as it happens
without having to navigate to our
site. For your security we only sell
the details you input to our site to
companies who “pinky promise” they
won’t be naughty with
them.<br/><br/>
StudentPolitics.Now
– Trading in satisfying others since
2011</p>
The problem is that the document only claims to be UTF-8 but isn't really.
Configure your editor to save in that format (the W3C has a guide for a number of them).
If you modify the HTML programatically, then check the program (and/or database if one is in play) aren't munging the data or storing non-UTF-8 data.
If that doesn't work, then try deleting the text and retyping it. You might have a zero width character that can't be represented properly in there.
Save your document in a UTF format. If it already is, try copy-paste the source code to a new file and save it in UTF format (sometimes it can get stuck during edits in some programs).
What editor are you using?
EDIT: There are some non-standard characters in your text: … (three dots in a single character, “” (curly braces), ’ (curly apostrophe), – (dash).
I guess you've copied your text from Word or a similar text processor, I get that often too. Either change those characters to their ASCII counterparts or HTML entities or be sure to save the file with UTF encoding.
validator did not like the three full stops after the word "return" three full stops after one another must mean something else...
Thank you for all your help guys.
Sometimes when you generate query from a database the encoding of the characters may not be UTF-8 in that case you should make sure that the values returned in the queries match UTF-8, also sometimes when making a substring you can cut a character in Spanish as tildes and las ñ and to show incomplete the character.
For example check the source code in your browser
Related
My friend runs a website and had an e-mail from Google Safesearch informing him he was hosting a phishing page. Turns out his cPanel was bruteforced (weak password) and they uploaded some of the pages onto his server. He told me about it and I wanted to take a look at how sophisticated are.
In many of the files, certain words/portions of text are strange. They display perfectly in a webbrowser, but are jumbled inside the HTML. I was wondering if anyone can tell me what this is?
Examples:
<title>WеlÑоmе tо еВаy: Sign in</title>
<span class="txtbox_title">Раsswоrd</span>
<a class="three" href="#">Fоrgоt yоur
It's also worth noting that there is normal text throughout the page that displays perfectly also.
I assume this is to stop the detection of certain words in the page, but I'm not sure. Any information would be great.
Edit: Originally was tagged as PHP. I realised that it probably shouldn't be so removed it. Be nice, kids.
Edit edit: For clarity, it's a phishing page targetting eBay users.
The examples I posted in the original post are (in order):
eBay: Sign In
Your Password
Forgot your [password]
As such I don't believe it to be any sort of malware, but a method of encrypting text to fight detection in browsers such as Chrome (which I assume detect 'hot' words in their algorithm).
They UTF-8 encoded Cyrillic letters and possibly other characters chosen for their visual similarity to common Latin letters. You are viewing the page in an editor that does not interpret data as UTF-8 but as in Latin 1 encoding.
For example, what you see as “о” is actually two bytes, 0xD0 0xBE. When interpreted as UTF-8 data (which is what browsers do here), they represent “о” U+043E CYRILLIC SMALL LETTER O. It is identical with the common Latin letter “o” in visual appearance (in any font that contains both letters), but coded as a separate character due to belonging to a different writing system. To any program, they are quite distinct characters, unless the program has been separately coded to handle “confusables”.
Such confusion is often intentionally created for various reasons. You are probably right in assuming that here the purpose was “to stop the detection of certain words in the page”. When e.g. “Forgot” is written using Cyrillic o’s (Fоrgоt), normal Find operations will find it when searching for “Forgot”.
My best guess is that there it is a custom type of keylogger. The WеlÑоmе tо еВаywould be parsed by the keylogger to output some data into a database that can be mined later for important information.
My second guess is that it is a means to scare or mess with the person whom owns the site.
My third guess is that the virus was coded by china or some other language and when the code was translated back into utf-8 it resulted in some of the unused characters to output the strange content.
EDIT
My fith guess is the the phishing website was programmatic getting the source code content of the ebay site and parsing it into it's own html file. And ebay has its own countermeasures against such a type of attack by scrambling the letter in the source code.
With this there must be some type of javascript that undoes the effects of the original source code.
I have a client using a CMS for a site. When they enter apostrophes, they render as periods within the HTML. I've checked the raw source, and an apostrophe (' - not a MS Word curly "smart" apostrophe) is indeed there but it renders as a period.
I've gone into the database and manually entered apostrophes thinking perhaps it was the CMS, but the problem persists. I've seen the "diamond question mark" unrecognizable character appear before, but never this... For example, the word "they're" displays as "they.re"
Any ideas? I thought it could be an encoding issue but I have
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
in place.
Any help appreciated!
As a first workaround, you could tell the content providers use the “smart” apostrophe and, for use a single quotation marks, ‘smart’ single quotes (assuming thet work OK—check it first of course). After all, the Ascii "straight" apostrophe should only be used in programming and comparable contexts, not in any normal human-language content.
It sounds like a CMS oddity, but check first that the data sent by the server actually contains “.” U+002E and not something else that just gets rendered as a period by browsers. Then you could submit a bug report to CMS provider. It might be a good idea to test the entire Ascii of characters, and why not all of Windows Latin 1 (using a page containing them all and checking that they are rendered OK, naturally with normal < and & precautions).
The euro € symbols in the footer of this page are not displaying correctly
http://fundcentre.newireland.ie/
What is the best way to correct this?
Edit: this html is supplied by a 3rd party. We take it, wrap it around our content, and render the page
Edit Again: just looking at the code, I can see that we read the 3rd party HTML into our solution with the following:
wrapperHtml = System.IO.File.ReadAllText(sWrapperLocation, Encoding.GetEncoding("iso-8859-1")); .. So we're reading it as one encoding and rendering it as the other..
This looks like UTF-8 data that was somehow interpreted in a ISO-8859-1 context (or some other single byte encoding). Whatever you use to read the 3rd party source may be incorrectly interpreting the data as single-byte while it in fact is UTF-8.
This is about everything that can be said without knowing more about your setup.
Edit: Why fixing this by using entities is a bad idea, copied from my comment:
The problem is not limited to the Euro character, but applies to all characters outside the ISO-8859-1 range. That means that while you can happily replace the € by € without any real damage, the instant a chinese or cyrillic character comes up in your data, you'll have no entity to convert it to. You would have to convert perfectly healthy UTF-8 content into their numeric entities in real time just to avoid having to fix the encoding problem. That is just insane.
€ is the entity you are looking for
Use HTML encoding; to get a € type a €
You are using:
wrapperHtml = System.IO.File.ReadAllText(sWrapperLocation, Encoding.GetEncoding("iso-8859-1"));
Try changing it to:
wrapperHtml = System.IO.File.ReadAllText(sWrapperLocation, System.Text.UTF8Encoding);
That should keep the multi-byte characters correctly.
Edit:
Also you could just remove the second argument all together as that will keep the original encoding regardless of what it was.
Update:
I know its evil, but try this. If it works, the encoding issue is on your end, somewhere, if it doesn't work the encoding issue is with the file or where you get the file.
wrapperHtml = HttpUtility.HtmlEncode(System.IO.File.ReadAllText(sWrapperLocation));
The above line will trap and encode the multi-byte and single byte characters that need to be for html encoding. For the moment it will take encoding issues off the plate if they are in your code (after this line), with the server, with the transport or with the browser, with the doc types and a lot of other things. If it works, you know the file is in a valid format and your encoding issues are somewhere after the file and you reading in the file.
Use HTML code : € or €
I've been asked to add a testimonial to this page...
http://www.orchardkitchens.com/Showroom/testimonials.html
As you will see there are funny characters showing up all over the place, and it has thrown the structure of the page out.
I've since reloaded the backup and the funny chars are still appearing. Any ideas what I need to do??
Please ask if you need more info from me about the problem in hand.
Many thanks,
ETFairfax.
Looks to me as though some of the text was encoded as UTF-8 yet loaded as if it were an ANSI charset then an HTML encode run over it. Resulting in these extra characters. You will need to find the source text re-build the HTML ensuring whatever is reading the source text understands that its in UTF-8 encoding.
Valid HTML might be a start; a HTML document shouldn't start with a meta tag directly. Also it seems that the charset problem is not with your web page but rather in the backend code. Look at the source, there are numerous things such as
“
appearing which are HTML character entities for things that UTF-8 encoding yields when interpreted as Latin 1. So you should probably fix your code instead of the HTML (well, that too).
Your HTML is syntactically invalid. The <!doctype> is missing, the <html> tag is missing, the <head> tag is missing, the meta information cannot be parsed reasonably by the webbrowser.
Fix your HTML first and then retry.
As to the character encoding story, just ensure that you're using one and same character encoding everywhere. In the datastore, in the source files, in the response headers, etcetera. You may find the introductory text of this article useful to learn a bit more about character encodings. If you actually know/use Java, then you may find the proposed solutions useful as well.
I would like to print some kind of ASCII "art" on a web page in pre-tags. These graphics use DOS characters to show a map like old maze games did. I didn't find anything in the HTML special character reference. Is there a way to use these characters in HTML ?
Thanks in advance.
With the right Unicode characters, the old character encodings shouldn't make much odds. The tricky bit may be converting existing ASCII art into Unicode - at which point you need to know the original encoding.
The relevant code charts will be listed on the Unicode "symbols" charts page. In particular, I suspect you'll find the box drawing and block elements charts useful.
You'll need to make sure that your page uses a font which contains the right characters, of course...
As an example, you can render this:
┌┐
└┘
With:
<pre>┌┐
└┘</pre>
Not quite a proper box, but getting there...
You can send them in the <pre> tags, although in XHTML you'll need to encapsulate it in <![CDATA[[]> I think. Be careful though, not all encodings render this correctly. For example, a lot of ASCII art designed for DOS code page 430 (US) fails over here in the UK (830). Eastern Europe suffers especially.
I think the best approach here would be to render images.
EDIT: Oh. You could try , but I'm not sure if that would work.