I have been tasked with cleaning up a very messy site, http://www.investravel.com/, built in joomla. I have first copied the entire output source to a static html file http://www.investravel.com/test.html but am getting the unknow character symbol repeated throughout the copy in the html version.
Does anybody have any idea why that might be as I find it quite curious given they should present the same source to the browser.
It might be worth nothing there are two
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
in the original, both spelt slightly differently. I have removed both and added the correct W3C version but still to no avail.
Any help much appreciated.
I just tried saving it with firefox and it saved everything in UTF8.
The way I did it was:
Go to the "view" menu, select "Character Encoding", and make sure it has "Unicode (UTF-8)" (note that after forcing the encoding, make sure all characters are correct, I tried with that encoding and at first glance all seems right).
Then save the page as html and open it, all should be ok!
The reason your characters are wrong is probably because you had some other encoding forced, in your case I detected the Western (ISO-8859-1) encoding.
Those are encoded in the database, then they show up as the symbol once it makes it in the browser. You will notice the same thing happens with things like the copyright symbol (in the database it is © but in the source it will show up as the actual symbol. You are not going to be able to make accurate copies of the pages as static HTML if they used a lot of smart quotes and other symbols.
Why would you want to take a dynamic site and make it static in the first place? That seems horribly inefficient.
Related
This character:

shows up on my site 3 times and for all 3 cases it's shown after a closed div tag. I searched the web and SOF and there are some solutions but none of them worked on mine so decided to post here. I am using .NET. I realize that this is not sufficient info but i am new to programming so not sure what other info you might need. Please let me know. Thanks!
Looks like an byte order mark. Please check your source and output encoding.
Yes it is the Byte Order Mark (BOM). It was driving me crazy too. I researched and started reading about BOM and tried adding charset="UTF-8" to some script tags but no go.
I use Dreamweaver and found that when I saved (save as) some recent html files, the option for "Include Unicode Signature (BOM) was checked. I unchecked and saved and it resolved the unwanted characters (I guess it saves it without the BOM)!!
Updating the meta tags charset to UTF-8 will resolve this too and is recommended (which means dozens of pages for me) but I needed this quick fix.
Also, saving with notepad++ looks to do the trick as well. Here's a related article wrt ++ and settings wrt BOM: notepad++ converting ansi encoded file to utf-8
I hope this help someone!
I use Dreamweaver and found that when I saved (save as) some recent html files, the option for "Include Unicode Signature (BOM) was checked. I unchecked and saved and it resolved the unwanted characters (I guess it saves it without the BOM)!!
This is the perfect solution. its worked for me.
thx everyone
I've been searching and testing for a solution the last 3 hours now.
I wanna be able to like the following link. Please note that this is the only category implementing the Like button right now and I have hard coded a quick fix.
I have implemented the like button and it works so far (It's hidden for now however). The problem arise when I try to add the OG meta data specified by Facebook. I have used the facebook debugger to find out what is wrong.
As you might notice the query string includes slashes which Facebook encodes. Obviously this was the first thing I tried to adjust. And believe me. I have tried everything here. Replacing / with %2F, encoding other special chars like & etc. My conclusion was that facebook arrives to the address with the slashes, encodes the content in the og:url property and therefor kind of mismatch somehow. I found more people having problems with slashes in the url but none of the solutions have worked out for me. I saw a note that content-length missing in the header could be a problem for the spider to handle, but adding it made no difference.
Change of doctype, temporarily remove other meta tags, change their order etc. have not made any effect.
The only thing that makes a difference is if i input the encoded version of the link in the debugger (http://www.d-gear.se/?page=%2Fshop%2Fbcat&c=144). The error is then gone (warnings remaining) but as you see it still can't find the og-tags in the document.
As a final way to get any clue I tried the following while following the original category link.
<meta property="og:url" content="http://www.d-gear.se/" />
It made absolutely no difference at all. In the debugger the same error arises and the information under redirect path is:
original http://www.d-gear.se/?page=%2Fshop%2Fbcat&c=144
rel="canonical" http://www.d-gear.se/?page=%2Fshop%2Fbcat&c=144
I checked the source code of the page and it had been updated to http://www.d-gear.se/ there. (Now I have changed back to the intended canonical URL again)
It's probably one really easy solution to this, but I'm stuck here and don't wanna waste the rest of the evening in case someone here is able to just point out the error to me.
After a few more hours of testing I noticed that charset was set to LATIN-1. Changing it to ISO-8859-1 made the difference. (Somewhere deep inside my brain I think I read that those two are the same)
I have been trying to validate my web page for the last two hours, I only have one error remaining before it is successfully validated but I keep on getting the character decoding problem, I cannot get round it.....
The whole document is fine except it says...
Sorry, I am unable to validate this document because on line 77 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.
The error was: utf8 "\x85" does not map to Unicode
The only thing on line 77 is some text inside some <p> tags, I have tried changing them to <a>, or <span> and taking the <p> away so it is just loose inside the div but the error only goes away when I delete the text inside the tags.
I am using the utf-8 encoding:
<meta http-equiv="Content-type" content="text/html;charset=utf-8" />
I am sorry if this is simple to resolve, my knowledge is extremely basic, I am only a first year computing student.
EDIT: the text inside the <p> tags are as follows:
<p>Our team thrives on the latest
political news as we do you. We work
around the clock to bring you the
latest, most important news as soon as
it happens. What do we ask in return…
nothing! This site is funded by us!
Your satisfaction is as much a pay
packet to us then a wad of untraceable
counterfeit notes.<br/><br/>
Sign up
to our newsletter to get regular
updates on news as soon as it happens
without having to navigate to our
site. For your security we only sell
the details you input to our site to
companies who “pinky promise” they
won’t be naughty with
them.<br/><br/>
StudentPolitics.Now
– Trading in satisfying others since
2011</p>
The problem is that the document only claims to be UTF-8 but isn't really.
Configure your editor to save in that format (the W3C has a guide for a number of them).
If you modify the HTML programatically, then check the program (and/or database if one is in play) aren't munging the data or storing non-UTF-8 data.
If that doesn't work, then try deleting the text and retyping it. You might have a zero width character that can't be represented properly in there.
Save your document in a UTF format. If it already is, try copy-paste the source code to a new file and save it in UTF format (sometimes it can get stuck during edits in some programs).
What editor are you using?
EDIT: There are some non-standard characters in your text: … (three dots in a single character, “” (curly braces), ’ (curly apostrophe), – (dash).
I guess you've copied your text from Word or a similar text processor, I get that often too. Either change those characters to their ASCII counterparts or HTML entities or be sure to save the file with UTF encoding.
validator did not like the three full stops after the word "return" three full stops after one another must mean something else...
Thank you for all your help guys.
Sometimes when you generate query from a database the encoding of the characters may not be UTF-8 in that case you should make sure that the values returned in the queries match UTF-8, also sometimes when making a substring you can cut a character in Spanish as tildes and las ñ and to show incomplete the character.
For example check the source code in your browser
I have some pages I'm copying Chinese text into from a Word doc. I have 2 kind of HTML documents. Some are parent pages, with a full head tag, and meta tag, where I have spec'd:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
I also have some snippets that just start with <div> because they are include files.
When I paste Chinese into the docs with the meta tag, it pastes fine and I can see the Chinese.
When I paste Chinese into the HTML snippets that start with a <div>, I just get boxes.
How do I paste Chinese into a snippet so that I can see the characters?
In Dreamweaver, I can flip over to visual mode and paste them in, but when I flip back to code mode, It shows me that all the characters have been converted to URL encoded equivalents. On the UTF-8 page, I can paste the Chinese and read it in code mode as Chinese characters.
HOWEVER
The weird thing is I have several include files. One I opened up has no meta tag on it, and it already had Chinese in it from development I did a few weeks back. I can still paste new Chinese into it and it's fine.
So basically, I have regular old HTML files, with no meta tag about UTF-8, and some allow me to paste Chinese into them in code mode and it works fine, and others don't allow it. The structure of these various HTML snippets are nearly identical.
Could this be a DreamWeaver bug or is there some trick / setting?
You cannot do so reliably without some sort of metadata; in HTML, that is what the <meta> tag is for. (Otherwise the browser will have to guess, and uses the default that the user has chosen. Which, in most cases, is anything but what one expects.)
Dreamweaver also has to determine the encoding of the files it opens, and in the absence of metadata it will presumably use the same techniques as a web browser (such as detecting byte sequences / character distributions that are only likely to appear in a specific encoding) to come up with a best guess.
This is the likely reason why one snippet opened with the correct encoding and another did not. Once open, Dreamweaver "knows" that a file is encoded with UTF-8, so it can continue to use UTF-8 for that file (and may even add a BOM when saving, to ensure it opens correctly in future) which is probably why saving over the bad one with the good one fixed the issue.
A good editor will let you specify the encoding to use when opening (and saving) a file. You can do this in Dreamweaver via Modify > Page Properties (see: UTF-8 Without BOM?).
I've been asked to add a testimonial to this page...
http://www.orchardkitchens.com/Showroom/testimonials.html
As you will see there are funny characters showing up all over the place, and it has thrown the structure of the page out.
I've since reloaded the backup and the funny chars are still appearing. Any ideas what I need to do??
Please ask if you need more info from me about the problem in hand.
Many thanks,
ETFairfax.
Looks to me as though some of the text was encoded as UTF-8 yet loaded as if it were an ANSI charset then an HTML encode run over it. Resulting in these extra characters. You will need to find the source text re-build the HTML ensuring whatever is reading the source text understands that its in UTF-8 encoding.
Valid HTML might be a start; a HTML document shouldn't start with a meta tag directly. Also it seems that the charset problem is not with your web page but rather in the backend code. Look at the source, there are numerous things such as
“
appearing which are HTML character entities for things that UTF-8 encoding yields when interpreted as Latin 1. So you should probably fix your code instead of the HTML (well, that too).
Your HTML is syntactically invalid. The <!doctype> is missing, the <html> tag is missing, the <head> tag is missing, the meta information cannot be parsed reasonably by the webbrowser.
Fix your HTML first and then retry.
As to the character encoding story, just ensure that you're using one and same character encoding everywhere. In the datastore, in the source files, in the response headers, etcetera. You may find the introductory text of this article useful to learn a bit more about character encodings. If you actually know/use Java, then you may find the proposed solutions useful as well.