My application stores configuration data (including strings for the UI) in a text file containing JSON. For example, config.json might contain the following:
{
"CustomerName" : "Omni Consumer Products",
"SubmitButtonText": "Click here to submit",
// etc etc etc..
}
This file goes to our translation vendor, who makes duplicates of it in multiple supported languages. They might be building their own app, or they might be editing it in a text editor. I don't know.
Since we're going to be using all manner of non-ASCII characters in some of our languages, I'd like to ensure everybody is clear on what character encoding we're using.
So if this were an XML file, I would stick the following declaration at the top of the file:
<?xml version="1.0" encoding="UTF-8"?>
Any reasonable text editor or XML parser will see this and know that the file is encoded in UTF-8.
Is there any similar standard I can put at the top of a JSON file, and be reasonably assured that consumers will play nicely with it?
JSON's default encoding is UTF-8:
http://www.ietf.org/rfc/rfc4627.txt
From section 3:
JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.
Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets.
This determination is unambiguous so there is no special place where an encoding is described in the format itself.
Related
I have a visual basic program that downloads individual files from the internet. Those files can be PDFs, or they be actual webpages, or they can be text. Normally I don't run into any other type of file (except maybe images).
It might seem easy to know what type of file I'm downloading, just test the extension of the URL.
For instance, a URL such as "http://microsoft.com/HowToUseAzure.pdf" is likely to be a PDF. But some URLs don't look like that. I encountered one that looked like this:
http://www.sciencedirect.com/science?_ob=MImg& _imagekey=B6VMC-4286N5V-6-18& _cdi=6147& _orig=search& _coverDate=12%2F01%2F2000& _qd=1& _sk=999059994& wchp=dGLSzV-lSzBV& _acct=C000000152& _version=1& _userid=4429& md5=d4d53f46bdf6fb8c7431f4a2e04876e7& ie=f.pdf
I can do some intelligent parsing of this URL, and I end up with a first part:
http://www.sciencedirect.com/science
and the second part, which is the question mark and everything after it. In this case, the first part doesn't tell me what type of file I have, though the second part does have a clue. But the second part could be arbitrary. So my question is, what do I do in this situation? Can I download the file as 'binary' and then test the 'binary' bytes I'm getting to see if I have either
1) text 2) pdf 3) html?
If so, what is the test? What is the difference between 'binary' and 'pdf' and 'text' anyway - are there some byte values in a binary file that would simply not occur in a html file - or in a Unicode file, or in a pdf file?
Thanks.
How to detect if a file is in the PDF format?
Allow me to quote ISO 32000-1:
The first line of a PDF file shall be a header consisting of the 5 characters %PDF– followed by a version number of the form 1.N, where N is a digit between 0 and 7.
And ISO 32000-2:
The PDF file begins with the 5 characters “%PDF–” and offsets shall be calculated from the PERCENT SIGN (25h).
What's the difference? When you encounter a file that starts with %PDF-1.0 to %PDF-1.7, you have an ISO 32000-1 file; starting with ISO 32000-2, a PDF file can also start with %PDF-2.0.
How to detect if a file is a binary file?
This is also explained in ISO 32000:
If a PDF file contains binary data, as most do, the header line shall be immediately followed by a comment line containing at least four binary characters–that is, characters whose codes are 128 or greater. This ensures proper behaviour of file transfer applications that inspect data near the beginning of a file to determine whether to treat the file’s contents as text or as binary.
If you open a PDF in a text editor instead of in a PDF viewer, you'll often see that the second line looks like this:
%âãÏÓ
There is no such thing as a "plain text file"; a file always has an encoding. However, when people talk about plain text files, they often mean to say ASCII files. ASCII files are files of which all the bytes have a value lower than 128 (10000000).
Back in the old days, transfer protocols often treated PDF documents as if they were ASCII files. Instead of sending 8-bit bytes, they only sent the first 7-bit of each bytes (this is sometimes referred to as "byte shaving"). When this happens, the ASCII bytes of a PDF file are preserved, but all the binary content gets corrupted. When you open such a PDF in a PDF viewer, you see the pages of the PDF file, but every page is empty.
To avoid this problem, four non-ASCII characters are added in the PDF header. Transfer protocols check the first series of bytes, see that some of these bytes have a value higher than 127 (01111111), and therefor treat the file as a binary file.
How to detect if a file is in the HTML format?
That's more tricky, as HTML allows people to be sloppy. You'd expect the first non-white space of an HTML file to be a < character, but such a file can also be a simple XML file that is not in the HTML format.
You'd expect <!doctype html>, <html> or <body> somewhere in the file (with or without attributes inside the tag), but some people create HTML files without mentioning the DocType, and even without an <html> or a <body> tag.
Note that HTML files can come in many different encodings. For instance: when they are encoded using UTF-8, they will contain bytes with a value higher than 127.
How to detect if a file is an ASCII text file?
Just loop over all the bytes. If you find a byte with a value higher than 127, you have a file that is not in ASCII format.
What about files in Unicode?
In that case, there will be a Byte Order Mark (BOM) that allows you to detect the encoding of the file. Read more about that here.
Are there other encodings?
Of course there are! See for instance ISO/IEC 8859. In many cases, a text file doesn't know which encoding was used as the encoding isn't stored as a property of the file.
I am working on a Talend Project, Where we are Transforming data from 1000's of XML files to CSV and we are creating CSV file encoding as UTF-8 from Talend itself.
But the issue is that some of the Files are created as UTF-8 and some of them created as ASCII , I am not sure why this is happening The files should always be created as UTF.
As mentioned in the comments, UTF8 is a superset of ASCII. This means that the code point for any ASCII characters will be the same in UTF8 as ASCII.
Any program identifying a file containing only ASCII characters will then simply assume it is ASCII encoded. It is only when you include characters outside of the ASCII character set that the file may be recognised by whatever heuristic the reading program uses.
The only exception to this is for file types that specifically state their encoding. This includes things like (X)HTML and XML which typically start with an encoding declaration.
You can go to the Advanced tab of the tFileOutputDelimited (or other kind of tFileOutxxx) you are using and select UTF-8 encoding.
Here is an image of the advanced tab where to perform the selection
I am quite sure the unix file util makes assumptions based on content of the file being in some range and or having specific start (magic numbers). In your case if you generate a perfectly valid UTF-8 file, but you just use only the ASCII subset the file util will probably flag it as ASCII. In that event you are fine, as you have a valid UTF-8 file. :)
To force talend to get a file as you wish, you can add an additional column to your file (for example in a tMap) and set an UTF-8 character in this column. The generated file will be in UTF8 as the other repliers mentioned.
is there any way to stop Powershell from converting/reading html characters in a file?
If I just open and then save an xml file:
[xml]$xml = get-content "c:\file.xml"
$xml.save("c:\file.xml")
The following key gets changed
from:
<add key="TimeFormat" value="h:mm tt"/>
to:
<add key="TimeFormat" value="h:mm tt" />
The point is, I don't want it to change any text, I want the HTML entities to remain as they were originally.
Thanks
(0xA0) is the non-breaking space in ISO-8859-1, but is not a valid character in UTF-8. One of two things could be happening:
Powershell is expecting (or your XML file claims to be) UTF-8, and Powershell is silently replacing the invalid character with a space. If this is the case, it really should be complaining with an error message.
(more likely) The XML file has the correct header specifying ISO-8859-1, and Powershell is replacing the entity with the actual '0xA0' byte. As far as the XML standard is concerned, this is a perfectly valid thing to do, and you'll have to look at Powershell's XML documentation to see if you can provide options to the XML serializer to prevent this conversion.
I ran my web page through the W3C HTML validator and received this error.
The encoding ascii is not the preferred name of the character
encoding in use. The preferred name is us-ascii. (Charmod C024) ✉
Line 5, Column 70: Internal encoding declaration utf-8 disagrees with
the actual encoding of the document (us-ascii).
<meta http-equiv="content-type" content="text/html;charset=utf-8">
Apparently, I am not "actually" using UTF-8 even though I specified UTF-8 in my meta tag.
How do I, well, "actually" use UTF-8? What does that even mean?
The HTML5 mode of the validator treats a mismatch between encoding declarations as an error. In the message, “internal encoding declaration” refers to a meta tag such as <meta charset=utf-8>, and “actual encoding” (misleadingly) refers to encoding declaration in HTTP headers.
According to current HTML specifications (HTML5 is just a draft), the mismatch is not an error, and the HTTP headers win.
There is no real problem if your document only contains Ascii characters. Ascii-encoded data is trivially UTF-8 encoded too, because in UTF-8, any Ascii character is represented as a single byte, with the same value as in Ascii.
It depends on the software used server-side whether and how you can change the HTTP headers. If they now specify charset=ascii, as it seems, it is not a real problem except in validation, provided that you keep using Ascii characters only. But it is somewhat odd and outdated. Try to have the encoding information there changed to charset=utf-8. You need not change the actual encoding, but if you later add non-Ascii characters, make sure you save the file as UTF-8 encoded by selecting a suitable command or option in the authoring program.
Open your file in notepad, then save as > UTF-8 (next to the save button).
On unix-like system you might use iconv tool to convert file from one encoding to another.
It can also be used from the scope of programming language(e.g. php).
The proper function has same name:
http://www.php.net/manual/en/function.iconv.php
Specifying encoding is one thing. Saving documents in a proper encoding is another.
Edit your documents in editors supporting UTF-8 encoding. Preferably UTF-8 without BOM. Notepad++ may be a good start.
Have a read too: UTF-8 all the way through.
Suppose I have an input field in a web page with charset UTF8; suppose I open a text file encoded with ISO-8859-1 as charset.
Now I copy and paste a string with special characters (like, for example, ô) from file to the input field : I see that the special characters is correctly displayed into input field.
Who does the conversion from ISO-8859-1 to UTF8? The browser?
When you open the file and copy/paste it to the browser, it ends up in Unicode, as that is what the browser's UI controls use internally. Who actually performs the conversion from ISO-8859-1 to Unicode depends on a few factors (what OS you are using, whether your chosen text editor is compiled to use Ansi or Unicode, what clipboard format(s) - CF_TEXT for Ansi, CF_UNICODETEXT for Unicode - the app uses for the copy, etc). But either way, when the web browser submits the form, it then encodes its Unicode data to the charset of the HTML/form during transmission.
In all likelihood, it's not really converted to UTF-8, but instead to the internal representation of characters used by the browser, which is quite likely to be UTF-16 (no matter what the encoding of the web page is).