What are the Legal / Allowed characters for web server file names on? - html

What characters are allowed in filenames for HTML files on ALL servers (*nix, Windows, etc.) ?
I'm looking for the "lowest common denominator" that will work on all servers.
USE: I'm naming a file to be served up publicly (Mysite.com/My-Page.htm)
E.g., space? _ - , etc.
E.g., can I have File-Name.htm, File_Name.htm File Name.htm?
Obviously, this needs to work with all servers and browsers. (IIRC, the name is limited by the server not the browser, but I could be wrong).

What characters are allowed in filenames for HTML files on servers?
That totally depends on the server. HTTP itself allows any character at all, including control characters and non-ASCII characters, as long as they are suitably %-encoded when requested in a URL.
On a Unix server you cannot use ‘/’ or the zero byte. (If you could use them, they'd appear in the URL as ‘%2F’ and ‘%00’ respectively.) You also can't have the specific filenames ‘.’ or ‘..’, or the empty string.
On a Windows server you have all the limitations of a Unix server, plus you also can't use any of \/:*?"<>| or control characters 1-31 and you can't have leading or trailing dot or spaces, and you'll have difficulty using any of the legacy device filenames (CON, PRN, COM1 and many more).
This is nothing to do with HTTP; just how filenames work on Windows, which is complicated.
can I have File-Name.htm, File_Name.htm File Name.htm?
Certainly. But in the last case you should link to it by URL-encoding the space:
thingy
Browsers will usually let you get away with leaving the space in, but it's not really valid. If you want to avoid having to think about URL-escaping, HTML-escaping and case-sensitive issues, stick to a–z, 0–9 and underscore.

If you don't want your filenames to be encoded by the server, you should avoid reserved characters: $&+,/:;=?# and unsafe characters: space, quotation marks, <>#%{}|\^~[]`
But as the previous answers stated, the web servers should cope with whatever you want to use by encoding the chars.

Be sure to eliminate
* . " / \ [ ] : ; | = ,
which are never allowed, due to inconsistencies in file naming conventions standard practice is to use a-z and 0-9 and the underscore character. Space is needful for most users but if you can get away from using it there are parsing issues that improve reliability, you can read rfc's on mime ( multi-part internet mail extensions ) to get a taste of what is involved.
No matter what you do, something somewhere is likely to make life difficult - so much so that I now use cryptographic methods to generate random a-z lowercase strings and use those as filenames, embedding the useful info in the file source code.
Avoid the ampersand at any cost, ...

I would say a good rule of thumb for filenames for HTML files on ALL servers can be any combination of alphabet (lowercase preferred) and number characters (1 though 9), plus the underline(_), minus(-) or plus(+) characters but no spaces. Also, end the filename with dot html (e.g. filename.html). I personally avoid using underline and plus characters.

There isn't such a thing as an html filename.
Certain characters have to be encoded in html (eg if used in links) but the allowed characters in the document names will depend on the web server (and possibly the file system on the server).

Any file name will be URL-encoded so you should be fine. And for the record all three of your file names would work just fine.

Related

Hi, I need write question mark into filename on windows How do I do it ? Plz THX [duplicate]

I know that / is illegal in Linux, and the following are illegal in Windows
(I think) * . " / \ [ ] : ; | ,
What else am I missing?
I need a comprehensive guide, however, and one that takes into account
double-byte characters. Linking to outside resources is fine with me.
I need to first create a directory on the filesystem using a name that may
contain forbidden characters, so I plan to replace those characters with
underscores. I then need to write this directory and its contents to a zip file
(using Java), so any additional advice concerning the names of zip directories
would be appreciated.
The forbidden printable ASCII characters are:
Linux/Unix:
/ (forward slash)
Windows:
< (less than)
> (greater than)
: (colon - sometimes works, but is actually NTFS Alternate Data Streams)
" (double quote)
/ (forward slash)
\ (backslash)
| (vertical bar or pipe)
? (question mark)
* (asterisk)
Non-printable characters
If your data comes from a source that would permit non-printable characters then there is more to check for.
Linux/Unix:
0 (NULL byte)
Windows:
0-31 (ASCII control characters)
Note: While it is legal under Linux/Unix file systems to create files with control characters in the filename, it might be a nightmare for the users to deal with such files.
Reserved file names
The following filenames are reserved:
Windows:
CON, PRN, AUX, NUL
COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9
LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9
(both on their own and with arbitrary file extensions, e.g. LPT1.txt).
Other rules
Windows:
Filenames cannot end in a space or dot.
macOS:
You didn't ask for it, but just in case: Colon : and forward slash / depending on context are not permitted (e.g. Finder supports slashes, terminal supports colons). (More details)
A “comprehensive guide” of forbidden filename characters is not going to work on Windows because it reserves filenames as well as characters. Yes, characters like
* " ? and others are forbidden, but there are a infinite number of names composed only of valid characters that are forbidden. For example, spaces and dots are valid filename characters, but names composed only of those characters are forbidden.
Windows does not distinguish between upper-case and lower-case characters, so you cannot create a folder named A if one named a already exists. Worse, seemingly-allowed names like PRN and CON, and many others, are reserved and not allowed. Windows also has several length restrictions; a filename valid in one folder may become invalid if moved to another folder. The rules for
naming files and folders
are on the Microsoft docs.
You cannot, in general, use user-generated text to create Windows directory names. If you want to allow users to name anything they want, you have to create safe names like A, AB, A2 et al., store user-generated names and their path equivalents in an application data file, and perform path mapping in your application.
If you absolutely must allow user-generated folder names, the only way to tell if they are invalid is to catch exceptions and assume the name is invalid. Even that is fraught with peril, as the exceptions thrown for denied access, offline drives, and out of drive space overlap with those that can be thrown for invalid names. You are opening up one huge can of hurt.
Under Linux and other Unix-related systems, there were traditionally only two characters that could not appear in the name of a file or directory, and those are NUL '\0' and slash '/'. The slash, of course, can appear in a pathname, separating directory components.
Rumour1 has it that Steven Bourne (of 'shell' fame) had a directory containing 254 files, one for every single letter (character code) that can appear in a file name (excluding /, '\0'; the name . was the current directory, of course). It was used to test the Bourne shell and routinely wrought havoc on unwary programs such as backup programs.
Other people have covered the rules for Windows filenames, with links to Microsoft and Wikipedia on the topic.
Note that MacOS X has a case-insensitive file system. Current versions of it appear to allow colon : in file names, though historically that was not necessarily always the case:
$ echo a:b > a:b
$ ls -l a:b
-rw-r--r-- 1 jonathanleffler staff 4 Nov 12 07:38 a:b
$
However, at least with macOS Big Sur 11.7, the file system does not allow file names that are not valid UTF-8 strings. That means the file name cannot consist of the bytes that are always invalid in UTF-8 (0xC0, 0xC1, 0xF5-0xFF), and you can't use the continuation bytes 0x80..0xBF as the only byte in a file name. The error given is 92 Illegal byte sequence.
POSIX defines a Portable Filename Character Set consisting of:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 . _ -
Sticking with names formed solely from those characters avoids most of the problems, though Windows still adds some complications.
1 It was Kernighan & Pike in ['The Practice of Programming'](http://www.cs.princeton.edu/~bwk/tpop.webpage/) who said as much in Chapter 6, Testing, §6.5 Stress Tests:
When Steve Bourne was writing his Unix shell (which came to be known as the Bourne shell), he made a directory of 254 files with one-character names, one for each byte value except '\0' and slash, the two characters that cannot appear in Unix file names. He used that directory for all manner of tests of pattern-matching and tokenization. (The test directory was of course created by a program.) For years afterwards, that directory was the bane of file-tree-walking programs; it tested them to destruction.
Note that the directory must have contained entries . and .., so it was arguably 253 files (and 2 directories), or 255 name entries, rather than 254 files. This doesn't affect the effectiveness of the anecdote, or the careful testing it describes.
TPOP was previously at
http://plan9.bell-labs.com/cm/cs/tpop and
http://cm.bell-labs.com/cm/cs/tpop but both are now (2021-11-12) broken.
See also Wikipedia on TPOP.
Instead of creating a blacklist of characters, you could use a whitelist. All things considered, the range of characters that make sense in a file or directory name context is quite short, and unless you have some very specific naming requirements your users will not hold it against your application if they cannot use the whole ASCII table.
It does not solve the problem of reserved names in the target file system, but with a whitelist it is easier to mitigate the risks at the source.
In that spirit, this is a range of characters that can be considered safe:
Letters (a-z A-Z) - Unicode characters as well, if needed
Digits (0-9)
Underscore (_)
Hyphen (-)
Space
Dot (.)
And any additional safe characters you wish to allow. Beyond this, you just have to enforce some additional rules regarding spaces and dots. This is usually sufficient:
Name must contain at least one letter or number (to avoid only dots/spaces)
Name must start with a letter or number (to avoid leading dots/spaces)
Name may not end with a dot or space (simply trim those if present, like Explorer does)
This already allows quite complex and nonsensical names. For example, these names would be possible with these rules, and be valid file names in Windows/Linux:
A...........ext
B -.- .ext
In essence, even with so few whitelisted characters you should still decide what actually makes sense, and validate/adjust the name accordingly. In one of my applications, I used the same rules as above but stripped any duplicate dots and spaces.
The easy way to get Windows to tell you the answer is to attempt to rename a file via Explorer and type in a backslash, /, for the new name. Windows will popup a message box telling you the list of illegal characters.
A filename cannot contain any of the following characters:
\ / : * ? " < > |
Microsoft Docs - Naming Files, Paths, and Namespaces - Naming Conventions
Well, if only for research purposes, then your best bet is to look at this Wikipedia entry on Filenames.
If you want to write a portable function to validate user input and create filenames based on that, the short answer is don't. Take a look at a portable module like Perl's File::Spec to have a glimpse to all the hops needed to accomplish such a "simple" task.
Discussing different possible approaches
Difficulties with defining, what's legal and not were already adressed and whitelists were suggested. But not only Windows, but also many unixoid OSes support more-than-8-bit characters such as Unicode. You could here also talk about encodings such as UTF-8. You can consider Jonathan Leffler's comment, where he gives info about modern Linux and describes details for MacOS. Wikipedia states, that (for example) the
modifier letter colon [(See 7. below) is] sometimes used in Windows filenames as it is identical to the colon in the Segoe UI font used for filenames. The [inherited ASCII] colon itself is not permitted.
Therefore, I want to present a much more liberal approach using Unicode Homoglyph characters to replace the "illegal" ones. I found the result in my comparable use-case by far more readable and it's only limited by the used font, which is very broad, 3903 characters for Windows default. Plus you can even restore the original content from the replacements.
Possible choices and research notes
To keep things organized, I will always give the character, it's name and the hexadecimal number representation. The latter is is not case sensitive and leading zeroes can be added or ommitted freely, so for example U+002A and u+2a are equivalent. If available, I'll try to point to more info or alternatives - feel free to show me more or better ones.
Instead of * (U+2A * ASTERISK), you can use one of the many listed, for example U+2217 ∗ (ASTERISK OPERATOR) or the Full Width Asterisk U+FF0A *. u+20f0 ⃰ combining asterisk above from combining diacritical marks for symbols might also be a valid choice. You can read 4. for more info about the combining characters.
Instead of . (U+2E . full stop), one of these could be a good option, for example ⋅ U+22C5 dot operator.
Instead of " (U+22 " quotation mark), you can use “ U+201C english leftdoublequotemark, more alternatives see here. I also included some of the good suggestions of Wally Brockway's answer, in this case u+2036 ‶ reversed double prime and u+2033 ″ double prime - I will from now on denote ideas from that source by ¹³.
Instead of / (U+2F / SOLIDUS), you can use ∕ DIVISION SLASH U+2215 (others here), ̸ U+0338 COMBINING LONG SOLIDUS OVERLAY, ̷ COMBINING SHORT SOLIDUS OVERLAY U+0337 or u+2044 ⁄ fraction slash¹³. Be aware about spacing for some characters, including the combining or overlay ones, as they have no width and can produce something like -> ̸th̷is which is ̸th̷is. With added spaces you get -> ̸ th ̷ is, which is ̸ th ̷ is. The second one (COMBINING SHORT SOLIDUS OVERLAY) looks bad in the stackoverflow-font.
Instead of \ (U+5C Reverse solidus), you can use ⧵ U+29F5 Reverse solidus operator (more) or u+20E5 ⃥ combining reverse solidus overlay¹³.
To replace [ (U+5B [ Left square bracket) and ] (U+005D ] Right square bracket), you can use for example U+FF3B[ FULLWIDTH LEFT SQUARE BRACKET and U+FF3D ]FULLWIDTH RIGHT SQUARE BRACKET (from here, more possibilities here).
Instead of : (u+3a : colon), you can use U+2236 ∶ RATIO (for mathematical usage) or U+A789 ꞉ MODIFIER LETTER COLON, (see colon (letter), sometimes used in Windows filenames as it is identical to the colon in the Segoe UI font used for filenames. The colon itself is not permitted ... source and more replacements see here). Another alternative is this one: u+1361 ፡ ethiopic wordspace¹³.
Instead of ; (u+3b ; semicolon), you can use U+037E ; GREEK QUESTION MARK (see here).
For | (u+7c | vertical line), there are some good substitutes such as: U+2223 ∣ DIVIDES, U+0964 । DEVANAGARI DANDA, U+01C0 ǀ LATIN LETTER DENTAL CLICK (the last ones from Wikipedia) or U+2D4F ⵏ Tifinagh Letter Yan. Also the box drawing characters contain various other options.
Instead of , (, U+002C COMMA), you can use for example ‚ U+201A SINGLE LOW-9 QUOTATION MARK (see here).
For ? (U+003F ? QUESTION MARK), these are good candidates: U+FF1F ? FULLWIDTH QUESTION MARK or U+FE56 ﹖ SMALL QUESTION MARK (from here and here). There are also two more from the Dingbats Block (search for "question") and the u+203d ‽ interrobang¹³.
While my machine seems to accept it unchanged, I still want to include > (u+3e greater-than sign) and < (u+3c less-than sign) for the sake of completeness. The best replacement here is probably also from the quotation block, such as u+203a › single right-pointing angle quotation mark and u+2039 ‹ single left-pointing angle quotation mark respectively. The tifinagh block only contains ⵦ (u+2D66)¹³ to replace <. The last notion is ⋖ less-than with dot u+22D6 and ⋗ greater-than with dot u+22D7.
For additional ideas, you can also look for example into this block. You still want more ideas? You can try to draw your desired character and look at the suggestions here.
How do you type these characters
Say you want to type ⵏ (Tifinagh Letter Yan). To get all of its information, you can always search for this character (ⵏ) on a suited platform such as this Unicode Lookup (please add 0x when you search for hex) or that Unicode Table (that only allows to search for the name, in this case "Tifinagh Letter Yan"). You should obtain its Unicode number U+2D4F and the HTML-code ⵏ (note that 2D4F is hexadecimal for 11599). With this knowledge, you have several options to produce these special characters including the use of
code points to unicode converter or again the Unicode Lookup to reversely convert the numerical representation into the unicode character (remember to set the code point base below to decimal or hexadecimal respectively)
a one-liner makro in Autohotkey: :?*:altpipe::{U+2D4F} to type ⵏ instead of the string altpipe - this is the way I input those special characters, my Autohotkey script can be shared if there is common interest
Alt Characters or alt-codes by pressing and holding alt, followed by the decimal number for the desired character (more info for example here, look at a table here or there). For the example, that would be Alt+11599. Be aware, that many programs do not fully support this windows feature for all of unicode (as of time writing). Microsoft Office is an exception where it usually works, some other OSes provide similar functionality. Typing these chars with Alt-combinations into MS Word is also the way Wally Brockway suggests in his answer¹³ that was already mentionted - if you don't want to transfer all the hexadecimal values to the decimal asc, you can find some of them there¹³.
in MS Office, you can also use ALT + X as described in this MS article to produce the chars
if you rarely need it, you can of course still just copy-paste the special character of your choice instead of typing it
For Windows you can check it using PowerShell
$PathInvalidChars = [System.IO.Path]::GetInvalidPathChars() #36 chars
To display UTF-8 codes you can convert
$enc = [system.Text.Encoding]::UTF8
$PathInvalidChars | foreach { $enc.GetBytes($_) }
$FileNameInvalidChars = [System.IO.Path]::GetInvalidFileNameChars() #41 chars
$FileOnlyInvalidChars = #(':', '*', '?', '\', '/') #5 chars - as a difference
For anyone looking for a regex:
const BLACKLIST = /[<>:"\/\\|?*]/g;
In Windows 10 (2019), the following characters are forbidden by an error when you try to type them:
A file name can't contain any of the following characters:
\ / : * ? " < > |
Here's a c# implementation for windows based on Christopher Oezbek's answer
It was made more complex by the containsFolder boolean, but hopefully covers everything
/// <summary>
/// This will replace invalid chars with underscores, there are also some reserved words that it adds underscore to
/// </summary>
/// <remarks>
/// https://stackoverflow.com/questions/1976007/what-characters-are-forbidden-in-windows-and-linux-directory-names
/// </remarks>
/// <param name="containsFolder">Pass in true if filename represents a folder\file (passing true will allow slash)</param>
public static string EscapeFilename_Windows(string filename, bool containsFolder = false)
{
StringBuilder builder = new StringBuilder(filename.Length + 12);
int index = 0;
// Allow colon if it's part of the drive letter
if (containsFolder)
{
Match match = Regex.Match(filename, #"^\s*[A-Z]:\\", RegexOptions.IgnoreCase);
if (match.Success)
{
builder.Append(match.Value);
index = match.Length;
}
}
// Character substitutions
for (int cntr = index; cntr < filename.Length; cntr++)
{
char c = filename[cntr];
switch (c)
{
case '\u0000':
case '\u0001':
case '\u0002':
case '\u0003':
case '\u0004':
case '\u0005':
case '\u0006':
case '\u0007':
case '\u0008':
case '\u0009':
case '\u000A':
case '\u000B':
case '\u000C':
case '\u000D':
case '\u000E':
case '\u000F':
case '\u0010':
case '\u0011':
case '\u0012':
case '\u0013':
case '\u0014':
case '\u0015':
case '\u0016':
case '\u0017':
case '\u0018':
case '\u0019':
case '\u001A':
case '\u001B':
case '\u001C':
case '\u001D':
case '\u001E':
case '\u001F':
case '<':
case '>':
case ':':
case '"':
case '/':
case '|':
case '?':
case '*':
builder.Append('_');
break;
case '\\':
builder.Append(containsFolder ? c : '_');
break;
default:
builder.Append(c);
break;
}
}
string built = builder.ToString();
if (built == "")
{
return "_";
}
if (built.EndsWith(" ") || built.EndsWith("."))
{
built = built.Substring(0, built.Length - 1) + "_";
}
// These are reserved names, in either the folder or file name, but they are fine if following a dot
// CON, PRN, AUX, NUL, COM0 .. COM9, LPT0 .. LPT9
builder = new StringBuilder(built.Length + 12);
index = 0;
foreach (Match match in Regex.Matches(built, #"(^|\\)\s*(?<bad>CON|PRN|AUX|NUL|COM\d|LPT\d)\s*(\.|\\|$)", RegexOptions.IgnoreCase))
{
Group group = match.Groups["bad"];
if (group.Index > index)
{
builder.Append(built.Substring(index, match.Index - index + 1));
}
builder.Append(group.Value);
builder.Append("_"); // putting an underscore after this keyword is enough to make it acceptable
index = group.Index + group.Length;
}
if (index == 0)
{
return built;
}
if (index < built.Length - 1)
{
builder.Append(built.Substring(index));
}
return builder.ToString();
}
Though the only illegal Unix chars might be / and NULL, although some consideration for command line interpretation should be included.
For example, while it might be legal to name a file 1>&2 or 2>&1 in Unix, file names such as this might be misinterpreted when used on a command line.
Similarly it might be possible to name a file $PATH, but when trying to access it from the command line, the shell will translate $PATH to its variable value.
The .NET Framework System.IO provides the following functions for invalid file system characters:
Path.GetInvalidFileNameChars
Path.GetInvalidPathChars
Those functions should return appropriate results depending on the platform the .NET runtime is running in. That said, the Remarks in the documentation pages for those functions say:
The array returned from this method is not guaranteed to contain the
complete set of characters that are invalid in file and directory
names. The full set of invalid characters can vary by file system.
I always assumed that banned characters in Windows filenames meant that all exotic characters would also be outlawed. The inability to use ?, / and : in particular irked me. One day I discovered that it was virtually only those chars which were banned. Other Unicode characters may be used. So the nearest Unicode characters to the banned ones I could find were identified and MS Word macros were made for them as Alt+?, Alt+: etc. Now I form the filename in Word, using the substitute chars, and copy it to the Windows filename. So far I have had no problems.
Here are the substitute chars (Alt + the decimal Unicode) :
⃰ ⇔ Alt8432
⁄ ⇔ Alt8260
⃥ ⇔ Alt8421
∣ ⇔ Alt8739
ⵦ ⇔ Alt11622
⮚ ⇔ Alt11162
‽ ⇔ Alt8253
፡ ⇔ Alt4961
‶ ⇔ Alt8246
″ ⇔ Alt8243
As a test I formed a filename using all of those chars and Windows accepted it.
This is good enough for me in Python:
def fix_filename(name, max_length=255):
"""
Replace invalid characters on Linux/Windows/MacOS with underscores.
List from https://stackoverflow.com/a/31976060/819417
Trailing spaces & periods are ignored on Windows.
>>> fix_filename(" COM1 ")
'_ COM1 _'
>>> fix_filename("COM10")
'COM10'
>>> fix_filename("COM1,")
'COM1,'
>>> fix_filename("COM1.txt")
'_.txt'
>>> all('_' == fix_filename(chr(i)) for i in list(range(32)))
True
"""
return re.sub(r'[/\\:|<>"?*\0-\x1f]|^(AUX|COM[1-9]|CON|LPT[1-9]|NUL|PRN)(?![^.])|^\s|[\s.]$', "_", name[:max_length], flags=re.IGNORECASE)
See also this outdated list for additional legacy stuff like = in FAT32.
As of 18/04/2017, no simple black or white list of characters and filenames is evident among the answers to this topic - and there are many replies.
The best suggestion I could come up with was to let the user name the file however he likes. Using an error handler when the application tries to save the file, catch any exceptions, assume the filename is to blame (obviously after making sure the save path was ok as well), and prompt the user for a new file name. For best results, place this checking procedure within a loop that continues until either the user gets it right or gives up. Worked best for me (at least in VBA).
In Unix shells, you can quote almost every character in single quotes '. Except the single quote itself, and you can't express control characters, because \ is not expanded. Accessing the single quote itself from within a quoted string is possible, because you can concatenate strings with single and double quotes, like 'I'"'"'m' which can be used to access a file called "I'm" (double quote also possible here).
So you should avoid all control characters, because they are too difficult to enter in the shell. The rest still is funny, especially files starting with a dash, because most commands read those as options unless you have two dashes -- before, or you specify them with ./, which also hides the starting -.
If you want to be nice, don't use any of the characters the shell and typical commands use as syntactical elements, sometimes position dependent, so e.g. you can still use -, but not as first character; same with ., you can use it as first character only when you mean it ("hidden file"). When you are mean, your file names are VT100 escape sequences ;-), so that an ls garbles the output.
When creating internet shortcuts in Windows, to create the file name, it skips illegal characters, except for forward slash, which is converted to minus.
I had the same need and was looking for recommendation or standard references and came across this thread. My current blacklist of characters that should be avoided in file and directory names are:
$CharactersInvalidForFileName = {
"pound" -> "#",
"left angle bracket" -> "<",
"dollar sign" -> "$",
"plus sign" -> "+",
"percent" -> "%",
"right angle bracket" -> ">",
"exclamation point" -> "!",
"backtick" -> "`",
"ampersand" -> "&",
"asterisk" -> "*",
"single quotes" -> "“",
"pipe" -> "|",
"left bracket" -> "{",
"question mark" -> "?",
"double quotes" -> "”",
"equal sign" -> "=",
"right bracket" -> "}",
"forward slash" -> "/",
"colon" -> ":",
"back slash" -> "\\",
"lank spaces" -> "b",
"at sign" -> "#"
};

Delimiters in BIML

I am migrating to biml for SSIS integration of flat, excel and CSV files and I want to know all the possible delimeters and text qualifiers we can use since the documentation isn't telling that much.
Basically, there are 3 options to choose from:
The “official” ENUM
Allowed values here are:
– CRLF
– CR
– LF
– Semicolon
– Comma
– Tab
– VerticalBar
– UnitSeparator
Use the Hex Code
If you know the ASCII Code of your qualifier, you can use it starting with “x” and ending with “”.
A ” would be described by “x0022” for example.
Use the actual character (HTML encoded or escaped)
If you want to (for example) define a ” as your qualifier, you can do so. Just make sure, depending on how you use it, to either encode or escape it. When defining it as an actual Biml property, it has to be encoded.

Using spéciâl characters in URLs

I'm developing a website which lets people create their own translator. They can choose the name of the URL, and it is sent to a database and I use .htaccess to redirect website.com/nameoftheirtranslator
to:
website.com/translator.php?name=nameoftheirtranslator
Here's my problem:
Recently, I've noticed that someone has created a translator with special characters in the name -> "LAEFÊVËŠI".
But when it is processed (posted to a php file, then mysqli_real_escape_string) and added to the database it appears as simply "LAEFVI" - so you can see the special characters have been lost somewhere.
I'm not quite sure what to do here, but I think there are two paths:
Try to keep the characters and do some encoding (no idea where to start)
Ditch them and tell users to only use 'normal' characters in the names of their translators (not ideal)
I'm wondering whether it's even possible to have a url like website.com/LAEFÊVËŠI - can that be interpreted by the server?
EDIT1: I notice that stack overflow, on this very question, translates the special characters in my title to .../using-special-characters-in-urls! This seems like a great solution, I guess I could make a function that translates special characters like â to their normal equivalent (like â)? And I suppose I would just ignore other characters like /##"',&? Now that I think of it, there must be some fairly standard/good-practice strategies for getting around problems like this.
EDIT2: Actually, now that I think about it (more) - I really want this thing to be usable by people of any language (not just English), so I would really love to be able to have special characters in the urls. Having said this, I've just found that Google doesn't interpret â as a, so people may have a hard time finding the LAEFÊVËŠI translator if I don't translate the letters to normal characters. Ahh!
Okay, after that crazy episode, here's what happened:
Found out that I was removing all the non alpha-numeric characters with PHP preg_replace().
Altered preg_replace so it only removes spaces and used rawurlencode():
$name = mysqli_real_escape_string($con, rawurlencode( preg_replace("/\s/", '', $name) ));
Now everything is in the database encoded, safe and sound.
Used this rewrite rule RewriteRule ^([^/.]+)$ process.php?name=$1 [B]
Run around in circles for 2 hours thingking my rewrite was wrong because I was getting "page not found"
Realise that process.php didn't have a rawurlencode() to read in the name
$name = rawurlencode($_GET['name']);
Now it works.
WOO!
Sleep time.

how to get the real file contents using TFilestream?

i try to get the file contents using TFilestream:
procedure ShowFileCont(myfile : string);
var
tr : string;
fs : TFileStream;
Begin
Fs := TFileStream.Create(myfile, fmOpenRead or fmShareDenyNone);
SetLength(tr, Fs.Size);
Fs.Read(tr[1], Fs.Size);
Showmessage(tr);
Fs.Free;
end;
I do a little text file with contents only:
aaaaaaaJ“њРЉTщЂ®8ЈЏVд"Ј¦AИaaaaaaa
And save this file (using AkelPad) with 1251 (ansi) codepege
Save with 65001 (UTF8) codepage.
these to files has different size but there contents is equal - i oped them both in notepad and they both has the same contents
But when i run ShowFileCont proc it shows to me different results:
aaaaaaaJ?ЊT?8?V?"?A?aaaaaaa
aaaaaaaJ“њРЉTщЂ®8ЈЏVд"Ј¦AИaaaaaaa
Questions:
how to get the real file contents using TFilestream?
How to explain that these 2 files has different size but the content (in notepad) is equeal?
Add: Sorry, i didn't say that i use Lazarus FPC and string = utf8string
Why do the files have different size?
Because they use different encodings. The 1251 encoding maps each character to a single byte. But UTF-8 uses variable numbers of bytes for each character.
How do I get the true file contents?
You need to use a string type that matches the encoding used in the file. So, for example, if the content is UTF-8 encoded, which is the best choice, then you load the content into a UTF-8 string. You are using FPC in a mode where string is UTF-8 encoded. In which case the code in the question is what you need.
Loading an MBCS encoded file with a code page of 1251, say, is more tricky. You can load that into an AnsiString variable and so long as your system's locale is 1251 then any conversions will be performed correctly.
But the code will behave differently when run on a machine with a different locale. And if you wanted to load text using different MBCS encodings, for example 1252, then you cannot use this approach. You would need to load into a byte array and then convert from 1252, say, to UTF-8 so that you could then store that UTF-8 in a string variable.
In order to do that you can use the LConvEncoding unit from LCL. For example, you can use CP1251ToUTF8, CP1252ToUTF8 etc. to convert from MBCS to UTF-8.
How can I determine from the file what encoding is used?
You cannot. You can make a guess that will be accurate in many cases. But in general, it is simply impossible to identify the encoding of an array of bytes that is meant to represent text.
It is sometimes possible to take a file and rule out certain encodings. For example, not all byte streams are valid UTF-8 or UTF-16 text. And so you can rule out such files. But for encodings like 1251, 1252 etc. then any byte stream is valid. There's simply no way for you to tell 1251 encoded streams apart from 1252 encoded streams with 100% accuracy.
The LConvEncoding unit has GuessEncoding which sounds like it may be of some use.
Their contents are obviously not equal. You can see for yourself that the file sizes are different. Things of different size are never equal.
Your files might appear equal in Notepad because Notepad knows how to recognize certain character encodings. You saved your file two different ways. One way used an encoding that assigns one byte to each of 256 possible values. The other way uses an encoding that assigns between one and six bytes to each of more than 10,000 possible values. Some of the characters you saved require more than one byte, which explains why one version of the file is bigger than the other.
TFileStream doesn't pay attention to any of that. It just deals with bytes. Depending on your Delphi version, your string variable may or may not pay attention to encodings. Prior to Delphi 2009, string stored one byte per character. As of Delphi 2009, string uses two bytes per character, so your SetLength call is wrong, and everything after that is pointless to investigate much further.
With one byte per character, your ShowMessage call is not going to interpret the string as UTF-8-encoded. Instead, it will interpret your string using whatever your system code page is. If you know that the string you've read is encoded with UTF-8, then you'll want to convert it to UTF-16 prior to display by calling UTF8Decode. That will return a WideString, and you can use any number of functions to display it, such as MessageBoxW. If you have Delphi 2009 or later, then the compiler will insert conversion code for you automatically, if you've used Utf8String instead of string.

What is the difference between plaintext and binary data?

Many languages have functions which only process "plaintext", not binary. Does this mean that only characters within the ASCII range will be allowed?
Binary is just a series of bytes, isn't it similar to plaintext which is just a series of bytes interpreted as characters? So, can plaintext store the same data formats / protocols as binary?
a plain text is human readable, a binary file is usually unreadable by a human, since it's composed of printable and non-printable characters.
Try to open a jpeg file with a text editor (e.g. notepad or vim) and you'll understand what I mean.
A binary file is usually constructed in a way that optimizes speed, since no parsing is needed.
A plain text file is editable by hand, a binary file not.
"Plaintext" can have several meanings.
The one most useful in this context is that it is merely a binary files which is organized in byte sequences that a particular computers system can translate into a finite set of what it considers "text" characters.
A second meaning, somewhat connected, is a restriction that said system should display these "text characters" as symbols readable by a human as members of a recognizable alphabet. Often, the unwritten implication is that the translation mechanism is ASCII.
A third, even more restrictive meaning, is that this system must be a "simple" text editor/viewer. Usually implying ASCII encoding. But, really, there is VERY little difference between you, the human, reading text encoded in some funky format and displayed by a proprietary program, vs. VI text editor reading ASCII encoded file.
Within programming context, your programming environment (comprized by OS + system APIs + your language capabilities) defines both a set of "text" characters, and a set of encodings it is able to read to convert to these "text" characters. Please note that this may not necessarily imply ASCII, English, or 8 bits - as an example, Perl can natively read and use the full Unicode set of "characters".
To answer your specific question, you can definitely use "character" strings to transmit arbitrary byte sequences, with the caveat that string termination conventions must apply.
The problem is that the functions that already exist to "process character data" would probably not have any useful functionality to deal with your binary data.
One thing it often means is that the language might feel free to interpret certian control characters, such as the values 10 or 13, as logical line terminators. In other words, an output operation might automagicly append these characters at the end, and an input operation might strip them from the input (and/or terminate reading there).
In contrast, language I/O operations that advertise working on "binary" data will usually include an input parameter for the length of data to operate on, since there is no other way (short of reading past end of file) to know when it is done.
Generally, it depends on the language/environment/functionality.
Binary data is always that: binary. It is transferred without modification.
"Plain text" mode may mean one or more of the following things:
the stream of bytes is split into lines. The line delimiters are \r, \n, or \r\n, or \n\r. Sometimes it is OS-dependent (like *nix likes \n, while windows likes \r\n). The line ending may be adjusted for the reading application
character encoding may be adjusted. The environment might detect and/or convert the source encoding into the encoding the application expects
probably some other conversions should be added to this list, but I can't think of any more at this moment
Technically nothing. Plain text is a form of binary data. However a major difference is how values are stored. Think of how an integer might be stored. In binary data it would use a two's complement format, probably taking 32 bits of space. In text format a number would be stored instead as a series of unicode digits. So the number 50 would be stored as 0x32 (padded to take up 32 bits) in binary but would be stored as '5' '0' in plain text.