Term for a "Special Identifier" Embedded in String Data - json

I'm mostly at a loss for how to describe this, so I'll start with a simple example that is similar to some JSON I'm working with:
"user_interface": {
username: "Hello, %USER.username%",
create_date: "Your account was created on %USER.create_date%",
favorite_color: "Your favorite color is: %USER.fav_color%"
}
The "special identifiers" located in the username create_date and favorite_color fields start and end with % characters, and are supposed to be replaced with the correct information for that particular user. An example for the favorite_color field would be:
Your favorite color is: Orange
Is there a proper term for these identifiers? I'm trying to search google for best practices or libraries when parsing these before I reinvent the wheel, but everything I can think of results in a sea of false-positives.

Just some thoughts on the subject of %special identifier%. Let's take a look at a small subset of examples, that implement almost similar strings replacement.
WSH Shell ExpandEnvironmentStrings Method
Returns an environment variable's expanded value.
WSH .vbs code snippet
Set WshShell = WScript.CreateObject("WScript.Shell")
WScript.Echo WshShell.ExpandEnvironmentStrings("WinDir is %WinDir%")
' WinDir is C:\Windows
.NET Composite Formatting
The .NET Framework composite formatting feature takes a list of objects and a composite format string as input. A composite format string consists of fixed text intermixed with indexed placeholders, called format items, that correspond to the objects in the list. The formatting operation yields a result string that consists of the original fixed text intermixed with the string representation of the objects in the list.
VB.Net code snippet
Console.WriteLine(String.Format("Prime numbers less than 10: {0}, {1}, {2}, {3}, {4}", 1, 2, 3, 5, 7 ))
' Prime numbers less than 10: 1, 2, 3, 5, 7
JavaScript replace Method (with RegEx application)
... The match variables can be used in text replacement where the replacement string has to be determined dynamically... $n ... The nth captured submatch ...
Also called Format Flags, Substitution, Backreference and Format specifiersJavaScript code snippet
console.log("Hello, World!".replace(/(\w+)\W+(\w+)/g, "$1, dear $2"))
// Hello, dear World!
Python Format strings
Format strings contain “replacement fields” surrounded by curly braces {}. Anything that is not contained in braces is considered literal text, which is copied unchanged to the output...
Python code snippet
print "The sum of 1 + 2 is {0}".format(1+2)
# The sum of 1 + 2 is 3
Ruby String Interpolation
Double-quote strings allow interpolation of other values using#{...} ...
Ruby code snippet
res = 3
puts "The sum of 1 + 2 is #{res}"
# The sum of 1 + 2 is 3
TestComplete Custom String Generator
... A string of macros, text, format specifiers and regular expressions that will be used to generate values. The default value of this parameter is %INT(1, 2147483647, 1) %NAME(ANY, FULL) lives in %CITY. ... Also, you can format the generated values using special format specifiers. For instance, you can use the following macro to generate a sequence of integer values with the specified minimum length (3 characters) -- %0.3d%INT(1, 100, 3).
Angular Expression
Angular expressions are JavaScript-like code snippets that are mainly placed in interpolation bindings such as{{ textBinding }}...
Django Templates
Variables are surrounded by {{ and }} like this:My first name is {{ first_name }}. My last name is {{ last_name }}.With a context of {'first_name': 'John', 'last_name': 'Doe'}, this template renders to:My first name is John. My last name is Doe.
Node.js v4 Template strings
... Template strings can contain place holders. These are indicated by the Dollar sign and curly braces (${expression}). The expressions in the place holders and the text between them get passed to a function...
JavaScript code snippet
var res = 3;
console.log(`The sum of 1 + 2 is ${res}`);
// The sum of 1 + 2 is 3
C/C++ Macros
Preprocessing expands macros in all lines that are not preprocessor directives...
Replacement in source code.
C++ code snippet
std::cout << __DATE__;
// Jan 8 2016
AutoIt Macros
AutoIt has an number of Macros that are special read-only variables used by AutoIt. Macros start with the # character ...
Replacement in source code.
AutoIt code snippet
MsgBox(0, "", "CPU Architecture is " & #CPUArch)
; CPU Architecture is X64
SharePoint solution Replaceable Parameters
Replaceable parameters, or tokens, can be used inside project files to provide values for SharePoint solution items whose actual values are not known at design time. They are similar in function to the standard Visual Studio template tokens... Tokens begin and end with a dollar sign ($) character. Any tokens used are replaced with actual values when a project is packaged into a SharePoint solution package (.wsp) file at deployment time. For example, the token $SharePoint.Package.Name$ might resolve to the string "Test SharePoint Package."
Apache Ant Replace Task
Replace is a directory based task for replacing the occurrence of a given string with another string in selected file... token... the token which must be replaced...
So, based on functional context I would call it %token% (such a flavor of strings with an identified "meaning").

Related

Hi, I need write question mark into filename on windows How do I do it ? Plz THX [duplicate]

I know that / is illegal in Linux, and the following are illegal in Windows
(I think) * . " / \ [ ] : ; | ,
What else am I missing?
I need a comprehensive guide, however, and one that takes into account
double-byte characters. Linking to outside resources is fine with me.
I need to first create a directory on the filesystem using a name that may
contain forbidden characters, so I plan to replace those characters with
underscores. I then need to write this directory and its contents to a zip file
(using Java), so any additional advice concerning the names of zip directories
would be appreciated.
The forbidden printable ASCII characters are:
Linux/Unix:
/ (forward slash)
Windows:
< (less than)
> (greater than)
: (colon - sometimes works, but is actually NTFS Alternate Data Streams)
" (double quote)
/ (forward slash)
\ (backslash)
| (vertical bar or pipe)
? (question mark)
* (asterisk)
Non-printable characters
If your data comes from a source that would permit non-printable characters then there is more to check for.
Linux/Unix:
0 (NULL byte)
Windows:
0-31 (ASCII control characters)
Note: While it is legal under Linux/Unix file systems to create files with control characters in the filename, it might be a nightmare for the users to deal with such files.
Reserved file names
The following filenames are reserved:
Windows:
CON, PRN, AUX, NUL
COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9
LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9
(both on their own and with arbitrary file extensions, e.g. LPT1.txt).
Other rules
Windows:
Filenames cannot end in a space or dot.
macOS:
You didn't ask for it, but just in case: Colon : and forward slash / depending on context are not permitted (e.g. Finder supports slashes, terminal supports colons). (More details)
A “comprehensive guide” of forbidden filename characters is not going to work on Windows because it reserves filenames as well as characters. Yes, characters like
* " ? and others are forbidden, but there are a infinite number of names composed only of valid characters that are forbidden. For example, spaces and dots are valid filename characters, but names composed only of those characters are forbidden.
Windows does not distinguish between upper-case and lower-case characters, so you cannot create a folder named A if one named a already exists. Worse, seemingly-allowed names like PRN and CON, and many others, are reserved and not allowed. Windows also has several length restrictions; a filename valid in one folder may become invalid if moved to another folder. The rules for
naming files and folders
are on the Microsoft docs.
You cannot, in general, use user-generated text to create Windows directory names. If you want to allow users to name anything they want, you have to create safe names like A, AB, A2 et al., store user-generated names and their path equivalents in an application data file, and perform path mapping in your application.
If you absolutely must allow user-generated folder names, the only way to tell if they are invalid is to catch exceptions and assume the name is invalid. Even that is fraught with peril, as the exceptions thrown for denied access, offline drives, and out of drive space overlap with those that can be thrown for invalid names. You are opening up one huge can of hurt.
Under Linux and other Unix-related systems, there were traditionally only two characters that could not appear in the name of a file or directory, and those are NUL '\0' and slash '/'. The slash, of course, can appear in a pathname, separating directory components.
Rumour1 has it that Steven Bourne (of 'shell' fame) had a directory containing 254 files, one for every single letter (character code) that can appear in a file name (excluding /, '\0'; the name . was the current directory, of course). It was used to test the Bourne shell and routinely wrought havoc on unwary programs such as backup programs.
Other people have covered the rules for Windows filenames, with links to Microsoft and Wikipedia on the topic.
Note that MacOS X has a case-insensitive file system. Current versions of it appear to allow colon : in file names, though historically that was not necessarily always the case:
$ echo a:b > a:b
$ ls -l a:b
-rw-r--r-- 1 jonathanleffler staff 4 Nov 12 07:38 a:b
$
However, at least with macOS Big Sur 11.7, the file system does not allow file names that are not valid UTF-8 strings. That means the file name cannot consist of the bytes that are always invalid in UTF-8 (0xC0, 0xC1, 0xF5-0xFF), and you can't use the continuation bytes 0x80..0xBF as the only byte in a file name. The error given is 92 Illegal byte sequence.
POSIX defines a Portable Filename Character Set consisting of:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 . _ -
Sticking with names formed solely from those characters avoids most of the problems, though Windows still adds some complications.
1 It was Kernighan & Pike in ['The Practice of Programming'](http://www.cs.princeton.edu/~bwk/tpop.webpage/) who said as much in Chapter 6, Testing, §6.5 Stress Tests:
When Steve Bourne was writing his Unix shell (which came to be known as the Bourne shell), he made a directory of 254 files with one-character names, one for each byte value except '\0' and slash, the two characters that cannot appear in Unix file names. He used that directory for all manner of tests of pattern-matching and tokenization. (The test directory was of course created by a program.) For years afterwards, that directory was the bane of file-tree-walking programs; it tested them to destruction.
Note that the directory must have contained entries . and .., so it was arguably 253 files (and 2 directories), or 255 name entries, rather than 254 files. This doesn't affect the effectiveness of the anecdote, or the careful testing it describes.
TPOP was previously at
http://plan9.bell-labs.com/cm/cs/tpop and
http://cm.bell-labs.com/cm/cs/tpop but both are now (2021-11-12) broken.
See also Wikipedia on TPOP.
Instead of creating a blacklist of characters, you could use a whitelist. All things considered, the range of characters that make sense in a file or directory name context is quite short, and unless you have some very specific naming requirements your users will not hold it against your application if they cannot use the whole ASCII table.
It does not solve the problem of reserved names in the target file system, but with a whitelist it is easier to mitigate the risks at the source.
In that spirit, this is a range of characters that can be considered safe:
Letters (a-z A-Z) - Unicode characters as well, if needed
Digits (0-9)
Underscore (_)
Hyphen (-)
Space
Dot (.)
And any additional safe characters you wish to allow. Beyond this, you just have to enforce some additional rules regarding spaces and dots. This is usually sufficient:
Name must contain at least one letter or number (to avoid only dots/spaces)
Name must start with a letter or number (to avoid leading dots/spaces)
Name may not end with a dot or space (simply trim those if present, like Explorer does)
This already allows quite complex and nonsensical names. For example, these names would be possible with these rules, and be valid file names in Windows/Linux:
A...........ext
B -.- .ext
In essence, even with so few whitelisted characters you should still decide what actually makes sense, and validate/adjust the name accordingly. In one of my applications, I used the same rules as above but stripped any duplicate dots and spaces.
The easy way to get Windows to tell you the answer is to attempt to rename a file via Explorer and type in a backslash, /, for the new name. Windows will popup a message box telling you the list of illegal characters.
A filename cannot contain any of the following characters:
\ / : * ? " < > |
Microsoft Docs - Naming Files, Paths, and Namespaces - Naming Conventions
Well, if only for research purposes, then your best bet is to look at this Wikipedia entry on Filenames.
If you want to write a portable function to validate user input and create filenames based on that, the short answer is don't. Take a look at a portable module like Perl's File::Spec to have a glimpse to all the hops needed to accomplish such a "simple" task.
Discussing different possible approaches
Difficulties with defining, what's legal and not were already adressed and whitelists were suggested. But not only Windows, but also many unixoid OSes support more-than-8-bit characters such as Unicode. You could here also talk about encodings such as UTF-8. You can consider Jonathan Leffler's comment, where he gives info about modern Linux and describes details for MacOS. Wikipedia states, that (for example) the
modifier letter colon [(See 7. below) is] sometimes used in Windows filenames as it is identical to the colon in the Segoe UI font used for filenames. The [inherited ASCII] colon itself is not permitted.
Therefore, I want to present a much more liberal approach using Unicode Homoglyph characters to replace the "illegal" ones. I found the result in my comparable use-case by far more readable and it's only limited by the used font, which is very broad, 3903 characters for Windows default. Plus you can even restore the original content from the replacements.
Possible choices and research notes
To keep things organized, I will always give the character, it's name and the hexadecimal number representation. The latter is is not case sensitive and leading zeroes can be added or ommitted freely, so for example U+002A and u+2a are equivalent. If available, I'll try to point to more info or alternatives - feel free to show me more or better ones.
Instead of * (U+2A * ASTERISK), you can use one of the many listed, for example U+2217 ∗ (ASTERISK OPERATOR) or the Full Width Asterisk U+FF0A *. u+20f0 ⃰ combining asterisk above from combining diacritical marks for symbols might also be a valid choice. You can read 4. for more info about the combining characters.
Instead of . (U+2E . full stop), one of these could be a good option, for example ⋅ U+22C5 dot operator.
Instead of " (U+22 " quotation mark), you can use “ U+201C english leftdoublequotemark, more alternatives see here. I also included some of the good suggestions of Wally Brockway's answer, in this case u+2036 ‶ reversed double prime and u+2033 ″ double prime - I will from now on denote ideas from that source by ¹³.
Instead of / (U+2F / SOLIDUS), you can use ∕ DIVISION SLASH U+2215 (others here), ̸ U+0338 COMBINING LONG SOLIDUS OVERLAY, ̷ COMBINING SHORT SOLIDUS OVERLAY U+0337 or u+2044 ⁄ fraction slash¹³. Be aware about spacing for some characters, including the combining or overlay ones, as they have no width and can produce something like -> ̸th̷is which is ̸th̷is. With added spaces you get -> ̸ th ̷ is, which is ̸ th ̷ is. The second one (COMBINING SHORT SOLIDUS OVERLAY) looks bad in the stackoverflow-font.
Instead of \ (U+5C Reverse solidus), you can use ⧵ U+29F5 Reverse solidus operator (more) or u+20E5 ⃥ combining reverse solidus overlay¹³.
To replace [ (U+5B [ Left square bracket) and ] (U+005D ] Right square bracket), you can use for example U+FF3B[ FULLWIDTH LEFT SQUARE BRACKET and U+FF3D ]FULLWIDTH RIGHT SQUARE BRACKET (from here, more possibilities here).
Instead of : (u+3a : colon), you can use U+2236 ∶ RATIO (for mathematical usage) or U+A789 ꞉ MODIFIER LETTER COLON, (see colon (letter), sometimes used in Windows filenames as it is identical to the colon in the Segoe UI font used for filenames. The colon itself is not permitted ... source and more replacements see here). Another alternative is this one: u+1361 ፡ ethiopic wordspace¹³.
Instead of ; (u+3b ; semicolon), you can use U+037E ; GREEK QUESTION MARK (see here).
For | (u+7c | vertical line), there are some good substitutes such as: U+2223 ∣ DIVIDES, U+0964 । DEVANAGARI DANDA, U+01C0 ǀ LATIN LETTER DENTAL CLICK (the last ones from Wikipedia) or U+2D4F ⵏ Tifinagh Letter Yan. Also the box drawing characters contain various other options.
Instead of , (, U+002C COMMA), you can use for example ‚ U+201A SINGLE LOW-9 QUOTATION MARK (see here).
For ? (U+003F ? QUESTION MARK), these are good candidates: U+FF1F ? FULLWIDTH QUESTION MARK or U+FE56 ﹖ SMALL QUESTION MARK (from here and here). There are also two more from the Dingbats Block (search for "question") and the u+203d ‽ interrobang¹³.
While my machine seems to accept it unchanged, I still want to include > (u+3e greater-than sign) and < (u+3c less-than sign) for the sake of completeness. The best replacement here is probably also from the quotation block, such as u+203a › single right-pointing angle quotation mark and u+2039 ‹ single left-pointing angle quotation mark respectively. The tifinagh block only contains ⵦ (u+2D66)¹³ to replace <. The last notion is ⋖ less-than with dot u+22D6 and ⋗ greater-than with dot u+22D7.
For additional ideas, you can also look for example into this block. You still want more ideas? You can try to draw your desired character and look at the suggestions here.
How do you type these characters
Say you want to type ⵏ (Tifinagh Letter Yan). To get all of its information, you can always search for this character (ⵏ) on a suited platform such as this Unicode Lookup (please add 0x when you search for hex) or that Unicode Table (that only allows to search for the name, in this case "Tifinagh Letter Yan"). You should obtain its Unicode number U+2D4F and the HTML-code ⵏ (note that 2D4F is hexadecimal for 11599). With this knowledge, you have several options to produce these special characters including the use of
code points to unicode converter or again the Unicode Lookup to reversely convert the numerical representation into the unicode character (remember to set the code point base below to decimal or hexadecimal respectively)
a one-liner makro in Autohotkey: :?*:altpipe::{U+2D4F} to type ⵏ instead of the string altpipe - this is the way I input those special characters, my Autohotkey script can be shared if there is common interest
Alt Characters or alt-codes by pressing and holding alt, followed by the decimal number for the desired character (more info for example here, look at a table here or there). For the example, that would be Alt+11599. Be aware, that many programs do not fully support this windows feature for all of unicode (as of time writing). Microsoft Office is an exception where it usually works, some other OSes provide similar functionality. Typing these chars with Alt-combinations into MS Word is also the way Wally Brockway suggests in his answer¹³ that was already mentionted - if you don't want to transfer all the hexadecimal values to the decimal asc, you can find some of them there¹³.
in MS Office, you can also use ALT + X as described in this MS article to produce the chars
if you rarely need it, you can of course still just copy-paste the special character of your choice instead of typing it
For Windows you can check it using PowerShell
$PathInvalidChars = [System.IO.Path]::GetInvalidPathChars() #36 chars
To display UTF-8 codes you can convert
$enc = [system.Text.Encoding]::UTF8
$PathInvalidChars | foreach { $enc.GetBytes($_) }
$FileNameInvalidChars = [System.IO.Path]::GetInvalidFileNameChars() #41 chars
$FileOnlyInvalidChars = #(':', '*', '?', '\', '/') #5 chars - as a difference
For anyone looking for a regex:
const BLACKLIST = /[<>:"\/\\|?*]/g;
In Windows 10 (2019), the following characters are forbidden by an error when you try to type them:
A file name can't contain any of the following characters:
\ / : * ? " < > |
Here's a c# implementation for windows based on Christopher Oezbek's answer
It was made more complex by the containsFolder boolean, but hopefully covers everything
/// <summary>
/// This will replace invalid chars with underscores, there are also some reserved words that it adds underscore to
/// </summary>
/// <remarks>
/// https://stackoverflow.com/questions/1976007/what-characters-are-forbidden-in-windows-and-linux-directory-names
/// </remarks>
/// <param name="containsFolder">Pass in true if filename represents a folder\file (passing true will allow slash)</param>
public static string EscapeFilename_Windows(string filename, bool containsFolder = false)
{
StringBuilder builder = new StringBuilder(filename.Length + 12);
int index = 0;
// Allow colon if it's part of the drive letter
if (containsFolder)
{
Match match = Regex.Match(filename, #"^\s*[A-Z]:\\", RegexOptions.IgnoreCase);
if (match.Success)
{
builder.Append(match.Value);
index = match.Length;
}
}
// Character substitutions
for (int cntr = index; cntr < filename.Length; cntr++)
{
char c = filename[cntr];
switch (c)
{
case '\u0000':
case '\u0001':
case '\u0002':
case '\u0003':
case '\u0004':
case '\u0005':
case '\u0006':
case '\u0007':
case '\u0008':
case '\u0009':
case '\u000A':
case '\u000B':
case '\u000C':
case '\u000D':
case '\u000E':
case '\u000F':
case '\u0010':
case '\u0011':
case '\u0012':
case '\u0013':
case '\u0014':
case '\u0015':
case '\u0016':
case '\u0017':
case '\u0018':
case '\u0019':
case '\u001A':
case '\u001B':
case '\u001C':
case '\u001D':
case '\u001E':
case '\u001F':
case '<':
case '>':
case ':':
case '"':
case '/':
case '|':
case '?':
case '*':
builder.Append('_');
break;
case '\\':
builder.Append(containsFolder ? c : '_');
break;
default:
builder.Append(c);
break;
}
}
string built = builder.ToString();
if (built == "")
{
return "_";
}
if (built.EndsWith(" ") || built.EndsWith("."))
{
built = built.Substring(0, built.Length - 1) + "_";
}
// These are reserved names, in either the folder or file name, but they are fine if following a dot
// CON, PRN, AUX, NUL, COM0 .. COM9, LPT0 .. LPT9
builder = new StringBuilder(built.Length + 12);
index = 0;
foreach (Match match in Regex.Matches(built, #"(^|\\)\s*(?<bad>CON|PRN|AUX|NUL|COM\d|LPT\d)\s*(\.|\\|$)", RegexOptions.IgnoreCase))
{
Group group = match.Groups["bad"];
if (group.Index > index)
{
builder.Append(built.Substring(index, match.Index - index + 1));
}
builder.Append(group.Value);
builder.Append("_"); // putting an underscore after this keyword is enough to make it acceptable
index = group.Index + group.Length;
}
if (index == 0)
{
return built;
}
if (index < built.Length - 1)
{
builder.Append(built.Substring(index));
}
return builder.ToString();
}
Though the only illegal Unix chars might be / and NULL, although some consideration for command line interpretation should be included.
For example, while it might be legal to name a file 1>&2 or 2>&1 in Unix, file names such as this might be misinterpreted when used on a command line.
Similarly it might be possible to name a file $PATH, but when trying to access it from the command line, the shell will translate $PATH to its variable value.
The .NET Framework System.IO provides the following functions for invalid file system characters:
Path.GetInvalidFileNameChars
Path.GetInvalidPathChars
Those functions should return appropriate results depending on the platform the .NET runtime is running in. That said, the Remarks in the documentation pages for those functions say:
The array returned from this method is not guaranteed to contain the
complete set of characters that are invalid in file and directory
names. The full set of invalid characters can vary by file system.
I always assumed that banned characters in Windows filenames meant that all exotic characters would also be outlawed. The inability to use ?, / and : in particular irked me. One day I discovered that it was virtually only those chars which were banned. Other Unicode characters may be used. So the nearest Unicode characters to the banned ones I could find were identified and MS Word macros were made for them as Alt+?, Alt+: etc. Now I form the filename in Word, using the substitute chars, and copy it to the Windows filename. So far I have had no problems.
Here are the substitute chars (Alt + the decimal Unicode) :
⃰ ⇔ Alt8432
⁄ ⇔ Alt8260
⃥ ⇔ Alt8421
∣ ⇔ Alt8739
ⵦ ⇔ Alt11622
⮚ ⇔ Alt11162
‽ ⇔ Alt8253
፡ ⇔ Alt4961
‶ ⇔ Alt8246
″ ⇔ Alt8243
As a test I formed a filename using all of those chars and Windows accepted it.
This is good enough for me in Python:
def fix_filename(name, max_length=255):
"""
Replace invalid characters on Linux/Windows/MacOS with underscores.
List from https://stackoverflow.com/a/31976060/819417
Trailing spaces & periods are ignored on Windows.
>>> fix_filename(" COM1 ")
'_ COM1 _'
>>> fix_filename("COM10")
'COM10'
>>> fix_filename("COM1,")
'COM1,'
>>> fix_filename("COM1.txt")
'_.txt'
>>> all('_' == fix_filename(chr(i)) for i in list(range(32)))
True
"""
return re.sub(r'[/\\:|<>"?*\0-\x1f]|^(AUX|COM[1-9]|CON|LPT[1-9]|NUL|PRN)(?![^.])|^\s|[\s.]$', "_", name[:max_length], flags=re.IGNORECASE)
See also this outdated list for additional legacy stuff like = in FAT32.
As of 18/04/2017, no simple black or white list of characters and filenames is evident among the answers to this topic - and there are many replies.
The best suggestion I could come up with was to let the user name the file however he likes. Using an error handler when the application tries to save the file, catch any exceptions, assume the filename is to blame (obviously after making sure the save path was ok as well), and prompt the user for a new file name. For best results, place this checking procedure within a loop that continues until either the user gets it right or gives up. Worked best for me (at least in VBA).
In Unix shells, you can quote almost every character in single quotes '. Except the single quote itself, and you can't express control characters, because \ is not expanded. Accessing the single quote itself from within a quoted string is possible, because you can concatenate strings with single and double quotes, like 'I'"'"'m' which can be used to access a file called "I'm" (double quote also possible here).
So you should avoid all control characters, because they are too difficult to enter in the shell. The rest still is funny, especially files starting with a dash, because most commands read those as options unless you have two dashes -- before, or you specify them with ./, which also hides the starting -.
If you want to be nice, don't use any of the characters the shell and typical commands use as syntactical elements, sometimes position dependent, so e.g. you can still use -, but not as first character; same with ., you can use it as first character only when you mean it ("hidden file"). When you are mean, your file names are VT100 escape sequences ;-), so that an ls garbles the output.
When creating internet shortcuts in Windows, to create the file name, it skips illegal characters, except for forward slash, which is converted to minus.
I had the same need and was looking for recommendation or standard references and came across this thread. My current blacklist of characters that should be avoided in file and directory names are:
$CharactersInvalidForFileName = {
"pound" -> "#",
"left angle bracket" -> "<",
"dollar sign" -> "$",
"plus sign" -> "+",
"percent" -> "%",
"right angle bracket" -> ">",
"exclamation point" -> "!",
"backtick" -> "`",
"ampersand" -> "&",
"asterisk" -> "*",
"single quotes" -> "“",
"pipe" -> "|",
"left bracket" -> "{",
"question mark" -> "?",
"double quotes" -> "”",
"equal sign" -> "=",
"right bracket" -> "}",
"forward slash" -> "/",
"colon" -> ":",
"back slash" -> "\\",
"lank spaces" -> "b",
"at sign" -> "#"
};

Julia Box plots, not reading columns where the csv file column that the name has spaces and parenthesis but has no problem reading 1word column title

So here's the code in Julia
using CSV
using DataFrames
using PlotlyJS
df= CSV.read("path", DataFrame)
plot(df, x=:Age, kind="box")
#I DO get the box plot for this one, because in the csv that column is headed with "Age"
plot(df, x=:Annual Income (k$), kind="box")
ERROR: syntax: missing comma or ) in argument list
Stacktrace:
[1] top-level scope
# none:1
#here I get an error asking about syntax, but I don't understand since the x= part is exactly what the column is labeled. If I try 'x=:Annual' I get a box plot of nothing, but the column title is "Annual Income (k$)".
Help is greatly appreciated!
Refrence: https://plotly.com/julia/box-plots/
Try:
plot(df, x=Symbol("Annual Income (k\$)"), kind="box")
The : syntax constructs a Symbol, but only upto the next space. So :Annual Income (k$) says to build the Symbol Symbol("Annual"), but then leaves the Income (k$) parts dangling. Instead you can explicitly construct the Symbol yourself like above.
The backslash before the $ symbol is because Julia uses $ usually for interpolation, and here we want to use the raw $ character itself. You can also do plot(df, x=Symbol(raw"Annual Income (k$)"), kind="box") instead, as no interpolation happens inside raw"" strings.

How to convert a string to UTF-16 format in google app script

I am writing a google app script which converts a string to UTF-16 unicode format. For example
Input:Hello World
Output:\u0048\u0065\u006c\u006c\u006f \u0057\u006f\u0072\u006c\u0064
I actually want the script to convert the column containing Arabic words in the goggle doc spreadsheet to UTF-16 format. Like-
Input: مرحبا بالعالم
Output: \u0645\u0631\u062d\u0628\u0627 \u0628\u0627\u0644\u0639\u0627\u0644\u0645
Is there any way I could do this in Google app script ? If yes, please point me to the right direction on ways to doing it.
SO is not the place to have a complete script written for you from scratch but a formula might help you get started:
=TEXTJOIN(,,ArrayFormula(lower("\u0"&DEC2HEX(CODE(SPLIT(regexreplace(A1,"(\D)","$1\"),"\"))))))
The above though recognises a space.
REGEXREPLACE here 'captures' (the ( )) each individual non digit character (the class \D) in A1 and appends to each element of the captured group ($1) a backslash. SPLIT parses the result of REGEXREPLACE at each \. CODE converts the characters into decimal map values which DEC2HEX then converts to signed hexadecimal format for appending to \u0 with the concatenation operator &. LOWER converts the alphabetic elements returned by DEC2HEX as capitals into lower case. SPLIT created an array so ARRAYFORMULA is required for the functions to process all the individual elements (eg DEC2HEX is a non-array function). TEXTJOIN then stitches all the pieces together and is used with defaults for the first two parameters.

How to get values from JSON file using AppleScript?

In reference to this question,
How to download and get values from JSON file using VBScript or batch file?
how to get the values from JSON file that looks like this,
["AA-BB-CC-MAKE-SAME.json","SS-ED-SIXSIX-TENSE.json","FF-EE-EE-EE-WW.json","ZS-WE-AS-FOUR-MINE.json","DD-RF-LATERS-LATER.json","FG-ER-DC-ED-FG.json"]
using AppleScript in MAC OS?
Here is part of VBScript code in Windows provided by Hackoo,
strJson = http.responseText
Result = Extract(strJson,"(\x22(.*)\x22)")
Arr = Split(Result,",")
For each Item in Arr
wscript.echo Item
Next
'******************************************
Function Extract(Data,Pattern)
Dim oRE,oMatches,Match,Line
set oRE = New RegExp
oRE.IgnoreCase = True
oRE.Global = True
oRE.Pattern = Pattern
set oMatches = oRE.Execute(Data)
If not isEmpty(oMatches) then
For Each Match in oMatches
Line = Line & Trim(Match.Value) & vbCrlf
Next
Extract = Line
End if
End Function
'******************************************
In MAC OS AppleScript I only need the code to get the values of the JSON file to a single array of string values. The above shown example above the VBScript is the how JSON file contents looks like.
Short answer: Unfortunately, AppleScript doesn't provide a built-in feature to parse JSON which is analogous to JavaScript's JSON.parse() method.
Below are a couple of solutions:
Solution 1: Requires a third party plug-in to be installed, which may not always be feasible.
Solution 2: Does not require any third party plug-in to be installed, and instead utilizes tools/features built-in to macOS as standard.
Solution 1:
If you have the luxury of being able to install a third-party plugin on your users systems then you can install JSON Helper for AppleScript (As suggested by #user3439894 in the comments).
Then use it in your AppleScript as follows:
set srcJson to read POSIX file (POSIX path of (path to home folder) & "Desktop/foobar.json")
tell application "JSON Helper" to set myList to read JSON from srcJson
Explanation:
On line 1 we read the contents of the .json file and assign it to the variable named srcJson.
Note You'll need to change the path part (i.e. Desktop/foobar.json) as necessary.
On line 2 we parse the contents using the JSON Helper plug-in. This assigns each item of the source JSON Array to a new AppleScript list. The resultant AppleScript list is assigned to a variable named myList.
Solution 2:
By utilizing tools built-in to macOS as standard, you can also do the following via AppleScript. This assumes that your JSON file is valid and contains a single Array only:
set TID to AppleScript's text item delimiters
set AppleScript's text item delimiters to ","
set myList to text items of (do shell script "tr ''\\\\n\\\\r'' ' ' <~/Desktop/foobar.json | sed 's/^ *\\[ *\"//; s/ *\" *\\] *$//; s/\" *, *\"/,/g;'")
set AppleScript's text item delimiters to TID
Note: you'll need to change the path part (i.e. ~/Desktop/foobar.json) as necessary.
Also, if your .json filename includes a space(s) you'll need to escape them with \\. For instance ~/Desktop/foo\\ bar.json
Explanation:
On line 1 AppleScript's current text item delimiters are assigned to a variable named TID.
On line 2 AppleScript's text item delimiters are set to a comma - this will help when extracting each individual value from the source JSON Array and assigning each value to a new AppleScript list.
On line 3 a shell script is executed via the do shell script command, which performs the following:
Reads the content of the source .json file via the part which reads ~/Desktop/foobar.json. This path currently assumes the file is named foobar.json and resides in your Desktop folder (You'll need to change this path to wherever your actual file exists).
The content of foobar.json is redirected, (note the < before the filepath), to tr (i.e. the part which reads: tr ''\\\\n\\\\r'' ' '). This translation will replace any newline characters which may exists in the contents of the source .json Array with space characters. This ensures the contents of foobar.json is transformed to one line.
Note: A JSON Array can contain newlines between each item and still be valid, so although the example JSON given in your question appears on one line - it is not a requirement of this solution as it will handle multi-line too.
The one line of text is then piped to sed's s command for further processing (i.e. the part which reads: | sed 's/^ *\\[ *\"//; s/ *\" *\\] *$//; s/\" *, *\"/,/g;').
The syntax of the s command is 's/regexp/replacement/flags'.
Let's breakdown each s command to further understand what is happening:
s/^ *\\[ *\"// removes the opening square bracket [, which may be preceded or followed by zero or more space characters, and the following double quote (i.e. the first occurrence) from the beginning of the string.
s/ *\" *\\] *$// removes the closing square bracket ], which may be preceded or followed by zero or more space characters, and the preceding double quote (i.e. the last occurrence) from the end of the string.
s/\" *, *\"/,/g replaces single commas, (which may be preceded with zero or more spaces, and/or followed by zero or more spaces) with a single comma.
The initial part on line 3 which reads; set myList to text items of ... utilizes text items to read the String into an AppleScript list using commas as delimiters to determine each item of the list. The resultant Array is assigned to a variable named myList.
On line 4 AppleScript's text item delimiters are restored to their original value.
Utilizing a variable for the source JSON filepath.
If you want to utilize a variable for the filepath to the source .json file then you can do something like this instead:
set srcFilePath to quoted form of (POSIX path of (path to home folder) & "Desktop/foobar.json")
set TID to AppleScript's text item delimiters
set AppleScript's text item delimiters to ","
set myList to text items of (do shell script "tr ''\\\\n\\\\r'' ' ' <" & srcFilePath & " | sed 's/^ *\\[ *\"//; s/ *\" *\\] *$//; s/\" *, *\"/,/g;'")
set AppleScript's text item delimiters to TID
Note This is very much the same as the first example. The notable differences are:
On the first line we assign the filepath to a variable named srcFilePath.
In the do shell script we reference the srcFilePath variable.
Additional note regarding JSON escaped special characters: Solution 2 preserves any JSON escaped special characters which may be present in the values of source JSON array. However, Solution 1 will interpret them.
Caveats Solution 2 produces unexpected results when an item in the source JSON array includes a comma because a comma is used as a text item delimiters.
How to get the values from JSON file that looks like this,
["AA-BB-CC-MAKE-SAME.json","SS-ED-SIXSIX-TENSE.json","FF-EE-EE-EE-WW.json","ZS-WE-AS-FOUR-MINE.json","DD-RF-LATERS-LATER.json","FG-ER-DC-ED-FG.json"]
If you actually mean what you wrote, and that the contents of the JSON file is that list of six strings in a single array, formatted on a single line, the simplest way is to treat it as text, trim the opening and closing square brackets, then delimit its fields at every occurrence of a ,. Finally, each individual text item can have the surrounding quotes trimmed as well.
Examining the VBScript, it looks like it uses a very similar process, albeit with regular expressions, which AppleScript doesn't feature but which aren't especially necessary in this simple situation.
Let's assume that the JSON array above is stored in a file on your desktop called "myfile.json". Then:
set home to the path to home folder
set f to the POSIX path of home & "Desktop/myfile.json"
set JSONstr to read POSIX file f
# Trim square brackets
set JSONstr to text 2 thru -2 of JSONstr
# Delimit text fields using comma
set the text item delimiters to ","
set Arr to the text items of JSONstr
# Trim quotes of each item in Arr
repeat with a in Arr
set contents of a to text 2 thru -2 of a
end repeat
# The final array
Arr
I only need the code to get the values of the JSON file to a single array of string values. The above shown example above the VBScript is the how JSON file contents looks like.
The variable Arr now contains the array (referred to as lists in AppleScript) of string values. You can access a particular item in it like this:
item 2 of Arr --> "SS-ED-SIXSIX-TENSE.json"
A More General Solution
I've decided to include a more advanced way to handle JSON in an AppleScript, partly because I've been doing a lot of JSON processing quite recently and this is all fresh on my event horizon; but also to demonstrate that, using AppleScriptObjC, parsing even very complex JSON data is not only possible, but quite simple.
I don't think you'll need it in this specific case, but it could come in useful for some future situation.
The script has three sections: it starts off importing the relevant Objective-C framework that gives AppleScript additional powers; then, I define the actual handler itself, called JSONtoRecord, which I describe below. Lastly, comes the bottom of the script where you can enter your code and do whatever you like with it:
use framework "Foundation"
use scripting additions
--------------------------------------------------------------------------------
property ca : a reference to current application
property NSData : a reference to ca's NSData
property NSDictionary : a reference to ca's NSDictionary
property NSJSONSerialization : a reference to ca's NSJSONSerialization
property NSString : a reference to ca's NSString
property NSUTF8StringEncoding : a reference to 4
--------------------------------------------------------------------------------
on JSONtoRecord from fp
local fp
set JSONdata to NSData's dataWithContentsOfFile:fp
set [x, E] to (NSJSONSerialization's ¬
JSONObjectWithData:JSONdata ¬
options:0 ¬
|error|:(reference))
if E ≠ missing value then error E
tell x to if its isKindOfClass:NSDictionary then ¬
return it as record
x as list
end JSONtoRecord
--------------------------------------------------------------------------------
###YOUR CODE BELOW HERE
#
#
set home to the path to home folder
set f to the POSIX path of home & "Desktop/myfile.json"
JSONtoRecord from f
--> {"AA-BB-CC-MAKE-SAME.json", "SS-ED-SIXSIX-TENSE.json", ¬
--> "FF-EE-EE-EE-WW.json", "ZS-WE-AS-FOUR-MINE.json", ¬
--> "DD-RF-LATERS-LATER.json", "FG-ER-DC-ED-FG.json"}
At the bottom of the script, I've called the JSONtoRecord handler, passing it the location of myfile.json. One of the benefits of this handler is that it doesn't matter whether the file is formatted all on one line, or over many lines. It can also handle complex, nested JSON arrays.
In those instances, what it returns is a native AppleScript record object, with all the JSON variables stored as property values in the record. Accessing the variables then becomes very simple.
This is actually exactly what the JSON Helper application that a couple of people have already mentioned does under the hood.
The one criterion (other than the JSON file containing valid JSON data) is that the path to the file is a posix path written in full, e.g. /Users/CK/Desktop/myfile.json, and not ~/Desktop/myfile.json or, even worse, Macintosh HD:Users:CK:Desktop:myfile.json.

What is the best value for "Unit Separator" in XML?

I used Unit Separator (US/0x1f) in database. When I export to XML 1.0 file, it is not accepted and leave the attribute with empty value.
I have data in database like this:
"option1=10;option2=20;option3=aaa[US]bbb[US]ccc;"
I'm assuming to export to XML 1.0 file like this:
<elementname, attr1="option1=10;option2=20;option3=aaa[US]bbb[US]ccc;"/>
However, the [US] is not accepted by XML 1.0. Any suggestions?
I can replace '\37' (oct 37, hex 1f) with something like "XXX", "$", "(0x1f)"... before writing to XML;
I can replace it when importing from XML and write to database. However, if I replace it with "& # x 1 F ;", which is the HTML Entity for Unit separator, I end up with "& a m p ; # x 1 F ;", which is definitely not what I wanted.
If I manually modify the XML file to "& # x 1 F ;", I can not use MSXML to load it, giving error "Invalid Unicode Character".
Any suggestions?
Thank you
Summary:
Let's make an analogy: Let's think about how the compiler works, there are two phases: "Pre-compile" and "Compile".
For XML File Generation, it acts like the "Compile" phase. E.g. convert "<" to "& l t ;"
However, the Unit Separator is not supported by XML 1.0, so the "Compile" phase will not convert it to HTML Entity "& # x 1 F ;"
So we have to seek solution in the "Pre-Compile" phase, which is our own application's responsibility.
When writing:
Option1: <unit>aaa</unit><unit>bbb</unit>
Option2: simply use "_x241F_" to replace "\37" in the string if "_x241F_" is not conflicting with any existing token in the string.
When reading:
According to Option1: Load the elements, catenate to a single string with "\37" as separator.
According to Option2: simply use "\37" to replace "_x241F_".
I've also found out that MSXML (even the highest version MSXML6.dll) will not load XML 1.1 .
So if we are unfortunately using MSXML, we have to write our own "Pre-Compile" code to handle the Unicode characters before feeding the "Compile" phase.
Note: I borrowed the idea of "_ x 2 4 1 F _" from here.
Thanks for everyone's help
There is no HTML entity for U+001F UNIT SEPARATOR. Besides, HTML entities would be irrelevant when dealing with generic XML.
The character references would be  and , in HTML and in XML, but the character is not allowed in HTML or in XML. For XML 1.0, which this seems to be about, please refer to section 2.2 Characters, where the normative definition is the following production (the associated comment is misleading, and comments are non-normative):
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]
The conclusions to be drawn depend on the meaning and purpose of UNIT SEPARATOR in the text. It has no generally defined meaning; it is up to applications to assign a meaning to it and process it accordingly.
Usually UNIT SEPARATOR is used to separate units of some kind, so the natural approach would be to process the incoming data so that instead of such separators, the data, when converted to XML format, has units denoted by markup. So for data like aaa[US]bbb[US]ccc where [US] is UNIT SEPARATOR, you would generate something like <unit>aaa</unit><unit>bbb</unit><unit>ccc</unit>.
This website
http://www.fileformat.info/info/unicode/char/1f/index.htm
suggests one of the following:
HTML Entity (decimal) 
HTML Entity (hex)