Powershell not able to convert while converting values from "&" to JSON - json

RoleFullPath
Applications\User Admin & Support-DEMO
PowerShell Code
$NewJSON.roleFullPath = $Line.RoleFullPath
.
.
.
.
$JSONPath = $RolePath + $FolderName + "-JSON.json"
Convertto-JSON $NewJSON | Out-file -Encoding "UTF8" $JSONPath
Output:
"roleFullPath": "Applications\\User Admin \u0026 Support-DEMO"
While converting from csv to json, character '&' is getting converted to '\u0026'
Any help?

In Windows PowerShell v5.1, ConvertTo-Json indeed unexpectedly encodes & characters as Unicode escape sequence \u0026, where 0026 represents hex. number 0x26, the Unicode code point representing the & character, U+0026.
(PowerShell Core, by contrast, preserves the & as-is.)
That said, JSON parsers should be able to interpret such escape sequences and, indeed, the complementary ConvertFrom-Json cmdlet is.
Note: The solutions below are general ones that can handle the Unicode escape sequences of any Unicode character; since ConvertTo-Json seemingly only uses these Unicode escape-sequence representations for the characters &, ', < and >, a simpler solution is possible, unless false positives must be ruled out - see this answer.
That said, if you do want to manually convert Unicode escape sequences into their character equivalents in JSON text, you can use the following - limited solution:
# Sample JSON with Unicode escapes.
$json = '{ "roleFullPath": "Applications\\User Admin \u0026 Support-DEMO" }'
# Replace Unicode escapes with the chars. they represent,
# with limitations.
[regex]::replace($json, '\\u[0-9a-fA-F]{4}', {
param($match) [char] [int] ('0x' + $match.Value.Substring(2))
})
The above yields:
{ "roleFullPath": "Applications\\User Admin & Support-DEMO" }
Note how \u0026 was converted to the char. it represents, &.
A robust solution requires more work:
There are characters that must be escaped in JSON and cannot be represented literally, so in order for the to-character conversion to work generically, these characters must be excluded.
Additionally, false positives must be avoided; e.g., \\u0026 is not a valid Unicode escape sequence, because a JSON parser interprets \\ as an escaped \ followed by verbatim u0026.
Finally, the Unicode sequences for " and \ must be translated into their escaped forms, \" and \\, and it is possible to represent a few ASCII-range control characters by C-style escape sequences, e.g., \t for a tab character (\u0009).
The following robust solution addresses all these issues:
# Sample JSON with Unicode escape sequences:
# \u0026 is &, which CAN be converted to the literal char.
# \u000a is a newline (LF) character, which CANNOT be converted, but can
# be translated to escape sequence "\n"
# \\u0026 is *not* a Unicode escape sequence and must be preserved as-is.
$json = '{
"roleFullPath": "Applications\u000aUser Admin \u0026 Support-DEMO-\\u0026"
}'
[regex]::replace($json, '(?<=(?:^|[^\\])(?:\\\\)*)\\u([0-9a-fA-F]{4})', {
param($match)
$codePoint = [int] ('0x' + $match.Groups[1].Value)
if ($codePoint -in 0x22, 0x5c) {
# " or \ must be \-escaped.
'\' + [char] $codePoint
}
elseif ($codePoint -in 0x8, 0x9, 0xa, 0xc, 0xd) {
# Control chars. that can be represented as short, C-style escape sequences.
('\b', '\t', '\n', $null, '\f', '\r')[$codePoint - 0x8]
}
elseif ($codePoint -le 0x1f -or [char]::IsSurrogate([char] $codePoint)) {
# Other control chars. and halves of surrogate pairs must be retained
# as escape sequences.
# (Converting surrogate pairs to a single char. would require much more effort.)
$match.Value
}
else {
# Translate to literal char.
[char] $codePoint
}
})
Output:
{
"roleFullPath": "Applications\nUser Admin & Support-DEMO-\\u0026"
}

To stop Powershell from doing this pipe your Json output through this
$jsonOutput | ForEach-Object { [System.Text.RegularExpressions.Regex]::Unescape($_) } | Set-Content $jsonPath -Encoding UTF8;
This will prevent the & being converted :)

Related

I need to extract Data from a single line of json-data which is inbetween two variables (Powershell)

I need to extract Data from a single line of json-data which is inbetween two variables (Powershell)
my Variables:
in front of Data:
DeviceAddresses":[{"Id":
after Data:
,"
I tried this, but there needs to be some error because of all the special characters I'm using:
$devicepattern = {DeviceAddresses":[{"Id":{.*?},"}
#$deviceid = [regex]::match($changeduserdata, $devicepattern).Groups[1].Value
#$deviceid
As you've found, some character literals can't be used as-is in a regex pattern because they carry special meaning - we call these meta-characters.
In order to match the corresponding character literal in an input string, we need to escape it with \ -
to match a literal (, we use the escape sequence \(,
for a literal }, we use \}, and so on...
Fortunately, you don't need to know or remember which ones are meta-characters or escapable sequences - we can use Regex.Escape() to escape all the special character literals in a given pattern string:
$prefix = [regex]::Escape('DeviceAddresses":[{"Id":')
$capture = '(.*?)'
$suffix = [regex]::Escape(',"')
$devicePattern = "${prefix}${capture}${suffix}"
You also don't need to call [regex]::Match directly, PowerShell will populate the automatic $Matches variable with match groups whenever a scalar -match succeeds:
if($changeduserdata -match $devicePattern){
$deviceid = $Matches[1]
} else {
Write-Error 'DeviceID not found'
}
For reference, the following ASCII literals needs to be escaped in .NET's regex grammar:
$ ( ) * + . ? [ \ ^ { |
Additionally, # and (regular space character) needs to be escaped and a number of other whitespace characters have to be translated to their respective escape sequences to make patterns safe for use with the IgnorePatternWhitespace option (this is not applicable to your current scenario):
\u0009 => '\t' # Tab
\u000A => '\n' # Line Feed
\u000C => '\f' # Form Feed
\u000D => '\r' # Carriage Return
... all of which Regex.Escape() takes into account for you :)
To complement Mathias R. Jessen's helpful answer:
Generally, note that JSON data is much easier to work with - and processed more robustly - if you parse it into objects whose properties you can access - see the bottom section.
As for your regex attempt:
Note: The following also applies to all PowerShell-native regex features, such as the -match, -replace, and -split operators, the switch statement, and the Select-String cmdlet.
Mathias' answer uses [regex]::Escape() to escape the parts of the regex pattern to be used verbatim by the regex engine.
This is unequivocally the best approach if those verbatim parts aren't known in advance - e.g., when provided via a variable or expression, or passed as an argument.
However, in a regex pattern that is specified as a string literal it is often easier to individually \-escape the regex metacharacters, i.e. those characters that would otherwise have special meaning to the regex engine.
The list of characters that need escaping is (it can be inferred from the .NET Regular-Expression Quick Reference):
\ ( ) | . * + ? ^ $ [ {
If you enable the IgnorePatternWhiteSpace option (which you can do inline with
(?x), at the start of a pattern), you'll additionally have to \-escape:
#
significant whitespace characters (those you actually want matched) specified verbatim (e.g., ' ', or via string interpolation,"`t"); this does not apply to those specified via escape sequences (e.g., \t or \n).
Therefore, the solution could be simplified to:
# Sample JSON
$changeduserdata = '{"DeviceAddresses":[{"Id": 42,"More": "stuff"}]}'
# Note how [ and { are \-escaped
$deviceId = if ($changeduserdata -match 'DeviceAddresses":\[\{"Id":(.*?),"') {
$Matches[1]
}
Using ConvertFrom-Json to properly parse JSON into objects is both more robust and more convenient, as it allows property access (dot notation) to extract the value of interest:
# Sample JSON
$changeduserdata = '{"DeviceAddresses":[{"Id": 42,"More": "stuff"}]}'
# Convert to an object ([pscustomobject]) and drill down to the property
# of interest; note that the value of .DeviceAddresses is an *array* ([...]).
$deviceId = (ConvertFrom-Json $changeduserdata).DeviceAddresses[0].Id # -> 42

Powershell convert to json removes special characters

I have problem with save string to json file.
$newY = "12313tytk1.xp1`F4i12313211ddsada;"
First I read json file
$a = Get-Content 'settings.json' -raw | ConvertFrom-Json
Then updating field
$a.X.y = $newY
And saving file
$a | ConvertTo-Json -Depth 5 | set-content 'settings.json'
There are some problems:
After save Y in file is wrong:
"12313tytk1.xp1F4i12313211ddsada;"
The special characters are missing: `.
File is wrongly formatted. To much spaces
"<" and ">" are changed to: \u003c and \u003e
How to change it?
Bactick ` is an escape character in Powershell. Single quoted strings ' are string literals, so the contents are not evaluated, escaped or the like. Doulbe quoted strings " are evaluated, so the backtick is interpreted as an escape character. See about_Quoting_Rules for more information.
Consider,
PS C:\> $newY = "12313tytk1.xp1`F4i12313211ddsada;"
PS C:\> $newY # Misses the backtick
12313tytk1.xp1F4i12313211ddsada;
PS C:\> $newY2 = '12313tytk1.xp1`F4i12313211ddsada;'
PS C:\> $newY2 # Contains the backtick
12313tytk1.xp1`F4i12313211ddsada;

JSON slashes and backslashes in string on bourne shell

I am trying to parse json files that contain sequences of slashes and backslashes in some of their strings like this:
echo '{"tag_string":"/\/\/\ test"}' | jq
which gives me:
parse error: Invalid escape at line 1, column 27
I have tried escaping with backslashes at different positions, but I can't seem to find a correct way. How do I output the string as it is, without removing any character or getting errors?
This only works on bash, but not sh (or zsh):
echo '{"tag_string":"/\\/\\/\\ test"}' | jq -r '.tag_string'
/\/\/\ test
A forward slash character is legal, but a single backslash character is not. According to json.org char description, the valid chars are:
char
any-Unicode-character-
except-"-or-\-or-
control-character
\"
\\
\/
\b
\f
\n
\r
\t
\u four-hex-digits
So in your example, the single backslashes are not legal, you need either "\\" which is interpreted as double backslashes, or you need to remove them entirely.
If you are trying to include literal backslashes:
(bash)
echo '{"tag_string":"/\\/\\/\\ test"}' | jq
{
"tag_string": "/\\/\\/\\ test"
}
echo '{"tag_string":"/\\/\\/\\ test"}' | jq -r '.["tag_string"]'
/\/\/\ test
(sh)
echo '{"tag_string":"/\\\\/\\\\/\\\\ test"}' | jq -r '.["tag_string"]'
/\/\/\ test
printf "%s" '{"tag_string":"/\\/\\/\\ test"}' | jq -r '.["tag_string"]'
/\/\/\ test
If you are trying to convert a file with non-JSON strings, then consider a tool such as any-json. Using the "cson-to-json" mode, "\/" will be interpreted as "/":
$ any-json -format=cson
Input:
{"tag_string":"/\/\/\ test"}
Output:
{
"tag_string": "/// test"
}

Why is JSON::XS Not Generating Valid UTF-8?

I'm getting some corrupted JSON and I've reduced it down to this test case.
use utf8;
use 5.18.0;
use Test::More;
use Test::utf8;
use JSON::XS;
BEGIN {
# damn it
my $builder = Test::Builder->new;
foreach (qw/output failure_output todo_output/) {
binmode $builder->$_, ':encoding(UTF-8)';
}
}
foreach my $string ( 'Deliver «French Bread»', '日本国' ) {
my $hashref = { value => $string };
is_sane_utf8 $string, "String: $string";
my $json = encode_json($hashref);
is_sane_utf8 $json, "JSON: $json";
say STDERR $json;
}
diag ord('»');
done_testing;
And this is the output:
utf8.t ..
ok 1 - String: Deliver «French Bread»
not ok 2 - JSON: {"value":"Deliver «French Bread»"}
# Failed test 'JSON: {"value":"Deliver «French Bread»"}'
# at utf8.t line 17.
# Found dodgy chars "<c2><ab>" at char 18
# String not flagged as utf8...was it meant to be?
# Probably originally a LEFT-POINTING DOUBLE ANGLE QUOTATION MARK char - codepoint 171 (dec), ab (hex)
{"value":"Deliver «French Bread»"}
ok 3 - String: 日本国
ok 4 - JSON: {"value":"æ¥æ¬å½"}
1..4
{"value":"日本国"}
# 187
So the string containing guillemets («») is valid UTF-8, but the resulting JSON is not. What am I missing? The utf8 pragma is correctly marking my source. Further, that trailing 187 is from the diag. That's less than 255, so it almost looks like a variant of the old Unicode bug in Perl. (And the test output still looks like crap. Never could quite get that right with Test::Builder).
Switching to JSON::PP produces the same output.
This is Perl 5.18.1 running on OS X Yosemite.
is_sane_utf8 doesn't do what you think it does. You're suppose to pass strings you've decoded to it. I'm not sure what's the point of it, but it's not the right tool. If you want to check if a string is valid UTF-8, you could use
ok(eval { decode_utf8($string, Encode::FB_CROAK | Encode::LEAVE_SRC); 1 },
'$string is valid UTF-8');
To show that JSON::XS is correct, let's look at the sequence is_sane_utf8 flagged.
+--------------------- Start of two byte sequence
| +---------------- Not zero (good)
| | +---------- Continuation byte indicator (good)
| | |
v v v
C2 AB = [110]00010 [10]101011
00010 101011 = 000 1010 1011 = U+00AB = «
The following shows that JSON::XS produces the same output as Encode.pm:
use utf8;
use 5.18.0;
use JSON::XS;
use Encode;
foreach my $string ('Deliver «French Bread»', '日本国') {
my $hashref = { value => $string };
say(sprintf("Input: U+%v04X", $string));
say(sprintf("UTF-8 of input: %v02X", encode_utf8($string)));
my $json = encode_json($hashref);
say(sprintf("JSON: %v02X", $json));
say("");
}
Output (with some spaces added):
Input: U+0044.0065.006C.0069.0076.0065.0072.0020.00AB.0046.0072.0065.006E.0063.0068.0020.0042.0072.0065.0061.0064.00BB
UTF-8 of input: 44.65.6C.69.76.65.72.20.C2.AB.46.72.65.6E.63.68.20.42.72.65.61.64.C2.BB
JSON: 7B.22.76.61.6C.75.65.22.3A.22.44.65.6C.69.76.65.72.20.C2.AB.46.72.65.6E.63.68.20.42.72.65.61.64.C2.BB.22.7D
Input: U+65E5.672C.56FD
UTF-8 of input: E6.97.A5.E6.9C.AC.E5.9B.BD
JSON: 7B.22.76.61.6C.75.65.22.3A.22.E6.97.A5.E6.9C.AC.E5.9B.BD.22.7D
JSON::XS is generating valid UTF-8, but you're using the resulting UTF-8 encoded byte strings in two different contexts that expect character strings.
Issue 1: Test::utf8
Here are the two main situations when is_sane_utf8 will fail:
You have a miscoded character string that had been decoded from a UTF-8 byte string as if it were Latin-1 or from double encoded UTF-8, or the character string is perfectly fine and looks like a potentially "dodgy" miscoding (using the terminology from its docs).
You have a valid UTF-8 byte string containing the encoded code points U+0080 through U+00FF, for example «French Bread».
The is_sane_utf8 test is intended only for character strings and has the documented potential for false negatives.
Issue 2: Output Encoding
All of your non-JSON strings are character strings while your JSON strings are UTF-8 encoded byte strings, as returned from the JSON encoder. Since you're using the :encoding(UTF-8) PerlIO layer for TAP output, the character strings are being implicitly encoded to UTF-8 with good results, while the byte strings containing JSON are being double encoded. STDERR however does not have an :encoding PerlIO layer set, so the encoded JSON byte strings look good in your warnings since they're already encoded and being passed straight out.
Only use the :encoding(UTF-8) PerlIO layer for IO with character strings, as opposed to the UTF-8 encoded byte strings returned by default from the JSON encoder.

How can I regex ,, to ,\N, in my CSVs so that mysqlimport understands them?

Say I have a normal CSV like
# helloworld.csv
hello,world,,,"please don't replace quoted stuff like ,,",,
If I want mysqlimport to understand that some of those fields are NULL, then I need:
# helloworld.mysql.csv
hello,world,\N,\N,"please don't replace quoted stuff like ,,",\N,\N
I got some help from another question -- Why does sed not replace overlapping patterns -- but note the problem:
$ perl -pe 'while (s#,,#,\\N,#) {}' -pe 's/,$/,\\N/g' helloworld.csv
hello,world,\N,\N,"please don't replace quoted stuff like ,\N,",\N,\N
^^
How can I write the regex so it doesn't replace ,, if they're between quotes?
FINAL ANSWER
Here's the final perl I used, thanks to the accepted answer below:
perl -pe 's/^,/\\N,/; while (s/,(?=,)(?=(?:[^"]*"[^"]*")*[^"]*$)/,\\N/g) {}; s/,$/,\\N/' helloworld.csv
That takes care of leading, trailing, and unquoted empty strings.
Why not use Text::CSV? You can parse the file with it and then use map to replace empty fields with '\N', e.g.
use Text::CSV;
my $csv = Text::CSV->new({ binary => 1 }) or die Text::CSV->error_diag();
$csv->parse($line); # parse a CSV string into fields
my #fields = $csv->fields(); # get the parsed fields
#fields = map { $_ eq "" ? '\N' : $_ } #fields;
$csv->combine(#fields); # combine fields into a string
Assuming that you won't have escaped quotes, you can make sure that you only replace ,, if it's followed by an even number of quotes:
$subject =~
s/, # Match ,
(?=,) # only if followed by another ,
(?= # and only if followed by...
(?: # the following group:
[^"]*" # any number of non-quote characters, followed by one quote
[^"]*" # the same thing again (even number!)
)* # any number of times, followed by
[^"]* # any number of non-quotes until...
$ # end of string.
) # End of lookahead assertion
/,\N/x
g;
Input:
foo,,bar,,,baz,"foo,,,oof",zap,,zip
Output:
foo,\N,bar,\N,\N,baz,"foo,,,oof",zap,\N,zip