Does Ruby's gsub not treat \n as a whitespace character? - json

I'm trying to perform a simple operation - turn some input text into JSON, process it, and use it further.
require 'json'
aws_region = "us-east-1"
tag = `sudo aws ec2 describe-tags --region="#{aws_region}" --
filters "Name=resource-type,Values=instance" "Name=key,Values=Group"
"Name=resource-id,Values=$(ec2metadata --instance-id)"`
puts tag
tag_json = tag.to_json.gsub(/\s+/, "")
#tag_json = tag.gsub("\n", "")
puts tag_json
obj = JSON.parse(tag_json)
desired_value = obj["Tags"][0]["Value"]
puts desired_value
I expected the above to strip out all whitespace including newlines, but to my surprise, the output still has newlines in it. The JSON.parse fails with the below error because the newlines are still present. With the additional tag_json assignment above uncommented, it removes the newlines and succeeds.
JSON::ParserError
-----------------
746: unexpected token at '"{\n\"Tags\": [\n{\n\"ResourceType\":
\"instance\", \n\"ResourceId\": \"i-XXXXXX\", \n\"Value\":
\"groupA\", \n\"Key\": \"Group\"\n}\n]\n}\n"'
I end up having to have a separate case for newlines. Why does gsub treat newline characters as non-whitespace? Is there any other expression that will combine all of whitespace, tabs and newlines so I can strip them out?

Maybe it's an encoding issue. Try tag_json = tag.to_json.gsub(/[\s\p{]/, "")
You don't need the + in your expression because gsub removes all occurrences of a single character anyway.
Consider "aaaaaa".gsub(/a/, '') # => ""

Related

Trying to dump information to a json, but getting double backslashs

I have some info store in a MySQL database, something like: AHmmgZq\n/+AH+G4
We get that using an API, so when I read it in my python I get: AHmmgZq\\n/+AH+G4 The backslash is doubled!
Now I need to put that into a JSON file, how can I remove the extra backslash?
EDIT: let me show my full code:
json_dict = {
"private_key": "AHmmgZq\\n/+AH+G4"
}
print(json_dict)
print(json_dict['private_key'])
with open(file_name, "w", encoding="utf-8") as f:
json.dump(json_dict, f, ensure_ascii=False, indent=2)
In the first print I have the doubled backslash, but in the second one there's only one. When I dump it to the json file it gives me doubled.
"AHmmgZq\\n/+AH+G4" in python is equivalent to the literal string "AHmmgZq\n/+AH+G4". print("AHmmgZq\\n/+AH+G4") => "AHmmgZq\n/+AH+G4"
\n is a new line character in python. So to represent \n literally it needs to be escaped with a \. I would first try to convert to json as is and see if that works.
Otherwise for removing extra backslashs:
string_to_json.replace("\\\\","\\")
Remember that \\ = escaped \ = \
But in the above string that will not help you, because python reads "AHmmgZq\\n/+AH+G4" as "AHmmgZq\n/+AH+G4" and so finds no double backslash.
What solved my problem was this:
string_to_json.replace("\\n","\n")
Thanks to everybody!

I need to extract Data from a single line of json-data which is inbetween two variables (Powershell)

I need to extract Data from a single line of json-data which is inbetween two variables (Powershell)
my Variables:
in front of Data:
DeviceAddresses":[{"Id":
after Data:
,"
I tried this, but there needs to be some error because of all the special characters I'm using:
$devicepattern = {DeviceAddresses":[{"Id":{.*?},"}
#$deviceid = [regex]::match($changeduserdata, $devicepattern).Groups[1].Value
#$deviceid
As you've found, some character literals can't be used as-is in a regex pattern because they carry special meaning - we call these meta-characters.
In order to match the corresponding character literal in an input string, we need to escape it with \ -
to match a literal (, we use the escape sequence \(,
for a literal }, we use \}, and so on...
Fortunately, you don't need to know or remember which ones are meta-characters or escapable sequences - we can use Regex.Escape() to escape all the special character literals in a given pattern string:
$prefix = [regex]::Escape('DeviceAddresses":[{"Id":')
$capture = '(.*?)'
$suffix = [regex]::Escape(',"')
$devicePattern = "${prefix}${capture}${suffix}"
You also don't need to call [regex]::Match directly, PowerShell will populate the automatic $Matches variable with match groups whenever a scalar -match succeeds:
if($changeduserdata -match $devicePattern){
$deviceid = $Matches[1]
} else {
Write-Error 'DeviceID not found'
}
For reference, the following ASCII literals needs to be escaped in .NET's regex grammar:
$ ( ) * + . ? [ \ ^ { |
Additionally, # and (regular space character) needs to be escaped and a number of other whitespace characters have to be translated to their respective escape sequences to make patterns safe for use with the IgnorePatternWhitespace option (this is not applicable to your current scenario):
\u0009 => '\t' # Tab
\u000A => '\n' # Line Feed
\u000C => '\f' # Form Feed
\u000D => '\r' # Carriage Return
... all of which Regex.Escape() takes into account for you :)
To complement Mathias R. Jessen's helpful answer:
Generally, note that JSON data is much easier to work with - and processed more robustly - if you parse it into objects whose properties you can access - see the bottom section.
As for your regex attempt:
Note: The following also applies to all PowerShell-native regex features, such as the -match, -replace, and -split operators, the switch statement, and the Select-String cmdlet.
Mathias' answer uses [regex]::Escape() to escape the parts of the regex pattern to be used verbatim by the regex engine.
This is unequivocally the best approach if those verbatim parts aren't known in advance - e.g., when provided via a variable or expression, or passed as an argument.
However, in a regex pattern that is specified as a string literal it is often easier to individually \-escape the regex metacharacters, i.e. those characters that would otherwise have special meaning to the regex engine.
The list of characters that need escaping is (it can be inferred from the .NET Regular-Expression Quick Reference):
\ ( ) | . * + ? ^ $ [ {
If you enable the IgnorePatternWhiteSpace option (which you can do inline with
(?x), at the start of a pattern), you'll additionally have to \-escape:
#
significant whitespace characters (those you actually want matched) specified verbatim (e.g., ' ', or via string interpolation,"`t"); this does not apply to those specified via escape sequences (e.g., \t or \n).
Therefore, the solution could be simplified to:
# Sample JSON
$changeduserdata = '{"DeviceAddresses":[{"Id": 42,"More": "stuff"}]}'
# Note how [ and { are \-escaped
$deviceId = if ($changeduserdata -match 'DeviceAddresses":\[\{"Id":(.*?),"') {
$Matches[1]
}
Using ConvertFrom-Json to properly parse JSON into objects is both more robust and more convenient, as it allows property access (dot notation) to extract the value of interest:
# Sample JSON
$changeduserdata = '{"DeviceAddresses":[{"Id": 42,"More": "stuff"}]}'
# Convert to an object ([pscustomobject]) and drill down to the property
# of interest; note that the value of .DeviceAddresses is an *array* ([...]).
$deviceId = (ConvertFrom-Json $changeduserdata).DeviceAddresses[0].Id # -> 42

How do I remove commas only from inside double quotes for every line in a comma delimited csv?

I have a comma delimited CSV file that encapsulates the fields in double quote that I am attempting to operate on in bash. I would like to remove commas from inside the double quoted field for each line. I've looked at other solutions for the question asked here, and they revolved around using external libraries for CSV parsing, which isn't an option for my limited environment where the majority of the work is being done in awk and sed.
"A","B","C D","E, F","G"
desired output
"A","B","C D","E F","G"
With sed, to remove all commas followed by one non quote character and commas not preceded by one non quote character:
sed 's/,*\([^"]\)/\1/g;s/\([^"]\),*/\1/g' file
Edit:
Added * quantifier to match subsequent commas.
Easy with Perl's Text::CSV_XS module:
perl -MText::CSV_XS=csv -we 'csv(
in => shift,
always_quote => 1,
on_in => sub { tr/,//d for #{ $_[1] } }
);' -- file.csv
in specifies the input, shift just takes one from the command line arguments
always_quote adds quotes even to fields that don't need them
on_in introduces code to run on each line, in this case, it iterates over all the cells in the row and removes commas using the transliteration operator tr.
With GNU awk and FPAT:
$ awk '
BEGIN {
FPAT = "([^,]+)|(\"[^\"]+\")" # field definition
OFS="," # output field separator
}
{
for(i=1;i<=NF;i++) # loop all fields
gsub(/,/,"",$i)} # replace all commas in fields
1' file # output
"A","B","C D","E F","G"
I like ruby for CSV one-liners:
ruby -rcsv -ne '
CSV.parse($_) {|row|
puts row.map {|field| field.delete(",")}
.to_csv(:force_quotes => true)
}
'

Regex to match spaces before comma but not after

I would like to have a regex that will not allow spaces AFTER comma but spaces before comma should be allowed. The comma should also be optional.
My current regex:
^[\w,]+$
I have tried to add \s in it and also tried ^[\w ,]+$ but that allows spaces after comma as well!
This should be the test case:
Hello World // true
Hello, World // false (space after comma)
Hello,World // true
Hello,World World // false
Any help would be appreciated!
The below regex won't allow space after a comma,
^[\w ]+(?:,[^ ]+)?$
DEMO
Explanation:
^ start of a line.
[\w ] Matches a word charcter or a space one or more times.
(?:) This is called non-capturing groups. Anything inside this group won't be catched.
(?:,[^ ]+)? A comma followed by any character not of space one or more times. By adding ? after the non-capturing group, this tells the regex engine that it would be an optional one.
$ End of a line
I guess it depends what you want to do, if you are just testing for the presence of the grammar error, you can use something like.
See this example here >
var patt = / ,/g; // or /\s,/g if you want
var str = 'Hello ,World ,World';
var str2 = 'Hello, World, World';
console.log( patt.test(str) ) // True, there are space before commas
console.log( patt.test(str2) ) // False, the string is OK!
Lookaheads are useful but can be hard to understand without knowing the basics.
Use this site, it is great for visualising your Regex
You can use this regex.
^[\w ]+(?:,\S+)?$
Explanation:
^ # the beginning of the string
[\w ]+ # any character of: word characters, ' ' (1 or more times)
(?: # group, but do not capture (optional):
, # ','
\S+ # non-whitespace (all but \n, \r, \t, \f, and " ") (1 or more times)
)? # end of grouping
$ # before an optional \n, and the end of the string

How can I regex ,, to ,\N, in my CSVs so that mysqlimport understands them?

Say I have a normal CSV like
# helloworld.csv
hello,world,,,"please don't replace quoted stuff like ,,",,
If I want mysqlimport to understand that some of those fields are NULL, then I need:
# helloworld.mysql.csv
hello,world,\N,\N,"please don't replace quoted stuff like ,,",\N,\N
I got some help from another question -- Why does sed not replace overlapping patterns -- but note the problem:
$ perl -pe 'while (s#,,#,\\N,#) {}' -pe 's/,$/,\\N/g' helloworld.csv
hello,world,\N,\N,"please don't replace quoted stuff like ,\N,",\N,\N
^^
How can I write the regex so it doesn't replace ,, if they're between quotes?
FINAL ANSWER
Here's the final perl I used, thanks to the accepted answer below:
perl -pe 's/^,/\\N,/; while (s/,(?=,)(?=(?:[^"]*"[^"]*")*[^"]*$)/,\\N/g) {}; s/,$/,\\N/' helloworld.csv
That takes care of leading, trailing, and unquoted empty strings.
Why not use Text::CSV? You can parse the file with it and then use map to replace empty fields with '\N', e.g.
use Text::CSV;
my $csv = Text::CSV->new({ binary => 1 }) or die Text::CSV->error_diag();
$csv->parse($line); # parse a CSV string into fields
my #fields = $csv->fields(); # get the parsed fields
#fields = map { $_ eq "" ? '\N' : $_ } #fields;
$csv->combine(#fields); # combine fields into a string
Assuming that you won't have escaped quotes, you can make sure that you only replace ,, if it's followed by an even number of quotes:
$subject =~
s/, # Match ,
(?=,) # only if followed by another ,
(?= # and only if followed by...
(?: # the following group:
[^"]*" # any number of non-quote characters, followed by one quote
[^"]*" # the same thing again (even number!)
)* # any number of times, followed by
[^"]* # any number of non-quotes until...
$ # end of string.
) # End of lookahead assertion
/,\N/x
g;
Input:
foo,,bar,,,baz,"foo,,,oof",zap,,zip
Output:
foo,\N,bar,\N,\N,baz,"foo,,,oof",zap,\N,zip