Repair Bad Json with Unescaped Quote in Field Name - json

Kissmetrics exports apparently produce invalid json when there is a quote in the field name, for example, the following is one of the events produced:
{
"ab test group native dialogs on mobile":"Control",
"ab test group "interested" button copy":"Interested",
"_t":1412633724,
"_p":"hk5yxuxcqe/935mkbj+pz8xi0a8="
}
(Newlines were added to clarify the issue, we can't use those to repair the JSON).
I am looking for a mechanism for repairing such broken JSON.
There are som assumptions I believe we can take advantage of:
We can assume that the JSON being produced is flat (no nested objects or arrays), so I think we can take advantage of that.
I believe all fields are strings, except for _t, but not 100% sure.
I don't think we can assume the bad unescaped quotes will be balanced.
I believe KM removes commas and colons from field names, but not 100% sure -- they are not removed from values (though I believe values to be properly encoded).

Solution I am using now, in python, which I'm sure is imperfect:
match = regex.match(r'^{("(?P<fieldName>([^:]*))":(?P<fieldValue>([0-9]*\.?[0-9]+)|("(([^"])|(\\"))*"))(,|}))*$', s)
fieldNames = match.captures('fieldName')
fieldValues = match.captures('fieldValue')
newJson = "{%s}" % (
",".join(
"\"%s\":%s" % (
fieldName.replace("\"", "\\\""),
fieldValue,
)
for fieldName, fieldValue
in zip(fieldNames, fieldValues)
)
)
This assumes there are no colons in the keys.

Related

Convert JSONB to minified (no spaces) String

If I convert a text value like {"a":"b"} to JSONB and then back to text a space () is added between the : and the ".
psql=> select '{"a":"b"}'::jsonb::text;
text
------------
{"a": "b"}
(1 row)
How can I convert a text to a jsonb, so I can use jsonb functions, and back to text to store it?
The JSON standard, RFC 8259, says "... Insignificant whitespace is allowed before or after any of the six structural characters". In other words, the cast from jsonb to text has no universal canonical form. The PostgreSQL cast convention (using spaces) is arbitrary.
So, we must to agree with the PostgreSQL's convention for CAST(var_jsonb AS text). When you need another cast convention, for example to debug or human-readable output, the built-in jsonb_pretty() function is a good choice.
Unfortunately PostgreSQL not offers other choices, like the compact one. So, you can overload jsonb_pretty() with a compact option:
CREATE or replace FUNCTION jsonb_pretty(
jsonb, -- input
compact boolean -- true for compact format
) RETURNS text AS $$
SELECT CASE
WHEN $2=true THEN json_strip_nulls($1::json)::text
ELSE jsonb_pretty($1)
END
$$ LANGUAGE SQL IMMUTABLE;
SELECT jsonb_pretty( jsonb_build_object('a',1, 'bla','bla bla'), true );
-- results {"a":1,"bla":"bla bla"}
See a complete discussion at this similar question.
From the docs:
https://www.postgresql.org/docs/12/datatype-json.html
"Because the json type stores an exact copy of the input text, it will preserve semantically-insignificant white space between tokens, as well as the order of keys within JSON objects. Also, if a JSON object within the value contains the same key more than once, all the key/value pairs are kept. (The processing functions consider the last value as the operative one.) By contrast, jsonb does not preserve white space, does not preserve the order of object keys, and does not keep duplicate object keys. If duplicate keys are specified in the input, only the last value is kept."
So:
create table json_test(fld_json json, fld_jsonb jsonb);
insert into json_test values('{"a":"b"}', '{"a":"b"}');
select * from json_test ;
fld_json | fld_jsonb
-----------+------------
{"a":"b"} | {"a": "b"}
(1 row)
If you want to maintain your white space or lack of it use json. Otherwise you will get a pretty print version on output with jsonb. You can json functions/operators on json type though not the jsonb operators/functions. More detail here:
https://www.postgresql.org/docs/12/functions-json.html
Modifying your example:
select '{"a":"b"}'::json::text;
text
-----------
{"a":"b"}
The way your question and comments are phrased, it really looks like you want replace().
We need to make the search as specific as possible to avoid messing with potentially embedded ': ' within the json payload, so it seems safer to match on the surrounding double quotes too, like:
replace('{"a":"b"}'::jsonb::text, '": "', '":"')

Regex for matching with and without quotes for dynamic JSON

I have the following text strings:
"Name":"John"}]
"Age":36
"Address":"ABC,PQR234[]/.,#ANYCHARACTERS"
"Gender":null
I need to get two groups (key value pair) from this such that the output would be only:
Key|Value
Name|John
Age|36
Address|ABC,PQR234[]/.,#ANYCHARACTERS
The requirement is to have a single regex to grab everything in the double quotes if the double quotes are present. If not, take the value without the quotes.
In our example above, 36 and null are the one without the quotes and they need to be captured as well.
I have tried a lot but have failed to do so.
UPDATE:
I don't know why I am getting down votes for this question. Yes this is JSON that I am trying to parse but there is a reason behind why I am doing this and not using any document parser.
I am supposed to use Talend for getting a dynamic JSON converted into Key Value Pair. What I mean by dynamic is the fields of the JSON can vary and hence I do not have a fixed schema and hence cannot use a document parser (which demands a fixed structure of JSON). I am devising a solution to get around this using Normalizer (on comma) and then extracting the key value pair which will be in double quotes using Regular Expressions. I tried many things on my own and since I am not an expert in Regular expressions, I have come here to get inputs.
If you know any better solution to this, I would be very happy to get your inputs.
How about this?
/"?([^\n"]*)"?:"?([^\n"]*)"?/
Explained in detail at:
https://regex101.com/r/UM0rl2/1/

How can I force a property of an object to be output as a string when returned as JSON

I'm storing color values as HEX in my database, which is mapped via ORM settings in CF9. When my color values are entirely numeric (e.g. 000000), ColdFusion is serializing them as numbers (e.g. 0.0) when returned from my CFC as JSON. Is there a way to force these columns/properties to be serialized as strings?
1st option
You could try this:
<cfset finalValue = " " & yourValue >
OR
<cfset finalValue = " #yourValue#" >
javaCast doesn't work, adding trailing space doesn't work.
http://www.mischefamily.com/nathan/index.cfm/2008/10/22/ColdFire-1295100-and-a-CF-to-JSON-Gotcha
http://www.ghidinelli.com/2008/12/19/tricking-serializejson-to-treat-numbers-as-strings
2nd option
Using custom method instead of serializeJSON, there's one on Ben Nadel's site which you could adjust to your needs http://www.bennadel.com/blog/100--CF-JSON-My-Own-ColdFusion-Version-For-AJAX.htm .
If you're not afraid of a little java (~100 loc), you can pass your query (a coldfusion.sql.QueryTable -- do a google search) out to a java class, and let Jackson convert it to json for you. This is very fast, and keeps your data types the same as what came from your database. So if you have a varchar with a 0 as the value, you get '0' back. If you have an int, you get an int. Null's are nulls, and empty strings are empty strings, (although you can override this if you want). Totally worth using java to get around all these CF json issues.
A hacky quick fix would be simply to armor your values with, say, a trailing non-numeric character prior to serialization and transit. It's ugly, but 000000Z will not be implicitly converted to a numeric by CF. Trim before using, and then figure out a purer solution to CF's aggressive "helpfulness" at your leisure.
If these are colours, stick a hash on the front?
<cfset Value = "##" & Value />

Do the JSON keys have to be surrounded by quotes?

Example:
Is the following code valid against the JSON Spec?
{
precision: "zip"
}
Or should I always use the following syntax? (And if so, why?)
{
"precision": "zip"
}
I haven't really found something about this in the JSON specifications. Although they use quotes around their keys in their examples.
Yes, you need quotation marks. This is to make it simpler and to avoid having to have another escape method for javascript reserved keywords, ie {for:"foo"}.
You are correct to use strings as the key. Here is an excerpt from RFC 4627 - The application/json Media Type for JavaScript Object Notation (JSON)
2.2. Objects
An object structure is represented as a pair of curly brackets
surrounding zero or more name/value pairs (or members). A name is a
string. A single colon comes after each name, separating the name
from the value. A single comma separates a value from a following
name. The names within an object SHOULD be unique.
object = begin-object [ member *( value-separator member ) ] end-object
member = string name-separator value
[...]
2.5. Strings
The representation of strings is similar to conventions used in the C
family of programming languages. A string begins and ends with
quotation marks. [...]
string = quotation-mark *char quotation-mark
quotation-mark = %x22 ; "
Read the whole RFC here.
From 2.2. Objects
An object structure is represented as a pair of curly brackets surrounding zero or more name/value pairs (or members). A name is a string.
and from 2.5. Strings
A string begins and ends with quotation marks.
So I would say that according to the standard: yes, you should always quote the key (although some parsers may be more forgiving)
Yes, quotes are mandatory. http://json.org/ says:
string
""
" chars "
Not if you use JSON5
For regular JSON, yes keys must be quoted. But if you need otherwise, checkout widely used JSON5, which is so-named because is a superset of JSON that allows ES5 syntax, including:
unquoted property keys
single-quoted, escaped and multi-line strings
alternate number formats
comments
extra whitespace
The JSON5 reference implementation (json5 npm package) provides a JSON5 object that has parse and stringify methods with the same args and semantics as the built-in JSON object.
widely used, and depended on by many high profile projects
JSON5 was started in 2012, and as of 2022, now gets >65M downloads/week, ranks in the top 0.1% of the most depended-upon packages on npm, and has been adopted by major projects like Chromium, Next.js, Babel, Retool, WebStorm, and more. It's also natively supported on Apple platforms like MacOS and iOS.
~ json5.org homepage
In your situation, both of them are valid, meaning that both of them will work.
However, you still should use the one with quotation marks in the key names because it is more conventional, which leads to more simplicity and ability to have key names with white spaces etc.
Therefore, use the one with the quotation marks.
edit// check this: What is the difference between JSON and Object Literal Notation?
Since you can put "parent.child" dotted notation and you don't have to put parent["child"] which is also valid and useful, I'd say both ways is technically acceptable. The parsers all should do both ways just fine. If your parser does not need quotes on keys then it's probably better not to put them (saves space). It makes sense to call them strings because that is what they are, and since the square brackets gives you the ability to use values for keys essentially it makes perfect sense not to.
In Json you can put...
>var keyName = "someKey";
>var obj = {[keyName]:"someValue"};
>obj
Object {someKey: "someValue"}
just fine without issues, if you need a value for a key and none quoted won't work, so if it doesn't, you can't, so you won't so "you don't need quotes on keys". Even if it's right to say they are technically strings. Logic and usage argue otherwise. Nor does it officially output Object {"someKey": "someValue"} for obj in our example run from the console of any browser.

Can you use a trailing comma in a JSON object?

When manually generating a JSON object or array, it's often easier to leave a trailing comma on the last item in the object or array. For example, code to output from an array of strings might look like (in a C++ like pseudocode):
s.append("[");
for (i = 0; i < 5; ++i) {
s.appendF("\"%d\",", i);
}
s.append("]");
giving you a string like
[0,1,2,3,4,5,]
Is this allowed?
Unfortunately the JSON specification does not allow a trailing comma. There are a few browsers that will allow it, but generally you need to worry about all browsers.
In general I try turn the problem around, and add the comma before the actual value, so you end up with code that looks like this:
s.append("[");
for (i = 0; i < 5; ++i) {
if (i) s.append(","); // add the comma only if this isn't the first entry
s.appendF("\"%d\"", i);
}
s.append("]");
That extra one line of code in your for loop is hardly expensive...
Another alternative I've used when output a structure to JSON from a dictionary of some form is to always append a comma after each entry (as you are doing above) and then add a dummy entry at the end that has not trailing comma (but that is just lazy ;->).
Doesn't work well with an array unfortunately.
No. The JSON spec, as maintained at http://json.org, does not allow trailing commas. From what I've seen, some parsers may silently allow them when reading a JSON string, while others will throw errors. For interoperability, you shouldn't include it.
The code above could be restructured, either to remove the trailing comma when adding the array terminator or to add the comma before items, skipping that for the first one.
Simple, cheap, easy to read, and always works regardless of the specs.
$delimiter = '';
for .... {
print $delimiter.$whatever
$delimiter = ',';
}
The redundant assignment to $delim is a very small price to pay.
Also works just as well if there is no explicit loop but separate code fragments.
Trailing commas are allowed in JavaScript, but don't work in IE. Douglas Crockford's versionless JSON spec didn't allow them, and because it was versionless this wasn't supposed to change. The ES5 JSON spec allowed them as an extension, but Crockford's RFC 4627 didn't, and ES5 reverted to disallowing them. Firefox followed suit. Internet Explorer is why we can't have nice things.
As it's been already said, JSON spec (based on ECMAScript 3) doesn't allow trailing comma. ES >= 5 allows it, so you can actually use that notation in pure JS. It's been argued about, and some parsers did support it (http://bolinfest.com/essays/json.html, http://whereswalden.com/2010/09/08/spidermonkey-json-change-trailing-commas-no-longer-accepted/), but it's the spec fact (as shown on http://json.org/) that it shouldn't work in JSON. That thing said...
... I'm wondering why no-one pointed out that you can actually split the loop at 0th iteration and use leading comma instead of trailing one to get rid of the comparison code smell and any actual performance overhead in the loop, resulting in a code that's actually shorter, simpler and faster (due to no branching/conditionals in the loop) than other solutions proposed.
E.g. (in a C-style pseudocode similar to OP's proposed code):
s.append("[");
// MAX == 5 here. if it's constant, you can inline it below and get rid of the comparison
if ( MAX > 0 ) {
s.appendF("\"%d\"", 0); // 0-th iteration
for( int i = 1; i < MAX; ++i ) {
s.appendF(",\"%d\"", i); // i-th iteration
}
}
s.append("]");
PHP coders may want to check out implode(). This takes an array joins it up using a string.
From the docs...
$array = array('lastname', 'email', 'phone');
echo implode(",", $array); // lastname,email,phone
Interestingly, both C & C++ (and I think C#, but I'm not sure) specifically allow the trailing comma -- for exactly the reason given: It make programmaticly generating lists much easier. Not sure why JavaScript didn't follow their lead.
Rather than engage in a debating club, I would adhere to the principle of Defensive Programming by combining both simple techniques in order to simplify interfacing with others:
As a developer of an app that receives json data, I'd be relaxed and allow the trailing comma.
When developing an app that writes json, I'd be strict and use one of the clever techniques of the other answers to only add commas between items and avoid the trailing comma.
There are bigger problems to be solved...
Use JSON5. Don't use JSON.
Objects and arrays can have trailing commas
Object keys can be unquoted if they're valid identifiers
Strings can be single-quoted
Strings can be split across multiple lines
Numbers can be hexadecimal (base 16)
Numbers can begin or end with a (leading or trailing) decimal point.
Numbers can include Infinity and -Infinity.
Numbers can begin with an explicit plus (+) sign.
Both inline (single-line) and block (multi-line) comments are allowed.
http://json5.org/
https://github.com/aseemk/json5
No. The "railroad diagrams" in https://json.org are an exact translation of the spec and make it clear a , always comes before a value, never directly before ]:
or }:
There is a possible way to avoid a if-branch in the loop.
s.append("[ "); // there is a space after the left bracket
for (i = 0; i < 5; ++i) {
s.appendF("\"%d\",", i); // always add comma
}
s.back() = ']'; // modify last comma (or the space) to right bracket
According to the Class JSONArray specification:
An extra , (comma) may appear just before the closing bracket.
The null value will be inserted when there is , (comma) elision.
So, as I understand it, it should be allowed to write:
[0,1,2,3,4,5,]
But it could happen that some parsers will return the 7 as item count (like IE8 as Daniel Earwicker pointed out) instead of the expected 6.
Edited:
I found this JSON Validator that validates a JSON string against RFC 4627 (The application/json media type for JavaScript Object Notation) and against the JavaScript language specification. Actually here an array with a trailing comma is considered valid just for JavaScript and not for the RFC 4627 specification.
However, in the RFC 4627 specification is stated that:
2.3. Arrays
An array structure is represented as square brackets surrounding zero
or more values (or elements). Elements are separated by commas.
array = begin-array [ value *( value-separator value ) ] end-array
To me this is again an interpretation problem. If you write that Elements are separated by commas (without stating something about special cases, like the last element), it could be understood in both ways.
P.S. RFC 4627 isn't a standard (as explicitly stated), and is already obsolited by RFC 7159 (which is a proposed standard) RFC 7159
It is not recommended, but you can still do something like this to parse it.
jsonStr = '[0,1,2,3,4,5,]';
let data;
eval('data = ' + jsonStr);
console.log(data)
With Relaxed JSON, you can have trailing commas, or just leave the commas out. They are optional.
There is no reason at all commas need to be present to parse a JSON-like document.
Take a look at the Relaxed JSON spec and you will see how 'noisy' the original JSON spec is. Way too many commas and quotes...
http://www.relaxedjson.org
You can also try out your example using this online RJSON parser and see it get parsed correctly.
http://www.relaxedjson.org/docs/converter.html?source=%5B0%2C1%2C2%2C3%2C4%2C5%2C%5D
As stated it is not allowed. But in JavaScript this is:
var a = Array()
for(let i=1; i<=5; i++) {
a.push(i)
}
var s = "[" + a.join(",") + "]"
(works fine in Firefox, Chrome, Edge, IE11, and without the let in IE9, 8, 7, 5)
From my past experience, I found that different browsers deal with trailing commas in JSON differently.
Both Firefox and Chrome handles it just fine. But IE (All versions) seems to break. I mean really break and stop reading the rest of the script.
Keeping that in mind, and also the fact that it's always nice to write compliant code, I suggest spending the extra effort of making sure that there's no trailing comma.
:)
I keep a current count and compare it to a total count. If the current count is less than the total count, I display the comma.
May not work if you don't have a total count prior to executing the JSON generation.
Then again, if your using PHP 5.2.0 or better, you can just format your response using the JSON API built in.
Since a for-loop is used to iterate over an array, or similar iterable data structure, we can use the length of the array as shown,
awk -v header="FirstName,LastName,DOB" '
BEGIN {
FS = ",";
print("[");
columns = split(header, column_names, ",");
}
{ print(" {");
for (i = 1; i < columns; i++) {
printf(" \"%s\":\"%s\",\n", column_names[i], $(i));
}
printf(" \"%s\":\"%s\"\n", column_names[i], $(i));
print(" }");
}
END { print("]"); } ' datafile.txt
With datafile.txt containing,
Angela,Baker,2010-05-23
Betty,Crockett,1990-12-07
David,Done,2003-10-31
String l = "[" + List<int>.generate(5, (i) => i + 1).join(",") + "]";
Using a trailing comma is not allowed for json. A solution I like, which you could do if you're not writing for an external recipient but for your own project, is to just strip (or replace by whitespace) the trailing comma on the receiving end before feeding it to the json parser. I do this for the trailing comma in the outermost json object. The convenient thing is then if you add an object at the end, you don't have to add a comma to the now second last object. This also makes for cleaner diffs if your config file is in a version control system, since it will only show the lines of the stuff you actually added.
char* str = readFile("myConfig.json");
char* chr = strrchr(str, '}') - 1;
int i = 0;
while( chr[i] == ' ' || chr[i] == '\n' ){
i--;
}
if( chr[i] == ',' ) chr[i] = ' ';
JsonParser parser;
parser.parse(str);
I usually loop over the array and attach a comma after every entry in the string. After the loop I delete the last comma again.
Maybe not the best way, but less expensive than checking every time if it's the last object in the loop I guess.