I noticed that depending on the implementation, some JSON libraries quote / characters, others don't.
Example 1: Lua
local cjson = require 'cjson'
print(cjson.encode({ x = "/" }))
--> {"x":"\/"}
Example 2: JavaScript
console.log(JSON.stringify({ x: "/" }))
--> '{"x":"/"}'
I wonder if the quoting of Lua's cjson libray is a bug or a valid feature. If it is not, I'm concerned about base64 encoded strings that are sent over the network and should be processed by any language. I'm concerned about possibly unintended side-effects of Lua cjson when it changes strings after first decoding the JSON string and than encoding it again, for example:
local x = '{"x":"/"}'
print(x)
--> {"x":"/"}
print(cjson.encode(cjson.decode(x)))
--> {"x":"\/"}
I wonder if this is allowed. Is it still the same JSON data? I would have expected that the actual string contents should not be changed by applying a decode followed by an encode operation.
Is it allowed in JSON to quote a '/', or does it change the payload in a non standard conformant way?
From what I tested, assuming that "/" == "\/" holds is not portable over different languages. In a small sample of languages, I found mixed results. Some accept it, some don't, some accept it but issue warnings (so it is maybe not portable). Here is an overview:
+------------+-------------+----------------------------------+
| Language | "/" == "\/" | Notes |
+------------+-------------+----------------------------------+
| Lua | true | - |
| JavaScript | true | - |
| C++ | true | warning: unknown escape sequence |
| Python | false | - |
| Ruby | true | - |
+------------+-------------+----------------------------------+
The spec defines a string as
So the sequence \/ is clearly allowed. But it is not necessary, since / also falls into the "Any Unicode character except " or \ or control character" range.
The "warning: unknown escape sequence" is not correct in this case.
If it is not, I'm concerned about base64 encoded strings that are sent over the network and should be processed by any language.
I'm not sure I understand. Base64 and JSON have nothing to do with each other.
Going by the ECMA-404 spec, it should be allowed:
\/ represents the solidus character (U+002F).
The following four cases all produce the same result:
"\u002F"
"\u002f"
"\/"
"/"
Related
I have a csv file with the following single line:
Some,Test,"is"thisvalid,or,not
according to cvslint.io it's not valid csv:
However according to https://www.toolkitbay.com/tkb/tool/csv-validator it is valid csv. Which site is lying?
Whether it is "valid" or not depends on the definition you, and the websites you found, are using. If you asked about "well-formed XML", everyone would agree that should be based on the W3C standard; or "valid HTML" would now probably refer to the WHATWG Living Standard. "Valid CSV" has no such universal definition - although there are standards for CSV, they've been written after years of use, in the rather optimistic hope that existing implementations will be amended to follow them.
So neither tool is "lying", they just evidently disagree on what "valid" means.
A far more useful question than if CSV is "valid" is whether it is interpreted as you want by whatever tool you try to process it with. From a practical point of view, it's likely that the unusual positioning of quote marks might be interpreted differently by different tools, so is probably best avoided if interoperability is relevant to your use case.
For CSV format the reference is this https://datatracker.ietf.org/doc/html/rfc4180
And you have:
Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.
If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote
Then your CSV is not valid.
If your columns are these
+------+------+---------------+----+-----+
| 1 | 2 | 3 | 4 | 5 |
+------+------+---------------+----+-----+
| Some | Test | "is"thisvalid | or | not |
+------+------+---------------+----+-----+
then valid version is this
Some,Test,"""is""thisvalid",or,not
And it's valid also for https://csvlint.io/
I want to export a value of a column in Bigquery to look like:
| NAME | JSON |
| abc | {"test": 1} |
However, when I want to export this to a gzipped csv/tsv via a python code to google cloud storage with field delimiter = '\t' (https://google-cloud.readthedocs.io/en/latest/bigquery/generated/google.cloud.bigquery.client.Client.extract_table.html) , I always get something like:
| NAME | JSON |
| abc | "{""test"": 1}" |
I know about escaping, and I have been trying a lot of possibilities with escaping (using "" to escape the " or adding -values), but I can't seem to get the export as:
{"test": 1}
Please help me?
The tool output is correct, but you'd need to read RFC 4180, the standard for CSV files, to see why.
Basically, the JSON spec says test needs to have double quotes, i.e. "test".
Double quotes around the entire field are allowed in CSV. But the CSV spec also says that in a CSV with quoted fields, an inner quote is duplicated. This is rule 7 on section 2 of RFC 4180:
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
So whats the solution?
Possibly, you need a RFC 4180 compliant CSV reader, so you aren't writing the parsing code yourself where the file is used.
You could replace the doubled double quotes with single double quotes, and the quotes at the braces with nothing like this:
sed -e 's/"{/{/g; s/}"/}/g; s/""/"/g;' in.csv > out.csv
transforming
"{""test"": 1}"
to
{ "test": 1}
or using String.replace in JavaScript, but then the resulting csv file is NOT RFC 4180 compliant.
I used Unit Separator (US/0x1f) in database. When I export to XML 1.0 file, it is not accepted and leave the attribute with empty value.
I have data in database like this:
"option1=10;option2=20;option3=aaa[US]bbb[US]ccc;"
I'm assuming to export to XML 1.0 file like this:
<elementname, attr1="option1=10;option2=20;option3=aaa[US]bbb[US]ccc;"/>
However, the [US] is not accepted by XML 1.0. Any suggestions?
I can replace '\37' (oct 37, hex 1f) with something like "XXX", "$", "(0x1f)"... before writing to XML;
I can replace it when importing from XML and write to database. However, if I replace it with "& # x 1 F ;", which is the HTML Entity for Unit separator, I end up with "& a m p ; # x 1 F ;", which is definitely not what I wanted.
If I manually modify the XML file to "& # x 1 F ;", I can not use MSXML to load it, giving error "Invalid Unicode Character".
Any suggestions?
Thank you
Summary:
Let's make an analogy: Let's think about how the compiler works, there are two phases: "Pre-compile" and "Compile".
For XML File Generation, it acts like the "Compile" phase. E.g. convert "<" to "& l t ;"
However, the Unit Separator is not supported by XML 1.0, so the "Compile" phase will not convert it to HTML Entity "& # x 1 F ;"
So we have to seek solution in the "Pre-Compile" phase, which is our own application's responsibility.
When writing:
Option1: <unit>aaa</unit><unit>bbb</unit>
Option2: simply use "_x241F_" to replace "\37" in the string if "_x241F_" is not conflicting with any existing token in the string.
When reading:
According to Option1: Load the elements, catenate to a single string with "\37" as separator.
According to Option2: simply use "\37" to replace "_x241F_".
I've also found out that MSXML (even the highest version MSXML6.dll) will not load XML 1.1 .
So if we are unfortunately using MSXML, we have to write our own "Pre-Compile" code to handle the Unicode characters before feeding the "Compile" phase.
Note: I borrowed the idea of "_ x 2 4 1 F _" from here.
Thanks for everyone's help
There is no HTML entity for U+001F UNIT SEPARATOR. Besides, HTML entities would be irrelevant when dealing with generic XML.
The character references would be and , in HTML and in XML, but the character is not allowed in HTML or in XML. For XML 1.0, which this seems to be about, please refer to section 2.2 Characters, where the normative definition is the following production (the associated comment is misleading, and comments are non-normative):
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]
The conclusions to be drawn depend on the meaning and purpose of UNIT SEPARATOR in the text. It has no generally defined meaning; it is up to applications to assign a meaning to it and process it accordingly.
Usually UNIT SEPARATOR is used to separate units of some kind, so the natural approach would be to process the incoming data so that instead of such separators, the data, when converted to XML format, has units denoted by markup. So for data like aaa[US]bbb[US]ccc where [US] is UNIT SEPARATOR, you would generate something like <unit>aaa</unit><unit>bbb</unit><unit>ccc</unit>.
This website
http://www.fileformat.info/info/unicode/char/1f/index.htm
suggests one of the following:
HTML Entity (decimal)
HTML Entity (hex)
I notice these characters are all illegal
#%<>?\/*+|:"
I notice these are encoded (%NN where NN is the hex value) but can be replace without problem
$,;=& #
(note the space which is typically encoded as + (but may be %20))
#%?/+ i understand. But whats do the following characters do? <>\*|":
Note: I understand what : does in the domain part (its the port) as # is a login but after the first / why is : illegal? (# isnt)
RFC 2396 (Uniform Resource Identifiers URI: Generic Syntax) says:
Many URI include components consisting of or delimited by, certain
special characters. These characters are called "reserved", since
their usage within the URI component is limited to their reserved
purpose.
reserved = ";" | "/" | "?" | ":" | "#" | "&" | "=" | "+" |
"$" | ","
2.4.3. Excluded US-ASCII Characters
The angle-bracket "<" and ">" and double-quote (") characters are
excluded because they are often used as the delimiters around URI in
text documents and protocol fields. The character "#" is excluded
because it is used to delimit a URI from a fragment identifier in URI
references (Section 4). The percent character "%" is excluded because
it is used for the encoding of escaped characters.
delims = "<" | ">" | "#" | "%" | <">
Other characters are excluded because gateways and other transport
agents are known to sometimes modify such characters, or they are
used as delimiters.
unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
I think that covers all that you mentioned. The star "*" is not reserved and may be used. Paste this in a browser: http://en.wikipedia.org/wiki/*
I'm not sure about this, but could those be reserved so that if you try typing in URLs into a shell environment, the URL isn't split up into different pieces unnecessarily? For example, imagine I try executing
curl http://www.stackoverflow.com/this>that > myFile.txt
This might trip up the command prompt by having it try to get the incorrect URL http://www.stackoverflow.com/this, then writing it to a file called that, and then tripping up the interpreter when it hits the second >. This explanation does account for all of the characters you listed (they all mean something in a shell environment), but it's just my first guess as to why it could be.
What is the meaning of a ^ sign in a URL?
I needed to crawl some link data from a webpage and I was using a simple handwritten PHP crawler for it. The crawler usually works fine; then I came to a URL like this:
http://www.example.com/example.asp?x7=3^^^^^select%20col1,col2%20from%20table%20where%20recordid%3E=20^^^^^
This URL works fine when typed in a browser but my crawler is not able to retrieve this page. I am getting an "HTTP request failed error".
^ characters should be encoded, see RFC 1738 Uniform Resource Locators (URL):
Other characters are unsafe because
gateways and other transport agents
are known to sometimes modify such
characters. These characters are "{",
"}", "|", "\", "^", "~", "[", "]",
and "`".
All unsafe characters must always
be encoded within a URL
You could try URL encoding the ^ character.
Based on the context, I'd guess they're a homespun attempt to URL-encode quote-marks.
Caret (^) is not a reserved character in URLs, so it should be acceptable to use as-is. However, if you re having problems, just replace it with its hex encoding %5E.
And yeah, putting raw SQL in the URL is like a big flashing neon sign reading "EXPLOIT ME PLEASE!".
Caret is neither reserved nor "unreserved", which makes it an "unsafe character" in URLs. They should never appear in URLs unencoded. From RFC2396:
2.2. Reserved Characters
Many URI include components consisting of or delimited by, certain
special characters. These characters are called "reserved", since
their usage within the URI component is limited to their reserved
purpose. If the data for a URI component would conflict with the
reserved purpose, then the conflicting data must be escaped before
forming the URI.
reserved = ";" | "/" | "?" | ":" | "#" | "&" | "=" | "+" |
"$" | ","
The "reserved" syntax class above refers to those characters that are
allowed within a URI, but which may not be allowed within a
particular component of the generic URI syntax; they are used as
delimiters of the components described in Section 3.
Characters in the "reserved" set are not reserved in all contexts.
The set of characters actually reserved within any given URI
component is defined by that component. In general, a character is
reserved if the semantics of the URI changes if the character is
replaced with its escaped US-ASCII encoding.
2.3. Unreserved Characters
Data characters that are allowed in a URI but do not have a reserved
purpose are called unreserved. These include upper and lower case
letters, decimal digits, and a limited set of punctuation marks and
symbols.
unreserved = alphanum | mark
mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
Unreserved characters can be escaped without changing the semantics
of the URI, but this should not be done unless the URI is being used
in a context that does not allow the unescaped character to appear.
2.4. Escape Sequences
Data must be escaped if it does not have a representation using an
unreserved character; this includes data that does not correspond to
a printable character of the US-ASCII coded character set, or that
corresponds to any US-ASCII character that is disallowed, as
explained below.
The crawler may be using regular expressions to parse the URL and therefore is falling over because the caret (^) means beginning of line. I'm thinking these URLs are really bad practice since they are exposing the underlying database structure; whomever wrote this might want to consider serious refactoring!
HTH!