What is the correct syntax to fold a JSON string? - json

I am using Delphi 2009 to build up a string variable containing a simple JON string from values I get from a database. This results in a string of the form below (although the real string could be much longer)
{"alice#example.com": {"first":"Alice", "id": 2},"bob#example.com": {"first":"Bob", "id":1},"cath#example.com": {"first":"Cath", "id":3},"derek#example.com": {"first":"Derek", "id": 4}}
This string gets sent as a header called Recipient-Variables in an email to a company.
The instructions I have for sending the emails to the company say
Note The value of the “Recipient-Variables” header should be
valid JSON string, otherwise we won’t be able to parse it. If
your “Recipient-Variables” header exceeds 998 characters,
you should use folding to spread the variables over multiple lines.
I have looked at these SO posts to try to understand what is meant by folding but cannot really understand the replies as they often seem to be referencing a particular editor.
notepad++ user defined regions with folding
Folding JSON at specific points
Can you customize code folding?
Please can somebody use my example to show me what syntax I should use or what characters I need to insert in my string to comply with the instruction and fold my JSON string, say in between the records for bob and cath?
(BTW I understand what is meant by folding when viewing JSON or other code in an editor but I don't understand how a simple JSON string needs to be formatted in order for the folding to happen at a specific place)

I finally found the answer myself so posting here to help others, just in case.
The answer is given in this document on rfc2822 standards, published in 2001 by the Network Working Group (P. Resnick, Editor)
https://www.rfc-editor.org/rfc/rfc2822#page-11
The document ...
specifies a syntax for text messages that are sent between computer
users, within the framework of "electronic mail" messages.
...and in particular describes how emails are constructed and in particular how to deal with long headers.
Section 2.2.3 talks about long header fields, > 998 characters, and says such headers need to be folded by inserting the CRLF characters followed immediately by some white space, eg a space character.
If the receiving server is following the same standards it will strip out the CRLF character before parsing the header, which will itself will include stripping space characters.
Though structured field bodies are defined in such a way that
folding can take place between many of the lexical tokens (and even
within some of the lexical tokens), folding SHOULD be limited to
placing the CRLF at higher-level syntactic breaks. For instance, if
a field body is defined as comma-separated values, it is recommended
that folding occur after the comma separating the structured items in
preference to other places where the field could be folded, even if
it is allowed elsewhere.
Later, in section 3.2.3 it explains how comments may be combined with folding white space.
So it seems that if generating the string through code, it is necessary to fold long header lines by detecting a higher level syntactic boundary, such as a comma, that is less than 988 characters from the start of the header (or the last fold point) and insert the three hex characters x0D0A20. This could be done after the header has been constructed or on the fly as it is generated.
As a follow up, I now notice that the Overbytes ICS component I am using (TSslSmtpCli) has a boolean property FoldHeaders so this might do all the work for me.

Related

Custom reason for word wrap in VS Code extension to enable working with multiline values in Json

I write Jsons for an API that often requires to have multiline values because scripts are in between the data in the attributes. I've written an extension for me that can escape and unescape multiline values, therefore I can cycle between those states:
{
"value": "
multiline
value
"
}
{
"value": "multiline\n value"
}
However, in the un-escaped, formatted, status, I have an invalid Json, which just causes trouble. I have to switch between escaped and unescaped states to do any Json operation (like format), which I work around by replacing \n with \\n and back.
I have even considered switching to another format, but neither of those I tried had a killer feature making me switch. Among those: Jsonc (no multiline value support), XML (hard to write and read, but supports multiline values and indentation), YAML (would be an option, but does not support indentation in multiline values).
Can I force VS Code to render a specific sequence of characters as line break (in this case, it would be \\n) without changing the document data? The intended functionality is like what the Alt+Z word wrap does, just in a different place.
After some research, I've found the following:
it is not possible or hard to do to hijack the default editor and make it render lines in a way I want
even if I managed to, it might have an underlying issue of not tokenizing long lines
I have decided to go in a way, where I define a custom language, because:
* the API has a constant Json structure
* I can define a new grammar for the API scripts and embed it to the Jsons
This approach seems to be a way to go for me, although it might be a temporary solution. I'm losing Json validations, therefore I do not get a Json new line in value error, but I'm also losing errors in missing commas. This is something I want to approach with the following precautions:
if a JSON is valid and contains attributes of the API-defined classes, offer a button to switch to my defined language, which also un-escapes \\\n to \n
if language is my language, escape the newlines and try to pass it to a JSON validation.

Is it possible to escape & ' present in database?

The data retrieved from database has & or '. How do I escape and show as & or ' without using gsub method?
If you can't stop the data from being inserted like that, then there is code here to create a function in MySQL that you can use in your query in order to return the decoded data.
Or from within Ruby, not using a replace strategy, take a look at how-do-i-encode-decode-html-entities-in-ruby.
First of all, an escape-sequence is found in string-analysis only, not in html or XML where you talk of masquerading. You can escape a string for reasons of concatenation for example. Html-Entities are specific entities which are replaced in urns to masquerade a special character. It is absolutely wrong to save strings still containing html-entities in a db-table. The masked string has to be demasked first, after you "reget" it from post :). Otherwise you try to save html-entities in a special table, eg. for programming reasons. A text-file should do better - try dBase 2 - or simply google the web for a page with an entity-listing.
The second point is that XML is - for the realization of better reading of your own code (in general), thought to be a personally defined markup-language. That is why any non-std-tags within that specification, have to be defined by your own. (It was strange to read about regular entities as "XML-entities", like in the case of "&apos(;)", explained on this entity-page: http://www.madore.org/~david/computers/unicode/htmlent.html)
Std-XML-tags (not entities) are mainly important in aspects of finalizing your html-code to better fit to ongoing programming languages later on, but in my opinion the mentioned ones are still html-entities!
This can and should be performed on the view level, ie, the front-end, since its an HTML entity.
assuming you use jquery, you can do this to make ' appear as ' on the HTML.
$('<div/>').html(''').text()
You can find respective entity values in the link above

How can I populate a query string variable to a text box which contains &,\ and $ in it

I have a variable like say A= drug & medicare $12/$15.
I need to assign it to a text box, but only 'drug' is posted the server. The rest of the data gets truncated.
this.textbox.text= request.querystring["A"].tostring();
The following is not valid for a="foo&bar$12":
http://example.com?a=foo&bar$12
The & symbol is a reserved character, it seperates query string variables. You will need to percent encode a value before sending them to that page.
Also & is a reserved character in HTML/XML. I suggest reading up on percent encoding and html encoding.
I believe you have problems with HTML entities. You need to read up on HTML escaping in your tool of choice. & cannot stand in HTML, since it begins an entity sequence - it needs to be replaced with &. Without specifying at least which toolchain you're using (as per #Richard's comment), we can't really suggest the best way to do it.
EDIT: Now that I reread your question, it seems A is not a variable but a query parameter :) Reading comprehension fail. Anyway, in this case a similar problem exists: & is not a valid character for a query parameter, and it needs URL escaping. Again, how exactly to do it is in the documentation for your toolchain, but in essence & will need to be replaced by %26. Plus sign is also not permitted (or rather it has another meaning); others are tolerated (but there are nicer ways to write them).
That looks more or less like ASP.NET pseudocode, so I'm going to diagnose your problem as the query string needing to be URL encoded. Key/value pairs in the query string are separated by an ampersand (&), and ASP.NET (along with other web platforms) automatically parse out the key value pairs for you.
In this case, the ampersand terminates the value of the "A=..." key/value pair. The problem will be solved if you can URL encode the link that brings the user into your page. If actually using ASP.NET, you can use the HttpUtility.UrlEncode() method for that:
string myValue = Server.UrlEncode("drug & medicare $12/$15");
You'll end up with this querystring instead: A=drug%20%26%20medicare%20%2412%2F%2415

Best HTML encoder for Delphi?

Seems like my data is getting corrupted when using HTTPapp.HTMLEncode( string ): String;
HTMLEncode( 'Jo&hn D<oe' ); // returns 'Jo&am'
This is not correct, and is corrupting my data. Does anyone have suggestions for VCL components that work better? Other than spending my time encoding all the cases
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
Update
After understanding more about HTML, I have found there is no need to encode the other characters referenced in my link. You would only need to know about the four HTML reserved characters being
&,<,>,"
The issue with the VCL HTTPApp.HTMLEncode( ) function is because of the buffer size and the new Delphi 2009/2010 specifications for default Unicode string types, this can be fixed the way that #mason says below, or it can be fixed with a call to WideFormatBuf( ) instead of the FormatBuf( ) that is currently in use.
Replacing the <, >, &, and " characters in a string is trivial. You could thus easily write your own routine for this. (And if your HTML page is UTF-8, there is absolutely no reason to encode any other characters, such as U+222B (the integral sign).)
But if you wish to stick to the Delphi RTL, then you can have a look at HTTPUtil.HTMLEscape with the exactly same signature as HTTPApp.HTMLEncode.
Or, have a look at this SO question.
You're probably using Delphi 2009 or 2010. It looks to me like they forgot to update HTMLEncode for Unicode. It's passing the wrong buffer lengths to FormatBuf.
The HTMLEncode routine is basically right, aside from that, and it's pretty short. You could probably just make your own copy. Everywhere it calls FormatBuf, it gives 5 parameters. The second and fourth are integer values. Double both of them in each call, (there are only four of them), and then it will work.
Also, you ought to open a QC report on this so it will get fixed.
Small hint: do not convert single quote (') to &apos; - some browsers do not understand this code because &apos; is not valid HTML
For details, see: "The Curse of &apos;" and "XHTML and '"
(Both Delphi units mentioned do not convert single quotes).

Which are the valid control characters in HTML/XHTML forms

I'm tring to create form validation unit that, in addition to "regular" tests checks
encoding as well.
According to this article http://www.w3.org/International/questions/qa-forms-utf-8 the
allowed characters are CR, LF and TAB in range of 0-31, the DEL=127 in not allowed.
On the other hand, there are control characters in range 0x80-0xA0. In different sources
I had seen that they are allowed and that not. Also I had seen that this is different
for XHTML, HTML and XML.
Some articles had told that FF is allowed as well?
Can someone provide a good answer with sources what can be given and what isn't?
EDIT: Even there http://www.w3.org/International/questions/qa-controls some ambiguity
The C1 range is supported
But table shows that they are illegal and previous shown UTF-8 validations allows them?
I think you're looking at this the wrong way around. The resources you link specify what encoded values are valid in (X)HTML, but it sounds like you want to validate the "response" from a web form — as in, the values of the various form controls, as passed back to your server. In that case, you shouldn't be looking at what's valid in (X)HTML, but what's valid in the application/x-www-form-urlencoded, and possibly also multipart/form-data, MIME types. The HTML 4.01 standards for <FORM> elements clearly states that for application/x-www-form-urlencoded, "Non-alphanumeric characters are replaced by '%HH'":
This is the default content type. Forms submitted with this content type must be encoded as follows:
Control names and values are escaped. Space characters are replaced by '+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by '%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
The control names/values are listed in the order they appear in the document. The name is separated from the value by '=' and name/value pairs are separated from each other by '&'.
As for what character encoding is contained, (i.e. whether %A0 is a non-breaking space or an error), that's negotiated by the accept-charset attribute on your <FORM> element and the response's (well, really a GET or POST request) Content-Type header.
Postel's Law: Be conservative in what you do; be liberal in what you accept from others.
If you're generating documents for others to read, you should avoid/escape all control characters, even if they're technically legal. And if you're parsing documents, you should endeavor to accept all control characters even if they're technically illegal.
The Unicode characters in these ranges are valid in HTML 4.01:
0x09..0x0A
0x0D
0x20..0x7E
0x00A0..0xD7FF
0xE000..0x10FFFF
In XHTML 1.0... it's unclear. See http://cmsmcq.com/2007/C1.xml#o127626258
First of all any octet is valid. The mentioned regular expression for UTF-8 sequences just omits some of them as they are rather uncommon in practice to be entered by a user. But that doesn’t mean that they are invalid. They are just not expected to occur.
The first link you mention does not have anything to do with validating the allowed characters in XHTML... the example on that link is simply showing a common/generic pattern for detecting whether or not raw data is in utf-8 encoding or not.
This is a quote from the second link:
HTML, XHTML and XML 1.0 do not support
the C0 range, except for HT
(Horizontal Tabulation) U+0009, LF
(Line Feed) U+000A, and CR (Carriage
Return) U+000D. The C1 range is
supported, i.e. you can encode the
controls directly or represent them as
NCRs (Numeric Character References).
The way I read this is:
Any control character in the C1 range is supported if you encode them (using base64, or Hex representations) or represent them as NCRs.
Only U+0009, U+000A, and U+000D are supported in the C0 range. No other control code in that range can be represented.
If the document is known to be XHTML, then you should just load it and validate it against the schema.
What programming language do you use? At least for Java there exist libraries to check the encoding of a string (or byte-array). I guess similar libraries would exist for other languages too.
Do I understand your question correctly: you want to check whether the data submitted by a form is valid, and properly encoded?
If so, why do several things at once? It would be a lot easier to separate those checks, and perform them step by step, IMHO.
You want to check that the submitted form data is correctly encoded (in UTF-8, I gather). As Archchancellor Ridcully says, that's easy to check in most languages.
Then, if the encoding is correct, you can check whether it's valid form data.
Then, if the form data is valid, you can check whether the data contains what you expect.