How can I decode HTML entities? - html

Here's a quick Perl question:
How can I convert HTML special characters like ü or ' to normal ASCII text?
I started with something like this:
s/\&#(\d+);/chr($1)/eg;
and could write it for all HTML characters, but some function like this probably already exists?
Note that I don't need a full HTML->Text converter. I already parse the HTML with the HTML::Parser. I just need to convert the text with the special chars I'm getting.

Take a look at HTML::Entities:
use HTML::Entities;
my $html = "Snoopy & Charlie Brown";
print decode_entities($html), "\n";
You can guess the output.

The above answers tell you how to decode the entities into Perl strings, but you also asked how to change those into ASCII.
Assuming that this is really what you want and you don't want all the unicode characters you can look at the Text::Unidecode module from CPAN to Zap all those odd characters back into a roughly similar collection of ASCII characters:
use Text::Unidecode qw(unidecode);
use HTML::Entities qw(decode_entities);
my $source = '北亰';
print unidecode(decode_entities($source));
# That prints: Bei Jing

Note that there are hex-specified characters too. They look like this: é (é).
Use HTML::Entities' decode_entities to translate the entities into actual characters. To convert that to ASCII requires more work. I've used iconv (perl interface: Text::Iconv)
with the transliterate option on with some success in the past. But if you are dealing
with a limited set of entities, or you don't actually need it reduced to ASCII equivalents,
you may be better off limiting what decode_entities produces or providing it with custom
conversion maps. See the HTML::Entities doc.

There are a handful of predefined HTML entities - & " > and so on - that you could hard code.
However, the larger case of numberic entities - { - is going to be much harder, as those values are Unicode, and conversion to ASCII is going to range from difficult to impossible.

I use this script. Save it as html2utf.py, and use it ala echo $some_html | html2utf.py.
#!/usr/bin/env python3
"""
An alternative for `perl -Mopen=locale -MHTML::Entities -pe '$_ = decode_entities($_)'` (which you can use by `cpanm HTML::Entities`) and `recode html..`.
"""
import fileinput
import html
for line in fileinput.input():
print(html.unescape(line.rstrip('\n')))

I have created a one-liner for bash, using Perl to decode the HTML entities that are passed to perl. My solution is a blend of this answer (see above) and something I found on commandlinefu.com last week.
Most of us who code in Bash aren't in the habit of using echo -n to strip out the \n newline character since it doesn't usually affect Bash text parsing. With Perl——and with this particular method——it's important to use echo -n or else perl will interpret the 'newline' \n character as a literal part of the response, adding an unwanted %0A to your results.
Here's my bash-perl one-liner hybrid:
encodedURL="$(echo -n "$entityURL" | perl -MHTML::Entities -MURI::Escape -ne 'print uri_escape(decode_entities($_))')"
Example:
Input: Seals \& Croft - Summer Breeze
Output: Seals%20%26%20Croft%20-%20Summer%20Breeze

Related

how can i make my code indented like this for better readability and ease [duplicate]

Is it possible to have multi-line strings in JSON?
It's mostly for visual comfort so I suppose I can just turn word wrap on in my editor, but I'm just kinda curious.
I'm writing some data files in JSON format and would like to have some really long string values split over multiple lines. Using python's JSON module I get a whole lot of errors, whether I use \ or \n as an escape.
JSON does not allow real line-breaks. You need to replace all the line breaks with \n.
eg:
"first line
second line"
can be saved with:
"first line\nsecond line"
Note:
for Python, this should be written as:
"first line\\nsecond line"
where \\ is for escaping the backslash, otherwise python will treat \n as
the control character "new line"
Unfortunately many of the answers here address the question of how to put a newline character in the string data. The question is how to make the code look nicer by splitting the string value across multiple lines of code. (And even the answers that recognize this provide "solutions" that assume one is free to change the data representation, which in many cases one is not.)
And the worse news is, there is no good answer.
In many programming languages, even if they don't explicitly support splitting strings across lines, you can still use string concatenation to get the desired effect; and as long as the compiler isn't awful this is fine.
But json is not a programming language; it's just a data representation. You can't tell it to concatenate strings. Nor does its (fairly small) grammar include any facility for representing a string on multiple lines.
Short of devising a pre-processor of some kind (and I, for one, don't feel like effectively making up my own language to solve this issue), there isn't a general solution to this problem. IF you can change the data format, then you can substitute an array of strings. Otherwise, this is one of the numerous ways that json isn't designed for human-readability.
I have had to do this for a small Node.js project and found this work-around to store multiline strings as array of lines to make it more human-readable (at a cost of extra code to convert them to string later):
{
"modify_head": [
"<script type='text/javascript'>",
"<!--",
" function drawSomeText(id) {",
" var pjs = Processing.getInstanceById(id);",
" var text = document.getElementById('inputtext').value;",
" pjs.drawText(text);}",
"-->",
"</script>"
],
"modify_body": [
"<input type='text' id='inputtext'></input>",
"<button onclick=drawSomeText('ExampleCanvas')></button>"
],
}
Once parsed, I just use myData.modify_head.join('\n') or myData.modify_head.join(), depending upon whether I want a line break after each string or not.
This looks quite neat to me, apart from that I have to use double quotes everywhere. Though otherwise, I could, perhaps, use YAML, but that has other pitfalls and is not supported natively.
Check out the specification! The JSON grammar's char production can take the following values:
any-Unicode-character-except-"-or-\-or-control-character
\"
\\
\/
\b
\f
\n
\r
\t
\u four-hex-digits
Newlines are "control characters" so, no, you may not have a literal newline within your string. However you may encode it using whatever combination of \n and \r you require.
JSON doesn't allow breaking lines for readability.
Your best bet is to use an IDE that will line-wrap for you.
This is a really old question, but I came across this on a search and I think I know the source of your problem.
JSON does not allow "real" newlines in its data; it can only have escaped newlines. See the answer from #YOU. According to the question, it looks like you attempted to escape line breaks in Python two ways: by using the line continuation character ("\") or by using "\n" as an escape.
But keep in mind: if you are using a string in python, special escaped characters ("\t", "\n") are translated into REAL control characters! The "\n" will be replaced with the ASCII control character representing a newline character, which is precisely the character that is illegal in JSON. (As for the line continuation character, it simply takes the newline out.)
So what you need to do is to prevent Python from escaping characters. You can do this by using a raw string (put r in front of the string, as in r"abc\ndef", or by including an extra slash in front of the newline ("abc\\ndef").
Both of the above will, instead of replacing "\n" with the real newline ASCII control character, will leave "\n" as two literal characters, which then JSON can interpret as a newline escape.
Write property value as a array of strings. Like example given over here https://gun.io/blog/multi-line-strings-in-json/. This will help.
We can always use array of strings for multiline strings like following.
{
"singleLine": "Some singleline String",
"multiline": ["Line one", "line Two", "Line Three"]
}
And we can easily iterate array to display content in multi line fashion.
While not standard, I found that some of the JSON libraries have options to support multiline Strings. I am saying this with the caveat, that this will hurt your interoperability.
However in the specific scenario I ran into, I needed to make a config file that was only ever used by one system readable and manageable by humans. And opted for this solution in the end.
Here is how this works out on Java with Jackson:
JsonMapper mapper = JsonMapper.builder()
.enable(JsonReadFeature.ALLOW_UNESCAPED_CONTROL_CHARS)
.build()
This is a very old question, but I had the same question when I wanted to improve readability of our Vega JSON Specification code which uses complex conditoinal expressions. The code is like this.
As this answer says, JSON is not designed for human. I understand that is a historical decision and it makes sense for data exchange purposes. However, JSON is still used as source code for such cases. So I asked our engineers to use Hjson for source code and process it to JSON.
For example, in Git for Windows environment,
you can download the Hjson cli binary and put it in git/bin directory to use.
Then, convert (transpile) Hjson source to JSON. To use automation tools such as Make will be useful to generate JSON.
$ which hjson
/c/Program Files/git/bin/hjson
$ cat example.hjson
{
md:
'''
First line.
Second line.
This line is indented by two spaces.
'''
}
$ hjson -j example.hjson > example.json
$ cat example.json
{
"md": "First line.\nSecond line.\n This line is indented by two spaces."
}
In case of using the transformed JSON in programming languages, language-specific libraries like hjson-js will be useful.
I noticed the same idea was posted in a duplicated question but I would share a bit more information.
You can encode at client side and decode at server side. This will take care of \n and \t characters as well
e.g. I needed to send multiline xml through json
{
"xml": "PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0idXRmLTgiID8+CiAgPFN0cnVjdHVyZXM+CiAgICAgICA8aW5wdXRzPgogICAgICAgICAgICAgICAjIFRoaXMgcHJvZ3JhbSBhZGRzIHR3byBudW1iZXJzCgogICAgICAgICAgICAgICBudW0xID0gMS41CiAgICAgICAgICAgICAgIG51bTIgPSA2LjMKCiAgICAgICAgICAgICAgICMgQWRkIHR3byBudW1iZXJzCiAgICAgICAgICAgICAgIHN1bSA9IG51bTEgKyBudW0yCgogICAgICAgICAgICAgICAjIERpc3BsYXkgdGhlIHN1bQogICAgICAgICAgICAgICBwcmludCgnVGhlIHN1bSBvZiB7MH0gYW5kIHsxfSBpcyB7Mn0nLmZvcm1hdChudW0xLCBudW0yLCBzdW0pKQogICAgICAgPC9pbnB1dHM+CiAgPC9TdHJ1Y3R1cmVzPg=="
}
then decode it on server side
public class XMLInput
{
public string xml { get; set; }
public string DecodeBase64()
{
var valueBytes = System.Convert.FromBase64String(this.xml);
return Encoding.UTF8.GetString(valueBytes);
}
}
public async Task<string> PublishXMLAsync([FromBody] XMLInput xmlInput)
{
string data = xmlInput.DecodeBase64();
}
once decoded you'll get your original xml
<?xml version="1.0" encoding="utf-8" ?>
<Structures>
<inputs>
# This program adds two numbers
num1 = 1.5
num2 = 6.3
# Add two numbers
sum = num1 + num2
# Display the sum
print('The sum of {0} and {1} is {2}'.format(num1, num2, sum))
</inputs>
</Structures>
\n\r\n worked for me !!
\n for single line break and \n\r\n for double line break
I see many answers here that may not works in most cases but may be the easiest solution if let's say you wanna output what you wrote down inside a JSON file (for example: for language translations where you wanna have just one key with more than 1 line outputted on the client) can be just adding some special characters of your choice PS: allowed by the JSON files like \\ before the new line and use some JS to parse the text ... like:
Example:
File (text.json)
{"text": "some JSON text. \\ Next line of JSON text"}
import text from 'text.json'
{text.split('\\')
.map(line => {
return (
<div>
{line}
<br />
</div>
);
})}}
Assuming the question has to do with easily editing text files and then manually converting them to json, there are two solutions I found:
hjson (that was mentioned in this previous answer), in which case you can convert your existing json file to hjson format by executing hjson source.json > target.hjson, edit in your favorite editor, and convert back to json hjson -j target.hjson > source.json. You can download the binary here or use the online conversion here.
jsonnet, which does the same, but with a slightly different format (single and double quoted strings are simply allowed to span multiple lines). Conveniently, the homepage has editable input fields so you can simply insert your multiple line json/jsonnet files there and they will be converted online to standard json immediately. Note that jsonnet supports much more goodies for templating json files, so it may be useful to look into, depending on your needs.
If it's just for presentation in your editor you may use ` instead of " or '
const obj = {
myMultiLineString: `This is written in a \
multiline way. \
The backside of it is that you \
can't use indentation on every new \
line because is would be included in \
your string. \
The backslash after each line escapes the carriage return.
`
}
Examples:
console.log(`First line \
Second line`);
will put in console:
First line Second line
console.log(`First line
second line`);
will put in console:
First line
second line
Hope this answered your question.

Perl: how to convert unicode symbols to something that will survive trip into and out of database and render in HTML

TL;DR I have input that looks like this:
इस परीक्षण के लिए है
Something
Zürich
This data is then piped through a few programs and is ultimately inserted into a mongodb database.
But by the time I query it out and try to display it on a web page it's all garbage.
I've found a lot of questions on how to encode these things but all the answers assume you want everything encoded and do not discuss how to decode it for display.
I only want the "weird" stuff encoded, so for the above I'd like to get some output like this
0x1234;0x8737;0x838784; ...
Something
Z0x8387;rich
which would store fine in a database, and would survive a vim edit or whatever else, but then when I pull it out I want it to render correctly.
So how do I do that, encode in Perl and decode in Javascript?
PS: I don't know what that string of symbols means, just found it somewhere. Sorry if it's offensive or something. Thanks!
Edit:
choroba's answer is a very good start, let's see with an example of what the algorithm produces:
input: 株式会社イノ設計
output: 0x230;0x160;0x170;0x229;0x188;0x143;0x228;0x188;0x154;0x231;0x164;0x190;0x227;0x130;0x164;0x227;0x131;0x142;0x232;0x168;0x173;0x232;0x168;0x136;
Now how do I render that in Javascript? 0xNN was just an example of what I imagine the answer would be but if there's a better way by all means!
Thanks!
Here's an example that produces something similar to what you want:
#! /usr/bin/perl
use warnings;
use strict;
sub escape {
my ($in) = #_;
$in =~ s/([\x{80}-\x{ffff}])/sprintf '0x%d;', ord $1/ger
}
my $in = "Z\N{LATIN SMALL LETTER U WITH DIAERESIS}rich";
my $out = 'Z0x252;rich';
$out eq escape($in) or die escape($in) . "\n$out\n";
You seem to want decimal digits after 0x. That's confusing as 0x usually means hexadecimal. To get hexadecimal codes, change the sprintf template to 0x%x;.
Also note that once someone enters 0x123; into your data directly, the data will become corrupted.
If you use &# instead of 0x at the beginning of each replaced character, the browser will render the characters correctly: Zürich renders as "Zürich".

Decode or unescape \u00f0\u009f\u0091\u008d to 👍

We all know UTF-8 is hard. I exported my messages from Facebook and the resulting JSON file escaped all non-ascii characters to unicode code points.
I am looking for an easy way to unescape these unicode code points to regular old UTF-8. I also would love to use PowerShell.
I tried
$str = "\u00f0\u009f\u0091\u008d"
[Regex]::Replace($str, "\\[Uu]([0-9A-Fa-f]{4})", `
{[char]::ToString([Convert]::ToInt32($args[0].Groups[1].Value, 16))} )
but that only gives me ð as a result, not 👍.
I also tried using Notepad++ and I found this SO post: How to convert escaped Unicode (e.g. \u0432\u0441\u0435) to UTF-8 chars (все) in Notepad++. The accepted answer also results in exactly the same as the example above: ð.
I found the decoding solution here: the UTF8.js library that decodes the text perfectly and you can try it out here (with \u00f0\u009f\u0091\u008d as input).
Is there a way in PowerShell to decode \u00f0\u009f\u0091\u008d to receive 👍? I'd love to have real UTF-8 in my exported Facebook messages so I can actually read them.
Bonus points for helping me understand what \u00f0\u009f\u0091\u008d actually represents (besides it being some UTF-8 hex representation). Why is it the same as U+1F44D or \uD83D\uDC4D in C++?
The Unicode code point of the 👍character is U+1F44D.
Using the variable-length UTF-8 encoding, the following 4 bytes (expressed as hex. numbers) are needed to represent this code point: F0 9F 91 8D.
While these bytes are recognizable in your string,
$str = "\u00f0\u009f\u0091\u008d"
they shouldn't be represented as \u escape codes, because they're not Unicode code units / code point, they're bytes.
With a 4-hex-digit escape sequence (UTF-16), the proper representation would require 2 16-bit Unicode code units, a so-called surrogate pair, which together represent the single non-BMP code point U+1F44D:
$str = "\uD83D\uDC4D"
If your JSON input used such proper Unicode escapes, PowerShell would process the string correctly; e.g.:
'{ "str": "\uD83D\uDC4D" }' | ConvertFrom-Json > out.txt
If you examine file out.txt, you'll see something like:
str
---
👍
(The output was sent to a file, because console windows wouldn't render the 👍char. correctly, at least not without additional configuration; note that if you used PowerShell Core on Linux or macOS, however, terminal output would work.)
Therefore, the best solution would be to correct the problem at the source and use proper Unicode escapes (or even use the characters themselves, as long as the source supports any of the standard Unicode encodings).
If you really must parse the broken representation, try the following workaround (PSv4+), building on your own [regex]::Replace() technique:
$str = "A \u00f0\u009f\u0091\u008d for Mot\u00c3\u00b6rhead."
[regex]::replace($str, '(?:\\u[0-9a-f]{4})+', { param($m)
$utf8Bytes = (-split ($m.Value -replace '\\u([0-9a-f]{4})', '0x$1 ')).ForEach([byte])
[text.encoding]::utf8.GetString($utf8Bytes)
})
This should yield A 👍 for Motörhead.
The above translates sequences of \u... escapes into the byte values they represent and interprets the resulting byte array as UTF-8 text.
To save the decoded string to a UTF-8 file, use ... | Set-Content -Encoding utf8 out.txt
Alternatively, in PSv5+, as Dennis himself suggests, you can make Out-File and therefore it's virtual alias, >, default to UTF-8 via PowerShell's global parameter-defaults hashtable:
$PSDefaultParameterValues['Out-File:Encoding'] = 'utf8'
Note, however, that on Windows PowerShell (as opposed to PowerShell Core) you'll get an UTF-8 file with a BOM in both cases - avoiding that requires direct use of the .NET framework: see Using PowerShell to write a file in UTF-8 without the BOM
iso-8859-1 - very often - intermediate member in operations with Utf-8
$text=[regex]::Unescape("A \u00f0\u009f\u0091\u008d for Mot\u00c3\u00b6rhead.")
Write-Host "[regex]::Unescape(utf-8) = $text"
$encTo=[System.Text.Encoding]::GetEncoding('iso-8859-1') # Change it to yours (iso-8859-2) i suppose
$bytes = $encTo.GetBytes($Text)
$text=[System.Text.Encoding]::UTF8.GetString($bytes)
Write-Host "utf8_DecodedFrom_8859_1 = $text"
[regex]::Unescape(utf-8) = A ð for Motörhead.
utf8_DecodedFrom_8859_1 = A 👍 for Motörhead.
What pleases in mklement0 example - it is easy to get an encoded string of this type.
What is bad - the line will be huge. (First 2 nibbles '00' is a waste)
I must admit, the mklement0 example is charming.
The code for encoding - one line only!!!:
$emoji='A 👍 for Motörhead.'
[Reflection.Assembly]::LoadWithPartialName("System.Web") | Out-Null
$str=(([System.Web.HttpUtility]::UrlEncode($emoji)) -replace '%','\u00') -replace '\+',' '
$str
You can decode this by the standard url way:
$str="A \u00f0\u009f\u0091\u008d for Mot\u00c3\u00b6rhead."
$str=$str -replace '\\u00','%'
[Reflection.Assembly]::LoadWithPartialName("System.Web") | Out-Null
[System.Web.HttpUtility]::UrlDecode($str)
A 👍 for Motörhead.

Perl: Why do i need to set the latin1 flag explicitly since JSON 2.xx?

Since JSON 2.xx i need to set the latin1 flag in order to get umlauts safe to the html document:
my $obj_with_umlauts = {
title => 'geändert',
}
my $json = JSON->new()->latin1(1)->encode($obj_with_umlauts);
This was not necessary using JSON 1.xx :
my $json = JSON->new()->objToJson($obj_with_umlauts);
The html document is in iso-8559-1 (meta-tag).
Can anybody explain to me why?
This is such a huge can of worms that you're opening here.
I suspect that the answer is something along the lines of "a bug was fixed in the character handling of JSON.pm". But it's hard to know what is going on without a lot more information about your situation.
How is $string_with_umlauts being set? How are you encoding the data that you write to the HTML document?
Do you want to handle utf8 data correctly (you really should) or are you happy assuming that you live in a Latin1 world?
It's important to realise that if you completely ignore Unicode considerations then it can often seem that your programs are working correctly as errors often cancel each other out. When you start to address Unicode issues, it can seem that your programs are getting worse until you address all of the issues.
The Perl Unicode Tutorial is a good place to start learning about these things.
P.S. It's "Perl", not "PERL".
What are you talking about?
$ perl -MJSON -E'
say $JSON::VERSION;
my $json = JSON->new()->objToJson(["\xE4"]);
say sprintf "%v02X", $json;
'
1.15
5B.22.E4.22.5D # Unicode code points for ["ä"]
$ perl -MJSON -E'
say $JSON::VERSION;
my $json = JSON->new()->encode(["\xE4"]);
say sprintf "%v02X", $json;
'
2.59
5B.22.E4.22.5D # Unicode code points for ["ä"]
Those two strings are identical! In fact, adding ->latin1() doesn't change anything because the iso-8859-1 encoding of Unicode code point U+00E4 is E4.
$ perl -MJSON -E'
say $JSON::VERSION;
my $json = JSON->new()->latin1()->encode(["\xE4"]);
say sprintf "%v02X", $json;
'
2.59
5B.22.E4.22.5D # iso-8859-1 encoding of ["ä"]
There is one difference between the last two: it's stored differently in the scalar. That should make absolutely no difference. If code treats them differently, then that code is incorrectly reading the data in the scalar, and that code is buggy.
$string_with_umlauts definetly is a string in winLatin
Well, that's error number one.
JSON expects strings of decoded text (strings of Unicode code points), not encoded text.
That said, there happens to be no difference between a string encoded using iso-8859-1 and a string of Unicode code points. For example, when encoded using iso-8859-1, "ä" is byte E4, and it's Unicode code point U+00E4, two different notation for the same number.
If the string is encoded using cp1252, though, you'll have problems with characters €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ (the characters in cp1252 but not in iso-8859-1). For example, when encoded using cp1252, "€" is byte 80, but it's Unicode code point U+20AC. 0x80 != 0x20AC.
The html document is in iso-8559-1 (meta-tag).
Then at some point, you'll have to encode the output into iso-8859-1. You can do it using an :encoding layer, or using Encode's encode or using JSON's ->latin1 directive. The advantage of using this final option is that it will cause JSON to escape any character outside of the iso-8859-1 character set before attempting to encode it.
Can anybody explain to me why?
You have a code (an XS module) that reads the underlying string buffer of the scalar and incorrectly treats that as the content of the string. There is a bug is in that module.

Are multi-line strings allowed in JSON?

Is it possible to have multi-line strings in JSON?
It's mostly for visual comfort so I suppose I can just turn word wrap on in my editor, but I'm just kinda curious.
I'm writing some data files in JSON format and would like to have some really long string values split over multiple lines. Using python's JSON module I get a whole lot of errors, whether I use \ or \n as an escape.
JSON does not allow real line-breaks. You need to replace all the line breaks with \n.
eg:
"first line
second line"
can be saved with:
"first line\nsecond line"
Note:
for Python, this should be written as:
"first line\\nsecond line"
where \\ is for escaping the backslash, otherwise python will treat \n as
the control character "new line"
Unfortunately many of the answers here address the question of how to put a newline character in the string data. The question is how to make the code look nicer by splitting the string value across multiple lines of code. (And even the answers that recognize this provide "solutions" that assume one is free to change the data representation, which in many cases one is not.)
And the worse news is, there is no good answer.
In many programming languages, even if they don't explicitly support splitting strings across lines, you can still use string concatenation to get the desired effect; and as long as the compiler isn't awful this is fine.
But json is not a programming language; it's just a data representation. You can't tell it to concatenate strings. Nor does its (fairly small) grammar include any facility for representing a string on multiple lines.
Short of devising a pre-processor of some kind (and I, for one, don't feel like effectively making up my own language to solve this issue), there isn't a general solution to this problem. IF you can change the data format, then you can substitute an array of strings. Otherwise, this is one of the numerous ways that json isn't designed for human-readability.
I have had to do this for a small Node.js project and found this work-around to store multiline strings as array of lines to make it more human-readable (at a cost of extra code to convert them to string later):
{
"modify_head": [
"<script type='text/javascript'>",
"<!--",
" function drawSomeText(id) {",
" var pjs = Processing.getInstanceById(id);",
" var text = document.getElementById('inputtext').value;",
" pjs.drawText(text);}",
"-->",
"</script>"
],
"modify_body": [
"<input type='text' id='inputtext'></input>",
"<button onclick=drawSomeText('ExampleCanvas')></button>"
],
}
Once parsed, I just use myData.modify_head.join('\n') or myData.modify_head.join(), depending upon whether I want a line break after each string or not.
This looks quite neat to me, apart from that I have to use double quotes everywhere. Though otherwise, I could, perhaps, use YAML, but that has other pitfalls and is not supported natively.
Check out the specification! The JSON grammar's char production can take the following values:
any-Unicode-character-except-"-or-\-or-control-character
\"
\\
\/
\b
\f
\n
\r
\t
\u four-hex-digits
Newlines are "control characters" so, no, you may not have a literal newline within your string. However you may encode it using whatever combination of \n and \r you require.
JSON doesn't allow breaking lines for readability.
Your best bet is to use an IDE that will line-wrap for you.
This is a really old question, but I came across this on a search and I think I know the source of your problem.
JSON does not allow "real" newlines in its data; it can only have escaped newlines. See the answer from #YOU. According to the question, it looks like you attempted to escape line breaks in Python two ways: by using the line continuation character ("\") or by using "\n" as an escape.
But keep in mind: if you are using a string in python, special escaped characters ("\t", "\n") are translated into REAL control characters! The "\n" will be replaced with the ASCII control character representing a newline character, which is precisely the character that is illegal in JSON. (As for the line continuation character, it simply takes the newline out.)
So what you need to do is to prevent Python from escaping characters. You can do this by using a raw string (put r in front of the string, as in r"abc\ndef", or by including an extra slash in front of the newline ("abc\\ndef").
Both of the above will, instead of replacing "\n" with the real newline ASCII control character, will leave "\n" as two literal characters, which then JSON can interpret as a newline escape.
Write property value as a array of strings. Like example given over here https://gun.io/blog/multi-line-strings-in-json/. This will help.
We can always use array of strings for multiline strings like following.
{
"singleLine": "Some singleline String",
"multiline": ["Line one", "line Two", "Line Three"]
}
And we can easily iterate array to display content in multi line fashion.
While not standard, I found that some of the JSON libraries have options to support multiline Strings. I am saying this with the caveat, that this will hurt your interoperability.
However in the specific scenario I ran into, I needed to make a config file that was only ever used by one system readable and manageable by humans. And opted for this solution in the end.
Here is how this works out on Java with Jackson:
JsonMapper mapper = JsonMapper.builder()
.enable(JsonReadFeature.ALLOW_UNESCAPED_CONTROL_CHARS)
.build()
This is a very old question, but I had the same question when I wanted to improve readability of our Vega JSON Specification code which uses complex conditoinal expressions. The code is like this.
As this answer says, JSON is not designed for human. I understand that is a historical decision and it makes sense for data exchange purposes. However, JSON is still used as source code for such cases. So I asked our engineers to use Hjson for source code and process it to JSON.
For example, in Git for Windows environment,
you can download the Hjson cli binary and put it in git/bin directory to use.
Then, convert (transpile) Hjson source to JSON. To use automation tools such as Make will be useful to generate JSON.
$ which hjson
/c/Program Files/git/bin/hjson
$ cat example.hjson
{
md:
'''
First line.
Second line.
This line is indented by two spaces.
'''
}
$ hjson -j example.hjson > example.json
$ cat example.json
{
"md": "First line.\nSecond line.\n This line is indented by two spaces."
}
In case of using the transformed JSON in programming languages, language-specific libraries like hjson-js will be useful.
I noticed the same idea was posted in a duplicated question but I would share a bit more information.
You can encode at client side and decode at server side. This will take care of \n and \t characters as well
e.g. I needed to send multiline xml through json
{
"xml": "PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0idXRmLTgiID8+CiAgPFN0cnVjdHVyZXM+CiAgICAgICA8aW5wdXRzPgogICAgICAgICAgICAgICAjIFRoaXMgcHJvZ3JhbSBhZGRzIHR3byBudW1iZXJzCgogICAgICAgICAgICAgICBudW0xID0gMS41CiAgICAgICAgICAgICAgIG51bTIgPSA2LjMKCiAgICAgICAgICAgICAgICMgQWRkIHR3byBudW1iZXJzCiAgICAgICAgICAgICAgIHN1bSA9IG51bTEgKyBudW0yCgogICAgICAgICAgICAgICAjIERpc3BsYXkgdGhlIHN1bQogICAgICAgICAgICAgICBwcmludCgnVGhlIHN1bSBvZiB7MH0gYW5kIHsxfSBpcyB7Mn0nLmZvcm1hdChudW0xLCBudW0yLCBzdW0pKQogICAgICAgPC9pbnB1dHM+CiAgPC9TdHJ1Y3R1cmVzPg=="
}
then decode it on server side
public class XMLInput
{
public string xml { get; set; }
public string DecodeBase64()
{
var valueBytes = System.Convert.FromBase64String(this.xml);
return Encoding.UTF8.GetString(valueBytes);
}
}
public async Task<string> PublishXMLAsync([FromBody] XMLInput xmlInput)
{
string data = xmlInput.DecodeBase64();
}
once decoded you'll get your original xml
<?xml version="1.0" encoding="utf-8" ?>
<Structures>
<inputs>
# This program adds two numbers
num1 = 1.5
num2 = 6.3
# Add two numbers
sum = num1 + num2
# Display the sum
print('The sum of {0} and {1} is {2}'.format(num1, num2, sum))
</inputs>
</Structures>
\n\r\n worked for me !!
\n for single line break and \n\r\n for double line break
I see many answers here that may not works in most cases but may be the easiest solution if let's say you wanna output what you wrote down inside a JSON file (for example: for language translations where you wanna have just one key with more than 1 line outputted on the client) can be just adding some special characters of your choice PS: allowed by the JSON files like \\ before the new line and use some JS to parse the text ... like:
Example:
File (text.json)
{"text": "some JSON text. \\ Next line of JSON text"}
import text from 'text.json'
{text.split('\\')
.map(line => {
return (
<div>
{line}
<br />
</div>
);
})}}
Assuming the question has to do with easily editing text files and then manually converting them to json, there are two solutions I found:
hjson (that was mentioned in this previous answer), in which case you can convert your existing json file to hjson format by executing hjson source.json > target.hjson, edit in your favorite editor, and convert back to json hjson -j target.hjson > source.json. You can download the binary here or use the online conversion here.
jsonnet, which does the same, but with a slightly different format (single and double quoted strings are simply allowed to span multiple lines). Conveniently, the homepage has editable input fields so you can simply insert your multiple line json/jsonnet files there and they will be converted online to standard json immediately. Note that jsonnet supports much more goodies for templating json files, so it may be useful to look into, depending on your needs.
The reason OP asked is the same reason I ended up here. Had a json file with long text.
In VS Code it's just ALT+Z to turn on word wrapping in a json file. Changing the actual data isn't what you want, if all you really want is to read the contents of the file as a developer.
If it's just for presentation in your editor you may use ` instead of " or '
const obj = {
myMultiLineString: `This is written in a \
multiline way. \
The backside of it is that you \
can't use indentation on every new \
line because is would be included in \
your string. \
The backslash after each line escapes the carriage return.
`
}
Examples:
console.log(`First line \
Second line`);
will put in console:
First line Second line
console.log(`First line
second line`);
will put in console:
First line
second line
Hope this answered your question.