I have this which works to declare a JSON string in a bash script:
local my_var="foobar"
local json=`cat <<EOF
{"quicklock":"${my_var}"}
EOF`
The above heredoc works, but I can't seem to format it any other way, it literally has to look exactly like that lol.
Is there any way to get the command to be on one line, something like this:
local json=`cat <<EOF{"quicklock":"${my_var}"}EOF`
that would be so much nicer, but doesn't seem to take, obviously simply because that's not how EOF works I guess lol.
I am looking for a shorthand way to declare JSON in a file that:
Does not require a ton of escape chars.
That allows for dynamic interpolation of variables.
Note: The actual JSON I want to use has multiple dynamic variables with many key/value pairs. Please extrapolate.
I'm not a JSON guy, don't really understand the "well-formed" arguments in the discussion above, but, you can use a 'here-string' rather than a 'here-document', like this:
my_var="foobar"
json=`cat <<<{\"quicklock\":\"${my_var}\"}`
why not use jq? It's pretty good at managing string interpolation and it lints your structure.
$ echo '{}' >> foo.json
$ declare myvar="assigned-var"
$ jq --arg ql "$myvar" '.quicklock=$ql' foo.json
the text that comes out on the other end of that call to jq can then be cat into a file or whatever you wanna do. text would look something like this:
{"quicklock": "assigned-var"}
You can do this with printf:
local json="$(printf '{"quicklock":"%s"}' "$my_var")"
(Never mind that SO's syntax highlighting looks odd here. Posix shell command substitution allows nesting one level of quotes.)
A note (thanks to Charles Duffy's comment on the question): I'm assuming $my_var is not controlled by user input. If it is, you'll need to be careful to ensure it is legal for a JSON string. I highly recommend barring non-ASCII characters, double quotes, and backslashes. If you have jq available, you can use it as Charles noted in the comments to ensure you have well-formed output.
You can define your own helper function to address the situation with missing bash syntax:
function begin() { eval echo $(sed "${BASH_LINENO[0]}"'!d;s/.*begin \(.*\) end.*/\1/;s/"/\\\"/g' "${BASH_SOURCE[0]}"); }
Then you can use it as follows.
my_var="foobar"
json=$(begin { "quicklock" : "${my_var}" } end)
echo "$json"
This fragment displays the desired output:
{ "quicklock" : "foobar" }
This is just a proof of concept. You can define your syntax in any way you want (such as end of the input by the custom EOF string, correctly escape invalid characters). For example, since Bash allows function identifiers using characters other than alphanumeric characters, it is possible to define such a syntax:
json=$(/ { "quicklock" : "${my_var}" } /)
Moreover, if you relax the first criterion (escape characters), ordinary assignment will nicely solve this problem:
json="{ \"quicklock\" : \"${my_var}\" }"
How about just using the shell's natural concatenation of strings? If you concatenate ${mybar} rather than interpolate it, you can avoid escapes and get everything on one line:
my_var1="foobar"
my_var2="quux"
json='{"quicklock":"'${my_var1}'","slowlock":"'$my_var2'"}'
That said, this is a pretty crude scheme, and as others have pointed out you'll have problems if the variables, say, contain quote characters.
Since no escape chars is strong requirement here is a here-doc based solution:
#!/bin/bash
my_var='foobar'
read -r -d '' json << EOF
{
"quicklock": "$my_var"
}
EOF
echo "$json"
It will give you the same output as the first solution I mentioned.
Just be careful, if you would put first EOF inside double quotes:
read -r -d '' json << "EOF"
$my_var would not be considered as a variable but as a plain text, so you would get this output:
{
"quicklock": "$my_var"
}
Related
Is it possible to have multi-line strings in JSON?
It's mostly for visual comfort so I suppose I can just turn word wrap on in my editor, but I'm just kinda curious.
I'm writing some data files in JSON format and would like to have some really long string values split over multiple lines. Using python's JSON module I get a whole lot of errors, whether I use \ or \n as an escape.
JSON does not allow real line-breaks. You need to replace all the line breaks with \n.
eg:
"first line
second line"
can be saved with:
"first line\nsecond line"
Note:
for Python, this should be written as:
"first line\\nsecond line"
where \\ is for escaping the backslash, otherwise python will treat \n as
the control character "new line"
Unfortunately many of the answers here address the question of how to put a newline character in the string data. The question is how to make the code look nicer by splitting the string value across multiple lines of code. (And even the answers that recognize this provide "solutions" that assume one is free to change the data representation, which in many cases one is not.)
And the worse news is, there is no good answer.
In many programming languages, even if they don't explicitly support splitting strings across lines, you can still use string concatenation to get the desired effect; and as long as the compiler isn't awful this is fine.
But json is not a programming language; it's just a data representation. You can't tell it to concatenate strings. Nor does its (fairly small) grammar include any facility for representing a string on multiple lines.
Short of devising a pre-processor of some kind (and I, for one, don't feel like effectively making up my own language to solve this issue), there isn't a general solution to this problem. IF you can change the data format, then you can substitute an array of strings. Otherwise, this is one of the numerous ways that json isn't designed for human-readability.
I have had to do this for a small Node.js project and found this work-around to store multiline strings as array of lines to make it more human-readable (at a cost of extra code to convert them to string later):
{
"modify_head": [
"<script type='text/javascript'>",
"<!--",
" function drawSomeText(id) {",
" var pjs = Processing.getInstanceById(id);",
" var text = document.getElementById('inputtext').value;",
" pjs.drawText(text);}",
"-->",
"</script>"
],
"modify_body": [
"<input type='text' id='inputtext'></input>",
"<button onclick=drawSomeText('ExampleCanvas')></button>"
],
}
Once parsed, I just use myData.modify_head.join('\n') or myData.modify_head.join(), depending upon whether I want a line break after each string or not.
This looks quite neat to me, apart from that I have to use double quotes everywhere. Though otherwise, I could, perhaps, use YAML, but that has other pitfalls and is not supported natively.
Check out the specification! The JSON grammar's char production can take the following values:
any-Unicode-character-except-"-or-\-or-control-character
\"
\\
\/
\b
\f
\n
\r
\t
\u four-hex-digits
Newlines are "control characters" so, no, you may not have a literal newline within your string. However you may encode it using whatever combination of \n and \r you require.
JSON doesn't allow breaking lines for readability.
Your best bet is to use an IDE that will line-wrap for you.
This is a really old question, but I came across this on a search and I think I know the source of your problem.
JSON does not allow "real" newlines in its data; it can only have escaped newlines. See the answer from #YOU. According to the question, it looks like you attempted to escape line breaks in Python two ways: by using the line continuation character ("\") or by using "\n" as an escape.
But keep in mind: if you are using a string in python, special escaped characters ("\t", "\n") are translated into REAL control characters! The "\n" will be replaced with the ASCII control character representing a newline character, which is precisely the character that is illegal in JSON. (As for the line continuation character, it simply takes the newline out.)
So what you need to do is to prevent Python from escaping characters. You can do this by using a raw string (put r in front of the string, as in r"abc\ndef", or by including an extra slash in front of the newline ("abc\\ndef").
Both of the above will, instead of replacing "\n" with the real newline ASCII control character, will leave "\n" as two literal characters, which then JSON can interpret as a newline escape.
Write property value as a array of strings. Like example given over here https://gun.io/blog/multi-line-strings-in-json/. This will help.
We can always use array of strings for multiline strings like following.
{
"singleLine": "Some singleline String",
"multiline": ["Line one", "line Two", "Line Three"]
}
And we can easily iterate array to display content in multi line fashion.
While not standard, I found that some of the JSON libraries have options to support multiline Strings. I am saying this with the caveat, that this will hurt your interoperability.
However in the specific scenario I ran into, I needed to make a config file that was only ever used by one system readable and manageable by humans. And opted for this solution in the end.
Here is how this works out on Java with Jackson:
JsonMapper mapper = JsonMapper.builder()
.enable(JsonReadFeature.ALLOW_UNESCAPED_CONTROL_CHARS)
.build()
This is a very old question, but I had the same question when I wanted to improve readability of our Vega JSON Specification code which uses complex conditoinal expressions. The code is like this.
As this answer says, JSON is not designed for human. I understand that is a historical decision and it makes sense for data exchange purposes. However, JSON is still used as source code for such cases. So I asked our engineers to use Hjson for source code and process it to JSON.
For example, in Git for Windows environment,
you can download the Hjson cli binary and put it in git/bin directory to use.
Then, convert (transpile) Hjson source to JSON. To use automation tools such as Make will be useful to generate JSON.
$ which hjson
/c/Program Files/git/bin/hjson
$ cat example.hjson
{
md:
'''
First line.
Second line.
This line is indented by two spaces.
'''
}
$ hjson -j example.hjson > example.json
$ cat example.json
{
"md": "First line.\nSecond line.\n This line is indented by two spaces."
}
In case of using the transformed JSON in programming languages, language-specific libraries like hjson-js will be useful.
I noticed the same idea was posted in a duplicated question but I would share a bit more information.
You can encode at client side and decode at server side. This will take care of \n and \t characters as well
e.g. I needed to send multiline xml through json
{
"xml": "PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0idXRmLTgiID8+CiAgPFN0cnVjdHVyZXM+CiAgICAgICA8aW5wdXRzPgogICAgICAgICAgICAgICAjIFRoaXMgcHJvZ3JhbSBhZGRzIHR3byBudW1iZXJzCgogICAgICAgICAgICAgICBudW0xID0gMS41CiAgICAgICAgICAgICAgIG51bTIgPSA2LjMKCiAgICAgICAgICAgICAgICMgQWRkIHR3byBudW1iZXJzCiAgICAgICAgICAgICAgIHN1bSA9IG51bTEgKyBudW0yCgogICAgICAgICAgICAgICAjIERpc3BsYXkgdGhlIHN1bQogICAgICAgICAgICAgICBwcmludCgnVGhlIHN1bSBvZiB7MH0gYW5kIHsxfSBpcyB7Mn0nLmZvcm1hdChudW0xLCBudW0yLCBzdW0pKQogICAgICAgPC9pbnB1dHM+CiAgPC9TdHJ1Y3R1cmVzPg=="
}
then decode it on server side
public class XMLInput
{
public string xml { get; set; }
public string DecodeBase64()
{
var valueBytes = System.Convert.FromBase64String(this.xml);
return Encoding.UTF8.GetString(valueBytes);
}
}
public async Task<string> PublishXMLAsync([FromBody] XMLInput xmlInput)
{
string data = xmlInput.DecodeBase64();
}
once decoded you'll get your original xml
<?xml version="1.0" encoding="utf-8" ?>
<Structures>
<inputs>
# This program adds two numbers
num1 = 1.5
num2 = 6.3
# Add two numbers
sum = num1 + num2
# Display the sum
print('The sum of {0} and {1} is {2}'.format(num1, num2, sum))
</inputs>
</Structures>
\n\r\n worked for me !!
\n for single line break and \n\r\n for double line break
I see many answers here that may not works in most cases but may be the easiest solution if let's say you wanna output what you wrote down inside a JSON file (for example: for language translations where you wanna have just one key with more than 1 line outputted on the client) can be just adding some special characters of your choice PS: allowed by the JSON files like \\ before the new line and use some JS to parse the text ... like:
Example:
File (text.json)
{"text": "some JSON text. \\ Next line of JSON text"}
import text from 'text.json'
{text.split('\\')
.map(line => {
return (
<div>
{line}
<br />
</div>
);
})}}
Assuming the question has to do with easily editing text files and then manually converting them to json, there are two solutions I found:
hjson (that was mentioned in this previous answer), in which case you can convert your existing json file to hjson format by executing hjson source.json > target.hjson, edit in your favorite editor, and convert back to json hjson -j target.hjson > source.json. You can download the binary here or use the online conversion here.
jsonnet, which does the same, but with a slightly different format (single and double quoted strings are simply allowed to span multiple lines). Conveniently, the homepage has editable input fields so you can simply insert your multiple line json/jsonnet files there and they will be converted online to standard json immediately. Note that jsonnet supports much more goodies for templating json files, so it may be useful to look into, depending on your needs.
If it's just for presentation in your editor you may use ` instead of " or '
const obj = {
myMultiLineString: `This is written in a \
multiline way. \
The backside of it is that you \
can't use indentation on every new \
line because is would be included in \
your string. \
The backslash after each line escapes the carriage return.
`
}
Examples:
console.log(`First line \
Second line`);
will put in console:
First line Second line
console.log(`First line
second line`);
will put in console:
First line
second line
Hope this answered your question.
Should or should I not wrap quotes around variables in a shell script?
For example, is the following correct:
xdg-open $URL
[ $? -eq 2 ]
or
xdg-open "$URL"
[ "$?" -eq "2" ]
And if so, why?
General rule: quote it if it can either be empty or contain spaces (or any whitespace really) or special characters (wildcards). Not quoting strings with spaces often leads to the shell breaking apart a single argument into many.
$? doesn't need quotes since it's a numeric value. Whether $URL needs it depends on what you allow in there and whether you still want an argument if it's empty.
I tend to always quote strings just out of habit since it's safer that way.
In short, quote everything where you do not require the shell to perform word splitting and wildcard expansion.
Single quotes protect the text between them verbatim. It is the proper tool when you need to ensure that the shell does not touch the string at all. Typically, it is the quoting mechanism of choice when you do not require variable interpolation.
$ echo 'Nothing \t in here $will change'
Nothing \t in here $will change
$ grep -F '#&$*!!' file /dev/null
file:I can't get this #&$*!! quoting right.
Double quotes are suitable when variable interpolation is required. With suitable adaptations, it is also a good workaround when you need single quotes in the string. (There is no straightforward way to escape a single quote between single quotes, because there is no escape mechanism inside single quotes -- if there was, they would not quote completely verbatim.)
$ echo "There is no place like '$HOME'"
There is no place like '/home/me'
No quotes are suitable when you specifically require the shell to perform word splitting and/or wildcard expansion.
Word splitting (aka token splitting);
$ words="foo bar baz"
$ for word in $words; do
> echo "$word"
> done
foo
bar
baz
By contrast:
$ for word in "$words"; do echo "$word"; done
foo bar baz
(The loop only runs once, over the single, quoted string.)
$ for word in '$words'; do echo "$word"; done
$words
(The loop only runs once, over the literal single-quoted string.)
Wildcard expansion:
$ pattern='file*.txt'
$ ls $pattern
file1.txt file_other.txt
By contrast:
$ ls "$pattern"
ls: cannot access file*.txt: No such file or directory
(There is no file named literally file*.txt.)
$ ls '$pattern'
ls: cannot access $pattern: No such file or directory
(There is no file named $pattern, either!)
In more concrete terms, anything containing a filename should usually be quoted (because filenames can contain whitespace and other shell metacharacters). Anything containing a URL should usually be quoted (because many URLs contain shell metacharacters like ? and &). Anything containing a regex should usually be quoted (ditto ditto). Anything containing significant whitespace other than single spaces between non-whitespace characters needs to be quoted (because otherwise, the shell will munge the whitespace into, effectively, single spaces, and trim any leading or trailing whitespace).
When you know that a variable can only contain a value which contains no shell metacharacters, quoting is optional. Thus, an unquoted $? is basically fine, because this variable can only ever contain a single number. However, "$?" is also correct, and recommended for general consistency and correctness (though this is my personal recommendation, not a widely recognized policy).
Values which are not variables basically follow the same rules, though you could then also escape any metacharacters instead of quoting them. For a common example, a URL with a & in it will be parsed by the shell as a background command unless the metacharacter is escaped or quoted:
$ wget http://example.com/q&uack
[1] wget http://example.com/q
-bash: uack: command not found
(Of course, this also happens if the URL is in an unquoted variable.) For a static string, single quotes make the most sense, although any form of quoting or escaping works here.
wget 'http://example.com/q&uack' # Single quotes preferred for a static string
wget "http://example.com/q&uack" # Double quotes work here, too (no $ or ` in the value)
wget http://example.com/q\&uack # Backslash escape
wget http://example.com/q'&'uack # Only the metacharacter really needs quoting
The last example also suggests another useful concept, which I like to call "seesaw quoting". If you need to mix single and double quotes, you can use them adjacent to each other. For example, the following quoted strings
'$HOME '
"isn't"
' where `<3'
"' is."
can be pasted together back to back, forming a single long string after tokenization and quote removal.
$ echo '$HOME '"isn't"' where `<3'"' is."
$HOME isn't where `<3' is.
This isn't awfully legible, but it's a common technique and thus good to know.
As an aside, scripts should usually not use ls for anything. To expand a wildcard, just ... use it.
$ printf '%s\n' $pattern # not ``ls -1 $pattern''
file1.txt
file_other.txt
$ for file in $pattern; do # definitely, definitely not ``for file in $(ls $pattern)''
> printf 'Found file: %s\n' "$file"
> done
Found file: file1.txt
Found file: file_other.txt
(The loop is completely superfluous in the latter example; printf specifically works fine with multiple arguments. stat too. But looping over a wildcard match is a common problem, and frequently done incorrectly.)
A variable containing a list of tokens to loop over or a wildcard to expand is less frequently seen, so we sometimes abbreviate to "quote everything unless you know precisely what you are doing".
Here is a three-point formula for quotes in general:
Double quotes
In contexts where we want to suppress word splitting and globbing. Also in contexts where we want the literal to be treated as a string, not a regex.
Single quotes
In string literals where we want to suppress interpolation and special treatment of backslashes. In other words, situations where using double quotes would be inappropriate.
No quotes
In contexts where we are absolutely sure that there are no word splitting or globbing issues or we do want word splitting and globbing.
Examples
Double quotes
literal strings with whitespace ("StackOverflow rocks!", "Steve's Apple")
variable expansions ("$var", "${arr[#]}")
command substitutions ("$(ls)", "`ls`")
globs where directory path or file name part includes spaces ("/my dir/"*)
to protect single quotes ("single'quote'delimited'string")
Bash parameter expansion ("${filename##*/}")
Single quotes
command names and arguments that have whitespace in them
literal strings that need interpolation to be suppressed ( 'Really costs $$!', 'just a backslash followed by a t: \t')
to protect double quotes ('The "crux"')
regex literals that need interpolation to be suppressed
use shell quoting for literals involving special characters ($'\n\t')
use shell quoting where we need to protect several single and double quotes ($'{"table": "users", "where": "first_name"=\'Steve\'}')
No quotes
around standard numeric variables ($$, $?, $# etc.)
in arithmetic contexts like ((count++)), "${arr[idx]}", "${string:start:length}"
inside [[ ]] expression which is free from word splitting and globbing issues (this is a matter of style and opinions can vary widely)
where we want word splitting (for word in $words)
where we want globbing (for txtfile in *.txt; do ...)
where we want ~ to be interpreted as $HOME (~/"some dir" but not "~/some dir")
See also:
Difference between single and double quotes in Bash
What are the special dollar sign shell variables?
Quotes and escaping - Bash Hackers' Wiki
When is double quoting necessary?
I generally use quoted like "$var" for safe, unless I am sure that $var does not contain space.
I do use $var as a simple way to join lines:
lines="`cat multi-lines-text-file.txt`"
echo "$lines" ## multiple lines
echo $lines ## all spaces (including newlines) are zapped
Whenever the https://www.shellcheck.net/ plugin for your editor tells you to.
The code below doesn't do what I need it to. I want to pass $STR to Underscore and pluck out the "name" attribute from the JSON data.
#!/bin/bash
STR='['$#']';
RESULT=`underscore pluck --data '+$STR+' name`;
echo $RESULT;
JSON Data:
{"maxResults":1,"resultList":[{"#class":"com.sohnar.trafficlite.transfer.crm.refactor.ClientCRMEntryTO","id":331458,"version":2,"dateCreated":"2017-05-31T13:20:22.960+0000","dateModified":"2017-06-05T14:23:59.961+0000","lastUpdatedUserId":71954,"name":"ACME_CLIENT","website":null,"description":null,"billingLocation":null,"primaryLocation":null,"crmEntryType":"CLIENT","industryType":null,"accountManagerId":103049,"crmClientClassificationListItemId":{"id":12405},"companyProfile":{"id":486024,"version":1,"dateCreated":"2017-05-31T13:20:22.960+0000","dateModified":"2017-06-05T14:23:59.962+0000","sourceOfBusinessListItemId":null,"creditTermsListItemId":{"id":4215},"relationshipSince":"2017-05-30T23:00:00.000+0000","turnover":0,"employees":0,"taxNumber":null,"companyNumber":null,"nominalCode":null,"accountPackageId":null,"optOutMarketing":false,"optOutEmail":false,"optOutTelephone":false,"notes":null},"colorCode":0,"externalCode":"SAP-01","clientState":"CLIENT","defaultCustomRateSetId":null,"preferredCurrencyId":{"id":48},"freeTags":[]}],"windowSize":5,"currentPage":1}
Several problems here. First you should use double quotes, not single. Double quotes allows the expansion of variable. Second, I am assuming that you don't want the + as part of the data - they are not a string concatenation operator in bash (there is no need for one).
str="[$#]"
result=$(underscore pluck --data "$str" name)
echo $result
I used the $( ) notation rather than back-ticks ( `` ). Back-ticks are considered deprecated and difficult to read.
I have replaced uppercase variable names with lower case. This is because there are many uppercase variable names used by the shell so using uppercase risks a name collision. Best to use lower or mixed-case variable names.
I have also removed the superfluous semi-colons ;. They are not doing any harm, but they are of no use either. In bash a semi-colon is a statement separator, whereas in languages like C and Java they are a statement terminator.
Edit: double-quoted $#
I figured it out...
#!/bin/bash
name=$(underscore select .name --outfmt text < get.crm.client.acme.json)
echo $name
I'm trying to pull variables from an API in json format and then put them back together with one variable changed and fire them back as a put.
Only issue is that every value has quote marks in it and must go back to the API separated by commas only.
example of what it should see with redacted information, variables inside the **'s:
curl -skv -u redacted:redacted -H Content-Type: application/json -X PUT -d'{properties:{basic:{request_rules:[**"/(req) testrule","/test-body","/(req) test - Admin","test-Caching"**]}}}' https://x.x.x.x:9070/api/tm/1.0/config/active/vservers/xxx-xx
Obviously if I fire them as a plain array I get spaces instead of commas. However I tried outputting it as a plain string
longstr=$(echo ${valuez[#]})
output=$(echo $longstr |sed -e 's/" /",/g')
And due to the way bash is interpreted it seems to either interpret the quotes wrong or something else. I guess it might well be the single ticks encapsulating after the PUT -d as well but I'm not sure how I can throw a variable into something that has single ticks.
If I put the raw data in manually it works so it's either the way the variable is being sent or the single ticks. I don't get an error and when I echo the line out it looks perfect.
Any ideas?
valuez=( "/(req) testrule" "/test-body" "/(req) test - Admin" "test-Caching" )
# Temporarily set IFS to some character which is known not to appear in the array.
oifs=$IFS
IFS=$'\014'
# Flatten the array with the * expansion giving a string containing the array's elements separated by the first character of $IFS.
d_arg="${valuez[*]}"
IFS=$oifs
# If necessary, quote or escape embedded quotation marks. (Implementation-specific, using doubled double quotes as an example.)
d_arg="${d_arg//\"/\"\"}"
# Substitute the known-to-be-absent character for the desired quote+separator+quote.
d_arg="${d_arg//$'\014'/\",\"}"
# Prepend and append quotes.
d_arg="\"$d_arg\""
# insert the prepared arg into the final string.
d_arg="{properties:{basic:{request_rules:[${d_arg}]}}}"
curl ... -d"$d_arg" ...
if you have gnu awk with version 4 and above, which support FPAT
output=$(echo $longstr |awk '$1=$1' FPAT="(\"[^\"]+\")" OFS=",")
Explanation
FPAT #
This is a regular expression (as a string) that tells gawk to create the fields based on text that matches the regular expression. Assigning a value to FPAT overrides the use of FS and FIELDWIDTHS for field splitting. See Splitting By Content, for more information.
If gawk is in compatibility mode (see Options), then FPAT has no special meaning, and field-splitting operations occur based exclusively on the value of FS.
valuez=( "/(req) testrule" "/test-body" "/(req) test - Admin" "test-Caching" )
csv="" sep=""
for v in "${valuez[#]}"; do csv+="$sep\"$v\""; sep=,; done
echo "$csv"
"/(req) testrule","/test-body","/(req) test - Admin","test-Caching"
If it's something you need to do repeatedly, but it into a function:
toCSV () {
local csv sep val
for val; do
csv+="$sep\"$val\""
sep=,
done
echo "$csv"
}
csv=$(toCSV "${valuez[#]}")
Here's a quick Perl question:
How can I convert HTML special characters like ü or ' to normal ASCII text?
I started with something like this:
s/\&#(\d+);/chr($1)/eg;
and could write it for all HTML characters, but some function like this probably already exists?
Note that I don't need a full HTML->Text converter. I already parse the HTML with the HTML::Parser. I just need to convert the text with the special chars I'm getting.
Take a look at HTML::Entities:
use HTML::Entities;
my $html = "Snoopy & Charlie Brown";
print decode_entities($html), "\n";
You can guess the output.
The above answers tell you how to decode the entities into Perl strings, but you also asked how to change those into ASCII.
Assuming that this is really what you want and you don't want all the unicode characters you can look at the Text::Unidecode module from CPAN to Zap all those odd characters back into a roughly similar collection of ASCII characters:
use Text::Unidecode qw(unidecode);
use HTML::Entities qw(decode_entities);
my $source = '北亰';
print unidecode(decode_entities($source));
# That prints: Bei Jing
Note that there are hex-specified characters too. They look like this: é (é).
Use HTML::Entities' decode_entities to translate the entities into actual characters. To convert that to ASCII requires more work. I've used iconv (perl interface: Text::Iconv)
with the transliterate option on with some success in the past. But if you are dealing
with a limited set of entities, or you don't actually need it reduced to ASCII equivalents,
you may be better off limiting what decode_entities produces or providing it with custom
conversion maps. See the HTML::Entities doc.
There are a handful of predefined HTML entities - & " > and so on - that you could hard code.
However, the larger case of numberic entities - { - is going to be much harder, as those values are Unicode, and conversion to ASCII is going to range from difficult to impossible.
I use this script. Save it as html2utf.py, and use it ala echo $some_html | html2utf.py.
#!/usr/bin/env python3
"""
An alternative for `perl -Mopen=locale -MHTML::Entities -pe '$_ = decode_entities($_)'` (which you can use by `cpanm HTML::Entities`) and `recode html..`.
"""
import fileinput
import html
for line in fileinput.input():
print(html.unescape(line.rstrip('\n')))
I have created a one-liner for bash, using Perl to decode the HTML entities that are passed to perl. My solution is a blend of this answer (see above) and something I found on commandlinefu.com last week.
Most of us who code in Bash aren't in the habit of using echo -n to strip out the \n newline character since it doesn't usually affect Bash text parsing. With Perl——and with this particular method——it's important to use echo -n or else perl will interpret the 'newline' \n character as a literal part of the response, adding an unwanted %0A to your results.
Here's my bash-perl one-liner hybrid:
encodedURL="$(echo -n "$entityURL" | perl -MHTML::Entities -MURI::Escape -ne 'print uri_escape(decode_entities($_))')"
Example:
Input: Seals \& Croft - Summer Breeze
Output: Seals%20%26%20Croft%20-%20Summer%20Breeze