Related
How do I use jq to convert an arbitrary JSON array of objects to CSV, while objects in this array are nested?
StackOverflow has a sea of questions/answers where specific input or output fields are referenced, but I'd like to have a generic solution that
includes a header row,
works for any JSON input including nested arrays + objects,
allows records that have missing values for keys that are present in other records
does not hard-code any field names,
allows converting the CSV back into the nested JSON structure if needed, and
uses key paths as header names (see the following description).
Dot notation
Many JSON-using products (like CouchDB, MongoDB, …) and libraries (like Lodash, …) use variations of syntax that allows access to nested property values / subfields by joining key fragments with a character, often a dot (‘dot notation’).
An example of a key path like this would be "a.b.0.c" to refer to the deeply nested property in this JSON snippet:
{
"a": {
"b": [
{
"c": 123,
}
]
}
}
Caveat: Using this method is a pragmatic solution for most cases, but means that either dot characters have to be banned in property names, or a more complex (and definitely never used property name) has to be invented for escaping dots in property names / accessing nested fields. MongoDB simply banned usage of "." in documents until v5.0, some libraries have workarounds for field access (Lodash example).
Despite this, for simplicity, a solution should use the described dot syntax in the CSV output’s header for nested properties. Bonus if there is a solution variant that solves this problem, e.g. with JSONPath.
Example JSON array as input
[
{
"a": {
"b": [
{
"c": 123
}
]
}
},
{
"a": {
"b": [
{
"c": "foo \" bar",
"d": "qux"
}
]
}
},
{
"a": {
"b": [
{
"d": 456
}
]
}
}
]
Example CSV output
The output should have a header that includes all fields (even if the object at the first array does not have defined values for all existing key paths).
To make the output intuitively editable by humans, each row should represent one object in the input array.
The expected output should look like this:
"a.b.0.c","a.b.0.d"
123,
"foo "" bar","qux"
,456
Command line
This is what I need:
cat example.json | jq <MISSING CODE HERE>
Solution 1, using dot notation
Here is the jq call to convert your array of nested JSON objects to CSV:
jq -r '(. | map(leaf_paths) | unique) as $cols | map (. as $row | ($cols | map(. as $col | $row | getpath($col)))) as $rows | ([($cols | map(. | map(tostring) | join(".")))] + $rows) | map(#csv) | .[]
The fastest way to try this solution out is to use JQPlay.
The CSV output will have a header row. It will contain all properties that exist anywhere in the input objects, including nested ones, in dot notation. Each input array element will be represented as a single row, properties that are missing will be represented as empty CSV fields.
Using solution 1 in bash or a similar shell
Create the JSON input file…
echo '[{"a": {"b": [{"c": 123}]}},{"a": {"b": [{"c": "foo \" bar","d": "qux"}]}},{"a": {"b": [{"d": 456}]}}]' > example.json
Then use this jq command to output the CSV on the standard output:
cat example.json | jq -r '(. | map(leaf_paths) | unique) as $cols | map (. as $row | ($cols | map(. as $col | $row | getpath($col)))) as $rows | ([($cols | map(. | map(tostring) | join(".")))] + $rows) | map(#csv) | .[]'
…or write the output to example.csv:
cat example.json | jq -r '(. | map(leaf_paths) | unique) as $cols | map (. as $row | ($cols | map(. as $col | $row | getpath($col)))) as $rows | ([($cols | map(. | map(tostring) | join(".")))] + $rows) | map(#csv) | .[]' > example.csv
Converting the data from solution 1 back to JSON
Here is a Node.js example that you can try on RunKit. It converts a CSV generated with the method in solution 1 back to an array of nested JSON objects.
Explanation for solution 1
Here is a longer, commented version of the jq filter.
# 1) Find all unique leaf property names of all objects in the input array. Each nested property name is an array with the components of its key path, for example ["a", 0, "b"].
(. | map(leaf_paths) | unique) as $cols |
# 2) Use the found key paths to determine all (nested) property values in the given input records.
map (. as $row | ($cols | map(. as $col | $row | getpath($col)))) as $rows |
# 3) Create the raw output array of rows. Each row is represented as an array of values, one element per existing column.
(
# 3.1) This represents the header row. Key paths are generated here.
[($cols | map(. | map(tostring) | join(".")))]
+ # 3.2) concatenate the header row with all other rows
$rows
)
# 4) Convert each row to a escaped CSV string.
| map(#csv)
# 5) output each array element directly. Without this, the result would be a JSON array of CSV strings.
| .[]
Solution 2: for input that does have dots in property names
If you do need to support dot characters in property names, you can either use a different separator string for the key path syntax (replace the dot in "." with something else), or replace the map(tostring) | join(".") part with tostring - this yields a JSON array of strings that you can use as key paths - no dot notation needed. Here is a JQPlay with this solution variant.
Full jq command:
jq -r (. | map(leaf_paths) | unique) as $cols | map (. as $row | ($cols | map(. as $col | $row | getpath($col)))) as $rows | ([($cols | map(. | tostring))] + $rows) | map(#csv) | .[]
The output CSV for the variant would look like this then – it’s less readable and not useful for cases where you want humans to intuitively understand the CSV’s header:
"[""a"",""b"",0,""c""]","[""a"",""b"",0,""d""]"
123,
"foo "" bar","qux"
,456
See below for an idea how to convert this format back to a representation in your programming language.
Bonus: Converting the generated CSV back to JSON
If the input's nested properties contain no ".", it’s simple to convert the CSV back to JSON, for example with a library that supports dot notation, or with JSONPath.
JavaScript: Use Lodash's _.set()
Other languages: Find a package/library that implements JSONPath and use selectors like $.a.b.0.c or $['a']['b'][0]['c'] to set each nested property of each record.
Solution 2 (with JSON arrays as headers) allows you to interpret the headers as JSON array strings. Then you can generate a JSON Path from each header, and re-create all records/objects:
"[""a"",""b"",0,""c""]" (CSV)
→ ["a","b",0,"c"] (array of key-path components after unescaping and parsing as JSON)
→ $.["a"]["b"][0]["c"] (JSONPath)
→ { a: { b: [{c: … }] } } (Nested regenerated object)
I've written an example Node.js script to convert a CSV like this back to JSON. You can try solution 2 in RunKit.
The following tocsv and fromcsv functions provide a solution to the stated problem except for one complication regarding requirement (6) concerning the headers. Essentially, this requirement can be met using the functions given here by adding a matrix transposition step.
Whether or not a transposition step is added, the advantage of the approach taken here is that there are no restrictions on the JSON keys or values. In particular, they may
contain periods (dots), newlines and/or NUL characters.
In the example, an array of objects is given, but in fact any stream of valid JSON documents could be used as input to tocsv; thanks to the magic of jq, the original stream will be recreated by fromcsv (in the sense of entity-by-entity equality).
Of course, since there is no CSV standard, the CSV produced by the
tocsv function might not be understood by all CSV processors. In
particular, please note that the tocsv function defined here maps
embedded newlines in JSON strings or key names to the two-character
string "\n" (i.e., a literal backslash followed by the letter "n");
the inverse operation performs the inverse translation to meet the
"round-trip" requirement.
(The use of tail is just to simplify the presentation; it would be
trivial to modify the solution to make it an only-jq one.)
The CSV is generated on the assumption that any value can be
included in a field so long as (a) the field is quoted, and (b)
double-quotes within the field are doubled.
Any generic solution that supports "round-trips" is bound to be
somewhat complicated. The main reason why the solution presented here is
more complex than one might expect is because a third column is
added, partly to make it easy to distinguish between integers and
integer-valued strings, but mainly because it makes it easy to
distinguish between the size-1 and size-2 arrays produced by jq's
--stream option. Needless to say, there are other ways
these issues could be addressed; the number of calls to jq could
also be reduced.
The solution is presented as a test script that checks the round-trip requirement on a telling test case:
#!/bin/bash
function json {
cat<<EOF
[
{
"a": 1,
"b": [
1,
2,
"1"
],
"c": "d\",ef",
"embed\"ed": "quote",
"null": null,
"string": "null",
"control characters": "a\u0000c",
"newline": "a\nb"
},
{
"x": 1
}
]
EOF
}
function tocsv {
jq -ncr --stream '
(["path", "value", "stringp"],
(inputs | . + [.[1]|type=="string"]))
| map( tostring|gsub("\"";"\"\"") | gsub("\n"; "\\n"))
| "\"\(.[0])\",\"\(.[1])\",\(.[2])"
'
}
function fromcsv {
tail -n +2 | # first duplicate backslashes and deduplicate double-quotes
jq -rR '"[\(gsub("\\\\";"\\\\") | gsub("\"\"";"\\\"") ) ]"' |
jq -c '.[2] as $s
| .[0] |= fromjson
| .[1] |= if $s then . else fromjson end
| if $s == null then [.[0]] else .[:-1] end
# handle newlines
| map(if type == "string" then gsub("\\\\n";"\n") else . end)' |
jq -n 'fromstream(inputs)'
}
# Check the roundtrip:
json | tocsv | fromcsv | jq -s '.[0] == .[1]' - <(json)
Here is the CSV that would be produced by json | tocsv, except that SO seems to disallow literal NULs, so I have replaced that by \0:
"path","value",stringp
"[0,""a""]","1",false
"[0,""b"",0]","1",false
"[0,""b"",1]","2",false
"[0,""b"",2]","1",true
"[0,""b"",2]","false",null
"[0,""c""]","d"",ef",true
"[0,""embed\""ed""]","quote",true
"[0,""null""]","null",false
"[0,""string""]","null",true
"[0,""control characters""]","a\0c",true
"[0,""newline""]","a\nb",true
"[0,""newline""]","false",null
"[1,""x""]","1",false
"[1,""x""]","false",null
"[1]","false",null
Let's say I have some JSON in a file, it's a subset of JSON data extracted from a larger JSON file - that's why I'll use stream later in my attempted solution - and it looks like this:
[
{"_id":"1","#":{},"article":false,"body":"Hello world","comments":"3","createdAt":"20201007200628","creator":{"id":"4a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"mkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"},
{"_id":"2","#":{},"article":false,"body":"Goodbye world","comments":"3","createdAt":"20201007200628","creator":{"id":"4a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"mkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"}
],
[
{"_id":"55","#":{},"article":false,"body":"Hello world","comments":"3","createdAt":"20201007200628","creator":{"id":"3a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"jkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"},
{"_id":"56","#":{},"article":false,"body":"Goodbye world","comments":"3","createdAt":"20201007200628","creator":{"id":"3a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"jkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"}
]
It describes 4 posts written by 2 different authors, with unique _id fields for each post. Both authors wrote 2 posts, where 1 says "Hello World" and the other says "Goodbye World".
I want to match on the word "Hello" and return the _id only for fields containing "Hello". The expected result is:
1
55
The closest I could come in my attempt was:
jq -nr --stream '
fromstream(1|truncate_stream(inputs))
| select(.body %like% "Hello")
| ._id
' <input_file
Assuming the input is modified slightly to make it a stream of the arrays as shown in the Q:
jq -nr --stream '
fromstream(1|truncate_stream(inputs))
| select(.body | test("Hello"))
| ._id
'
produces the desired output.
test uses regex matching. In your case, it seems you could use simple substring matching instead.
Handling extraneous commas
Assuming the input has commas between a stream of valid JSON exactly as shown, you could presumably use sed to remove them first.
Or, if you want an only-jq solution, use the following in conjunction with the -n, -r and --stream command-line options:
def iterate:
fromstream(1|truncate_stream(inputs?))
| select(.body | test("Hello"))
| ._id,
iterate;
iterate
(Notice the "?".)
The streaming parser (invoked with --stream) is usually not needed for the kind of task you describe, so in this response, I'm going to assume that the following (or a variant thereof) will suffice:
.[]
| select( .body | test("Hello") )._id
This of course assumes that the input is valid JSON.
Handling comma-delimited JSON
If your input is a comma-delimited stream of JSON as shown in the Q, you could use the following in conjunction with the -n command-line option:
# This is a variant of the built-in `recurse/1`:
def iterate(f): def r: f | (., r); r;
iterate( inputs? | .[] | select( .body | test("Hello") )._id )
Please note that this assumes that whatever occurs on a line after a delimiting comma can be ignored.
I have the following JSON content in a file.json.
I need only a particular key field from all this overwhelming information.
Let us assume that I need web_url,
The problem here is there are multiple key field with "web_url".
How do only get the web_url field I am after?
[{"id":196,"iid":1,"project_id":233,"title":"DEV to Master","description":"","state":"merged","created_at":"2019-12-04T14:14:35.424-06:00","updated_at":"2019-12-04T14:14:47.310-06:00","merged_by":{"id":122,"name":"Sengoku","username":"sengk","state":"active","avatar_url":"https://secure.gravatar.com/avatar/7cvffgfgfgfgf9eb1348d0ba7795a076?s=80\u0026d=identicon","web_url":"https://gitlaboo.tests.com/sengk"},"merged_at":"2019-12-04T14:14:47.468-06:00","closed_by":null,"closed_at":null,"target_branch":"master","source_branch":"DEV","upvotes":0,"downvotes":0,"author":{"id":122,"name":"Sengoku","username":"sengk","state":"active","avatar_url":"https://secure.gravatar.com/avatar/7fgdfdgdfgdvfg9eb1348d0ba7795a076?s=80\u0026d=identicon","web_url":"https://gitlaboo.tests.com/sengk"},"assignee":{"id":122,"name":"Sengoku","username":"sengk","state":"active","avatar_url":"https://secure.gravatar.com/avatar/7afsdfdvdfvfde24f89eb1348d0ba7795a076?s=80\u0026d=identicon","web_url":"https://gitlaboo.tests.com/sengk"},"source_project_id":233,"target_project_id":233,"labels":[],"work_in_progress":false,"milestone":null,"merge_when_pipeline_succeeds":false,"merge_status":"can_be_merged","sha":"6318e51ea8czfdfsdvdfvdfbc02988ba62c71e5774107e","merge_commit_sha":"6dc5vdfvdfgdfg5bf14e97dea949b8584c0c68d6","user_notes_count":0,"discussion_locked":null,"should_remove_source_branch":null,"force_remove_source_branch":false,"web_url":"https://gitlaboo.tests.com/demo/frog/merge_requests/1","time_stats":{"time_estimate":0,"total_time_spent":0,"human_time_estimate":null,"human_total_time_spent":null},"squash":false}]
You have 4 web_url in your JSON.
Can check the below results,
.[] | .web_url
.[] | .merged_by.web_url
.[] | .author.web_url
.[] | .assignee.web_url
If the question is essentially how to find the needle in the haystack, the answer is: use paths; more specifically, in your case:
jq -c 'paths(. == "https://gitlaboo.tests.com/demo/frog/merge_requests/1")
| select(.[-1] == "web_url")
' file.json
The output gives the path as a JSON array:
[0,"web_url"]
This can be used directly in jq (using getpath/1), or as the basis for a direct query:
.[0].web_url
Having difficulties converting this JSON. It is multi-line similar to what is below. The example data at the bottom is what is reads as-is once unzipped.
An example of what has been tried:
jq -r '(([["user_id","server_received_time","app","device_carrier","$schema","city","uuid","event_time","platform","os_version","amplitude_id","processed_time","user_creation_time","version_name","ip_address","paying","dma","group_properties","user_properties","client_upload_time","$insert_id","event_type","library","amplitude_attribution_ids","device_type","device_manufacturer","start_version","location_lng","server_upload_time","event_id","location_lat","os_name","amplitude_event_type","device_brand","groups","event_properties","data","device_id","language","device_model","country","region","is_attribution_event","adid","session_id","device_family","sample_rate","idfa","client_event_time"]]) + [(.table.All[] | [.user_id,.server_received_time,.app,.device_carrier,.$schema,.city,.uuid,.event_time,.platform,.os_version,.amplitude_id,.processed_time,.user_creation_time,.version_name,.ip_address,.paying,.dma,.group_properties,.user_properties,.client_upload_time,.$insert_id,.event_type,.library,.amplitude_attribution_ids,.device_type,.device_manufacturer,.start_version,.location_lng,.server_upload_time,.event_id,.location_lat,.os_name,.amplitude_event_type,.device_brand,.groups,.event_properties,.data,.device_id,.language,.device_model,.country,.region,.is_attribution_event,.adid,.session_id,.device_family,.sample_rate,.idfa,.client_event_time])])[]|#csv' test.json > test.csv
As well as some other jq options. I need every column regardless of the value, and the values as-is. Does anyone have thoughts on why we are running into issues? One error we get is:
jq: error: try .["field"] instead of .field for unusually named fields at <top-level>, line 1:
Other jq lines have given the following error:
string (...) cannot be csv-formatted, only array
This is an excerpt from one of the JSON files:
{"groups":{},"country":"United States","device_id":"3d-88c-45-b6-ed81277eR","is_attribution_event":false,"server_received_time":"2019-12-17 17:29:11.113000","language":"English","event_time":"2019-12-17 17:27:49.047000","user_creation_time":"2019-11-08 13:15:32.919000","city":"Sure","uuid":"someID","device_model":"Windows","amplitude_event_type":null,"client_upload_time":"2019-12-17 17:29:21.958000","data":{},"library":"amplitude-js\/5.2.2","device_manufacturer":null,"dma":"Washington, DC (Townville, USA)","version_name":null,"region":"Virginia","group_properties":{},"location_lng":null,"device_family":"Windows","paying":null,"client_event_time":"2019-12-17 17:27:59.892000","$schema":12,"device_brand":null,"user_id":"email#gmail.com","event_properties":{"title":"Name","id":"1-253251","applicationName":"SomeName"},"os_version":"18","device_carrier":null,"server_upload_time":"2019-12-17 17:29:11.135000","session_id":1576603675620,"app":231165,"amplitude_attribution_ids":null,"event_type":"CHANGE_PERSPECTIVE","user_properties":{},"adid":null,"device_type":"Windows","$insert_id":"e308c923-d8eb-48c6-8ea5-600","event_id":24,"amplitude_id":515,"processed_time":"2019-12-17 17:29:12.760372","platform":"Web","idfa":null,"os_name":"Edge","location_lat":null,"ip_address":"123.456.78.90","sample_rate":null,"start_version":null}
Thank you!
There are several problems with your attempt.
First, the keys with "$" in their names cannot be specified using the abbreviated .foo syntax; you could use .["$foo"] instead.
Second, #csv expects an array of atomic values. Thus the keys with JSON objects as values must be handled specially.
Third, the "+" is incorrect. The relevant connector here is ",".
With your sample JSON, the following will work:
(["user_id","server_received_time","app","device_carrier","$schema","city","uuid","event_time","platform","os_version","amplitude_id","processed_time","user_creation_time","version_name","ip_address","paying","dma","group_properties","user_properties","client_upload_time","$insert_id","event_type","library","amplitude_attribution_ids","device_type","device_manufacturer","start_version","location_lng","server_upload_time","event_id","location_lat","os_name","amplitude_event_type","device_brand","groups","event_properties","data","device_id","language","device_model","country","region","is_attribution_event","adid","session_id","device_family","sample_rate","idfa","client_event_time"]),
([.user_id,.server_received_time,.app,.device_carrier,.["$schema"],.city,.uuid,.event_time,.platform,.os_version,.amplitude_id,.processed_time,.user_creation_time,.version_name,.ip_address,.paying,.dma,.group_properties,.user_properties,.client_upload_time,.["$insert_id"],.event_type,.library,.amplitude_attribution_ids,.device_type,.device_manufacturer,.start_version,.location_lng,.server_upload_time,.event_id,.location_lat,.os_name,.amplitude_event_type,.device_brand,.groups,.event_properties,.data,.device_id,.language,.device_model,.country,.region,.is_attribution_event,.adid,.session_id,.device_family,.sample_rate,.idfa,.client_event_time]
| map(if type=="object"
then to_entries
| map( "\(.key):\(.value)" )
| join(";")
else . end))
| #csv
A less error-prone solution
Specifying the long list of keys twice makes the above solution error-prone. It would be better to specify the keys just once, and then programatically generate the rows.
Here's a utility function that can be used to this end:
def toa($headers):
. as $in | $headers | map($in[.]);
Or you could handle the object-valued keys inside toa:
def toa($headers):
def flat:
if type == "object" or type == "array"
then to_entries | map( "\(.key):\(.value)" ) | join(";")
else .
end;
. as $in | $headers | map($in[.] | flat);
JSONL
If the input is a stream of JSON objects of the type illustrated in the question, an efficient solution would use inputs with the -n command line option. This could be along the lines of:
print_header,
(inputs | print_row)
I am trying to join() a relatively big array (20k elements) of objects with a character ('\n' in this particular case). I have a few operation upfront which solve in about 8 seconds (acceptable) but when I try to '| join("\n")' at the end the runtime jump to 3+ minutes.
Is there any reason for the join() to be that slow ? Is there another way of having the same output without join() ?
I am currently using jq-1.5 (latest stable)
Here is the JQ file
json2csv.jq
def json2csv:
def tonull: if . == "null" then null else . end;
(.[0] | keys) as $headers |
[(
$headers | join("\t")
), (
[ .[] as $row | [ $headers[] as $h | $row[$h] | tostring | tonull ] | join("\t") ] | join("\n")
)] | join("\n")
;
json2csv
Considering:
$ jq 'length' test.json
23717
With the script is I want it (and put above)
$ time jq -rf json2csv.jq test.json > test.csv
real 3m46.721s
user 1m48.660s
sys 1m57.698s
With the same script, removing the join("\n")
$ time jq -rf json2csv.jq test.json > test.csv
real 0m8.564s
user 0m8.301s
sys 0m0.242s
(note: I remove the second join because else JQ cannot aggregate an array and a string, which make sense (but that's only on an array of 2 elements anyways, so the second join isn't the problem))
You don't need to use join at all. Rather than thinking of converting the whole file to a single string, think of it as converting each row to strings. The way jq outputs streams of results will give you the desired result in the end (assuming you take the raw output).
try something more like this.
def json2csv:
def tonull: if . == "null" then null else . end;
(.[0] | keys) as $headers
# output headers followed by rows of values as arrays
| (
$headers
),
(
.[] | [ .[$headers[]] | tostring | tonull ]
)
# convert the arrays to tab separated values strings
| #tsv
;
After thinking about it I remembered that jq automatically display carriage return ('\n') if you scan an array (.[]), which mean that in this particular case I can just do this:
def json2csv:
def tonull: if . == "null" then null else . end;
(.[0] | keys) as $headers |
[(
$headers | join("\t")
), (
[ .[] as $row | [ $headers[] as $h | $row[$h] | tostring | tonull ] | join("\t") ] | .[]
)] | .[]
;
json2csv
And this solved my problem
time jq -rf json2csv.jq test.json > test.csv
real 0m6.725s
user 0m6.454s
sys 0m0.245s
I'm leaving the question up as if I had wanted to use any other character than '\n' this wouldn't have solved the issue.
When producing output such as CSV or TSV, the idea is to stream the data as much as possible. The last thing you want to do is run join on an array containing all the data. If you did want to use a delimiter other than \n, you'd add it to each item in the stream, and then use the -j command-line option.
Also, I think your diagnosis is probably not quite right as joining an array with a large number of small strings is quite fast. Below are timings comparing joining an array with two strings and one with 100,000 strings. In case you're wondering, my machine is rather slow.
./join.sh 2
3
real 0.03
user 0.02
sys 0.00
1896448 maximum resident set size
$ ./join.sh 100000
588889
real 2.20
user 2.05
sys 0.13
21188608 maximum resident set size
$cat join.sh
#!/bin/bash
/usr/bin/time -lp jq -n --argjson n "$1" '[range(0;$n)|tostring]|join(".")|length'
The above runs used jq 1.6, but using jq 1.5 produces very similar results.
On the other hand, joining a large number (20,000) of very long strings (1K) is noticeably slow, so evidently the current jq implementation is not designed for such operations.