Error while reading JSON file in chunksizes with python - json

I have a large json file, so I want to read the file in chunks while testing. I have implemented the code below:
if fpath.endswith('.json'):
with open(fpath, 'r') as f:
read_query = pd.read_json(f, lines=True, chunksize=100)
for chunk in read_query:
print(chunk)
I get the error:
File "nameoffile.py", line 168, in read_queries_func
for chunk in read_query:
File "C:\Users\Me\Python38\lib\site-packages\pandas\io\json\_json.py", line 798, in __next__
obj = self._get_object_parser(lines_json)
File "C:\Users\Me\Python38\lib\site-packages\pandas\io\json\_json.py", line 770, in _get_object_parser
obj = FrameParser(json, **kwargs).parse()
File "C:\Users\Me\Python38\lib\site-packages\pandas\io\json\_json.py", line 885, in parse
self._parse_no_numpy()
File "C:\Users\Me\Python38\lib\site-packages\pandas\io\json\_json.py", line 1159, in _parse_no_numpy
loads(json, precise_float=self.precise_float), dtype=None
ValueError: Expected object or value
Why am I getting an error?
The JSON file looks like this:
[
{
"a": "13",
"b": "55"
},
{
"a": "15",
"b": "16"
},
{
"a": "18",
"b": "45"
},
{
"a": "1650",
"b": "26"
},
.
.
.
{
"a": "214",
"b": "23"
}
]
Also, is there a way to extract just the 'a' attribute's values while reading the file? Or can that only be done after I've read the file?

Your json file contains just one object. As per the line-delimited json doc to which the doc of the chunksize argument points:
pandas is able to read and write line-delimited json files that are common in data processing pipelines using Hadoop or Spark.
For line-delimited json files, pandas can also return an iterator which reads in chunksize lines at a time. This can be useful for large files or to read from a stream.
It also implies that lines=True, and the doc for lines says:
Read the file as a json object per line.
This means that files like this work:
{"a": 1, "b": 2}
{"a": 3, "b": 4}
{"a": 5, "b": 6}
{"a": 7, "b": 8}
{"a": 9, "b": 10}
These don’t:
[
{"a": 1, "b": 2},
{"a": 3, "b": 4},
{"a": 5, "b": 6},
{"a": 7, "b": 8},
{"a": 9, "b": 10}
]
So you have to read the file in one go, or modify it as you go to have one object per line.

Related

Split a large json file into multifile with size within a certain range

I have a large json file (~10GB) like this
{"id":1, "attributes":{"a": 1}}
{"id":2, "attributes":{"a": 4, "b": 5, "d": 6}}
{"id":2, "attributes":{"a": 4, "b": 5, "c": 6, "d": 5, "e": 1}}
{"id":2, "attributes":{"a": 4, "b": 5, "c": 6, "d": 5, "e": 1, h: "l"}}
I need split this file into multifile with size within a certain range (300-350MB)
I tried using split command line
split -l 5000000 test.json
or
split -b 300MB test.json
Both ways don't work as I expected because each line of json file has different size. If divided by size, the size of each file after splitting can be larger or smaller than the range I want. If divided by line, the last line or first line of the files after the split may be cut off

How can I emit delimited text (like CSV) from Jq?

When using Jq for data processing, it's often more convenient to emit the processed text in some kind of "delimited" form that other CLI tools can consume, such as Awk, Cut, and the read builtin in Bash.
Is there a straightforward way to achieve this?
Sample data:
[
{"a": 11, "b": 12, "c": 13},
{"a": 21, "b": 22, "c": 23},
{"a": 31, "b": 32, "c": 33},
{"a": 41, "b": 42, "c": 43}
]
Desired output:
a,c
11,13
21,21
31,33
41,43
jq --raw-output 'map({ a, c }) | ( .[0] | keys_unsorted), (.[] | [.[]]) | #csv'
Will produce:
"a","c"
11,13
21,23
31,33
41,43
Online JqPlay Demo
If you can assume that the attribute names are the same in all array elements, you can use the #csv formatter along with --raw-output:
Put this in a script like json-records-to-csv.jq, adjusting the shebang as needed:
#!/usr/bin/jq --raw-output -f
# Like `keys`; extracts object values as an array.
def values:
to_entries | map(.value)
;
# Get the column names from the first array element keys
(.[0] | keys | #csv)
,
# Get the values from every array element values
(.[] | values | #csv)
Usage example:
json-records-to-csv.jq <<'JSON'
[
{"a": 11, "b": 12, "c": 13},
{"a": 21, "b": 22, "c": 23},
{"a": 31, "b": 32, "c": 33},
{"a": 41, "b": 42, "c": 43}
]
JSON
Output:
"a","b","c"
11,12,13
21,22,23
31,32,33
41,42,43

Sorting a list of data's in json format

So I am trying to sort through a good few lines of text and numbers in JSON format. An example of some of these lines would be
{ "a": "123456", "b": 16, "c": "Data" }
{ "a": "654321", "b": 30, "c": "Data" }
{ "a": "015864", "b": 18, "c": "Data" }
I am trying to sort these in a way that the value for B (16, 30, and 18, respectively) is sorted from largest to lowest. Example:
{ "a": "654321", "b": 30, "c": "Data" }
{ "a": "015864", "b": 18, "c": "Data" }
{ "a": "123456", "b": 16, "c": "Data" }
I know it is possible, but I have 0 idea how regex works, or if there is something else I could use. Preferably able to be done from a linux command line, but if it needs to be done with notepad++ or something I can make it work. Also please explain why your suggestion works so I can learn, not just copy-paste your suggestion. Thank you!!
Edit: I was informed Regex would not work. I am okay with any scripting language being used, as long as I can learn how to do these things myself. Python, Perl, php, any of them work.
if the fields are aligned you can sort as if it's a text file.
$ sort -k5,5nr file
{ "a": "654321", "b": 30, "c": "Data" }
{ "a": "015864", "b": 18, "c": "Data" }
{ "a": "123456", "b": 16, "c": "Data" }
field are space delimited, sorting based on field 5 values (which are b values), numerical reverse sort.

How do I transform this JSON data using JQ to extract each nested array element to the top level in turn?

Given input of the form
[
{"a": 1, "b": [{"c": 1}, {"c": 2}]},
{"a": 2, "b": [{"c": 4}, {"c": 5}]}
]
I'm trying to transform to look like:
[
{"a": 1, "b": [{"c": 1}],
{"a": 1, "b": [{"c": 2}],
{"a": 2, "b": [{"c": 3}],
{"a": 2, "b": [{"c": 4}]
]
I have [map(.b) ] | flatten, however any further operation using the parent context does not seems to be possible. I'm really stuck and would appreciate any help.
Thanks
Here's a straightforward solution that makes no mention of any keys besides "b":
map(. + (.b[] | {b: [.]}))
You can try this filter:
jq 'map({a,"b":.b[]|[.]})' file
It updates the content of b with each value of c separately.

Assigning parent keys in innermost object using JQ

I would like to turn this:
{
"a": 1,
"b": [1,2,3,4]
}
into this
[
{"a": 1, "b": 1},
{"a": 1, "b": 2},
...
]
This is sort of like python's zip but with unequally shaped objects.
Thanks!
Here is a solution:
$ jq -Mc '[.b=.b[]]' data.json
If data.json contains the sample data the output is
[{"a":1,"b":1},{"a":1,"b":2},{"a":1,"b":3},{"a":1,"b":4}]
You can use cat ab.json|jq '[{"a": .a, "b": .b[]}]' to get the answer.
If minimizing keystrokes is the goal, then consider:
jq '.+{b:.b[]}' <<< "$j"
{
"a": 1,
"b": 1
}
{
"a": 1,
"b": 2
}
{
"a": 1,
"b": 3
}
{
"a": 1,
"b": 4
}
Using . here ensures that all keys other than "b" will be preserved. By contrast, if one wants to ignore all the keys other than "a" and "b", then one could use the jq filter:
{a,b:.b[]}
To turn the stream into an array, just wrap the expression in square brackets: [ ... ]