Error in loading data in JSON file to Python Pandas dataframe

Error in loading data in JSON file to Python Pandas dataframe - json

I have a JSON file with multiple 'records' that I can easily load into a MongoDB database and then extract certain records from MongoDB into a python Pandas Dataframe. This is working just fine. However I wish to avoid this MongoDB route and directly load all the records in the JSON file into a pandas DF. I thought that this would be easy, but somehow it is not working at all.
This is what I have done
import pandas as pd
!wget -O peopleData.json -q https://github.com/prithwis/parashar21/raw/main/data/peopleDataTest5.json
data = pd.read_json('/content/peopleData.json')
#data = pd.read_json('/content/peopleData.json', lines=True)
This is throwing errors. I am using Google Colab and the notebook is available at this link.
I have seen quite a few other questions in stackoverflow that seem to address the same problem, but somehow none of the answers seem to work in my case. Will be grateful if someone can help me fix this.

Placing a new-line character between two successive json objects solves the problem!
# Retrieve JSON file from Github
!wget -O peopleData.json -q https://github.com/prithwis/parashar21/raw/main/data/peopleDataTest5.json
!cat peopleData.json
!grep '}{' peopleData.json
!sed -i 's/}{/}\n{/g' peopleData.json
!cat peopleData.json
data = pd.read_json('./peopleData.json', lines=True)
data
Inserted a \n between }{ using sed. Prior to this, the file was one continuous line, now it has 5 separate lines and hence read_json() function works with lines=True option

Related

Using sys.stdout.write() to create multiple files in NiFi?

I have a pipeline in NiFi that pulls down some invalid JSON that I need to clean up. The best solution I've concocted is to run a Python script via ExecuteStreamCommand and simultaneously clean/split it up in one fell swoop. However, even though I use sys.stdout.write() in my for loop, only the original JSON comes out in the output stream in NiFi.
Am I misusing sys.stdout.write() or is this possible, but I've just done something wrong? My end goal is for each line of the json to be a new flow file, i.e. file 1 is {"fruit":"apple",..., file 2 is {"fruit":"cherry",..., and so on.
example JSON
{"fruit":"apple", "vegetable":"celery", "location":{"country":"nor\\way", "city":"oslo", }, "color":"blue"}
{"fruit":"cherry", "vegetable":"kale", "location":{"country":"france", "city":"calais", }, "color":"green"}
{"fruit":"peach", "vegetable":"peas", "location":{"country":"united\\kingdom", "city":"london", }, "color":"yellow"}
script
import json
import re
import sys
flow_file = sys.stdin.read()
try:
load = json.loads(flow_file)
sys.stdout.write(flow_file)
except:
flow_file_esc = re.sub(r"[(\\)]", "", flow_file)
for f in flow_file_esc.splitlines():
sys.stdout.write(str(f))

Can you clean the file first with ReplaceText and then split it with SplitJson, SplitRecord, or ForkRecord?
If you need to combine the two operations and want to script it, you could try ExecuteScript with Jython (since it doesn't look like you're using native CPython libraries), I have some simple examples in my cookbook and my blog.

split big json files into small pieces without breaking the format

I'm using spark.read() to read a big json file on databricks. And it failed due to: spark driver has stopped unexpectedly and is restarting after a long time of runing.I assumed it is because the file is too big, so i decided to split it. So I used command:
split -b 100m -a 1 test.json
This actually split my files into small pieces and I can now read that on databricks. But then I found what I got is a set of null values. I think that is because i splitted the file only by the size,and some files might become files that are not in json format. For example , i might get something like this in the end of a file.
{"id":aefae3,......
Then it can't be read by spark.read.format("json").So is there any way i can seperate the json file into small pieces without breaking the json format?

How to auto format JSON on save in Vim

To be honest go has spoiled me. With go I got used to having a strict formatting standard that is being enforced by my editor (vim) and is almost accepted and followed by everybody else on the team and around the world.
I wanted to format JSON files on save the same way.
Question: How to auto format/indent/lint json files on save in vim.

In one command, try this:
execute '%!python -m json.tool' | w
You could then add you own key binding to make it a simpler keystroke. Of course, for this to work, you need to have Python installed on your machine.

If you are keen on using external tool and you are doing some work with json, I would suggest using the jq:
https://stedolan.github.io/jq/
Then, you can execute :%!jq . inside vim which will replace the current buffer with the output of jq.

%!python -m json.tool
or
%!python -c "import json, sys, collections; print json.dumps(json.load(sys.stdin, object_pairs_hook=collections.OrderedDict), ensure_ascii=False, indent=4)"
you can add this to your vimrc:
com! FormatJSON %!python -m json.tool
than you can use :FormatJson format json files

Thanks mMontu and Jose B, this is what I ended up doing:
WARNING this will overwrite your buffer. So if you OPEN a json file that already has a syntax error, you will lose your whole file (or can lose it).
Add this line to your ~/.vimrc
" Ali: to indent json files on save
autocmd FileType json autocmd BufWritePre <buffer> %!python -m json.tool
you need to have python on your machine, of course.
EDIT: this next one should not overwrite your buffer if your json has error. Which makes it the correct answer, but since I don't have a good grasp of Vim script or shell for that matter, I present it as an experimental thing that you can try if you are feeling lucky. It may depend on your shell too. You are warned.
" Ali: to indent json files on save
autocmd FileType json autocmd BufWritePre <buffer> %!python -m json.tool 2>/dev/null || echo <buffer>

A search for JSON plugins on vim.org returned this:
jdaddy.vim : JSON manipulation and pretty printing
It has the following on description:
gqaj "pretty prints" (wraps/indents/sorts keys/otherwise cleans up)
the JSON construct under the cursor.
If it does the formatting you are expecting then you could create an autocmd BufWritePre to format when saving.

Here is my solution. It doesn't exactly address the question part of "on save" but if you perform this action before save it will output errors you can then fix before save.
Also, it depends on only one external tool -- jq -- which has become the gold standard of unix shell JSON processing tools. And which you probably already have installed (macOS and Linux/Unix only; idk how this would behave in Windows)
Basically, it's just:
ggVG!jq '.'
That will highlight the entire JSON document then run it through jq which will just parse it for correctness, reformat it (e.g. fix any indents, etc), and spit the output back into the Vim editor.
If you want to parse only part of the document, you can highlight that part manually by pressing v or V and then run
!jq '.'
The benefit here is that you can fix subsections of your document this way.

Vim Autoformat
https://github.com/Chiel92/vim-autoformat
There is this Vim plugin which supports multiple auto format and indent schemes as well as extending with custom formatters per filetype.
https://github.com/Chiel92/vim-autoformat#default-formatprograms
Note:
You will need to have nodejs and js-beautify installed as vim-autoformat uses these as the default external tool.
npm install -g js-beautify

Another solution is to use coc-format-json.

I did some organizing (though some of it had nothing to do with vim) and to write the script by yourself on the neovim!
solution1: neovim
1-1: write the script by yourself
Neovim allows Python3 plugins to be defined by placing python files or packages in rplugin/python3/ in a runtimepath folder)
in my case
- init.vim
- rplugin/python3/[your_py_file_set].py
- rplugin/python3/fmt_file.py
The fmt_file.py as following
# rplugin/python3/fmt_file.py
import pynvim
import json
#pynvim.plugin
class Plugin:
__slots__ = ('vim',)
def __init__(self, vim):
self.vim = vim
#pynvim.command('FormatJson', nargs='*', range='')
def format_json(self, args, rg):
"""
USAGE::
:FormatJson
"""
try:
buf = self.vim.current.buffer
json_content: str = '\n'.join(buf[:])
dict_content: dict = json.loads(json_content)
new_content: str = json.dumps(dict_content, indent=4, sort_keys=True)
buf[:] = new_content.split('\n')
except Exception as e:
self.vim.current.line = str(e)
afterwards run: :UpdateRemotePlugins from within Nvim once, to generate the necessary Vimscript to make your Plugin available. (and you best restart the neovim)
and then, you open the JSON file that one you want to format and typing: :FormatJson in the command. all done.
don't forget to tell vim where is your python
" init.vim
let g:python3_host_prog = '...\python.exe''
and pip install pynvim
1-2: use tool.py
where tool.py is located on the Lib/json/tool.py
:%!python -m json.tool
solution2: command line
If you already install the python, and you can open the command line:
python -m json.tool "test.json" >> "output.json"
solution3: python
I write a simple script for those things.
"""
USAGE::
python fmt_file.py fmt-json "C:\test\test.json"
python fmt_file.py fmt-json "C:\test\test.json" --out_path="abc.json"
python fmt_file.py fmt-json "test.json" --out_path="abc.json"
"""
import click # pip install click
from click.types import File
import json
from pathlib import Path
#click.group('json')
def gj():
...
#gj.command('fmt-json')
#click.argument('file_obj', type=click.File('r', encoding='utf-8'))
#click.option('--out_path', default=None, type=Path, help='output path')
def format_json(file_obj: File, out_path: Path):
new_content = ''
with file_obj as f:
buf_list = [_ for _ in f]
if buf_list:
json_content: str = '\n'.join(buf_list)
dict_content: dict = json.loads(json_content)
new_content: str = json.dumps(dict_content, indent=4, sort_keys=True)
if new_content:
with open(out_path if out_path else Path('./temp.temp_temp.json'),
'w', encoding='utf-8') as f:
f.write(new_content)
def main():
for register_group in (gj,):
register_group()
if __name__ == '__main__':
main()

you can search for 'vim-json-line-format' plugin, Open a file in Normal mode, move your cursor on the json line, use <leader>pj to show formated json by print it, use <leader>wj could change the text to formatted json.
Invalid json can not format!

Use ALE to auto-format on save
Configure ALE to format JSON
add the following to .vim/vimfiles/after/ftplugin/json.vim:
let b:ale_fix_on_save = 1 " Fix files when they are saved.

Neo4j jexp/batch-import weird error: java.lang.NumberFormatException

I'm trying to import around 6M nodes using Michael Hunger's batch importer but I'm getting this weird error:
java.lang.NumberFormatException: For input string: "78rftark42lp5f8nadc63l62r3" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
It is weird because 78rftark42lp5f8nadc63l62r3 is the very first value of the big CSV file that I'm trying to import and its datatype is set to string.
These are the first three lines of that file:
name:string:sessions labels:label timestamp:long:timestamps visitor_pid referrer_url
78rftark42lp5f8nadc63l62r3 Session 1401277353000 cd7b76ef09b498e95b35b49de2925c5f http://someurl.com/blah?t=123
dt2gshq5pao8fg7bka8fdri123 Session 1401277329000 4036ac507698e4daf2ada98664da6d58 http://enter.url.com/signup/signup.php
As you can see here name:string:session the datatype of that column is set to string, why is the importer trying to parse the value as long?
I'm completely new to Neo4j and its ecosystem so I'm sure I'm missing something here.
This is the command I ran to import a bunch of nodes and relations:
./import.sh \
-db-directory sessions.db \
-nodes "toImport/browser-nodes.csv.gz,toImport/country-nodes.csv.gz,toImport/device-nodes.csv.gz,toImport/ip-nodes.csv.gz,toImport/language-nodes.csv.gz,toImport/operatingSystem-nodes.csv.gz,toImport/referrerType-nodes.csv.gz,toImport/resolution-nodes.csv.gz,toImport/session-nodes.csv" \
-rels "toImport/rel-session-browser.csv.gz,toImport/rel-session-country.csv.gz,toImport/rel-session-device.csv.gz,toImport/rel-session-ip.csv.gz,toImport/rel-session-language.csv.gz,toImport/rel-session-operatingSystem.csv.gz,toImport/rel-session-referrerType.csv.gz,toImport/rel-session-resolution.csv.gz"
The file that fails is the last one in the list of nodes toImport/session-nodes.csv
The other files were successfully processed by the importer.
This is the content of the batch.properties file:
dump_configuration=false
cache_type=none
use_memory_mapped_buffers=true
neostore.propertystore.db.index.keys.mapped_memory=1G
neostore.propertystore.db.index.mapped_memory=3G
neostore.nodestore.db.mapped_memory=1G
neostore.relationshipstore.db.mapped_memory=1G
neostore.propertystore.db.mapped_memory=1G
neostore.propertystore.db.strings.mapped_memory=1G
batch_import.node_index.sessions=exact
batch_import.node_index.browsers=exact
batch_import.node_index.operatingsystems=exact
batch_import.node_index.referrertypes=exact
batch_import.node_index.devices=exact
batch_import.node_index.resolutions=exact
batch_import.node_index.countries=exact
batch_import.node_index.languages=exact
batch_import.node_index.ips=exact
batch_import.node_index.timestamps=exact
Any thoughts?
I can't see what's the problem here so any help will be appreciated.
EDIT:
I'm using this binary:
https://dl.dropboxusercontent.com/u/14493611/batch_importer_20.zip

Ways to parse JSON using KornShell

I have a working code for parsing a JSON output using KornShell by treating it as a string of characters. The issue I have is that the vendor keeps changing the position of the field that I am intersted in. I understand in JSON, we can parse it by key-value pairs.
Is there something out there that can do this? I am intersted in a specific field and I would like to use it to run the checks on the status of another RESTAPI call.
My sample json output is like this:
JSONDATA value :
{
"status": "success",
"job-execution-id": 396805,
"job-execution-user": "flexapp",
"job-execution-trigger": "RESTAPI"
}
I would need the job-execution-id value to monitor this job through the rest of the script.
I am using the following command to parse it:
RUNJOB=$(print ${DATA} |cut -f3 -d':'|cut -f1 -d','| tr -d [:blank:]) >> ${LOGDIR}/${LOGFILE}
The problem with this is, it is field delimited by :. The field position has been known to be changed by the vendors during releases.
So I am trying to see if I can use a utility out there that would always give me the key-value pair of "job-execution-id": 396805, no matter where it is in the json output.
I started looking at jsawk, and it requires the js interpreter to be installed on our machines which I don't want. Any hint on how to go about finding which RPM that I need to solve it?
I am using RHEL5.5.
Any help is greatly appreciated.

The ast-open project has libdss (and a dss wrapper) which supposedly could be used with ksh. Documentation is sparse and is limited to a few messages on the ast-user mailing list.
The regression tests for libdss contain some json and xml examples.
I'll try to find more info.

Python is included by default with CentOS so one thing you could do is pass your JSON string to a Python script and use Python's JSON parser. You can then grab the value written out by the script. An example you could modify to meet your needs is below.
Note that by specifying other dictionary keys in the Python script you can get any of the values you need without having to worry about the order changing.
Python script:
#get_job_execution_id.py
# The try/except is because you'll probably have Python 2.4 on CentOS 5.5,
# and the straight "import json" statement won't work unless you have Python 2.6+.
try:
import json
except:
import simplejson as json
import sys
json_data = sys.argv[1]
data = json.loads(json_data)
job_execution_id = data['job-execution-id']
sys.stdout.write(str(job_execution_id))
Kornshell script that executes it:
#get_job_execution_id.sh
#!/bin/ksh
JSON_DATA='{"status":"success","job-execution-id":396805,"job-execution-user":"flexapp","job-execution-trigger":"RESTAPI"}'
EXECUTION_ID=`python get_execution_id.py "$JSON_DATA"`
echo $EXECUTION_ID

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Error in loading data in JSON file to Python Pandas dataframe - json

Related

Using sys.stdout.write() to create multiple files in NiFi?

split big json files into small pieces without breaking the format

How to auto format JSON on save in Vim

Neo4j jexp/batch-import weird error: java.lang.NumberFormatException

Ways to parse JSON using KornShell

Categories

Resources