Edit a large JSON file - json

How can I edit a large JSON manually?
I have a large JSON file, about 100 MB. I'd like to manually inspect some attributes, and then add more attributes to some of the objects.
I'd start off by looking at a subset of the file. Say, the 1st 100 objects. I'd gradually scale up to looking then at maybe 250, then a thousand, etc.
Can someone suggest a language or software (I'm running Windows) that excels at this task?
Some previous suggestion that aren't working or can't work.
Sublime - Could never load the file. Loading bar forever. Had to kill.
NotePad++ - Could never load. Froze. Had to kill.
Anything online - The data is confidential.
More Python and Jupyter information.
with open(path, 'r') as f:
data = json.load(f)
for i, (k, v) in enumerate(data.items()):
print(i, k, v)
if i == 2:
break
Causes an error. I think it has to do with Jupyter, but I'm not sure.
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.
Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)
That makes me wonder if going about it this way is just dumb.
Possible Solutions
Build a custom app using TKinter
Just don't use a Jupyter Notebook

What you can do is to write a simple GUI program. use TKinter, to create a window and a text area inside it to show the json, a text box where you will input, how many objects you want to see, and a button named Next or something to see next and one more button to save.The following will be the functionalities for each of the items.
First you will be reading complete json in python and making it a dict.
Next Button - This will keep iterating based on the value in the TextBox. you could write a custom generator, where it will be yielding based on the number of values required.
Save Button-: This will keep saving the current json into a new json or if you could, you can try to write a function to update the current json directly.
Text Area - you should take the dictionary and convert to json and show the output from the Next Button's generator.

If you are using linux (or have an opportunity to transfer the file to *nix) you might wish to check out for number of lines within a file via
wc -l myfile.json
Let's say, for the purpose of simplicity, that your file has 2530000 lines and you wish to split it into 100k lines each, you can utilize any of the commands available at your distro to split the file further into desired chunks and then to edit them, one by one.
If you are comfortable with going the "linux way", check out some of the hints given on other topics, i.e.
edit multi-GB file when vi editor doesn't work
I hope it helps!

The only viewer I have used that works on large files (I had up to 250MB size files) is Dadroit. It is fast to view and comes with search.
Now, to edit, I use vi. I search for the location and make local edits. Vim or another simpler editor should work on Windows. Have you tried vscode? 100MB shouldn't be too large for it.
The other awesome terminal tool for viewing and editing data is Visidata. I have had mixed luck with it working on json files.

Not the best answer, but the problem with reading the JSON seems limited to Jupyter Notebooks (or even the limitations of my laptop).
Working in Spyder or running from the command line circumvents the Jupyter error mentioned in the original question.
It'd be great if someone knew how to tweak Jupyter to avoid this problem (sorry, I'm not sure how yet).

for editor,try notepad++
for language, try Python
since you haven't give your data structure, I can't give more answer.

Related

Trying to parse a JSON file but it seems the format is different or something is wrong with the JSON file

Hi I'm trying to parse any of the files from the link underneath. I've tried reaching out to the owner of the data dumps, but nothing works in trying to parse the files as proper JSON files. No program we use (Power BI, Jupyter, Excel) anything really, wants to recognise the files as JSON and we can't figure out why this might be. I was wondering if anyone could help figuring out what the issue is here as this dataset is very interesting to me and my co-students. I hope I'm using the word 'parsing' correctly.
The link to the data dumps is linked underneath:
https://files.pushshift.io/reddit/comments/
The file I downloaded (I just tried one at random) was handled just fine by jq, my preferred command-line tool for processing JSON files.
jq accepts an input consisting of a sequence of JSON objects, which is what I found when I decompressed the test file. This format is commonly known as JSON lines, and many tools can handle it. The Wikipedia article on JSON streaming contains more information and a (possibly outdated) list of tools.
If your tools aren't capable of handling more than one JSON object in an input, you could turn the files into something which you can handle by adding a comma to the end of every line except the last one (since each JSON object is a single line) and then surrounding the whole input inside a pair of brackets to turn the sequence into a JSON list. Since JSON does not actually care about newlines, it would be sufficient to add a line containing [ at the beginning and a line containing ] at the end. I don't know what command-line tools you have available and are comfortable with, but the task shouldn't be too difficult.

Managing a large SPSS (*.sav) file (4.2 GB)

I have received an SPSS file from survey fielded by another company that allegedly only contains ~1500 respondents, but the file size somehow has ballooned 4.2GB. My hunch is that the reason for this is that the file was from a global survey and the 1500 records that have been selected are from the US only so there are a series of blank variables, metadata for those variables that are included in this file and may also be in multiple languages/alphabets.
I only need a subset of this data, and can likely work with it if I removed the metadata but my issue has been that I can't get the damn thing open to cut down on the number of variables. I have been using the tools at my disposal to try the following workarounds, though I'm sure there are better options:
Opening the file using PSPP (freeware SPSS) - this causes the PSPP to stop responding
Using the R command read.spss (from the foreign package) to write a .csv - this claims that the file has a duplicate variable name and won't proceed further
Using the R command spss.system.file to write a .csv - when I tried this, R has spend a lot of time thinking as it as it attempts to run this and has been running for a couple hours with no apparent success.
Using the PSPP text conversion tool (https://pspp.benpfaff.org/) to create either a dictionary or a .csv file - both of these options crash after the file has completed uploading.
I've gone back to the other company to try have them work on reducing the file size, however I wasn't sure if anyone else had any ideas to do either of the following:
Open the file using another program/converter that could turn it into a .csv or other similarly skinny file format
Use another program to at least read only the variable names included in the file so that I can provide the other company with the specific variables I need
The following command from PSPP should do what you need:
$ pspp-convert originalFile.sav output.csv
In case it doesn't, please provide terminal error message.

Puppet - CSV file header

I'm, writing a Puppet (3.6.2) module that reads data fields from a CSV file via the extlookup function and I cannot figure out how to tell extlookup that the first line is the header field. Does extlookup support this? If not, can anyone recommend an external function I could import and use?
thanks,
PS - Yes I know about hiera, and having the data in YAML or JSON files but my requirement is CSV files only.
Brandon
The behavior of extlookup() is pretty well documented. It makes no special provision for column headers, which are by no means an inherent feature of CSV format. Indeed, if your header line is not readable as a data line, then your file is not CSV at all.
Supposing that your file is indeed valid CSV, the absolute simplest solution would be to ignore the issue. It presents a problem only if the first column heading duplicates an actual or potential data name. If it does not, then you will never look up or use the psuedo-value represented by the first row.
If your file in fact is not CSV on account of its first line, or if the first column name conflicts with a real data name, then it seems the next best alternative would be to just remove that line, or to avoid creating it in the first place. I don't see any reason why one of these should not be possible.
I know about heira, and having the data in YAML or JSON files but my requirement is CSV files only.
How sad. Do be aware that extlookup() has long been deprecated, and it was removed from Puppet 4.
I'm inclined to suggest you implement a translator from CSV to Hiera-friendly YAML, and use Hiera in your module. Alternatively, Hiera supports custom backends, and it's not too hard to write one. I am unaware of an existing CSV backend for Hiera, but you could write one. Ignoring a header line would then be under your control, and you would simultaneously achieve a measure of future-proofing.

How to automate getting a CSV file from this website?

I've never worked with web pages before and I'd like to know how best to automate the following through programming/scripting:
go to http://financials.morningstar.com/ratios/r.html?t=GMCR&region=USA&culture=en_US
invoke the 'Export to CSV' button near the top right
save this file into local directory
parse file
Part 4 doesn't need to use the same language as for 1-3 but ideally I would like to do everything in one shot using one language.
I noticed that if I hover my mouse over the button it says: javascript:exportKeyStat2CSV(); Is this a java function I could call somehow?
Any suggestions are appreciated.
It's a JavaScript function, which is not Java!
At first glance, this may seem like you need to execute Javascript to get it done, but if you look at the source of the document, you can see the function is simply implemented like this:
function exportKeyStat2CSV(){
var orderby = SRT_keyStuts.getOrderFromCookie("order");
var urlstr = "//financials.morningstar.com/ajax/exportKR2CSV.html?&callback=?&t=XNAS:GMCR&region=usa&culture=en-US&cur=&order="+orderby;
document.location = urlstr;
}
So, it builds a url, which is completely fixed, except the order by part, which is taken from a cookie. Then it simply navigates to that url by setting document.location. A small test shows you even get a csv file if you leave the order by part empty, so probably, you can just download the CSV from the base url that is in the code.
Downloading can be done using various tools, for instance WGet for Windows. See SuperUser for more possibilities. Anyway, 'step 1 to 3' is actually just a single command.
After that, you just need to parse the file. Parsing CSV files can be done using batch, and there are several examples available. I won't get into details, since you didn't provide any in your question.
PS. I'd check their terms of use before you actually implement this.
The button directs me to this link:
http://financials.morningstar.com/ajax/exportKR2CSV.html?&callback=?&t=XNAS:GMCR&region=usa&culture=en-US&cur=&order=asc
You could use the Python 3 module urllib and fetch the file, save it using the os or shutil modules, then parse it using one of the many CSV parsing modules, or by making your own.

export plots with netlogo

I am trying to export all the plots of my NetLogo model after simulation runs in a csv format with the primitive export-all-plots.
I haven't found yet the way to open this csv file with an external reader in order to get more clear plots. I tried with gnuplot but it looks like it's not able to open the csv format created with NetLogo:
"export-plots data (NetLogo 5.0.5)"
^
"C:\results\interface.csv", line 1: invalid command
How can I open csv plots with an external reader?
There are two complicating factors about NetLogo's plot export format. First, there's a three line header at the beginning (plus an empty line after) that just gives information about the model and when the data was generated. Next, there's data about the model settings, the plot state (pen colors and such). Finally, there's the data itself, which itself is somewhat complicated by the fact that you can have multiple pens per plot. So I'm not surprised gnuplot couldn't read it as is.
The table's are quite easy to use in GUI spreadsheet application, like Excel, LibreOffice's Calc, or Gnumeric. You can just select the data you want and generate the plots.
To do this at the command line, I'm afraid you might have to write a script to read it in. This should be pretty easy in something like Python or R. Just skip the metadata lines, and use a CSV parser to read in the rest.
You might also try using BehaviorSpace to generate the data, but make sure to use the table output. It let's you generate the data from many runs at once, and the format is a little more consistent. There are still 6 lines of metadata at the top, but you can just delete that. I believe this is more the standard practice in NetLogo.