I have a JSON file with a large array of JSON objects. I am using JsonTextReader on StreamReader to read data from the files. But, also, some attributes need to be updated as well.
Is it possible to use JsonTextWriter to find and update a particular JSON object?
Generally, to modify a file means reading the whole file to memory, making the change, then writing the whole thing back out to the file. (There are certain file formats that don't require this by virtue of having a static-size layout or other mechanisms designed to work around having to read in the whole file but JSON isn't one of those.)
JSON.net is capable of reading and writing JSON streams as a series of tokens, so it should be possible to minimize the memory footprint by using this. However you will still be reading the entire file into memory and then writing it back out. Because of the simultaneous read/write, you'd need to write to a temp file instead and then, once you're done, move/rename that temp file to the correct place.
Depending on how you've structured the JSON, you may also need to keep track of where you are in that structure. This can be done by tracking the tokens as they're received and using them to maintain a kind of "path" into the structure. That path can be used to determine when you're at a place that needs updating.
The general strategy is to read in tokens, alter them if required, then write them out again.
Related
I've inherited some code which makes calls to a web API, and get's a deeply nested (up to eight levels) response.
I've written some code to flatten the structure so that it can be written to .csv files, and a SQL database, for people to consume more easily.
What I'd really like to do though is keep a version of the original response, so that there's a reference of the original structure if I ever want/need it.
I understand that HDF5 is primarily meant to store numerical data. Is there any reason not to use it to dump JSON blobs? It seems a lot easier than setting up a NoSQL database.
It should be fine. It sounds like you'd be storing each JSON response as a HDF5 variable length string. Which is fine, it's just a string to the library.
Do you plan to store each response as a separate dataset? That may be inefficient if you are talking about >1000's of responses.
Alternatively, you can create a 1-d extensible dataset, and just append to it with each response.
Decided it was easier to set up a Mongo database.
I want to process a ~300 GB JSON file in Hadoop. As far as my understanding goes a JSON consists of a single string with data nested in it. Now if I want to parse the JSON string using Google's GSON, then won't the Hadoop have to put the entire load upon a single node as the JSON is not logically divisible for it.
How do I partition the file (I can make out the partitions logically looking at the data) if I want that it should be processed parallely on different nodes. Do I have to break the file before I load it onto HDFS itself. Is it absolutely necessary that the JSON is parsed by one machine (or node) at least once?
Assuming you know can logically parse the JSON into logical separate components then you can accomplish this just by writing your own InputFormat.
Conceptually you can think of each of the logically divisible JSON components as one "line" of data. Where each component contains the minimal amount of information that can be acted on independently.
Then you will need to make a class, a FileInputFormat, where you will have to return each of these JSON components.
public class JSONInputFormat extends FileInputFormat<Text,JSONComponent {...}
If you can logically divide your giant JSON into parts, do it, and save these parts as separate lines in file (or records in sequence file). Then, if you feed this new file to Hadoop MapReduce, mappers will be able to process records in parallel.
So, yes, JSON should be parsed by one machine at least once. This preprocessing phase doesn't need to be performed in Hadoop, simple script can do the work. Use streaming API to avoid loading a lot of data into memory.
You might find this JSON SerDe useful. It allows hive to read and write in JSON format. If it works for you, it'll be a lot more convenient to process you JSON data with Hive as you don't have to worry about the custom InputFormat that is going to read your JSON data and create splits for you.
I am designing a system with 30,000 objects or so and can't decide between the two: either have a JSON file pre computed for each one and get data by pointing to URL of the file (I think Twitter does something similar) or have a PHP/Perl/whatever else script that will produce JSON object on the fly when requested, from let's say database, and send it back. Is one more suited for than another? I guess if it takes a long time to generate the JSON data it is better to have already done JSON files. What if generating is as quick as accessing a database? Although I suppose one has a dedicated table in the database specifically for that. Data doesn't change very often so updating is not a constant thing. In that respect the data is static for all intense and purposes.
Anyways, any thought would be much appreciated!
Alex
You might want to try MongoDB which retrieves the objects as JSON and is highly scalable and easy to setup.
I have a JSON file with a lot of unneeded data and I wish to get rid of most of it.
It a huge file so I need to make an operation that will do that.
I tried Regex but most of the apps I tried seems to stuck in the middle of the process.
What I need is simply find objects by their key and delete them from the file.
Any Ideas?
If the file is too large to be read into memory, you might want to use something like yajl, which provides an event-driven, SAX-like interface. This allows you to make changes to the JSON as you read it (and, I suppose, write it to another file).
I want to do a bulk load into MongoDB. I have about 200GB of files containing JSON objects which I want to load, the problem is I cannot use the mongoimport tool as the objects contain objects (i.e. I'd need to use the --jsonArray aaram) which is limited to 4MB.
There is the Bulk Load API in CouchDB where I can just write a script and use cURL to send a POST request to insert the documents, no size limits...
Is there anything like this in MongoDB? I know there is Sleepy but I am wondering if this can cope with a JSON nest array insert..?
Thanks!
Ok, basically appears there is no real good answer unless I write my own tool in something like Java or Ruby to pass the objects in (meh effort)... But that's a real pain so instead I decided to simply split the files down to 4MB chunks... Just wrote a simple shell script using split (note that I had to split the files multiple times because of the limitations). I used the split command with -l (line numbers) so each file had x number of lines in it. In my case each Json object was about 4kb so I just guessed line sizes.
For anyone wanting to do this remember that split can only make 676 files (26*26) so you need to make sure each file has enough lines in it to avoid missing half the files. Any way put all this in a good old bash script and used mongo import and let it run overnight. Easiest solution IMO and no need to cut and mash files and parse JSON in Ruby/Java or w.e. else.
The scripts are a bit custom, but if anyone wants them just leave a comment and ill post.
Without knowing anything about the structure of your data I would say that if you can't use mongoimport you're out of luck. There is no other standard utility that can be tweaked to interpret arbitrary JSON data.
When your data isn't a 1:1 fit to what the import utilities expect, it's almost always easiest to write a one-off import script in a language like Ruby or Python to do it. Batch inserts will speed up the import considerably, but don't do too large batches or else you will get errors (the max size of an insert in 1.8+ is 16Mb). In the Ruby driver a batch insert can be done by simply passing an array of hashes to the insert method, instead of a single hash.
If you add an example of your data to the question I might be able to help you further.