I've got a huge (>4GB) JSON file where all the information is written in one single line. Unfortunately, my scripts can't work with such a huge file and I wasn't able to split it into multiple lines. Every script or program I've used was crashing because of limited memory.
Can you help me splitting a json file into several files => for example with a bash command?
Related
I am trying to merge multiple small JSON files (about 500,000 files of 400-500 byte size and are no longer susceptible to change) into one big CSV file, using AWS Lambda. I have a job that works something like this:
Use the s3.listobjects() to fetch keys
Use s3.getObject() to fetch each JSON file (is there a better way to do this?)
Create a CSV file in-memory (what's the best way to do this in nodejs?)
Upload that file in S3
I'd love to know if there's a better way to go about doing this. Thanks!
I would recommend using Amazon Athena.
It allows you to run SQL commands across multiple data files simultaneously (including JSON) and can create output files by Creating a Table from Query Results (CTAS) - Amazon Athena.
I’ve got a live firebase app with a database that’s about 5GB in size. The firebase dashboard refuses to show me the contents of my database and just fails to load every time, presumably because the thing is too big. I’ve been digging around for some time now in search of some tool that makes it possible for me to come up with an ERD of my data. Help?
Atom crashes, vim takes forever and doesnt load anything, jq simply spits out a formatted version of my data, i’ve tried a couple of java tools to generate JSON schemas, but they crash after a while.. most python programs to do the same don’t even start properly.
How would you explore 5GB of json data?
Most of the file editors have line pagination, so your file should load.
Unless it's a one single line file.
In that case, you can use sed or jq to reformat the file in order to have more than one line.
After that operation you should be able to open it.
In case you need to extract data, you could use cat file.json | grep "what you need to extract".
That should work even on a one single line 5gb file.
In progress4gl, Iam exporting some values from mutiple procedure files to a single csv file. But when running the second procedure (.p) file the values which I got from the previous file is getting overwritten...How to export all datas of all the procedure files to a single csv file? Thanks in Advance..
The quick answer is to open the second and subsequent outputs to the file as
OUTPUT TO file.txt APPEND.
if that is all you need. If you are looking to do something more complex, then you could define and open a new shared stream in the calling program, and use that stream in each of the called programs, thus only opening and closing the stream once.
If you're using persistent procedures and functions, this answer may help, as it's a little more complex than normal shared streams.
I would realy not suggest to use a SHARED Stream. Especially with persischen Procedures or OO. STREAM-HANDLES provide a more flexible way of distributing the Stream.
So as was previously suggested
on the first job running you do:
OUTPUT TO file.txt.
on all the other jobs running after this you do:
OUTPUT TO file.txt APPEND.
Is it possible to perform any sort of indirection in SSIS?
I have a series of jobs performing FTP and loops through the files before trying to run another DTSX package on them. Currently this incurs a lot of repeated cruft to pull down the file and logging.
Is there any way of redesigning this so I only need one package rather than 6?
Based on your comment:
Effectively the 6 packages are really 2 x 3. 1st for each "group" is FTP pull
down and XML parsing to place into flat tables. Then 2nd then transforms and
loads that data.
Instead of downloading files using one package and inserting data into tables using another package, you can do that in a single package.
Here is a link containing an example which downloads files from FTP and saves it to local disk.
Here is a link containing an example to loop through CSV files in a given folder and inserts that data into database.
Since you are using XML files, here is a link that shows how to loop through XML files.
You can effectively combine the above examples into a single package by placing the control flow tasks one after the other.
Let me know if this is not what you are looking for.
I want to do a bulk load into MongoDB. I have about 200GB of files containing JSON objects which I want to load, the problem is I cannot use the mongoimport tool as the objects contain objects (i.e. I'd need to use the --jsonArray aaram) which is limited to 4MB.
There is the Bulk Load API in CouchDB where I can just write a script and use cURL to send a POST request to insert the documents, no size limits...
Is there anything like this in MongoDB? I know there is Sleepy but I am wondering if this can cope with a JSON nest array insert..?
Thanks!
Ok, basically appears there is no real good answer unless I write my own tool in something like Java or Ruby to pass the objects in (meh effort)... But that's a real pain so instead I decided to simply split the files down to 4MB chunks... Just wrote a simple shell script using split (note that I had to split the files multiple times because of the limitations). I used the split command with -l (line numbers) so each file had x number of lines in it. In my case each Json object was about 4kb so I just guessed line sizes.
For anyone wanting to do this remember that split can only make 676 files (26*26) so you need to make sure each file has enough lines in it to avoid missing half the files. Any way put all this in a good old bash script and used mongo import and let it run overnight. Easiest solution IMO and no need to cut and mash files and parse JSON in Ruby/Java or w.e. else.
The scripts are a bit custom, but if anyone wants them just leave a comment and ill post.
Without knowing anything about the structure of your data I would say that if you can't use mongoimport you're out of luck. There is no other standard utility that can be tweaked to interpret arbitrary JSON data.
When your data isn't a 1:1 fit to what the import utilities expect, it's almost always easiest to write a one-off import script in a language like Ruby or Python to do it. Batch inserts will speed up the import considerably, but don't do too large batches or else you will get errors (the max size of an insert in 1.8+ is 16Mb). In the Ruby driver a batch insert can be done by simply passing an array of hashes to the insert method, instead of a single hash.
If you add an example of your data to the question I might be able to help you further.