decompressing huge json gzip file causing memory error in python - json

I got the following problem.
I am working with small machine with low memory (1 GB).
my program download a huge gzip file from some url. And I need to decompress it to dict I know for sure that the file is in json format.
My problem is that after I run the following command I got a memory error:
data = zlib.decompress(url, 16 + zlib.MAX_WBITS).decode('utf8')
results_list.append(json.loads(data ))
now for small files this works fine, but for large I got the error.
my intuition tells me that I should split the file into chunks, but then because I am expecting a json file i wont be able to restore the chunks back to json (because each part wont be a valid json string).
what I should do?
Thank a lot!

Create a decompression object using z=zlib.decompressobj(), and then do z.decompress(some_compressed_data, max), which will return no more than max bytes of uncompressed data. You then call again with z.decompress(z.unconsumed_tail, max) until the rest of some_compressed_data is consumed, and then feed it more compressed data.
You will need to then be able to process the resulting uncompressed data a chunk at a time.

Related

Best Approach to read large number of JSON Files (1 JSON per file) in Databricks

Hope everyone is doing well.
I am trying to read a large number of JSON files i.e. around 150,000 files in a folder using Azure Databricks. Each file contains a single JSON i.e. 1 record per file. Currently the read process is taking over an hour to just read all the files despite having a huge cluster. The files are read using pattern as shown below.
val schema_variable = <schema>
val file_path = "src_folder/year/month/day/hour/*/*.json"
// e.g. src_folder/2022/09/01/10/*/*.json
val df = spark.read
.schema(schema_variable)
.json(file_path)
.withColumn("file_name", input_file_name())
Is there any approach or option we can try to make the reads faster.
We have already considered copying the file contents into a single file and then reading it, but we are losing lineage of file content i.e. which record came from which file.
I have also gone through various links in SO, but most of them seem to be around a single/multiple files of huge size say 10GB to 50GB.
Environment - Azure Databricks 10.4 Runtime.
Thank you for all the help.

Reading a large json file (More then 1 GB) taking unlimited time and RStudio Crashes

I have a mongoDB exported json file. Total size of this Json is more then 3GB.
I am trying to parse nested columns and make my own dataframe for writing in a CSV file.
My code works when the file size is small like 10-20MB. When my file sizes increase it takes unlimited time and at some point RStudio crashes.
I am using tidyjon and dplyr library to parse json file.

avoid splitting json output by pyspark (v. 2.1)

using spark v2.1 and python, I load json files with
sqlContext.read.json("path/data.json")
I have problem with output json. Using the below command
df.write.json("path/test.json")
data is saved in a folder called test.json (not a file) which includes two empty files: one empty and the other with a strange name:
part-r-00000-f9ec958d-ceb2-4aee-bcb1-fa42a95b714f
Is there anyway to have a clean single json output file?
thanks
Yes, spark writes the output in multiple file when you try to save. Since the computation is distributed the output files are written in multiples part files like (part-r-00000-f9ec958d-ceb2-4aee-bcb1-fa42a95b714f). The number of files created are equal to the number of partition.
If your data is small and can fits in the memory then you can save your output file in a single file. But if your data is large saving on a single file is not the suggested way.
Actually the test.json is a directory and not a json file. It contains multiple part files inside it. This does not create any problem for you you can easily read this later.
If you still want your output in a single file then you need to repartition to 1, which brings your all data to single node and saves. This may cause issue if you have large data.
df.repartition(1).write.json("path/test.json")
Or
df.collect().write.json("path/test.json")

"resourcesExceeded" error when creating a table from a .avro file in BigQuery

I have uploaded a .avro file on Google Cloud Storage which is about 100MB. It is converted from a 800MB .csv file.
When trying to create a table from this file in the BigQuery web interface, I get the following error after a few seconds:
script: Resources exceeded during query execution: UDF out of memory. (error code: resourcesExceeded)
Job ID audiboxes:bquijob_4462680b_15607de51b9
I checked the BigQuery Quota Policy and I think my file does not exceed it.
Is there a workaround or do I need to split my original .csv in order to get multiple, smaller .avro files ?
Thanks in advance !
This error means that the parser used more memory than allowed. We are working on fixing this issue. In the meantime, if you used compression in the Avro files, try remove it. Using a smaller data block size will also help.
And yes splitting into smaller Avro files like 10MB or less will help too, but the two approaches above are easier if they work for you.

Upload file in chunks with unknown length

Can you give me an example (html or java) to implement the upload file in chunks. How can I upload a file If the length is unknown at the time of the request? I found this way https://developers.google.com/drive/manage-uploads#resumable but I can't understand how can I say to the server the stream is terminated.
Thank you in advance.
aGO!
I can't give you an example with the google drive sdk, but I can tell you how it's done with the raw HTTP requests.
First you have to initiate a resumable upload session, after which you can make partial uploads with unknown size by setting the Content-Range header to something like this 0-262144/* for the first 256 KB, or whatever the size of your chunks are (they have to be multiples of 256KB as per the documentation).
You make requests like this to upload intermediate chunks until you get to the final chunk, which should be smaller than those fixed intermediate chunks. This last chunk will complete the file upload by setting the Content-Range field to <byte interval of the chunk>/<final size of file>.
The final size of the file can be calculated before the last request: number of requests * 256 KB + size of last chunk.
The algorithm is exemplified better here: http://pragmaticjoe.blogspot.ro/2015/10/uploading-files-with-unknown-size-using.html