With Spark 1.6.2, reading a gzip compressed JSON file from a normal file-system:
val df = sqlContext
.read
.json("file:///data/blablacar/transactions.json.gz")
.count()
Will use a single task on a single worker.
But if I save the file:
sc.textFile("file:///data/blablacar/transactions.json.gz")
.saveAsTextFile("file:///user/blablacar/transactions")
sqlContext.read.json("file:///user/blablacar/transactions")
.count()
Will execute the first job on a single task, but the JSON decoding on several (which is good!).
Why Spark didn't unzip the file in memory and distribute the JSON decoding in several task in the 1st case?
Why Spark didn't unzip the file in memory and distribute the JSON decoding in several task in the 1st case?
Because you gzip compression is not splittable, therefore file has to be loaded as whole one a single machine. If you want parallel reads:
Don't use gzip at all, or
Use gzip compression on smaller files comparable to split size, or
Unpack files yourself before you pass them to Spark.
Calling .repartition(8) did the trick!
val rdd = sc.textFile("file:///data/blablacar/transactions.json.gz")
.repartition(8)
sqlContext.read.json(rdd)
.count()
Related
I am trying to access a json object which is stored as a zipped gz on an html website. I would like to do this directly with urllib if possible.
This is what I have tried:
import urllib
import json
#get the zip file
test = urllib.request.Request('http://files.tmdb.org/p/exports/movie_ids_01_27_2021.json.gz')
#unzip and read
with gzip.open(test, 'rt', encoding='UTF-8') as zipfile:
my_object = json.loads(zipfile)
but this fails with:
TypeError: filename must be a str or bytes object, or a file
Is it possible to read the json directly like this (e.g. I don't want to download locally).
Thank you.
Use requests library. pip install requests if you don't have it.
Then use the following code:
import requests
r = requests.get('http://files.tmdb.org/p/exports/movie_ids_01_27_2021.json.gz')
print(r.content)
r.content will be the binary content of the gzip file, but it will consume 11352985 bytes of memory (10.8 MB) because the data need to be kept somewhere.
then you can use
gzip.decompress(r.content)
to decompress the gzip binary and get the data. that will consume much bigger memory after decompression.
I have a list containing millions of small records as dicts. Instead of serialising the entire thing to a single file as JSON, I would like to write each record to a separate file. Later I need to reconstitute the list from JSON deserialised from the files.
My goal isn't really minimising I/O so much as a general strategy for serialising individual collection elements to separate files concurrently or asynchronously. What's the most efficient way to accomplish this in either Python 3.x or similar high-level language?
For those looking for a modern Python-based solution supporting async/await, I found this neat package which does exactly what I'm looking for: https://pypi.org/project/aiofiles/. Specifically, I can do
import aiofiles, json
"""" A generator that reads and parses JSON from a list of files asynchronously."""
async json_reader(files: Iterable):
async for file in files:
async with aiofiles.open(file) as f:
data = await f.readlines()
yield json.loads(data)
We need to implement a cron service in node js that follows this flow:
query from postgres lot's of data (about 500mb)
transform json data into another json
convert json to csv
gzip
upload to s3 with "upload" method
Obviusly, we need to implement this procedure using streams, without generating memory overhead.
we got lot's of problems:
we are using sequelize, an SQL orm. With it, we can't stream the queries. So we are converting our JSON returned by the query into a readable Stream
we can't find an elegant and clever way to implement a transform stream that transforms the json returned by the query. (for example input-> [{a:1,b:2}..] --> output ->[{a1:1,b1:2}..]
while logging and tryng to write to fs instead of s3 (using fs.createWriteStream), seems that the file is created at same time as the pipeline starts but the size it's about 10bytes and it became consistent only when the streaming process is finished. Furthermore, lot's of RAM is used and the streaming process seems to be useless in terms of memory usage.
How would you write this flow in node js?
I've used the following libraries during my experiments:
json2csv-stream
JSONStream
oboe
zlib
fs
aws-sdk
Since the Sequelize results are being read into memory anyway, I don't see the point of setting up a stream to transform the JSON (as opposed to directly manipulating the data that's in memory already), but say you would port the Sequelize queries to mysql, which does provide streaming, you could use something like this:
const es = require('event-stream');
const csv = require('fast-csv');
const gzip = require('zlib').createGzip();
const AWS = require('aws-sdk');
const s3Stream = require('s3-upload-stream')(new AWS.S3());
// Assume `connection` is a MySQL connection.
let sqlStream = connection.query(...).stream();
// Create the mapping/transforming stream.
let mapStream = es.map(function(data, cb) {
...modify `data`...
cb(null, data);
});
// Create the CSV outputting stream.
let csvStream = csv.createWriteStream();
// Create the S3 upload stream.
let upload = s3Stream.upload(...);
// Let the processing begin.
sqlStream.pipe(mapStream).pipe(csvStream).pipe(gzip).pipe(upload);
If the "input stream" were emitting JSON, you can replace sqlStream with something like this:
const JSONStream = require('JSONStream');
someJSONOutputtingStream.pipe(JSONStream.parse('*'))
(the rest of the pipeline would remain the same)
I'm tring to read a json log file and insert into solr collection using apache nifi.logfile is in following format(one json object perline)
{"#timestamp": "2017-02-18T02:16:50.496+04:00","message": "hello"}
{"#timestamp": "2017-02-18T02:16:50.496+04:00","message": "hello"}
{ "#timestamp": "2017-02-18T02:16:50.496+04:00","message": "hello"}
I was able to load the file and split by lines using different processes. How can i proceed further ?
You can use the PutSolrContentStream processor to write content to Solr from Apache NiFi. If each flowfile contains a single JSON record (and you should ensure you are splitting the JSON correctly even if it covers multiple lines, so examine SplitJSON vs. SplitText), each will be written to Solr as a different document. You can also use MergeContent to write in batches and be more efficient.
Bryan Bende wrote a good article on the Apache site on how to use this processor.
I am trying to parse the JSON files and insert into the SQL DB.My parser worked perfectly fine as long as the files are small (less than 5 MB).
I am getting "Out of memory exception" when trying to read the large(> 5MB) files.
if (System.IO.Directory.Exists(jsonFilePath))
{
string[] files = System.IO.Directory.GetFiles(jsonFilePath);
foreach (string s in files)
{
var jsonString = File.ReadAllText(s);
fileName = System.IO.Path.GetFileName(s);
ParseJSON(jsonString, fileName);
}
}
I tried the JSONReader approach, but no luck on getting the entire JSON into string or variable.Please advise.
Use 64 bit, check RredCat's answer on a similar question:
Newtonsoft.Json - Out of memory exception while deserializing big object
NewtonSoft Jason Performance Tips
Read the article by David Cox about tokenizing:
"The basic approach is to use a JsonTextReader object, which is part of the Json.NET library. A JsonTextReader reads a JSON file one token at a time. It, therefore, avoids the overhead of reading the entire file into a string. As tokens are read from the file, objects are created and pushed onto and off of a stack. When the end of the file is reached, the top of the stack contains one object — the top of a very big tree of objects corresponding to the objects in the original JSON file"
Parsing Big Records with Json.NET
The json file is too large to fit in memory, in any form.
You must use a JSON reader that accepts a filename or stream as input. It's not clear from your question which JSON Reader you are using. From which library?
If your JSON reader builds the whole JSON tree, you will still run out of memory. As you read the JSON file, either cherry pick the data you are looking for, or write data structures to another on-disk format that can be easily queried, for example, an sqlite database.