I have a very large JSON file (~30GB, 65e6 lines) that I would like to process using some dataframe structure. This dataset does of course not fit into my memory and therefore I ultimately want to use some out-of-memory solution like dask or vaex. I am aware that in order to do this I would first have to convert it into an already memory-mappable format like hdf5 (if you have suggestion for the format, I'll happily take them; the dataset includes categorical features among other things).
Two important facts about the dataset:
The data is structured as a list and each dict-style JSON object is then on a single line. This means that I can very easily convert it to line-delimited JSON by parsing it and removing square brackets and commas, which is good.
The JSON objects are deeply nested and there's variability in the presence of keys among them. This means that if I use a JSON reader for line-delimited JSON that reads chunks sequentially (like pandas.read_json() with specified lines=True and chunksize=int) the resulting dataframes after flattening (pd.json_normalize) might not have the same columns, which is bad for streaming them into an hdf5 file.
Before I spend an awful lot of time with writing a script that extracts me all possible keys and streams each column of a chunk one-by-one to the hdf5-file and inserts NaNs wherever needed:
Does anyone know a more elegant solution to this problem? Your help would be really appreciated.
P.S. Unfortunately I can't really share any data, but I hope that the explanations above describe the structure well enough. If not I will try to provide similar examples.
As a general rule, what you need is a stream/event-oriented JSON parser. See for example json-stream. Such a parser can handle input of any size with a fixed amount of memory. Instead of loading the entire JSON to memory, the parser calls your functions in response to individual elements in the tree. You can write your processing in callback functions. If you need to do more complex or repeated processing of this data, it might make sense to store it in a database first.
Related
In simple words: Is
{
"diary":{
"number":100,
"year":2006
},
"case":{
"number":12345,
"year":2006
}
}
or
{
"diary_number":100,
"diary_year":2006,
"case_number":12345,
"case_year":2006
}
better when using Elasticsearch?
In my case total keys are only a few (10-15). Which is better performance wise?
Use case is displaying data from noSQL database (mostly dynamoDB). Also feeding it into Elasticsearch.
My rule of thumb - if you would need to query/update nested fields, use flat structure.
If you use nested structure, then elastic will make it flat but then has an overhead of managing those relations. Performance wise - flat is always better since elastic doesnt need to related and find nested documents.
Here's an excerpt from Managing Relations Inside Elasticsearch which lists some disadvantages you might want to consider.
Elasticsearch is still fundamentally flat, but it manages the nested
relation internally to give the appearance of nested hierarchy. When
you create a nested document, Elasticsearch actually indexes two
separate documents (root object and nested object), then relates the
two internally. Both docs are stored in the same Lucene block on the
same Shard, so read performance is still very fast.
This arrangement does come with some disadvantages. Most obvious, you
can only access these nested documents using a special nested
query. Another big disadvantage comes when you need to update the
document, either the root or any of the objects.
Since the docs are all stored in the same Lucene block, and Lucene
never allows random write access to it's segments, updating one field
in the nested doc will force a reindex of the entire document.
This includes the root and any other nested objects, even if they were
not modified. Internally, ES will mark the old document as deleted,
update the field and then reindex everything into a new Lucene block.
If your data changes often, nested documents can have a non-negligible
overhead associated with reindexing.
Lastly, it is not possible to "cross reference" between nested
documents. One nested doc cannot "see" another nested doc's
properties. For example, you are not able to filter on "A.name" but
facet on "B.age". You can get around this by using include_in_root,
which effectively copies the nested docs into the root, but this get's
you back to the problems of inner objects.
Nested data is quite good. Unless you explicitly declare diary and case as nested field, they will be indexed as object fields. So elasticsearch will convert them itself to
{
"diary.number":100,
"diary.year":2006,
"case.number":12345,
"case.year":2006
}
Consider also, that every field value in elasticsearch can be a array. You need the nested datatype only if you have many diaries in a single document and need to "maintain the independence of each object in the array".
The answer is a clear it-depends. JSON is famous for its nested structures. However, there are some tools which only can deal with key-value structures and flat JSONs and I feel Elastic is more fun with flat JSONs, in particular if you use Logstash, see e.g. https://discuss.elastic.co/t/what-is-the-best-way-of-getting-mongodb-data-into-elasticsearch/40840/5
I am happy to be proven wrong..
The first reason that pops up in the mind is backward capability. But maybe much more significant reasons exist?
json has its uses:
it has less processing overhead when you store and retrieve the whole column, because it is stored as plain text and doesn't have to be parsed and converted to the internal binary representation.
it conserves the formatting and attribute order.
it does not remove duplicate attributes.
In short, json is better if you don't want to process the column inside the database, that is, when the column is just used to store application data.
If you want to process the JSON inside the database, jsonb is better.
Need to create one json file for each row from the dataframe. I'm using PartitionBy which creates subfolders for each file. Is there a way to avoid creating the subfolders and rename the json files with the unique key?
OR any other alternatives? Its a huge dataframe with thousands (~300K) of unique values, so Repartition is eating up a lot of resources and taking time.Thanks.
df.select(Seq(col("UniqueField").as("UniqueField_Copy")) ++
df.columns.map(col): _*)
.write.partitionBy("UniqueField")
.mode("overwrite").format("json").save("c:\temp\json\")
Putting all the output in one directory
Your example code is calling partitionBy on a DataFrameWriter object. The documentation tells us that this function:
Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:
year=2016/month=01/
year=2016/month=02/
This is the reason you're getting subdirectories. Simply removing the call to partitionBy will get all your output in one directory.
Getting one row per file
Spark SQL
You had the right idea partitioning your data by UniqueField, since Spark writes one file per partition. Rather than using DataFrameWriter's partition, you can use
df.repartitionByRange(numberOfJson, $"UniqueField")
to get the desired number of partitions, with one JSON per partition. Notice that this requires you to know the number of JSON's you will end up with in advance. You can compute it by
val numberOfJson = df.select(count($"UniqueField")).first.getAs[Long](0)
However, this adds an additional action to your query, which will cause your entire dataset to be computed again. It sounds like your dataset is too big to fit in memory, so you'll need to carefully consider if caching (or checkpointing) with df.cache (or df.checkpoint) actually saves you computation time. (For large datasets that don't require intensive computation to create, recomputation can actually be faster)
RDD
An alternative to using the Spark SQL API is to drop down to the lower-level RDD. Partitioning by key (in pyspark) for RDDs was discussed thoroughly in the answer to this question. In scala, you'd have to specify a custom Partitioner as described in this question.
Renaming Spark's output files
This is a fairly common question, and AFAIK, the consensus is it's not possible.
Hope this helps, and welcome to Stack Overflow!
I would like to insert data to mongodb in perl. I can insert perl objects like hash-ref. But I want to append to them also prepared JSONs.
I have these JSONs in text files and I can transform them to hash-ref and then put to database but I looking for more efficient way because of amount of data, that I need to process.
It is possible? I can do inserts but I looking for optimization.
Similar topic (but without answer for this question):
Insert into mongodb with perl
Technical aspects:
For one insert there are processed one file 100kB - 1MB I contains 4 JSON strings among rest of text, any string about 2 - 15 k characters. I getting some properties from file and rest of text and has it in hash-ref. I do not want any information from this JSON in rest of my program. I am interested only put them together into database.
There's no direct way to insert JSON into MongoDB. It always has to be processed into MongoDB's wire format. For Perl that means JSON decoding and then inserting with the driver, which, as you point out, has overhead.
If you just have JSON data, the best thing might be to use the mongoimport tool that comes with the database.
I am trying to understand why JSON is widely used for data transfer between client and server. I understand that it offers simple design which is easy to understand. However, on the contrary;
A JSON string includes repeated data, e.g, incase of a table, columns names (keys) are repeated in each object . Would it not be wise to send columns as first object and rest of the object should be the data (without columns/keys information) from the table.
Once we have a JSON object, the searching based on keys is expensive (in time) compared to indexes. Imagine a table with 20-30 column, doing this searching for each key for each object would cost a lot more time compare to directly using indexes.
There may be many more drawbacks and advantages, add here if you know one.
I think if you want data transfer then you want a table based format. The JSON format is not a table based format like standard databases or Excel. This can complicate analyzing data if there is a problem because someone will usually use excel for that (sorting, filtering, formulas). Also building test files will be more difficult because you can't simply use excel to export to JSON.
But, If you wanted to use JSON for data transfer you could basically build a JSON version of a CSV file. You would only use arrays.
Columns: ["First_Name", "Last_Name"]
Rows: [
["Joe", "Master"],
["Alice", "Gooberg"]
.... etc
]
Seems messy to me though.
If you wanted to use objects then you will have to embed Column names for every bit of data, which in my opinion indicates a wrong approach.