Efficient Portable Database for Hierarchical Dataset - Json, Sqlite or? - json

I need to make a file that contains a hierarchical dataset. The dataset in question is a file-system listing (directory names, file name/sizes in each directory, sub-directories, ...).
My first instinct was to use Json and flatten the hierarchy using paths so the parser doesn't have to recurse so much. As seen in the example below, each entry is a path ("/", "/child01", "/child01/gchild01",...) and it's files.
{
"entries":
[
{
"path":"/",
"files":
[
{"name":"File1", "size":1024},
{"name":"File2", "size":1024}
]
},
{
"path":"/child01",
"files":
[
{"name":"File1", "size":1024},
{"name":"File2", "size":1024}
]
},
{
"path":"/child01/gchild01",
"files":
[
{"name":"File1", "size":1024},
{"name":"File2", "size":1024}
]
},
{
"path":"/child02",
"files":
[
{"name":"File1", "size":1024},
{"name":"File2", "size":1024}
]
}
]
}
Then I thought that repeating the keys over and over ("name", "size") for each file kind of sucks. So I found this article about how to use Json as if it were a database - http://peter.michaux.ca/articles/json-db-a-compressed-json-format
Using that technique I'd have a Json table like "Entry" with columns "Id", "ParentId", "EntryType", "Name", "FileSize" where "EntryType" would be 0 for Directory and 1 for File.
So, at this point, I'm wondering if sqlite would be a better choice. I'm thinking that the file size would be a LOT smaller than a Json file, but it might only be negligible if I use Json-DB-compressed format from the article. Besides size, are there any other advantages that you can think of?

I think a Javascript object for datasource, loaded as a file stream into the browser and then used in javascript logic in the browser would consume the least time and have good performance.. BUT only until a limited hierarchy size of the content.
Also, not storing the hierarchy anywhere else and keeping it only as a JSON file badly limits your data source's use in your project to client-side technologies.. or forces conversions to other technologies.
If you are building a pure javascript based application (html, js, css only app), then you could keep it as JSON object alone.. and limit your hierarchy sizes.. you could split bigger hierarchies into multiple files linking json objects.
If you will have server-side code like php, in your project,
Considering managebility of code, and scaling, you should ideally store the data in SQLite DB, at runtime create your json hierarchies for limited levels as ajax loads from your page.

If this is the only data your application stores then you can do something really simple like just store the data in an easy to parse/read text file like this:
File1:1024
File2:1024
child01
File1:1024
File2:1024
gchild01
File1:1024
File2:1024
child02
File1:1024
File2:1024
Files get File:Size and directories get just their name. Indentation gives structure. For something slightly more standard but just as easy to read, use yaml.
http://www.yaml.org/
Both can benefit from decreased file size (but decreased user readability) by gzipping the file.
And if you have more data to store, then use SQLite. SQLite is great.
Don't use JSON for data persistence. It's wasteful.

Related

Data Factory Copy Data Source as Body of Sink Object

I've been trying to create an ADF pipeline to move data from one of our databases into an azure storage folder - but I can't seem to get the transform to work correctly.
I'm using a Copy Data task and have the source and sink set up as datasets and data is flowing from one to the other, it's just the format that's bugging me.
In our Database we have a single field that contains a JSON object, this needs to be mapped into the sink object but doesn't have a Column name, it is simply the base object.
So for example the source looks like this
and my output needs to look like this
[
{
"ID": 123,
"Field":"Hello World",
"AnotherField":"Foo"
},
{
"ID": 456,
"Field":"Don't Panic",
"AnotherField":"Bar"
}
]
However, the Copy Data task seems to only seems to accept direct Source -> Sink mapping, and also is treating the SQL Server field as VARCHAR (which I suppose it is). So as a result I'm getting this out the other side
[
{
"Json": "{\"ID\": 123,\"Field\":\"Hello World\",\"AnotherField\":\"Foo\"}"
},
{
"Json": "{\"ID\": 456,\"Field\":\"Don't Panic\",\"AnotherField\":\"Bar\"}"
}
]
I've tried using the internal #json() parse function on the source field but this causes errors in the pipeline. I also can't get the sink to just map directly as an object inside the output array.
I have a feeling I just shouldn't be using Copy Data, or that Copy Data doesn't support the level of transformation I'm trying to do. Can anybody set me on the right path?
Using a JSON dataset as a source in your data flow allows you to set five additional settings. These settings can be found under the JSON settings accordion in the Source Options tab. For Document Form setting, you can select one of Single document, Document per line and Array of documents types.
Select Document form as Array of documents.
Refer - https://learn.microsoft.com/en-us/azure/data-factory/format-json

How to store large JSON documents(>20MB) in MongoDB without using GridFS

I want to store a large document in MongoDB, however, these are the two ways I will interact with the document:
I do frequent reads of that data and need to get a part of that data using aggregations
When I need to write to the document, I will be building it from scratch again, i.e remove the document that exists and insert a new one.
Here is how a sample document looks like:
{
"objects_1": [
{
}
],
"objects_2": [
{
}
],
"objects_3": [
{
}
],
"policy_1": [
{
}
],
"policy_2": [
{
}
],
"policy_3": [
{
}
]
}
Here is how I want to access that data:
{
"objects_1": [
{
}
}
If I was storing it in a conventional way, I would write a query like this:
db.getCollection('configuration').aggregate([
{ $match: { _id: "FAAAAAAAAAAAA" } },
{ $project: {
"_id": 0,
"a_objects": {
$filter: {
input: "$settings.a_objects",
as: "arrayItem",
cond: { $eq: [ "$$arrayItem.name", "objectName" ] }
}
}
}}
])
However, since the size of the document is >16 MB, we cant save it directly to MongoDB. The size can be a max of 50MB.
Solutions I thought of:
I thought of storing the json data in gridfs format and reading it as per the docs here: https://docs.mongodb.com/manual/core/gridfs/ . However, then I would need to read the entire file every time I want to look up only one object inside the large json blob, and I need to do such reads frequently, on multiple large documents which would lead to high memory usage
I thought of splitting the json into parts and storing each object in it's own separate collection, and when I need to fetch the entire document, I can reassemble the json
How should I approach this problem? Is there something obvious that I am missing here?
I think your problem is that you're not using the right tools for the job, or not using the tools you have in the way they were meant to be used.
If you want to persist large objects as JSON then I'd argue that a database isn't a natural choice for that - especially if the objects are large. I'd be looking at storage systems designed to do that well (say if your solution is on Azure/AWS/GCP see what specialist service they offer) or even just the file system if you run on a local server.
There's no reason why you can't have the JSON in a file and related data in a database - yes there are issues with that but the limitations of MongoDB won't be one of them.
I do frequent reads of that data and need to get a part of that data using aggregations
If you are doing frequent reads, and only for part of the data, then forcing your system to always read the whole record means you are just penalizing yourself. One option is to store the bits that are highly read in a way that doesn't incur the performance penalty of the full read.
Storing objects as JSON means you can change your program and data without having to worry about what the database looks like, its convenient. But it also has it's limitations. If you think you have hit those limitations then now might be the time to consider a re-architecture.
I thought of splitting the JSON into parts and storing each object in it's own separate collection, and when I need to fetch the entire document, I can reassemble the JSON
That's definably worth looking into. You just need to make sure that the different parts are not stored in the same table / rows, otherwise there'll be no improvement. Think carefully about how you spilt the objects up - think about the key scenarios the objects deal with - e.g. you mention reads. Designing the sub-objects to align with key scenarios is the way to go.
For example, if you commonly show an object's summary in a list of object summaries (e.g. search results), then the summary text, object name, id are candidates for object data that you would split out.

manipulating (nested) JSON keys and there values, using nifi

I am currently facing an issue where I have to read a JSON file that has mostly the same structure, has about 10k+ lines, and is nested.
I thought about creating my own custom processor which reads the JSON and replaces several matching key/values to the ones needed. As I am trying to use NiFi I assume that there should be a more comfortable way as the JSON-structure itself is mostly consistent.
I already tried using the ReplaceText processor as well as the JoltTransformJson processor, but I could not figure out. How can I transform both keys and values, if needed? For example: if there is something like this:
{
"id": "test"
},
{
"id": "14"
}
It might be necessary to turn the "id" into "Number" and map "test" to "3", as I am using different keys/values in my jsonfiles/database, so they need to fit those. Is there a way of doing so without having to create my own processor?
Regards,
Steve

Parsing translatable messages from JSON file

I have a project that I want to be localizable. While most strings are in the source code, where xgettext/Poedit can easily find them when wrapped with the localization function call, some are in pure JSON files, that I'm using for data storage. Since it's just JSON, and not actually JS, I can't use function calls. For example, a little database:
somedb.txt
[
{ "id": 1, "name": "Xyz", "local": "AxWhyZzz", /*...*/ },
/*...*/
]
Is there a way to extract the "local" values from the JSON files with xgettext? And if there isn't, what are my options? Creating a source file that has all local values, wrapped with calls to _?
Alternatively I could write my own parser of course, or modify gettext, but I'd much rather use existing solutions if available.
No, there isn't a way. JSON is just a generic container format, the actual meaning of the values is domain/application specific — and xgettext must understand the meaning to know what to extract. How could it understand your homegrown format?
For XML files, this is solved by ITS (v2), which gettext (and thus Poedit) supports since 0.19.7. But for JSON, nothing exists… yet. There's some work being done (see here and here and here), though.
Here is the way you will get them as JS arrays through XMLHttpRequest: http://codepen.io/KryptoniteDove/post/load-json-file-locally-using-pure-javascript
Also there is a way to include somedb.txt as a valid js if you modify it by adding variable id somevar to provide further access:
somevar = [
{ "id": 1, "name": "Xyz", "local": "AxWhyZzz", /*...*/ },
/*...*/
]

How to store JSON data on local machine?

This question has plagued me for months, now, and no matter how many articles and topics I read, I've gotten no good information...
I want to send a request to a server which returns a JSON file. I want to take those results and load them into tables on my local machine. Preferably Access or Excel so I can sort and manipulate the data.
Is there a way to do this...? Please help!!
Google comes up with this: json2excel.
Or write your own little application.
EDIT
I decided to be nice and write a python3 application for you. Use on the command line like this python jsontoxml.py infile1.json infile2.json and it will output infile1.json.xml and infile2.json.xml.
#!/usr/bin/env python3
import json
import sys
import re
from xml.dom.minidom import parseString
if len(sys.argv) < 2:
print("Need to specify at least one file.")
sys.exit()
ident = " " * 4
for infile in sys.argv[1:]:
orig = json.load(open(infile))
def parseitem(item, document):
if type(item) == dict:
parsedict(item, document)
elif type(item) == list:
for listitem in item:
parseitem(listitem, document)
else:
document.append(str(item))
def parsedict(jsondict, document):
for name, value in jsondict.items():
document.append("<%s>" % name)
parseitem(value, document)
document.append("</%s>" % name)
document = []
parsedict(orig, document)
outfile = open(infile + ".xml", "w")
xmlcontent = parseString("".join(document)).toprettyxml(ident)
#http://stackoverflow.com/questions/749796/pretty-printing-xml-in-python/3367423#3367423
xmlcontent = re.sub(">\n\s+([^<>\s].*?)\n\s+</", ">\g<1></", xmlcontent, flags=re.DOTALL)
outfile.write(xmlcontent)
Sample input
{"widget": {
"debug": "on",
"window": {
"title": "Sample Konfabulator Widget",
"name": "main_window",
"width": 500,
"height": 500
},
"image": {
"src": "Images/Sun.png",
"name": "sun1",
"hOffset": 250,
"vOffset": 250,
"alignment": "center"
},
"text": {
"data": "Click Here",
"size": 36,
"style": "bold",
"name": "text1",
"hOffset": 250,
"vOffset": 100,
"alignment": "center",
"onMouseUp": "sun1.opacity = (sun1.opacity / 100) * 90;"
}
}}
Sample output
<widget>
<debug>on</debug>
<window title="Sample Konfabulator Widget">
<name>main_window</name>
<width>500</width>
<height>500</height>
</window>
<image src="Images/Sun.png" name="sun1">
<hOffset>250</hOffset>
<vOffset>250</vOffset>
<alignment>center</alignment>
</image>
<text data="Click Here" size="36" style="bold">
<name>text1</name>
<hOffset>250</hOffset>
<vOffset>100</vOffset>
<alignment>center</alignment>
<onMouseUp>
sun1.opacity = (sun1.opacity / 100) * 90;
</onMouseUp>
</text>
</widget>
It's probably overkill, but MongoDB uses JSON-style documents as it's native format. That means you can insert your JSON data directly with little or no modifications. It can handle JSON data on its own, without you having to jump through hoops to force your data into a more RDBMS-friendly format.
It is open source software and available for most major platforms. It can also handle extreme amounts of data and multiple servers.
Its command shell is probably not as easy to use as Excel or Access, but it can do sorting etc on its own, and there are bindings for most programming languages (e.g. C, Python and Java) if you find that you need to do more tricky stuff.
EDIT:
For importing/exporting data from/to other more common formats MongoDB has a couple of useful utilities. CSV is supported, although you should keep in mind that JSON uses structured objects and it is not easy to come up with a direct mapping to a table-based model like CSV, especially on a schema-free database like MongoDB.
Converting JSON to CSV or any other RDBMS-friendly format comes close to (if it does not outright enter) the field or Object-Relational Mapping which in general is neither simple nor something that can be easily automated.
The MongoDB tools, for example, allow you to create CSV files, but you have to specify which field will be in each collumn, implicitly assuming that there is in fact some kind of schema in your data.
MongoDB allows you to store and manipulate structured JSON data without having to go through a cumbersome mapping process than can be very frustrating. You would have to modify your way of thinking, moving a bit away from the conventional tabular view of databases, but it allows you to work on the data as it is intended to be worked on, rather than try to force the tabular model on it.
Json (like xml) is a tree rather than a literal table of elements. You will need to populate the table by hand (essentially doing a stack of SQL LEFT JOINS) or populate a bunch of tables and manipulate the joins by hand.
Or is the JSON flat packed? It MAY be possible to do what you're asking, I'm just pointing out that there's no guarantee.
If it's a quick kludge, and the data is flatpacked then a quick script to read the json, dump to csv and then open in Excel will probably be easiest.
Storing in Access or Excel can not be done easily I guess. You would have to essentially parse the json string with any programming language that supports it (PHP, NodeJS, Python, .. all have native support for it) and then use a library to output an Excel sheet with the data.
Something else that could be an option depending on how versed you are with programming languages is to use something like the ElasticSearch search engine or the CouchDB database that both support json input natively. You could then use them to query the content in various ways.
I've kinda done that before. Turn JSON into HTML table. that means, you can turn into csv.
However here are something you need to know
1) JSON data must be well format into predefined structure. e.g.
{
[
['col1', 'col2', 'col3'],
[data11, data12, data13],
...
]
}
2) U have to parse the data row by row, column by column. and you have to take care of missing data or unmatch column, if possible. Of course, you have to aware of data type.
3) My experience is, if you have ridicuously large data, then doing that will kill client's browser. You have to progressively get formatted HTML or CSV data from server.
as suggested by nightcracker above, try the google tool. :)