Handling Incredibly large JSON Document in CouchDB - json

I'm new to NoSql databases and I'm having a hard time figuring how to handle a very large JSON Document that could amount to over 20MB on my local drive. This structure will definitely increase over time and I worry about the speed of queries and having to search deep though the returned JSON object nest just to get a string out. My JSON is deeply nested like so for example.
{
"exams": {
"exam1": {
"year": {
"math": {
"questions": [
{
"question_text": "first question",
"options": [
"option1",
"option2",
"option3",
"option4",
"option5"
],
"answer": 1,
"explaination": "explain the answer"
},
{
"question_text": "second question",
"options": [
"option1",
"option2",
"option3",
"option4",
"option5"
],
"answer": 1,
"explaination": "explain the answer"
},
{
"question_text": "third question",
"options": [
"option1",
"option2",
"option3",
"option4",
"option5"
],
"answer": 1,
"explaination": "explain the answer"
}
]
},
"english": {same structure as above}
},
"1961": {}
},
"exam2": {},
"exam3": {},
"exam4": {}
}
}
In the main application, question objects are created and appended based on type of exam, year, and subject making the JSON document huge over time. How can I re-model this so as to avoid slow queries in the future?

Dominic is right. You need to start dividing the documents and storing them as separate documents.
The next question is how to recompose the document after it's been split.
Considering you're using Couch, I would recommend doing this at the application layer. A good starting point would be to create exam documents and store them in their own database. Then have a document (exams) in another database that has pointers to the exam documents.
You can retrieve the exams document and get exams one by one as needed. This could be especially useful with paging since most people will only want to see the most recent exams.

Related

JSON Database doesn't work correctly as per REST technique

I created a JSON server and this is the data that I'm using. However, when I'm trying to query the examlist and relate it to the students (i'd like to receive the students based on their ID (the picture below shows the REST query - I'm using ?_expand=student )) it won't display them. The code shows correct as per JSON validators, but my goal is to have them working.
The way my data is organized (the examlist table) won't display the students, because apparently, it cannot read their IDs. This database will be used for HTTP requests, hence I need it fully working.
I'll upload another image so that you can visualize my code.
Momentarily instead of my studentIDs, it's showing some random 0,1 numbers and the student IDs are being pushed down along the arbitrary tree.
(Just the examlist "table")
It's M:M relationship (relational database) and how I want it structured is:
Table "students" that contains information about the students;
I have "table" exams that contains information about the exams;
And then I have another "table" examlist which contains information about the EXAM (ExamID) and the students enrolled in it (basically relates the two abovementioned tables)
When I try querying the students through the "examlist" table, it won't work. However, the other "table" -- exam, does work.
My assumption is the way I have organized the students in the examlist "table" is not good, however, given my very little experience I cannot seem to see where the issue is.
I hope I cleared it out for the most of you! Sorry for any inconvenience caused.
{
"students": [
{
"id": 3021,
"nume": "Ionut",
"prenume": "Grigorescu",
"an": 3,
"departament": "IE"
},
{
"id": 3061,
"nume": "Nadina",
"prenume": "Pop",
"an":3,
"departament": "MG"
},
{
"id": 3051,
"nume": "Ionut",
"prenume": "Casca",
"an": 3,
"departament": "IE"
}
],
"exams": [
{
"id": 1,
"subiect": "Web Semantic",
"profesor": {
"Nume": "Robert Buchman"
}
},
{
"id": 2,
"subiect": "Programare Web",
"profesor": {
"Nume": "Mario Cretu"
}
},
{
"id": 3,
"subiect": "Medii de Programare",
"profesor": {
"Nume": "Valentin Stinga"
}
}
],
"listaexamene": [
{
"examId":1,
"Data Examen":"02/06/2022 12:00",
"studentId":
[
{
"id":3021
},
{
"id":3051
}
]
},
{
"examId":2,
"Data Examen":"27/05/2022 10:00",
"studentId":
[
{
"id":3021
},
{
"id":3051
}
]
},
{
"examId":1,
"Data Examen":"04/06/2022 10:00",
"studentId":
[
{
"id":3021
},
{
"id":3051
},
{
"id":3061
}
]
}
]
}
I had to repost with more information after my first one got closed down
I think I finally got the answer. The problem lays in the JSON server. Apparently, it cannot obtain information from further down the arbitrary tree, only the first layer.
Thank you all for your input on the previous post!

replace "key" name in whole JSON python for bulk data in efficient way

Actually i am pushing data to other system but before pushing i have to change the "key" in the whole JSON. JSON may contain 200 or 10000 or 250000 data.
sample JSON:
{
"insert": "table",
"contacts": [
{
"testName": "testname",
"ContactID": 212121
},
{
"testName": "testname",
"ContactID": 2146354564
},
{
"testName": "testname",
"ContactID": 12312
},
{
"testName": "testname",
"ContactID": 211221
},
{
"testName": "testname",
"ContactID": 10218550
}
]
}
I need to change contacts array Keys. These contacts may be in bulk. So i need to work with this efficiently with minimal complexity.
The above JSON to be converted as below
{
"insert": "table",
"contacts": [
{
"name": "testname",
"phone": 212121
},
{
"name": "testname",
"phone": 2146354564
},
{
"name": "testname",
"phone": 12312
},
{
"name": "testname",
"phone": 211221
},
{
"name": "testname",
"phone": 10218550
}
]
}
here is my code trying by loop
ini_dict = request.data
contact_data = ini_dict['contacts']
for i in contact_data:
i['name'] = i.pop('testName')
print(contact_data)
Please suggest me how can i change the key names efficiently for bulk data. i mean for 50000 lists in contacts. "for loop" will be leading a performance issue. So please let me know the efficient way to achieve this
I dont know how fast you need it to be nor how you are choosing to store your json. One simple solution is just store it as a string and then replace all the instances of your attributes.
# Something like this using a jsonstring
jsonstring.replace("'testName':", "'name':")
jsonstring.replace("'ContactId':", "'phone':")
If you want to do this in bulk you, may need to create some batch process to be able to fetch multiple existing records and make changes at once. I have done this before with the java equivalent of https://pypi.org/project/JayDeBeApi/ but, that was more for modifying existing records in a database.

How to query deep nested json array from couchbase?

How to query deep nested json array from couchbase? I have the following documents in the couchbase bucket. I need to query to list all apps who have Permissions "android.permission.BATTERY_STATS"
How to query to list all apps with permissions from nested json array?
My Json Documents,
Document:1
{
"data": {
"com.facebook.katana": {
"studioId": "Facebook",
"screenshotUrls": [
"https://lh3.googleusercontent.com/JcPdPqplBxgG6dEQuxvuhO4jvE64AzxOCGWe8w55dMMeXU4rZs2MwpfGQTWvv6QR-g",
"https://lh3.googleusercontent.com/w0kSYY7jlPjGDd3KEVgDTpzUf4k67G7rfELOf6qj1SSC7n6Ege44vp8QkeX57ZM6bFU"
],
"primaryCategoryName": "Social",
"studioName": "Facebook",
"description": "Keeping up with friends is faster and easier than ever. Share updates and photos, engage with friends and Pages, and stay connected to communities important to you"
"starRatings": {
"1": 9706642,
"2": 3384344,
"3": 7224416,
"4": 12323358,
"5": 49438051
},
"numDownloads": "1,000,000,000+ downloads",
"price": 0,
"permissions": [
"android.permission.ACCESS_COARSE_LOCATION",
"android.permission.ACCESS_FINE_LOCATION",
"android.permission.ACCESS_NETWORK_STATE",
"android.permission.ACCESS_WIFI_STATE",
"android.permission.AUTHENTICATE_ACCOUNTS",
"android.permission.BATTERY_STATS",
"android.permission.BLUETOOTH",
"android.permission.READ_PHONE_STATE",
"android.permission.READ_PROFILE",
"android.permission.READ_SMS",
"android.permission.READ_SYNC_SETTINGS",
"com.nokia.pushnotifications.permission.RECEIVE",
"com.sec.android.provider.badge.permission.READ",
"com.sec.android.provider.badge.permission.WRITE",
"com.sonyericsson.home.permission.BROADCAST_BADGE"
],
"appId": "com.facebook.katana",
"userRatingCount": 82076811,
"currency": "USD",
"iconUrl": "https://lh3.googleusercontent.com/ZZPdzvlpK9r_Df9C3M7j1rNRi7hhHRvPhlklJ3lfi5jk86Jd1s0Y5wcQ1QgbVaAP5Q=w100",
"releaseDate": "Nov 14, 2018",
"appName": "Facebook",
"studioUrl": "https://www.facebook.com/facebook",
"hasInAppPurchases": 1,
"bundleId": "com.facebook.katana",
"version": "198.0.0.53.101",
"commentCount": 22211936,
"fileSizeBytes": 58044343,
"formattedPrice": "",
"categoryIds": [
"APPLICATION",
"SOCIAL"
],
"tagline": "Find friends, watch live videos, play games & save photos in your social network",
"averageUserRating": 4.0770621299744,
"primaryCategoryId": "SOCIAL",
"videoScreenUrl": "https://lh4.ggpht.com/3RG_Y8JPK0Hcyui9OcapiONP_aDWKTRZ50wqZW_wbyOF0FamAYEYZfMTW9Cs1OT1kA"
}
},
"response_msec": 11,
"status": 200
}
Document:2
{
"data": {
"com.whatsapp": {
"studioId": "WhatsApp Inc.",
"screenshotUrls": [
"https://lh3.googleusercontent.com/MMue08byixTw74ST_VkNQDUUJBgVEbjNHDYLhIuHmYhMIMJIp3KjVlnhhqZQOZUtNt8",
"https://lh3.googleusercontent.com/foFmwvVGIwWWXJIukN7png18lFjFgbw3K7BqIm8G-jsFgSTVtkCa-dDkFApUzbvzIvbe"
],
"primaryCategoryName": "Communication",
"studioName": "WhatsApp Inc.",
"description": "WhatsApp Messenger is a FREE messaging app available for Android and other smartphones.
"starRatings": {
"1": 4713598,
"2": 1917919,
"3": 4962745,
"4": 11307648,
"5": 55392894
},
"numDownloads": "1,000,000,000+ downloads",
"price": 0,
"permissions": [
"android.permission.ACCESS_COARSE_LOCATION",
"android.permission.ACCESS_FINE_LOCATION",
"android.permission.ACCESS_NETWORK_STATE",
"android.permission.ACCESS_WIFI_STATE",
"android.permission.AUTHENTICATE_ACCOUNTS",
"android.permission.BLUETOOTH",
"android.permission.BROADCAST_STICKY",
"android.permission.CAMERA",
"android.permission.CHANGE_WIFI_STATE",
"android.permission.GET_ACCOUNTS",
"android.permission.GET_TASKS",
"android.permission.INSTALL_SHORTCUT",
"android.permission.INTERNET",
"android.permission.MANAGE_ACCOUNTS",
"com.whatsapp.permission.REGISTRATION",
"com.whatsapp.permission.VOIP_CALL",
"com.whatsapp.sticker.READ"
],
"appId": "com.whatsapp",
"userRatingCount": 78294804,
"currency": "USD",
"iconUrl": "https://lh6.ggpht.com/mp86vbELnqLi2FzvhiKdPX31_oiTRLNyeK8x4IIrbF5eD1D5RdnVwjQP0hwMNR_JdA=w100",
"releaseDate": "Nov 5, 2018",
"appName": "WhatsApp Messenger",
"studioUrl": "http://www.whatsapp.com/",
"bundleId": "com.whatsapp",
"version": "2.18.341",
"commentCount": 19763316,
"fileSizeBytes": 23857699,
"formattedPrice": "",
"categoryIds": [
"APPLICATION",
"COMMUNICATION"
],
"tagline": "Simple. Personal. Secure.",
"averageUserRating": 4.4145045280457,
"primaryCategoryId": "COMMUNICATION",
"videoScreenUrl": "https://lh3.ggpht.com/aZrXAunkovhf0630Ykz1A7h2rzFX_dErd6fRiB7fNKU_DkNtetTquEra1bjc3sR2kLs"
}
},
"response_msec": 15,
"status": 200
}
As I say in the comment, this is a tricky one. I'm going to try to simplify your docs first, and then give an answer that I came up with.
You have two docs, which contain a nested object with a permissions array. Each nested object has a (potentially) different name. So, let's assume we have two simple docs like this:
id: doc1
{
"foo": {
"permissions": [
"android.permission.ACCESS_COARSE_LOCATION",
"android.permission.BATTERY_STATS"
]
}
}
id: doc2
{
"bar": {
"permissions": [
"android.permission.ACCESS_FINE_LOCATION"
]
}
}
The first one has a "foo" nested object, the second has a "bar" nested object, but both nested objects have a "permissions" array. You want to find all the documents that have a permission of "android.permission.BATTERY_STATS".
I checked out the N1QL docs for anything that might be helpful, and I especially checked out the Object Functions section. There's a function called OBJECT_UNWRAP that might do the trick. From the docs: "This function enables you to unwrap an object without knowing the name in the name-value pair."
So, if I simply unwrap the above documents, then I can basically discard the "foo" and the "bar" parts.
SELECT META(b).id, OBJECT_UNWRAP(b).permissions
FROM sstbucket b
You can put unwrap a deeper nested object if necessary, but I'm trying to keep this simple.
The results of that query would be:
[
{
"id": "doc1",
"permissions": [
"android.permission.ACCESS_COARSE_LOCATION",
"android.permission.BATTERY_STATS"
]
},
{
"id": "doc2",
"permissions": [
"android.permission.ACCESS_FINE_LOCATION"
]
}
]
And now, it's a simple ANY/SATISFIES statement to find the document:
SELECT META(b).id
FROM sstbucket b
WHERE ANY p IN OBJECT_UNWRAP(b).permissions SATISFIES p == 'android.permission.BATTERY_STATS' END;
Which would return
[
{
"id": "doc1"
}
]
So, that works. What I don't know for sure is how to create a proper index for this particular query. I created a primary index just to make it work (CREATE PRIMARY INDEX ON sstbucket), but that's not going to perform very well.
You can use OBJECT functions (https://docs.couchbase.com/server/6.0/n1ql/n1ql-language-reference/objectfun.html) and Array indexing.
If you need document ID or whole document.
CREATE INDEX ix1 ON default ( DISTINCT ARRAY (DISTINCT ARRAY permision
FOR permision IN app.permissions END)
FOR app IN OBJECT_VALUES(data) END);
SELECT META(d).id FROM default AS d
WHERE ANY app IN OBJECT_VALUES(d.data)
SATISFIES (ANY permision IN app.permissions
SATISFIES permision = "android.permission.BATTERY_STATS"
END)
END;
If you need only appId and see if it uses covering index.
CREATE INDEX ix2 ON default ( ALL ARRAY (ALL ARRAY [permision, app.appId]
FOR permision IN app.permissions END)
FOR app IN OBJECT_VALUES(data) END);
SELECT [permision, app.appId][1] AS appId FROM default AS d
UNNEST OBJECT_VALUES(d.data) AS app
UNNEST app.permissions AS permision
WWHERE [permision, app.appId] >= ["android.permission.BATTERY_STATS"] AND
[permision, app.appId] < [SUCCESSOR("android.permission.BATTERY_STATS")] ;

Grouping nested document

I'm new to elasticsearch querying and probably this question is not so smart, but I would appreciate any help. Is there any idea how I can query only events in the following sample json user session (field "l"):
{
"dvs_t": 103492673,
"l": [
{
"e": "SessionInfo",
"p": {
"Device": "samsung GT-P6800",
"SessionNumber": "36"
},
"ts": 103279627
},
{
"e": "InAppPurchaseCompleted",
"p": {
"ItemID": "sdbundle_stars_10",
"TimePlayed_Total": "3 - 3.25 Hours"
},
"ts": 103318595
}
],
"osv": "4.1.2",
"request": "ANME",
"srv_ver": "0.2"
}
For instance, can I somehow
count the number of InAppPurchaseCompleted events in the session?
count the number of InAppPurchaseCompleted events in the sessions grouped by session parameter request or any other parameter?
Filter in Elasticsearch allow you to search (really quickly) on a specific set of data. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-bool-filter.html.
If you only want to return count you can use search type count http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-search-type.html#count

denormalizing JSON for mongoDB

I think that's the word I'm looking for. I'm trying to get parent info into each of the cards. I think that's what I need to do, but chime in if you have any other ideas.
{
"LEA": {
"name": "Limited Edition Alpha",
"code": "LEA",
"releaseDate": "1993-08-05",
"border": "black",
"type": "core",
"cards": [
{"name": "Air Elemental"},
{"name": "Earth Elemental"},
{"name": "Fire Elemental"},
{"name": "Water Elemental"}
]
},
"LEB": {
"name": "Limited Edition Beta",
"code": "LEB",
"releaseDate": "1993-10-01",
"border": "black",
"type": "core",
"cards": [
{"name": "Armageddon"},
{"name": "Fireball"},
{"name": "Swords to Plowshares"},
{"name": "Wrath of God"}
]
}
}
This is a tiny subset of the data, obviously. LEA and LEB are sets of cards, and inside each set there are a bunch of cards. I'm thinking of denormalizing this into just the cards, with the set info added to each card. Something like this...
{
{
"name": "Air Elemental",
"set": {
"name": "Limited Edition Alpha",
"code": "LEA",
"releaseDate": "1993-08-05",
"border": "black",
"type": "core"
}
},
{
"name": "Earth Elemental",
"set": {
"name": "Limited Edition Alpha",
"code": "LEA",
"releaseDate": "1993-08-05",
"border": "black",
"type": "core"
}
},
{
"name": "Armageddon",
"set": {
"name": "Limited Edition Beta",
"code": "LEB",
"releaseDate": "1993-10-01",
"border": "black",
"type": "core"
}
},
{
"name": "Fireball",
"set": {
"name": "Limited Edition Beta",
"code": "LEB",
"releaseDate": "1993-10-01",
"border": "black",
"type": "core"
}
}
}
Is my thinking right, first and foremost? Would I want a giant collection of cards and have the set information flattened into each card? In SQL, I'd do a table for the sets, and and the cards would belong_to a set. I'm trying to wrap my head around 'document thinking'.
Second, if my thinking is correct, any ideas on how I could achieve this denormalizing?
Here you go =).
OK here is where I would start. Since we've said that cards will never change (since they're based on physical MTG cards), create one collection with all of your cards in it, this will be used for easily populating a user's deck later on. You can search on it by card name or some sort of card ID (like a physical one, stored on the card).
For the user's array of card objects, you shouldn't just store the _id field for a card, because that forces you to join. Since cards will never change, completely denormalize them and just shove them in that card array, so a user object, so far, resembles:
{
name: "Tom Hanks",
skill_level: 0,
decks: [
[
{
card_name: "Balance",
card_description: "LONG_BLOCK_OF_DESCRIP_TEXT",
card_creator: "Sugargirl14",
type: "Normal",
_id: $SOME_MONGO_ID_HERE,
... rest of card data...
}, {
...card 2 complete data...
}
],
[
{ ...another deck here... }
]
]
}
OK, back to set info, I will also assume set info is a constant (based on your SO post, I can't see how it would physically change). So, if that set info is always relevant to the card, I would denormalize and include it, changing our card object to:
{
card_name: "Balance",
card_description: "LONG_BLOCK_OF_DESCRIP_TEXT",
card_creator: "Sugargirl14",
type: "Normal",
_id: $SOME_MONGO_ID_HERE
set: {
"name": "Limited Edition Alpha",
"code": "LEA",
"releaseDate": "1993-08-05",
"border": "black",
"type": "core",
"_id": $SOME_MONGO_ID_HERE
},
... rest of card data...
}
I imagine that storing the other cards in the denormalized object for a given card isn't relevant, if it is, add them. If you'll note, the key that is given in your SO example is dropped, since it seems to always == the "code" field.
OK, now to properly answer your SO question about whether you should embed sets in cards, or vice versa. First off, both collections are relevant. So, even if we embed sets into cards, you'll want those sets in a collection so they can be fetched later and inserted into new cards.
Which gets embedded in which is really determined by business logic, how the data is used and which gets pulled more often. Are you frequently displaying sets and pulling cards from them (like for users to search)? You could embed all of the card data, or any relevant data, in each set's cards array. But with the above data model, each card stores its set ID in its set object. I assume cards belong to only one set, so to get all cards for a set you can query over your card collection where set.id == the Mongo ID of the set you want. Now sets need minimal updates, due to business logic, (hopefully none at all) and your queries are still fast (and you get complete card objects). I'd, honestly, do that latter one and keep my sets clean of cards. As such, a card owns the set it belongs to as opposed to a set owning a card. That's a more SQLy way to think that actually can work fine in Mongo (you'll never join).
So our final data model resembles:
Collection 1, Set:
//data model
{
"name": "Limited Edition Alpha",
"code": "LEA",
"releaseDate": "1993-08-05",
"border": "black",
"type": "core",
"_id": $SOME_MONGO_ID_HERE
}
Collection 2, cards:
//data model
{
_id: $SOME_MONGO_ID_HERE
card_name: "Balance",
card_description: "LONG_BLOCK_OF_DESCRIP_TEXT",
card_creator: "Sugargirl14",
type: "Normal",
set: {
"name": "Limited Edition Alpha",
"code": "LEA",
"releaseDate": "1993-08-05",
"border": "black",
"type": "core",
"_id": $SOME_MONGO_ID_HERE
... rest of card data...
},
}
Collection 3, users:
{
_id: $SOME_MONGO_ID_HERE,
name: "Tom Hanks",
skill_level: 0,
decks: [
[
{
card_name: "Balance",
card_description: "LONG_BLOCK_OF_DESCRIP_TEXT",
card_creator: "Sugargirl14",
type: "Normal",
_id: $SOME_MONGO_ID_HERE,
set: {
"name": "Limited Edition Alpha",
"code": "LEA",
"releaseDate": "1993-08-05",
"border": "black",
"type": "core",
"_id": $SOME_MONGO_ID_HERE
},
}, {
...card 2 complete data...
}
],
[
{ ...another deck here... }
]
]
}
This, obviously, assumes set data for each card is relevant to the user. Now your data is denormalized, sets and cards rarely need updates (according to business logic), so you'll never need cascading updates or deletes. Manipulating users is easy. When you remove a card from a user's deck you can do a $pull from Mongo (I think that's what it's called) on the relevant decks array where a contained item's _id field == the Mongo ID of the card you want to remove. All other updates are easier.
In retrospect, you might want to make the user's decks like so:
decks: {
"SOME_ID_HERE": [
{ ...card 1... },
{ ...card 2... }
]
}
This makes identifying the decks MUCH easier and will make your pulls easier (you'll have more data on the frontend and the pull query will be more precise). It can be a number, random string, anything really, since it gets passed back to the frontend. Or just use their Mongo ID, when looking at a deck, a user will have it's Mongo ID. Then when they pull a card out of it, or add one in, you have a direct identifier to easily grab the deck needed.
Obviously all values with text like: $MONGO_ID_HERE should really be MongoId() objects.
Whew, that was intense, 6800 characters. Hope it makes sense to you and I apologize if any verbiage is confusing or if any of my JSON objects' formatting is fucked up (just let me know if any prose is confusing, I'll reword). Does this make sense/solve your problem?