Reading huge json files - json

I have a huge (1.2Gb) json files lookings like this, except that it is written as a single line:
[
{ data: [{ (json11) }, { (json12) }, { (json13) }, ... },
{ data: [{ (json21) }, { (json22) }, { (json23) }, ... },
{ data: [{ (json31) }, { (json32) }, { (json33) }, ... },
...
{ data: [{ (jsonN1) }, { (jsonN2) }, { (jsonN3) }, ... }
]
It is to big to be opened in a text editor and I couldn't manage to read it via fs.readFile in nodejs (I got this error:
buffer.js:194
this.parent = new SlowBuffer(this.length);
^
RangeError: length > kMaxLength
)
I'd like to read it to get all the json items in a separate files with on each line:
{ (json11) },
{ (json12) },
{ (json13) },
...
{ (json21) },
{ (json22) },
....
What is the easiest way to do this?

Related

How to do custom window function on JSON object with pandas?

I have a rather nested JSON object below, and I am trying to calculate the user (ie 'profileId') with the most events (ie length of 'parameters' key.
I have the code below to get the length of the parameter, but I am trying to now have that calculation be correct for each record, as they way I have it set now would set it the same value for each record - I looked into pandas window functions https://pandas.pydata.org/docs/user_guide/window.html but am having trouble getting to the correct outcome.
response = response.json()
df = pd.json_normalize(response['items'])
df['calcfield'] = len(df["events"].iloc[0][0].get('parameters'))
the output of df['arrayfield'] is below:
[
{
"type":"auth",
"name":"activity",
"parameters":[
{
"name":"api_name",
"value":"admin"
},
{
"name":"method_name",
"value":"directory.users.list"
},
{
"name":"client_id",
"value":"722230783769-dsta4bi9fkom72qcu0t34aj3qpcoqloq.apps.googleusercontent.com"
},
{
"name":"num_response_bytes",
"intValue":"7158"
},
{
"name":"product_bucket",
"value":"GSUITE_ADMIN"
},
{
"name":"app_name",
"value":"Untitled project"
},
{
"name":"client_type",
"value":"WEB"
}
]
}
] }, {
"kind":"admin#reports#activity",
"id":{
"time":"2022-05-05T23:58:48.914Z",
"uniqueQualifier":"-4002873813067783265",
"applicationName":"token",
"customerId":"C02f6wppb"
},
"etag":"\"5T53xK7dpLei95RNoKZd9uz5Xb8LJpBJb72fi2HaNYM/9DTdB8t7uixvUbjo4LUEg53_gf0\"",
"actor":{
"email":"nancy.admin#hyenacapital.net",
"profileId":"100230688039070881323"
},
"ipAddress":"54.80.168.30",
"events":[
{
"type":"auth",
"name":"activity",
"parameters":[
{
"name":"api_name",
"value":"gmail"
},
{
"name":"method_name",
"value":"gmail.users.messages.list"
},
{
"name":"client_id",
"value":"927538837578.apps.googleusercontent.com"
},
{
"name":"num_response_bytes",
"intValue":"2"
},
{
"name":"product_bucket",
"value":"GMAIL"
},
{
"name":"app_name",
"value":"Zapier"
},
{
"name":"client_type",
"value":"WEB"
}
]
ORIGINAL JSON BLOB I READ IN
{
"kind":"admin#reports#activities",
"etag":"\"5g8\"",
"nextPageToken":"A:1651795128914034:-4002873813067783265:151219070090:C02f6wppb",
"items":[
{
"kind":"admin#reports#activity",
"id":{
"time":"2022-05-05T23:59:39.421Z",
"uniqueQualifier":"5526793068617678141",
"applicationName":"token",
"customerId":"cds"
},
"etag":"\"jkYcURYoi8\"",
"actor":{
"email":"blah#blah.net",
"profileId":"1323"
},
"ipAddress":"107.178.193.87",
"events":[
{
"type":"auth",
"name":"activity",
"parameters":[
{
"name":"api_name",
"value":"admin"
},
{
"name":"method_name",
"value":"directory.users.list"
},
{
"name":"client_id",
"value":"722230783769-dsta4bi9fkom72qcu0t34aj3qpcoqloq.apps.googleusercontent.com"
},
{
"name":"num_response_bytes",
"intValue":"7158"
},
{
"name":"product_bucket",
"value":"GSUITE_ADMIN"
},
{
"name":"app_name",
"value":"Untitled project"
},
{
"name":"client_type",
"value":"WEB"
}
]
}
]
},
{
"kind":"admin#reports#activity",
"id":{
"time":"2022-05-05T23:58:48.914Z",
"uniqueQualifier":"-4002873813067783265",
"applicationName":"token",
"customerId":"df"
},
"etag":"\"5T53xK7dpLei95RNoKZd9uz5Xb8LJpBJb72fi2HaNYM/9DTdB8t7uixvUbjo4LUEg53_gf0\"",
"actor":{
"email":"blah.blah#bebe.net",
"profileId":"1324"
},
"ipAddress":"54.80.168.30",
"events":[
{
"type":"auth",
"name":"activity",
"parameters":[
{
"name":"api_name",
"value":"gmail"
},
{
"name":"method_name",
"value":"gmail.users.messages.list"
},
{
"name":"client_id",
"value":"927538837578.apps.googleusercontent.com"
},
{
"name":"num_response_bytes",
"intValue":"2"
},
{
"name":"product_bucket",
"value":"GMAIL"
},
{
"name":"client_type",
"value":"WEB"
}
]
}
]
}
]
}
Use:
df.groupby('actor.profileId')['events'].apply(lambda x: [len(x.iloc[i][0]['parameters']) for i in range(len(x))])
which returns the list of each profileid count of parameters. Output and the sample data:
actor.profileId
1323 [7]
1324 [7]
Name: events, dtype: object
It's not entirely clear what you asking and df['arrayfield'] isn't in your example provided. However, if you look at the events column after json_normalize, you can use the following line to pull out the length of each parameters key. The blob you gave as an example was set to response...
df = pd.json_normalize(response['items'])
df['calcfield'] = df['events'].str[0].str.get('parameters').str.len()
Becauase each parameters key has 7 elements, it's tough to say this is what you really want.

Removing a specific attribute in an array of nested documents

Excuse my English, I'm from Russia.
I asked this question in the Russian version SO, but they still haven't answered it.
There is a record collection that stores archival files. Here is its simplified structure (I omitted most of the attributes):
{
"_id": 1,
"tomes": [
{
"number":1,
"archive_number":1
},
{
"number":2,
"archive_number":1
}
]
}
{
"_id": 2,
"tomes": [
{
"number":1,
"archive_number":1
},
{
"number":2,
"archive_number":1
},
{
"number":3,
"archive_number":1
}
]
}
I need to remove the archive_number attribute from each of the nested documents of the tomes array for all documents in the record collection.
After deletion, the structure should look like this:
{
"_id": 1,
"tomes": [
{
"number":1,
},
{
"number":2,
}
]
}
{
"_id": 2,
"tomes": [
{
"number":1,
},
{
"number":2,
},
{
"number":3,
}
]
}
I was able to write a query like this:
db.record.update(
{
"tomes": {
$elemMatch:{
"archive_number":{$exists:true}
}
}
},
{
$unset: {
"tomes.$.archive_number":1
}
},
false, true
)
But this query only removes the archive_number attribute on one volume per archive case. I.e., after launch, we will see the following picture:
{
"_id": 1,
"tomes": [
{
"number":1,
},
{
"number":2,
"archive_number":1
}
]
}
{
"_id": 2,
"tomes": [
{
"number":1,
},
{
"number":2,
"archive_number":1
},
{
"number":3,
"archive_number":1
}
]
}
Can you please tell me how to delete all volumes? I don’t know how to correct the request, but my head doesn’t understand anymore.
Solution 1
With $[<indentifier>] (filtered positional operator) and arrayFilters to update the document(s) in the array.
db.collection.update({
"tomes": {
$elemMatch: {
"archive_number": {
$exists: true
}
}
}
},
{
$unset: {
"tomes.$[tome].archive_number": 1
}
},
{
arrayFilters: [
{
"tome.archive_number": {
$exists: true
}
}
],
multi: true
})
Sample Mongo Playground (Solution 1)
Solution 2
With $[] (all positional operator).
The all positional operator $[] indicates that the update operator should modify all elements in the specified array field.
db.collection.update({
"tomes": {
$elemMatch: {
"archive_number": {
$exists: true
}
}
}
},
{
$unset: {
"tomes.$[].archive_number": 1
}
},
{
multi: true
})
Sample Mongo Playground (Solution 2)
References
How the arrayFilters Parameter Works in MongoDB

Type of "freeplay" (string) is not supported

I have a json file which looks like this
{
"language":[
{
"lang":"English"
},
{
"lang":"Polish"
},
{
"lang":"German"
},
{
"lang":"Swedish"
},
{
"lang":"Dutch"
},
{
"lang":"Finnish"
},
{
"lang":"Turkish"
}
],
"currency":[
{
"curr" : "dollar"
},
{
"curr" : "pound"
},
{
"curr" : "rupees"
},
{
"curr" : "euro"
},
{
"curr" : "euro"
}
],
"gamename":[
{
"gname":"poker"
},
{
"gname":"slot"
}
],
"freeplay": "false"
}
I installed json-server-init globally and then ran watch command which threw the following error
Type of "freeplay" (string) in linkto.json is not supported. Use
objects or arrays of objects.
Can someone help me in understanding what is wrong or what did I do wrong?
From my understanding of json-server, the value of each key must be a valid JSON object, which is not the case for a simple string.
For example, change the value (contents of other keys omitted) to:
{
"language":[
...
],
"currency":[
...
],
"gamename":[
...
],
"freeplay": {
"enabled": "false"
}
}
if you'd like the request to:
http://localhost:3000/freeplay
to return:
{
"enabled": "false"
}

Using JSON API Serializer to create more complicated JSON

The examples here don't go nearly far enough in explaining how to produce a more complicated structure...
If I want to end up with something like:
{
"data": {
"type": "mobile_screens",
"id": "1",
"attributes": {
"title": "Watch"
},
"relationships": {
"mobile_screen_components": {
"data": [
{
"id": "1_1",
"type": "mobile_screen_components"
},
{
"id": "1_2",
"type": "mobile_screen_components"
},
...
]
}
}
},
"included": [
{
"id": "1_1",
"type": "mobile_screen_components",
"attributes": {
"title": "Featured Playlist",
"display_type": "shelf"
},
"relationships": {
"playlist": {
"data": {
"id": "938973798001",
"type": "playlists"
}
}
}
},
{
"id": "938973798001",
"type": "playlists",
"relationships": {
"videos": {
"data": [
{
"id": "5536725488001",
"type": "videos"
},
{
"id": "5535943875001",
"type": "videos"
}
]
}
}
},
{
"id": "5536725488001",
"type": "videos",
"attributes": {
"duration": 78321,
"live_stream": false,
"thumbnail": {
"width": 1280,
"url":
"http://xxx.jpg?pubId=694940094001",
"height": 720
},
"last_published_date": "2017-08-09T18:26:04.899Z",
"streams": [
{
"url":
"http://xxx.m3u8",
"mime_type": "MP4"
}
],
"last_modified_date": "2017-08-09T18:26:27.621Z",
"description": "xxx",
"fn__media_tags": [
"weather",
"personality"
],
"created_date": "2017-08-09T18:23:16.830Z",
"title": "NOAA predicts most active hurricane season since 2010",
"fn__tve_authentication_required": false
}
},
...,
]
}
what is the most simple data structure and serializer I can set up?
I get stumped after something like:
const mobile_screen_components = responses.map((currentValue, index) => {
id[`id_${index}`];
});
const dataSet = {
id: 1,
title: 'Watch',
mobile_screen_components,
};
const ScreenSerializer = new JSONAPISerializer('mobile_screens', {
attributes: ['title', 'mobile_screen_components'],
mobile_screen_components: {
ref: 'id',
}
});
Which only gives me:
{
"data": {
"type": "mobile_screens",
"id": "1",
"attributes": { "title": "Watch" },
"relationships": {
"mobile-screen-components": {
"data": [
{ "type": "mobile_screen_components", "id": "1_0" },
{ "type": "mobile_screen_components", "id": "1_1" },
{ "type": "mobile_screen_components", "id": "1_2" },
{ "type": "mobile_screen_components", "id": "1_3" },
{ "type": "mobile_screen_components", "id": "1_4" },
{ "type": "mobile_screen_components", "id": "1_5" }
]
}
}
}
}
I have no idea how to get the "included" sibling to "data." etc.
So, the question is:
what is the most simple data structure and serializer I can set up?
Below is the simplest object that can be converted to JSON similar to JSON in the question using jsonapi-serializer:
let dataSet = {
id: '1',
title: 'Watch',
mobile_screen_components: [
{
id: '1_1',
title: 'Featured Playlists',
display_type: 'shelf',
playlists: {
id: 938973798001,
videos: [
{
id: 5536725488001,
duration: 78321,
live_stream: false
},
{
id: 5535943875001,
duration: 52621,
live_stream: true
}
]
}
}
]
};
To serialize this object to JSON API, I used the following code:
let json = new JSONAPISerializer('mobile_screen', {
attributes: ['id', 'title', 'mobile_screen_components'],
mobile_screen_components: {
ref: 'id',
attributes: ['id', 'title', 'display_type', 'playlists'],
playlists: {
ref: 'id',
attributes: ['id', 'videos'],
videos: {
ref: 'id',
attributes: ['id', 'duration', 'live_stream']
}
}
}
}).serialize(dataSet);
console.log(JSON.stringify(json, null, 2));
The first parameter of JSONAPISerializer constructor is the resource type.
The second parameter is the serialization options.
Each level of the options equals to the level of the nested object in serialized object.
ref - if present, it's considered as a relationships.
attributes - an array of attributes to show.
Introduction
First of all we have to understand the JSON API document data structure
[0.1] Refering to the top level (object root keys) :
A document MUST contain at least one of the following top-level
members:
data: the document’s “primary data”
errors: an array of error objects
meta: a meta object that contains non-standard meta-information.
A document MAY contain any of these top-level members:
jsonapi: an object describing the server’s implementation
links: a links object related to the primary data.
included: an array of resource objects that are related to the primary data and/or each other (“included resources”).
[0.2]
The document’s “primary data” is a representation of the resource or
collection of resources targeted by a request.
Primary data MUST be either:
a single resource identifier object, or
null, for requests that target single resources
an array of resource identifier
objects, or an empty array ([]), for reqs. that target
collections
Example
The following primary data is a single resource object:
{
"data": {
"type": "articles",
"id": "1",
"attributes": {
// ... this article's attributes
},
"relationships": {
// ... this article's relationships
}
}
}
In the (jsonapi-serializer) documentation : Available serialization option (opts argument)
So in order to add the included (top-level member) I performed the following test :
var JsonApiSerializer = require('jsonapi-serializer').Serializer;
const DATASET = {
id:23,title:'Lifestyle',slug:'lifestyle',
subcategories: [
{description:'Practices for becoming 31337.',id:1337,title:'Elite'},
{description:'Practices for health.',id:69,title:'Vitality'}
]
}
const TEMPLATE = {
topLevelLinks:{self:'http://example.com'},
dataLinks:{self:function(collection){return 'http://example.com/'+collection.id}},
attributes:['title','slug','subcategories'],
subcategories:{ref:'id',attributes:['id','title','description']}
}
let SERIALIZER = new JsonApiSerializer('pratices', DATASET, TEMPLATE)
console.log(SERIALIZER)
With the following output :
{ links: { self: 'http://example.com' },
included:
[ { type: 'subcategories', id: '1337', attributes: [Object] },
{ type: 'subcategories', id: '69', attributes: [Object] } ],
data:
{ type: 'pratices',
id: '23',
links: { self: 'http://example.com/23' },
attributes: { title: 'Lifestyle', slug: 'lifestyle' },
relationships: { subcategories: [Object] } } }
As you may observe, the included is correctly populated.
NOTE : If you need more help with your dataSet, edit your question with the original data.

Mongodb insert with multiple conditions

I'm having multiple documents in a collection, each document has this data structure :
{
_id: "some object id",
data1: [
{
data2_id : 13233,
data2: [
{
sub_data1: "text1",
sub_data2: "text2",
sub_data3: "text3",
},
{
sub_data1: "text4",
sub_data2: "text5",
sub_data3: "text6",
}
]
},
{
data2_id : 53233,
data2: [
{
sub_data1: "text4",
sub_data2: "text5",
sub_data3: "text6",
}
...
]
},
{
data2_id : 56233,
data2: [
{
sub_data1: "text7",
sub_data2: "text8",
sub_data3: "text9",
}
...
]
},
{
data2_id : 53236,
data2: [
{
sub_data1: "text10",
sub_data2: "text22",
sub_data3: "text33",
}
...
]
}
]
}
I'd like to update to a set of ids that maches some condition, update only the sub object within the document.
I've tries this:
db.collection.update({
"$and": [
{
"_id": {
"$in": [
{
"$id": "54369aca9bc25af3ca8b4568"
},
{
"$id": "54369aca9bc25af3ca8b4562"
}
]
}
},
{
"data1.data2": {
"$elemMatch": {
"sub_data1": "text4",
"sub_data2": "text5"
}
}
}
]
},
{
"data1.data2.$.sub_data3" : "text updated"
}
)
But I get the following error:
Update of data into MongoDB failed: dev.**.com:27017: cannot use the part (data2 of data1.data2.0.sub_data3) to traverse the element...
Any Ideas?
There is an open issue here that imposes a limitation when trying to update elements of an array nested within another array.
Besides, there are some improvements you can do here:
For your query you don't need the $and
db.collection.update(
{
"_id": {
"$in": [
{"$id": "54369aca9bc25af3ca8b4568"},
{"$id": "54369aca9bc25af3ca8b4562"}
]},
"data1.data2": {
"$elemMatch": {
"sub_data1": "text4",
"sub_data2": "text5"
}
},{..update...})
You might want to use $set:
db.collection.update(query,{ $set:{"name": "Mike"} })
Otherwise, you might lose the rest of the data within your document.