I have a number of JSON sources I wish to import to power BI. The format is such that foreign keys are such that there can be 0, 1, or many, but they store both the ID to another table as well as the name. An example of one entry in one of the JSON files is:
{
"ID": "5bb68fde9088104f8c2a85be",
"Name": "name here",
"Date": "2018-10-04T00:00:00Z",
"Account": {
"ID": "5bb683509088104f8c2a85bc",
"Name": "name here"
},
"Amount": 38.21,
"Received": true
}
Some tables are much more complex etc, but for the most part, they always follow this sort of format for foreign keys. In power BI, I pull in the JSON, convert to table, and expand the column to view the top level in the table, but any lower levels, such as these foreign keys, are represented as lists. How do I pull them out into each row? I can extract values, but that duplicates rows etc.
I have googled multiple times for this and tried to follow what others have posted but can't seem to get anything to work.
Related
I want to use an ID as the primary key in a JSON object. This way all users in the list are unique.
Like so:
{
"user": [{
"id": 1,
"name": "bob"
}]
}
In an application, I have to search for the id in all elements of the list 'user'.
But I can also use the ID as an index to get easier access to a specific user.
Like so:
{
"user": {
"1": {
"name": "bob"
}
}
}
In an application, I can now simply write user["3"] to get the correct user.
What should I use? Are there any disadvantages to the second option? I'm sure there is a best practice.
It depends on what format you want objects to look like, how much processing you want to do on your objects and how much data you have.
When dealing with web data you will often see the first format. If there is a lot of data then you will need to iterate through all records to find your matching id because your data is an array. Often that query would be enforced on your lower level data set though so it might already be indexed (eg. if it is a database) so this may not be an issue. This format is clean and binds easily.
Your second option works best when you need efficiency in your lookups since you have a dictionary with key value pairs allowing for significantly faster lookups in large datasets. Putting a numeric key (even though you are forcing it to be a string) is not supported by all libraries. You can prefix your Id with an alpha value though, then you can just add the prefix when doing a lookup. I have used k in this example but you can choose a prefix that makes sense for your data. I use this format when storing objects as the json binary data type in databases.
{
"user": {
"k1": {
"name": "bob"
}
}
}
so I am importing a JSON file of (Example) Users into Realtime Database and have noticed a slight issue.
When added, each User is sorted by the number order they are in on the JSON File. Since it starts with User 4, it has the value of 1. As seen here:
.
The User Json is formatted as such:
{
"instagram": "null",
"invited_by_user_profile": "null",
"name": "Rohan Seth",
"num_followers": 4187268,
"num_following": 599,
"photo_url": "https://clubhouseprod.s3.amazonaws.com:443/4_b471abef-7c14-43af-999a-6ecd1dd1709c",
"time_created": "2020-03-17T07:51:28.085566+00:00",
"twitter": "rohanseth",
"user_id": 4,
"username": "rohan"
}
Is there some easy way to make it so the titles in Firebase are the User ID's of each user instead of the numbers currently used?
When you import a JSON into the Firebase Realtime Database, it uses whatever keys exist in the JSON. There is no support to remap the keys during the import.
But of course you can do this with some code:
For example, you can change the JSON before you import it, to have the keys you want.
You can read the JSON in a small script, and then insert the data into Firebase through its API.
I'd recommend against using sequential numeric IDs as keys though. To learn why, have a look at this blog post: Best Practices: Arrays in Firebase..
I'll quickly explain the Data Flow Diagram above.
The main process in the bottom left corner and the mongodb datastore next to it are two main components of my system. Simply put, main process is gathering data from a MySQL system which serves as a datastore for other backend systems in our company. Other systems which are external to my system are constantly changing data in their respective MySQL DB. The main process is transforming data from those systems, not actually changing the original schema but adding more information to it and sometimes updating its values, NOT SCHEMA. The transformed data is used by our mobile apps ie external entity next to mongodb datastore in DFD. Now, everything works fine when I use the system to create a new copy of transformed data, at that moment it is synchronized with all other systems in terms of data.
The problem is,
When I try to further transform the data later at some point, I want to be able to notify user of changes and synchronize it(if the user wants to) with original data as other external systems or even my own process could have updated the data.
{
"data_gathered_by_process": [
{
"id": "DB1",
"original data field 1": "original value 1"
},
{
"id": "DB2",
"original data field 2": "original value 2"
},
{
"id": "DB3",
"original data field 3": "original value 3"
}
]
}
This could be transformed into
{
"transformed data": [
{
"id": "DB1",
"original data field 1": "transformed value 1",
"additional field added by process": "value"
},
{
"id": "DB2",
"original data field 2": "original value 2",
"additional values": ["one", "two"]
},
{
"id": "DB3",
"original data field 3": "original value 3"
}
]
}
Now the original data could again be changed this way
{
"data_gathered_by_process": [
{
"id": "DB1",
"original data field 1": "changed to some other value"
},
{
"id": "DB2",
"original data field 2": "original value 2"
},
{
"id": "DB3",
"original data field 3": "original value 3"
}
]
}
I'm thinking of implementing something like this:
add a last_updated timestamp on entities of DB1, DB2, DB3 and also store it in transformed copies of data. When working on already transformed data, check timestamps of all entities one by one and update if mismatched. I'll first notify user that original data has changed since and if he wants to make changes use the same logic. But this would be a processing overhead as there are more than ten entities with each having different set of properties.
I think there's a better way to do this, if only someone could point me in the right direction.
(From a MySQL perspective...)
A single dataset (all the DBs), living on 3 machines. Those 3 machines in some kind of Replication.
If it is specifically "3", then Galera Cluster is an excellent choice. Each client can write to its nearby Galera "node"; that data will be immediately replicated to the other two nodes.
If the "3" is likely to grow over time, I would have 1 "Primary" with some number of Replicas. This topology requires that all writes go to the Primary. How far (ping time) is it from each client to where the Primary might live?
The number of Replicas could be zero -- everything is done on the Primary. Or it could be a growing number of servers; this is useful if the read traffic is too high to handle in a single server.
Both approaches (Galera vs Primary+Replicas) force every write to go to all servers, thereby eliminating the synchronization you describe.
(A 3rd approach is "InnoDB Cluster".)
We are currently using Pact-Broker in our Spring Boot application with really good results for our integration tests.
Our tests using Pact-Broker are base in a call to a REST API and comparing the response with the value in our provider, always using JSON format.
Our problem is that the values to compare are in a DB where the data is changing quite often, which make us update the tests really often.
Do you know if it is possible to just validate by the data type?
What we would like to try is to validate that the JSON is properly formed and the data type match, for example, if our REST API gives this output:
[
{
"action": "VIEW",
"id": 1,
"module": "A",
"section": "pendingList",
"state": null
},
{
"action": "VIEW",
"id": 2,
"module": "B",
"section": "finished",
"state": null
}
}
]
For example, what we would like to validate from the previous output is the following:
The JSON is well formed.
All the keys / value pair exists based in the model.
The value match a specific data type, for example, that the key action exist in all the entries and contains a string data type.
Do you know if this is possible to be accomplished with Pact-Broker? I was searching in the documentation but I did not found any example of how to do it.
Thanks a lot in advance.
Best regards.
Absolutely! The first 2 things Pact will always do without any extra work.
What you are talking about is referred to as flexible matching [1]. You don't want to match the value, but the type (or a regex). Given you are using Spring Boot, you may want to look at the various matchers available for Pact JVM [2].
I'm not sure if you meant it, but just for clarity, Pact and Pact Broker are separate things. Pact is the Open Source contract-testing framework, and Pact Broker [3] is a tool to help share and collaborate on those contracts with the team.
[1] https://docs.pact.io/getting_started/matching
[2] https://github.com/DiUS/pact-jvm/tree/master/consumer/pact-jvm-consumer#dsl-matching-methods
[3] https://github.com/pact-foundation/pact_broker/
I would like to keep a DB of nested groups (For example, a company hierarchy: a manager has managers who manage other managers who manage employees..)
How should one represent this structure as a JSON?
Should the name of each manager be a key and the managers below him be an object? (Assume each manager has a unique name)
{
"mamanger1": {
"sub_manager1": {
...
},
"sub_manager2": {
...
}
}
}
Or, should the JSON consist of "recursive objects", i.e, a key-value object where key is an identifier and value is an array of same, key-value objects?
In this case, the key-value pair would be called "name"-"employees".
{
"name": "mamanger1",
"employees": [
{
"name": "sub_manager1",
"employees": [ ... ]
},
{
"name": "sub_manager2",
"employees": [ ... ]
},
]
}
In the first example ,each manager has a unique key (Better performance on search?)
In the second example, all objects have the save keys (Easier looping?)
In my view you should use the second approach:
Benifits:
It is more extensible. You can add more data to manager entity
later, if needed.
Easy looping as you are having field names against each value.
Your approach will be more realworld relevant as manager names may or may not be unique.
you will not loose in the search performance as you will be having key "name" and values are still unique as per you. Even if it won't be unique, All nosql db store range of values from a key on same node.
When you will ask for details about manager/managers who has name "xyz" the search process is as follows:
You hit the api
A node receives request
Request will be forwarded to a node/nodes having range which xyz belongs to
Only data of this node will be scanned and matched ones will be returned.
Also as per me the first approach will be creating as many key as the number of managers. Considering the limited number of nodes, one node will be scanned if you try to get the details for "xyz" as key.
You will get better performance in approach 2 , if you are seaching for "xyz" and "xyz1" in same query. As the string values are close to each other you may get it in the same node(mostly). However, in the first approach there are less chances of getting it on same node as both are not considered neighbors because of different keys all together.