I'll quickly explain the Data Flow Diagram above.
The main process in the bottom left corner and the mongodb datastore next to it are two main components of my system. Simply put, main process is gathering data from a MySQL system which serves as a datastore for other backend systems in our company. Other systems which are external to my system are constantly changing data in their respective MySQL DB. The main process is transforming data from those systems, not actually changing the original schema but adding more information to it and sometimes updating its values, NOT SCHEMA. The transformed data is used by our mobile apps ie external entity next to mongodb datastore in DFD. Now, everything works fine when I use the system to create a new copy of transformed data, at that moment it is synchronized with all other systems in terms of data.
The problem is,
When I try to further transform the data later at some point, I want to be able to notify user of changes and synchronize it(if the user wants to) with original data as other external systems or even my own process could have updated the data.
{
"data_gathered_by_process": [
{
"id": "DB1",
"original data field 1": "original value 1"
},
{
"id": "DB2",
"original data field 2": "original value 2"
},
{
"id": "DB3",
"original data field 3": "original value 3"
}
]
}
This could be transformed into
{
"transformed data": [
{
"id": "DB1",
"original data field 1": "transformed value 1",
"additional field added by process": "value"
},
{
"id": "DB2",
"original data field 2": "original value 2",
"additional values": ["one", "two"]
},
{
"id": "DB3",
"original data field 3": "original value 3"
}
]
}
Now the original data could again be changed this way
{
"data_gathered_by_process": [
{
"id": "DB1",
"original data field 1": "changed to some other value"
},
{
"id": "DB2",
"original data field 2": "original value 2"
},
{
"id": "DB3",
"original data field 3": "original value 3"
}
]
}
I'm thinking of implementing something like this:
add a last_updated timestamp on entities of DB1, DB2, DB3 and also store it in transformed copies of data. When working on already transformed data, check timestamps of all entities one by one and update if mismatched. I'll first notify user that original data has changed since and if he wants to make changes use the same logic. But this would be a processing overhead as there are more than ten entities with each having different set of properties.
I think there's a better way to do this, if only someone could point me in the right direction.
(From a MySQL perspective...)
A single dataset (all the DBs), living on 3 machines. Those 3 machines in some kind of Replication.
If it is specifically "3", then Galera Cluster is an excellent choice. Each client can write to its nearby Galera "node"; that data will be immediately replicated to the other two nodes.
If the "3" is likely to grow over time, I would have 1 "Primary" with some number of Replicas. This topology requires that all writes go to the Primary. How far (ping time) is it from each client to where the Primary might live?
The number of Replicas could be zero -- everything is done on the Primary. Or it could be a growing number of servers; this is useful if the read traffic is too high to handle in a single server.
Both approaches (Galera vs Primary+Replicas) force every write to go to all servers, thereby eliminating the synchronization you describe.
(A 3rd approach is "InnoDB Cluster".)
Related
I'm working on an app that will generate a Json potentially very big. In my tests this was 8000 rows. This is because is an aggregation of data for a year, and is required to display details in the UI.
For example:
"voice1": {
"sum": 24000,
"items": [
{
"price": 2000,
"description": "desc1",
"date": "2021-11-01T00:00:00.000Z",
"info": {
"Id": "85fda619bbdc40369502ec3f792ae644",
"address": "add2",
"images": {
"icon": "img.png",
"banner": null
}
}
},
{
"price": 2000,
"description": "desc1",
"date": "2021-11-01T00:00:00.000Z",
"info": {
"Id": "85fda619bbdc40369502ec3f792ae644",
"address": "add2",
"images": {
"icon": "img.png",
"banner": null
}
}
}
]
},
The point is that I can have potentially 10 voices and for each dozen and dozens of items.
I was wondering if you can point to me some Best Practices or if you have some tips about them because I've got the feeling this can be done better.
It sounds like you are finding out that JSON is a rather verbose format (not as bad as XML but still very verbose). If you are worried about the size of messages between server client and you have a few options:
JSON compresses rather well. You can see how most tokens repeat many times. So make sure to Gzip or Snappy before sending to clients. This will drastically reduce the size, but cost some performance for inflating / deflating.
The other alternative is to not use JSON for transfer, but a more optimized format. One of the best options here is Flat Buffers. It does require you to provide schemas of the data that you are sending but is an optimized binary format with minimal overhead. It will also drastically speed up your application because it will remove the need for serialization / deserialization, which takes a significant time for JSON. Another popular, but slightly slower alternative is Protobuf.
The only thing immediately obvious to me is that you would likely want to make a list of voices (like you have for items) rather than voice1, voice2, etc.
Beyond that it really just depends the structure of the data you start with (to create the json) and the structure of the data or code at the destination (and possibly also the method of transferring data if size is a concern). If you're doing a significant amount of processing on either end to encode/decode the json that can suggest there's a simpler way to structure the data. Can you share some additional context or examples of the overall process?
I have a number of JSON sources I wish to import to power BI. The format is such that foreign keys are such that there can be 0, 1, or many, but they store both the ID to another table as well as the name. An example of one entry in one of the JSON files is:
{
"ID": "5bb68fde9088104f8c2a85be",
"Name": "name here",
"Date": "2018-10-04T00:00:00Z",
"Account": {
"ID": "5bb683509088104f8c2a85bc",
"Name": "name here"
},
"Amount": 38.21,
"Received": true
}
Some tables are much more complex etc, but for the most part, they always follow this sort of format for foreign keys. In power BI, I pull in the JSON, convert to table, and expand the column to view the top level in the table, but any lower levels, such as these foreign keys, are represented as lists. How do I pull them out into each row? I can extract values, but that duplicates rows etc.
I have googled multiple times for this and tried to follow what others have posted but can't seem to get anything to work.
Background: We have all seen several ways to configure a distributed application. For my purposes, two of them stand out:
Have a massive database that all nodes have access to. Each node knows its own identity, and so can perform queries against said database to pull out the configuration information specific to itself.
Use tailored (i.e., specific to each node) configuration files (e.g., JSON) so that the nodes do not have to touch a database at all. They simply read the tailored config file and do what it says.
There are pros and cons to each. For my purposes, I would like to explore #2a little further, but the problem I'm running into is that the JSON files can get pretty big. I'm wondering if anyone knows a DSL that is well-suited for generating these JSON files.
Step-by-step examples to illustrate what I mean:
Suppose I make up this metalanguage that looks like this:
bike.%[0..3](Vehicle)
This would then output the following JSON:
{
"bike.0":
{
"type": "Vehicle"
},
"bike.1":
{
"type": "Vehicle"
},
"bike.2":
{
"type": "Vehicle"
},
"bike.3":
{
"type": "Vehicle"
}
}
The idea is that we've just created 4 bikes, each of which is of type Vehicle.
Going further:
bike[i=0..3](Vehicle)
label: "hello %i"
label.3: "last"
Now what this does is to name the index variable 'i' so that it can be used for the configuration information of each item. The JSON that would be output would be something like this:
{
"bike.0":
{
"type": "Vehicle",
"cfg":
{
"label": "hello 0"
}
},
"bike.1":
{
"type": "Vehicle",
"cfg":
{
"label": "hello 1"
}
},
"bike.2":
{
"type": "Vehicle",
"cfg":
{
"label": "hello.2"
}
},
"bike.3":
{
"type": "Vehicle",
"cfg":
{
"label": "last"
}
}
}
You can see how the last label was overridden, so this is a way to sparsely specify stuff. Is there anything already out there that lets one do this?
Thanks!
Rather than thinking of the metalanguage as a monolithic entity, it might be better to divide it into three parts:
An input specification. You can use a configuration file syntax to hold this specification.
A library or utility that can use print statements and for-loops to generate runtime configuration files. The Apache Velocity template engine comes to mind as something that is suitable for this purpose. I suggest you look at its user guide to get a feel for what it can do.
Some glue code to join together the above two items. In particular, the glue code reads name=value pairs from the input specification, and passes them to the template engine, which uses them as parameters to "instantiate" the template versions of the configuration files that you want to generate.
My answer to another StackOverflow question provides some more details of the above idea.
My application will be receiving a large json payload from an upstream system. This upsteam system is essentially a UI that will be collecting business requirements from a user, format those questions and facts into a json payload, and transmit the json to my application, which will validate it against a schema defined by the json-schema standard. The conundrum is that this upstream system is being built by a different team who doesn't necessarily understand all of the business requirements that need to be captured.
Take the following schema:
schema = {
"$schema": "http://json-schema.org/draft-04/schema#",
"title":"Requirements",
"description": "A Business Requirements Payload",
"type": "object",
"properties": {
"full_name": {
"type": "string"
},
"sex": {
"enum": ["m", "f"]
},
"age": {
"type": "number"
},
"consents": {
"type": "boolean"
}
},
"required": ["full_name", "sex", "age", "consents"],
"additionalProperties": False
}
Assume that the upstream system has no idea what a full_name, sex, or age was. Currently, I am having meetings explaining the nature of every field/question/fact that I require, default values that should show up on the UI, accompanying text labels that should show up to each field, and etc.
In brainstorming a mechanism to make this easier for everyone, I thought of tightly coupling the json-schema I am creating to the UI that the upstream system is building. What if I include these details inside of the json-schema itself, hand the json-schema to the upstream system, and let the UI team generate the UI with the accompanying text labels, default values, and etc?
For example, the full_name and sex fields could instead be described like this:
"full_name": {
"type": "string",
"default": "\"John Smith\"",
"label": "Full Name",
"text": "Please include your full name.",
"description": "This field will be the primary key in the database"
},
"sex": {
"enum": ["m", "f"],
"default": "m",
"enum_labels": ["Male", "Female"],
"label": "Sex",
"text": "Please include your sex.",
"description": "We want to run analytics on this field"
}
The UI team and I could come to an agreement on certain things:
If the field is of type string, generate a text box.
If the field is an enum, generate a combo box.
Use the field's label property infront of the form entry.
If the field is of type enum, generate pretty labels for the enum values by comparing positioninally against the enum_labels property.
Use the field's text property right below the form entry.
The Description field is only to help you, the UI guy, to know the business logic.
Here are some negatives to this approach:
Tightly coupling the view in this manner may not be optimal
If json-schema v5 introduces a keyword that I am using, such as text, the schema would break if I upgraded to v5 and then I would have to change the contract with the UI team. (What could also be done to avoid this is to use the description field to hold all the form-related keywords, delimited by some character, but it wouldn't look as nice).
It it appropriate to tightly couple a json-schema with a UI, and if it is, is there anything wrong with adding properties to the json-schema like I have described in order to accomplish this?
*I just stumbled across jsonform which is pretty much what I desire, but this question still applies to jsonform as well as a custom parser.
Just to be certain, you are aware there is an optional form object which is used to structure the form output? It allows custom grouping, custom ordering, conditional fields and more ...
https://github.com/joshfire/jsonform/wiki#fields
If your default schema object is satisfactory for both the form layout, as well as how the data object gets stored, then nothing wrong with sticking to the schema for layout of the form.
I am not sure if this answers your question, but the question is slightly unclear to me. Basically yes you can stick to the main schema, but if that is not sufficient for the form layout, you can populate the form object.
This is related to my original question here:
Elasticsearch Delete Mapping Property
From that post assuming you are going to have to "reindex" your data. What is a safe strategy for doing this?
To summarize from the original post I am trying to take the mapping from:
{
"propVal1": {
"type": "double",
"index": "analyzed"
},
"propVal2": {
"type": "string",
"analyzer": "keyword"
},
"propVal3": {
"type": "string",
"analyzer": "keyword"
}
}
to this:
{
"propVal1": {
"type": "double",
"index": "analyzed"
},
"propVal2": {
"type": "string",
"analyzer": "keyword"
}
}
Removing all data for the property that was removed.
I have been contemplating using the REST API for this. This seems dangerous though since you are going to need to synchronize state with the client application making the REST calls, i.e. you need to send all of your documents to the client, modify them, and send them back.
What would be ideal is if there was a server side operation that could move and transform types around. Does something like this exist or am I missing something obvious with the "reindexing"?
Another approach would be to flag the data as no longer valid. Is there any built in flags for this, in terms of the mapping, or is it necessary to create an auxiliary type to define if another type property is valid?
You can have a look to elasticsearch-reindex plugin.
A more manual operation could be to use scan & scroll API to get back your original content and use bulk API to index it in a new index or type.
Last answer, how did you get your docs in Elasticsearch? If you have already a data source somewhere, just use the same process as before.
If you don't want any downtime, use an alias on top of your old index and once reindex is done, just move the alias to the new index.