I need a private blockchain system in which I would store complex data structures such as JSON documents.
The idea is that every transaction is a JSON document (with potentially various schema).
Hyperleadger Fabric seems to be a great fit since it can run using CouchDB. Although, from my understanding (please correct me if I'm wrong), in Fabric, CouchDB is supposed to be used as a state database that contains the latest state of the blockchain. Furthermore, data stored in CouchDB is not actually part of the blockchain which means it doesn't support Byzantine fault tolerance. So I could only use that system in a trusted consensus. If that is the case, then the use of a Blockchain over a distributed database system becomes irrelevant altogether.
Am I missing something?
Could I store my heterogeneous JSON documents in the ledger via transactions to benefit byzantine fault tolerance? If that is the case will it be possible to query the blockchain at this point?
A blockchain ledger consists of two distinct, though related, parts – a world state and a blockchain.
Firstly, there’s a world state – a database that holds the current values of a set of ledger states. The world state makes it easy for a program to get the current value of these states, rather than having to calculate them by traversing the entire transaction log. Ledger states are, by default, expressed as key-value pairs, though we’ll see later that Hyperledger Fabric provides flexibility in this regard. The world state can change frequently, as states can be created, updated and deleted.
Secondly, there’s a blockchain – a transaction log that records all the changes that determine the world state. Transactions are collected inside blocks that are appended to the blockchain – enabling you to understand the history of changes that have resulted in the current world state. The blockchain data structure is very different to the world state because once written, it cannot be modified. It is an immutable sequence of blocks, each of which contains a set of ordered transactions.click here to read more
We are using ledger to get the current state/data of blockchain. Without ledger we would have to traverse each block for trasaction logs & calculate current state.
Could I store my heterogeneous JSON documents in the ledger via transactions
Yes, you can store JSON documents in the ledger and create composite keys.
Related
Is there a good way to surface when a dataset backing an object was last built in a Workshop module? This would be helpful for giving users of the module a view on data freshness.
The ideal situation is that your data encodes the relevant information about how fresh it is; for instance if your object type represent "flights" then you can write a Function that sorts and returns the most recent flight and present it's departure timestamp as the "latest" update, since it represents the most recent data available.
The next best approach would be to have a last_updated column or similar that's either coming from the source system or added during the sync step. If the data connection is a JDBC or similar connection, this would be straightforward; something like select *, now() as last_updated_timestamp. If you have a file-based connection, you might need to get a bit more creative. This still falls short of accurately conveying the actual "latest data" available in the object type, but at least let's the user know when the last extract from the source system occured.
There are API endpoints for various services in Foundry related to executing schedules and builds, but metadata from these can be misleading if presented to users as an indication of data freshness because they don't actually "know" anything about the data itself - for example you might get the timestamp of when the latest pipeline build started, but if the source system has a 4 hour lag before data is synced to the relevant export tables, then you'll still be "off". So again, best to get this from inside your data wherever possible.
Having at hand, the task of migrating previously collected environmental datasets (weather, airquality, noise etc) from sensors deployed in different locations, and stored in several tables of MySQL database, to my instance of fiware Orion CB, and thus persisted to fiware backend.
The challenges are many:
the data isn't stored in fiware standards, so must be transformed according to the fiware data models.
not all tables are a good candidates of being transformed to an Entity.
some Entities need have field values from several tables as attributes. For instance, defining AirQualityObserved Entity-type would have attributes from these tables: airquality, co, co2, no2 and deployment. So mapping these attributes to a particular Entity-type is a challenge.
As this is a one-time upload (not live data), I am thinking of two possibilities to go about it.
Add an LwM2M client, to keep sending data to an IoTAgent and eventually passed to Orion CB until the last record.
Create a Python script that "pretends" to be a contextProvider to the Orion instance, sending data (say every 5sec) until the last record.
I have not come across a case in my literature search that addresses such a situation. Is there any recommendations from FIWARE Foundation for situations similar to this?
How would you suggest about data fields --> Entity's attributes mapping that actually need be combined from several tables?
IOTA usage makes sense when you have live data (I mean, a real device sending information to the FIWARE platform). However, you say this is a one-time upload, so the Python script option seems better this case.
(A little terminological comment here: your script will take the role of context producer. A context provider is a different actor, related with registrations and query/update forwarding. See this piece of documentation for additional detail).
With regards to the data fields to Entity's attributes mapping I don't have any particular suggestion. This is just a matter of analyzing the data model (i.e. entity attributes) and find how to set that information from your data in the tables.
I'll describe the application I'm trying to build and the technology stack I'm thinking at the moment to know your opinion.
Users should be able to work in a list of task. These tasks are coming from an API with all the information about it: id, image urls, description, etc. The API is only available in one datacenter and in order to avoid the delay, for example in China, the tasks are stored in a queue.
So you'll have different queues depending of your country and once that you finish with your task it will be send to another queue which will write this information later on in the original datacenter
The list of task is quite huge that's why there is an API call to get the tasks(~10k rows), store it in a queue and users can work on them depending on the queue the country they are.
For this system, where you can have around 100 queues, I was thinking on redis to manage the list of tasks request(ex: get me 5k rows for China queue, write 500 rows in the write queue, etc).
The API response are coming as a list of json objects. These 10k rows for example need to be stored somewhere. Due to you need to be able to filter in this queue, MySQL isn't an option at least that I store every field of the json object as a new row. First think is a NoSQL DB but I wasn't too happy with MongoDB in the past and an API response doesn't change too much. Like I need relation tables too for other thing, I was thinking on PostgreSQL. It's a relation database and you have the ability to store json and filter by them.
What do you think? Ask me is something isn't clear
You can use HStore extension from PostgreSQL to store JSON, or dynamic columns from MariaDB (MySQL clone).
If you can move your persistence stack to java, then many interesting options are available: mapdb (but it requires memory and its api is changing rapidly), persistit, or mvstore (the engine behind H2).
All these would allow to store json with decent performances. I suggest you use a full text search engine like lucene to avoid searching json content in a slow way.
I wanted to know, if in general, when integrating 2 or more systems via whatever means (ie. webservice, MQ, etc.), is it a best practice or a standard for your system to capture a snapshot of data that you are sending with another system? I am thinking that this is as an insurance when reconciling is required for scenarios such as prod incidents.
Secondly, I would think this data snapshot is different from audit trail, in that the data being sent itself is saved (ie. xml data, csv file) as a LOB column in a snapshot table. Is this redundant with the audit trail?
For your first question ...
I've done many, many integrations using queues, web services, etc. and I will usually store an audit trail (a high level set of data telling me what happened), but I've never actually stored the payload itself for each call.
A few reasons for that:
The storage of the payloads being sent back and forth can get quite large.
I can usually reconstruct the payload using the audit trail. "Oh entity XYZ with ID 123 was sent yesterday. Let's take a look at what that entity looks like."
If you do the integration really well and have good testing around it, having copies of the payloads becomes unnecessary.
Instead of storing a copy of the payload I would focus on these things for integration:
Good unit tests on both sides and integration testing for the entire process.
Audit logs as you mentioned.
Good retry policies when a message fails (specifically for queues and topics).
Focusing on idempotent messages. So if something fails, you just do it again and everything is ok.
i will be using couchbase as the database for my website. i plan for the website to be international so i will probably have datacenters in the usa, europe and australia to keep latency low. i also want to minimise bandwidth between datacenters so i am planning to fire off parallel updates (ajax) to all datacenters whenever the user stores data.
my question is then: if i insert the same data into all three clusters approximately simultaneously, is couchbase smart enough to recognize that this data is identical and therefore does not need replicating between datacenters?
i watched this video and he explained that the cas value is updated when a document is updated and this is used to determine which documents require replication. if the cas value is updated when any document on the cluster is updated then my guess is that the answer is "no" - as it is very likely that i may be sending only some data to all 3 clusters at once, and any data which is sent to only one cluster will get the cas temporarily out of sync for that cluster. however if the cas value is independent per document then the answer may be "yes". maybe there are some options which can be altered to make the cas value independent per document?
Couchbase does not know anything about the body of the documents that you store. From it's perspective, if you write the same document to 3 clusters (all linked bi-directionally with XDCR) it considers them 3 different document mutations to the document with that ID. Couchbase will perform its normal conflict resolution process to choose which of the 3 is the "winner". This will result in "winning" document being transferred to the other two clusters, despite the fact that it may have the exact same content as the "losing" revisions.
Anytime you write to the same document ID in different clusters, you have to be aware that conflict resolution will choose the winning revision. If you're not careful you can overwrite data you didn't mean to.
Typically a different approach is chosen for your use case. For each user, a "home" cluster is chosen, probably based geography. All operations are tied to this cluster for that user. If that cluster is down, you can switch to another cluster. Using this approach you avoid writing to multiple clusters, and you would only change clusters under well defined conditions.
The CAS value is just an opaque identifier of the revision. In your example above, all 3 document writes would end up with different CAS values (which is one of the reasons Couchbase sees them as different, and has to choose a winner)
The conflict resolution process is document in this section of the manual