Suppose one is trying to save such API responses for analytics later, ie, a single response has about a 1000 persons
Each object has about 26 properties.
The API query is made every 5 minutes for example.
{person1 : {propertyA:a1, propertyB:b1 ....... propertyZ:z1}
person2 : {propertyA:a2, propertyB:b2 ....... propertyZ:z2}
....
....
person999: {propertyA:a999, propertyB:b999 ....... propertyZ:z999}
person1000: {propertyA:a1000, propertyB:b1000 ....... propertyZ:z1000}}
What is the best way to store such kind of data for analytics later? What kind of database? (the simpler the better)
Should the multiple responses of such API calls be stored in single rows or make multiple columns for each object? Or some other way like JSON dbs?
Note - the person might change over time, eg person100 might stop being updated or become inactive .... so an API resposne in future might not include person100 instead another record for person1001 might be added (unrelated to person100 becoming inactive)
Additional info :
Data would be updated say every 5 mins for a say 5 years (to give an idea about usage/retention of data).
Queries would mostly be limited to how a personX is changing over a given time frame that is likely to range from a few hours to over 6 months.
Properties of a person are likely to have same/similar profile of attributes, althoug their values would obviously change over time
the simpler the better
The simplest would presumably be to keep the results of each API query in a single file, though if you did so, it would probably best to use a JSONLines format, with
one line per person. However, in either case, I would almost certainly add an 'id' field to make it trivially easy to query for a particular person, and to migrate the data elsewhere should that become necessary.
A variant of the above would be to have one file per person, again with a JSONLines format, but with the addition of some kind of timestamp.
Next up the ladder of complexity, you might want to consider a SQLite database. If you want to retain the JSON format, then you'd presumably want to add
indices, e.g. on the person id.
If the JSON object representation of each person is flat and the property list stable, then the conventional wisdom would be to store the data in columnar format. A reasonable compromise would be to move the properties of interest to columns, and to relegate all the other (relevant) details to JSON-valued columns.
Of course there are umpteen other database options, and you can climb the complexity ladder as high as it goes. Likewise for cost. You might like to look at TimescaleDB for starters.
Managing Scale
If the data for an individual does not change very often, there will
presumably be various ways to reduce the redundancy.
At one end of the spectrum of possibilities, you could simply discard
an entire record if the prior retained record for that person is essentially the same.
Towards the other end of the spectrum, you could recast the data as a
series of events that would be easy to store as a table:
timestamp id propertyName value
This would have the advantage of giving you flexibility w.r.t. both
the universe of persons and the set of properties of interest.
See also https://www.timescale.com/blog/time-series-compression-algorithms-explained/
Footnote: The PmWiki system https://en.m.wikipedia.org/wiki/PmWiki illustrates how a fairly complex “database” system can be constructed using the underlying file system.
My question relates to the minting process to create an NTF.
I might be wrong but the tokenization function can be compared to an hashing function which
takes as input the media
and
outputs the token.
Yeah this actually already is a question, cause otherwise the main question maybe does not makes sense.
Assuming the comparison to an hash function makes sense and forgetting about collisions let's assume the following scenario:
I create a digital artwork and the related NFT. It's published and sells somehow (hopefully :D).
Imagine Mr.XYZW is a well known digital artist who gets huge revenues from NFT, he sees my artwork, somehow he likes it but
also thinks the artwork would look better if for example the colors simply get inverted. Here I'm just mentioning one of all
the possible changes he could do, the point is that easily those changes could not even be noticeable to the human eye, but not to the tokenizer,
which would in the end clearly create a different token.
Now the problem should be clear.
If what I said makes sense, how is it usually tackled?
in case it doesn’t, please help me to understand.
Thank you
tokenization function can be compared to an hashing function which takes as input the media and outputs the token
This is an incorrect assumption.
You can compare an NFT collection (at least per the most widely used standard - ERC-721) to a key-value dictionary, where the key is an integer ID, and the value is a URL. The standard defines that the URL should lead to a JSON containing the token name, description, and image URL.
But there's no hashing function that would calculate the token parameters based on the image.
Each collection (holding several NFTs) is a smart contract deployed on a different address (e.g. 0x12345). Also, each NFT within its collection has a unique ID (e.g. 1).
Combination of the collection address and the token ID can be used as a unique identificator of each NFT (e.g. 0x12345 / 1).
It's technically possible for multiple different NFTs (no matter whether they're in the same or different collections) to lead to very similar images or even the same image. But the combination of collection address and token ID is always unique.
I am trying to execute 2 choices one after the another. Both are executing so fast, that they have same timestamp.
timestamp = 1607079031453,
Thus making it difficult to arrange via ascending order in a table.
Canyou suggest any work aroud for this?
getTime in DAML does not give you "system time", as there is no notion of system time on a distributed system. It gives you something called "Ledger Time" documented here: https://docs.daml.com/concepts/time.html
Ledger Time is specified by the submitting node, and a property of the entire transaction. That means all calls to getTime within a single transaction will return the same time.
If you create two identical contracts in a single transaction, there are only two ways to distinguish them:
Position in the transaction tree
Contract Id
Contract Id is a hash so gives you no useful ordering properties other than some value to order by stably. If you want to order by the order in which contracts were created, you need to use the position in the transaction tree.
I don't know where you store your data, or which API you use to store it there, but suppose you used a subscription to the Transaction Service, which returns Create events in order, and stored it to an SQL database, you can just put an auto-incrementing integer column on your table and use that integer to sort by.
#bame's answer is mostly geared towards the DAML language, I'll explore it from the point of view of the Ledger API.
If your objective is to assess that one choice effectively occurred after the other and the two choices occur as part of different transactions you could use offsets for it.
Offsets are effectively an opaque binary blob from the client perspective, but they must be lexicographically comparable: take the two offsets and the lowest one will have occurred before the one with the higher offset.
Note that this only applies if the two choices were taken as part of two different transactions. If they occurred in the same transaction, the choice that occurred before will appear before as you traverse the transaction tree in preorder.
my application currently using MySQL makes phone calls fetching information about the dialed numbers and the caller ID from the DB. I want to have a group where a list of caller IDs to be defined in Redis. Let's say 10 caller IDs. But for each dialing, I want to SELECT/GET the caller ID from redis server not just a random number. Is that possible with Redis? It's like load balancing from the list of Keys from redis to make sure all keys are given a fair chance to be used?
An example of the data set will be a phonebook which will be the key, and there will be say 10 phone numbers in that phonebook. I want to use those numbers for every unique dialing so all numbers in the phonebook are used evenly for dialing.
I can do that in MySQL by setting up an update field in the table but that's going to increate UPDATE's on MySQL. Is this something can easily be done with Redis? I can't seem to think of a logic on how to do that.
Thanks.
There are two ways to do it in Redis:
ZSET
You can track the usage frequency with the score of a zset entry. So when you fetch one out from Redis with lowest score, you increase its score by one.
The side benefit is you can easily see exactly how many times each element has been used.
LIST
If you're not bothered about tracking the usage in numbers. You can also do it with a Redis list. Just use RPOPLPUSH source destination from/to itself to achieve round robin load balancing effect. Basically it takes an element from the bottom and puts it back onto the top the queue, and returns you the value of the shuffled element, obviously.
The benefit is there is only one command to run and the operation is atomic.
I have recently run across these terms few times but I am quite confused how they work and when they are usualy implemented?
Well, think of it this way.
If you use an array, a simple index-based data structure, and fill it up with random stuff, finding a particular entry gets to be a more and more expensive operation as you fill it with data, since you basically have to start searching from one end toward the other, until you find the one you want.
If you want to get faster access to data, you typicall resort to sorting the array and using a binary search. This, however, while increasing the speed of looking up an existing value, makes inserting new values slow, as you need to move existing elements around when you need to insert an element in the middle.
A hashtable, on the other hand, has an associated function that takes an entry, and reduces it to a number, a hash-key. This number is then used as an index into the array, and this is where you store the entry.
A hashtable revolves around an array, which initially starts out empty. Empty does not mean zero length, the array starts out with a size, but all the elements in the array contains nothing.
Each element has two properties, data, and a key that identifies the data. For instance, a list of zip-codes of the US would be a zip-code -> name type of association. The function reduces the key, but does not consider the data.
So when you insert something into the hashtable, the function reduces the key to a number, which is used as an index into this (empty) array, and this is where you store the data, both the key, and the associated data.
Then, later, you want to find a particular entry that you know the key for, so you run the key through the same function, get its hash-key, and goes to that particular place in the hashtable and retrieves the data there.
The theory goes that the function that reduces your key to a hash-key, that number, is computationally much cheaper than the linear search.
A typical hashtable does not have an infinite number of elements available for storage, so the number is typically reduced further down to an index which fits into the size of the array. One way to do this is to simply take the modulus of the index compared to the size of the array. For an array with a size of 10, index 0-9 will map directly to an index, and index 10-19 will map down to 0-9 again, and so on.
Some keys will be reduced to the same index as an existing entry in the hashtable. At this point the actual keys are compared directly, with all the rules associated with comparing the data types of the key (ie. normal string comparison for instance). If there is a complete match, you either disregard the new data (it already exists) or you overwrite (you replace the old data for that key), or you add it (multi-valued hashtable). If there is no match, which means that though the hash keys was identical, the actual keys were not, you typically find a new location to store that key+data in.
Collision resolution has many implementations, and the simplest one is to just go to the next empty element in the array. This simple solution has other problems though, so finding the right resolution algorithm is also a good excercise for hashtables.
Hashtables can also grow, if they fill up completely (or close to), and this is usually done by creating a new array of the new size, and calculating all the indexes once more, and placing the items into the new array in their new locations.
The function that reduces the key to a number does not produce a linear value, ie. "AAA" becomes 1, then "AAB" becomes 2, so the hashtable is not sorted by any typical value.
There is a good wikipedia article available on the subject as well, here.
lassevk's answer is very good, but might contain a little too much detail. Here is the executive summary. I am intentionally omitting certain relevant information which you can safely ignore 99% of the time.
There is no important difference between hash tables and hash maps 99% of the time.
Hash tables are magic
Seriously. Its a magic data structure which all but guarantees three things. (There are exceptions. You can largely ignore them, although learning them someday might be useful for you.)
1) Everything in the hash table is part of a pair -- there is a key and a value. You put in and get out data by specifying the key you are operating on.
2) If you are doing anything by a single key on a hash table, it is blazingly fast. This implies that put(key,value), get(key), contains(key), and remove(key) are all really fast.
3) Generic hash tables fail at doing anything not listed in #2! (By "fail", we mean they are blazingly slow.)
When do we use hash tables?
We use hash tables when their magic fits our problem.
For example, caching frequently ends up using a hash table -- for example, let's say we have 45,000 students in a university and some process needs to hold on to records for all of them. If you routinely refer to student by ID number, then a ID => student cache makes excellent sense. The operation you are optimizing for this cache is fast lookup.
Hashes are also extraordinarily useful for storing relationships between data when you don't want to go whole hog and alter the objects themselves. For example, during course registration, it might be a good idea to be able to relate students to the classes they are taking. However, for whatever reason you might not want the Student object itself to know about that. Use a studentToClassRegistration hash and keep it around while you do whatever it is you need to do.
They also make a fairly good first choice for a data structure except when you need to do one of the following:
When Not To Use Hash Tables
Iterate over the elements. Hash tables typically do not do iteration very well. (Generic ones, that is. Particular implementations sometimes contain linked lists which are used to make iterating over them suck less. For example, in Java, LinkedHashMap lets you iterate over keys or values quickly.)
Sorting. If you can't iterate, sorting is a royal pain, too.
Going from value to key. Use two hash tables. Trust me, I just saved you a lot of pain.
if you are talking in terms of Java, both are collections which allow objects addition, deletion and updation and use Hasing algorithms internally.
The significant difference however, if we talk in reference to Java, is that hashtables are inherently synchronized and hence are thread safe while the hash maps are not thread safe collection.
Apart from the synchronization, the internal mechanism to store and retrieve objects is hashing in both the cases.
If you need to see how Hashing works, I would recommend a bit of googling on Data Structers and hashing techniques.
Hashtables/hashmaps associate a value (called 'key' for disambiguation purposes) with another value. You can think them as kind of a dictionary (word: definition) or a database record (key: data).