Get maximum value of number of keys (hlen) for hashes that match given pattern in Redis - mysql

I have many records that start with the pattern as shown below:
user:8d6120be2e7247e49545502092c389fd and
user:000935dc3bb16bd2e0de50988751acfd
Though the hash represent user object, one hash may have more keys than the other. Say if user is a Manager then he may have few additional keys like Reportees, Benifits etc., Without actually looking into all the records is there a way to know the maximum number of keys in any hash? I am in the process of converting the Redis structure in to Relational schema and this gives me an idea on what all columns should be present.

Just use HLEN if your user:<hash> key is HSET. The most of data structures in redis have the way to get they len:
LLEN in LIST
SMEMBERS in SET
ZCARD in SORTED SET

Related

How to store large numbers of Python dictionaries in a datastore and filter/query it?

I have a python dictionary with the following fields:
{
"attribute_a": "3898801b-4595-4113-870b-ee5906457edf", # UUID
"attribute_b": "50df4979-7448-468a-994c-96797b0f958b", # UUID
"attribute_c": "0f6b2331-f86b-4e76-9efe-42ef8d843273", # UUID
"attribute_d": "blah1", # string
"attribute_e": "blah2", # string
"attribute_f": "72.154.80.0", # IP Address
"attribute_g": "blah3", # string
"created_timestamp": datetime.datetime.now() # datetime
}
Now comes the hard part: I will have about 3 millions such records created daily and I need to store at least the last 90-days worth of these records in some type of datastore.
Once each record is stored, it will never need to be updated (but it can be deleted). And I will need the ability to occasionally query this datastore to find all records based matching any of the first 7 attributes and/or date comparison with the last attribute created_timestamp. Mostly I will only be filtering on attribute_a and will want to see the matching records sorted by created_timestamp.
Which datastore should I use? I fear that if I tried to store this vast quantity of data in a MYSQL table with even one index on it, that much data will cause the insertions to become too slow. And if there are no indexes on it, then querying it becomes impossible. So I'm leaning towards a NoSQL solution like MongoDB.
However, I have no experience with NoSQL and am concerned if I can use it for this purpose. Will I be able to filter across multiple fields? Will it be able to handle the created_timestamp field as an actual date instead of as a string? Should I set one attribute_a as the primary key and all the other attributes as secondary global keys? If i do so, will inserts become exceedingly slow? Can it return the data to me date-sorted by created_timestamp?

Anonymization of Account Numbers in 2TB of CSV's

I have ~2TB of CSV's where the first 2 columns contains two ID numbers. These need to be anonymized so the data can be used in academic research. The anonymization can be (but does not have to be) irreversible. These are NOT medical records, so I do not need the fanciest cryptographic algorithm.
The Question:
Standard hashing algorithms make really long strings, but I will have to do a bunch of ID-matching (i.e. 'for subset of rows in data containing ID XXX, do...)' to process the anonymized data, so this is not ideal. Is there a better way?
For example, If I know there are ~10 million unique account numbers, is there a standard way of using the set of integers [1:10million] as replacement/anonymized ID's?
The computational constraint is that data will likely be anonymized on a 32-core ~500GB server machine.
I will assume that you want to make a single pass, one CSV with ID
numbers as input, another CSV with anonymized numbers as output. I will
also assume the number of unique IDs is somewhere on the order of 10
million or less.
It is my thought that it would be best to use some totally arbitrary
one-to-one function from the set of ID numbers (N) to the set of
de-identified numbers (D). This would be more secure. If you used some
sort of hash function, and an adversary learned what the hash was, the
numbers in N could be recovered without too much trouble with a
dictionary attack. Instead I suggest a simple lookup table: ID 1234567
maps to de-identified number 4672592, etc. The correspondence would be
stored in another file, and an adversary without that file would not be
able to do much.
With 10 million or fewer records, on a machine such as you describe,
this is not a big problem. A sketch program in pseudo-Python:
mapping = {}
unused_numbers = list(range(10000000))
while data:
read record
for each ID number N in record:
if N in mapping:
D = mapping[N]
else:
D = choose_random(unused_numbers)
unused_numbers.del(D)
mapping[N] = D
replace N with D in record
write record
write mapping to lookup table file
It seems you don't care about the ids being reversible, but if it helps, you can try one of the format preserving encryption ideas. They are pretty much designed for this use case.
Otherwise if hashes are too large, you can always just strip the end of it. Even if you replace each digit (of the original ID) with a hex digit (from the hash), the collisions are unlikely. You could first read the file and check for collisions though.
PS. If you end up doing hashing, make sure you prepend salt of a reasonable size. Hashes of IDs in the range [1:10M] would be trivial to bruteforce otherwise.

Delete from Redis Sorted Set based on JSON Property

I have a large number of items stored in a Redis Sorted Set (of the order 100,000) that fairly frequently get updated. These items are objects encoded as JSON strings, and the rank for sorting in the set is derived (on insert, by my code) from a date/time property on the object.
Each item in the set has an Id property (which is a Guid encoded as a string) which uniquely identifies the item within the system.
When these items are updated, I need to either update the item within the sorted set, or delete and reinsert the item. The problem I have is how to find that item to perform the operation.
What I'm currently doing is loading the entire contents of the sorted set into memory, operating on that collection in my code and then writing the complete collection back to Redis. Whilst this works, it's not particularly efficient and won't scale well if the lists start to grow very large.
Would anybody have any suggestions as to how to do this in a more efficient manner? The only unique identifier I have for the items is the Id property as encoded in the item.
Many Thanks,
Richard.
Probably, your case is just a bad design choice.
You shouldn't store JSON strings in sorted sets: you need to store identifiers, and the whole JSON serialized objects should be stored in a hash.
This way, when you need to update an object, you update the whole hash key using hset and you can locate the whole object by its unique identifier.
In the other hand, any key in the hash must be present in your sorted set. When you add an object to the sorted set, you're adding its unique identifier.
When you need to list your objects in a particular order, you do the following operations:
You get a page of identifiers from the sorted set (for example, using zrange).
You get all objects from the page giving their identifiers to a hmget command.

storing multiple values as binary in one field

I have a project where I need to store a large number of values.
The data is a dataset holding 1024 2Byte Unsigned integer values. Now I store one value at one row together with a timestamp and a unik ID.
This data is continously stored based on a time trigger.
What I would like to do, is store all 1024 values in one field. So would it be possible to do some routine that stores all the 1024 2byte integer values in one field as binary. Maybe a blobfield.
Thanks.
Br.
Enghoej
Yes. You can serialize your data into a byte array, and store it in a BLOB. 2048 bytes will be supported in a BLOB in most databases.
One big question to ask yourself is "how will I need to retrieve this data?" Any reports or queries such as "what IDs have value X set to Y" will have to load all rows from the table and parse the data AFAIK. For instance, if this were user configuration data, you might need to know which users had a particular setting set incorrectly.
In SQL Server, I'd suggest considering using an XML data type and storing a known schema, since this can be queried with XPath. MySQL did not support this as of 2007, so that may not be an option for you.
I would definitely consider breaking out any data that you might possibly need to query in such a manner into separate columns.
Note also that you will be unable to interpret BLOB data without a client application.
You always want to consider reporting. Databases often end up with multiple clients over the years.

How does a hash table work? Is it faster than "SELECT * from .."

Let's say, I have :
Key | Indexes | Key-values
----+---------+------------
001 | 100001 | Alex
002 | 100002 | Micheal
003 | 100003 | Daniel
Lets say, we want to search 001, how to do the fast searching process using hash table?
Isn't it the same as we use the "SELECT * from .. " in mysql? I read alot, they say, the "SELECT *" searching from beginning to end, but hash table is not? Why and how?
By using hash table, are we reducing the records we are searching? How?
Can anyone demonstrate how to insert and retrieve hash table process in mysql query code? e.g.,
SELECT * from table1 where hash_value="bla" ...
Another scenario:
If the indexes are like S0001, S0002, T0001, T0002, etc. In mysql i could use:
SELECT * from table WHERE value = S*
isn't it the same and faster?
A simple hash table works by keeping the items on several lists, instead of just one. It uses a very fast and repeatable (i.e. non-random) method to choose which list to keep each item on. So when it is time to find the item again, it repeats that method to discover which list to look in, and then does a normal (slow) linear search in that list.
By dividing the items up into 17 lists, the search becomes 17 times faster, which is a good improvement.
Although of course this is only true if the lists are roughly the same length, so it is important to choose a good method of distributing the items between the lists.
In your example table, the first column is the key, the thing we need to find the item. And lets suppose we will maintain 17 lists. To insert something, we perform an operation on the key called hashing. This just turns the key into a number. It doesn't return a random number, because it must always return the same number for the same key. But at the same time, the numbers must be "spread out" widely.
Then we take the resulting number and use modulus to shrink it down to the size of our list:
Hash(key) % 17
This all happens extremely fast. Our lists are in an array, so:
_lists[Hash(key % 17)].Add(record);
And then later, to find the item using that key:
Record found = _lists[Hash(key % 17)].Find(key);
Note that each list can just be any container type, or a linked list class that you write by hand. When we execute a Find in that list, it works the slow way (examine the key of each record).
Do not worry about what MySQL is doing internally to locate records quickly. The job of a database is to do that sort of thing for you. Just run a SELECT [columns] FROM table WHERE [condition]; query and let the database generate a query plan for you. Note that you don't want to use SELECT *, since if you ever add a column to the table that will break all your old queries that relied on there being a certain number of columns in a certain order.
If you really want to know what's going on under the hood (it's good to know, but do not implement it yourself: that is the purpose of a database!), you need to know what indexes are and how they work. If a table has no index on the columns involved in the WHERE clause, then, as you say, the database will have to search through every row in the table to find the ones matching your condition. But if there is an index, the database will search the index to find the exact location of the rows you want, and jump directly to them. Indexes are usually implemented as B+-trees, a type of search tree that uses very few comparisons to locate a specific element. Searching a B-tree for a specific key is very fast. MySQL is also capable of using hash indexes, but these tend to be slower for database uses. Hash indexes usually only perform well on long keys (character strings especially), since they reduce the size of the key to a fixed hash size. For data types like integers and real numbers, which have a well-defined ordering and fixed length, the easy searchability of a B-tree usually provides better performance.
You might like to look at the chapters in the MySQL manual and PostgreSQL manual on indexing.
http://en.wikipedia.org/wiki/Hash_table
Hash tables may be used as in-memory data structures. Hash tables may also be adopted for use with persistent data structures; database indices sometimes use disk-based data structures based on hash tables, although balanced trees are more popular.
I guess you could use a hash function to get the ID you want to select from. Like
SELECT * FROM table WHERE value = hash_fn(whatever_input_you_build_your_hash_value_from)
Then you don't need to know the id of the row you want to select and can do an exact query. Since you know that the row will always have the same id because of the input you build the hash value form and you can always recreate this id through the hash function.
However this isn't always true depending on the size of the table and the maximum number of hashvalues (you often have "X mod hash-table-size" somewhere in your hash). To take care of this you should have a deterministic strategy you use each time you get two values with the same id. You should check Wikipedia for more info on this strategy, its called collision handling and should be mentioned in the same article as hash-tables.
MySQL probably uses hashtables somewhere because of the O(1) feature norheim.se (up) mentioned.
Hash tables are great for locating entries at O(1) cost where the key (that is used for hashing) is already known. They are in widespread use both in collection libraries and in database engines. You should be able to find plenty of information about them on the internet. Why don't you start with Wikipedia or just do a Google search?
I don't know the details of mysql. If there is a structure in there called "hash table", that would probably be a kind of table that uses hashing for locating the keys. I'm sure someone else will tell you about that. =)
EDIT: (in response to comment)
Ok. I'll try to make a grossly simplified explanation: A hash table is a table where the entries are located based on a function of the key. For instance, say that you want to store info about a set of persons. If you store it in a plain unsorted array, you would need to iterate over the elements in sequence in order to find the entry you are looking for. On average, this will need N/2 comparisons.
If, instead, you put all entries at indexes based on the first character of the persons first name. (A=0, B=1, C=2 etc), you will immediately be able to find the correct entry as long as you know the first name. This is the basic idea. You probably realize that some special handling (rehashing, or allowing lists of entries) is required in order to support multiple entries having the same first letter. If you have a well-dimensioned hash table, you should be able to get straight to the item you are searching for. This means approx one comparison, with the disclaimer of the special handling I just mentioned.