I have an object store in an IDB that has a simple (non-compound) index on a field X. This index is not unique (many items may have the same value for X).
I'd like to query the IDB to return all items that have an X value of either "foo", "bar", or "bat".
According to the documentation, index getAll takes either a key (in my case a string) or an IDBKeyRange. However, it's not obvious to me how to construct an IDBKeyRange with an arbitrary set of keys, and get the union of all results based on those keys.
You cannot do this in a single request. indexedDB does not currently support "OR" style queries.
An alternative solution is to do one request per value. Basically, for each value, use getAll on the index for the value, then concatenate all of the arrays into a single array (possibly merging duplicates). You don't actually have that many round trips against the DB since you are using getAll. In setting up this index, you basically want a store of let's say "things", where each "thing" has a property such as "tags", where tags is an array of the values (each being a string). The index you create on the "tags" property should be flagged as a multi-entry index.
There are, of course, creative hacky solutions. Here is one. Keep in mind it is completely useless if you have things that have different tag sets but you still want to match the ones that share, this would only work if you do not care about whether any one thing has extra tags. For example, consider each distinct set of values, ignoring order. Let's call them groups. E.g. foo is 1, bar is 2, bat is 3, foo-bar is 4, foo-bat is 5, bar-bat is 6, etc. You can give each group a key, like the numerical counter value I just used in the example. Then you can store the group key as a property in the object. Each time you go to store the object, calculate its group key. You can precalculate all group keys, or develop a hash-style function that generates a particular key given a set of arbitrary string values. Sure, you pay a tiny bit more upfront at time of storage, and when building the request query, but you save a ton of processing because indexedDB does all the processing after that. So you want a simple fast hash. And sure, this is added complexity. But maybe it will work. Just find a simple JS hash. Modify it so that you lexicographically store the value set prior to use (so that difference in value order does not cause difference in hash value). So, to explain more concretely, for the things object store, each thing object has a property called "tags-hash". You create a basic index on this (not unique, not multi-entry). Each time you put a thing in the store, you calculate the value of tags-hash, and set the property's value, before calling put. Then, each time you want to query, you calculate the hash of the array of tags by which you wish to query, then call getAll(calculated-hash-value), and it will give you all things that have those exact tags.
Related
I need to store a dynamic number of integer attributes (1-8). I'm storing them in individual columns in the database table, like:
attribute_1, attribute_2, ..., attribute_8
This makes for a fairly messy model when methods need to reference these, as well as an unwieldy database table and schema.
These are assigned default values (but are overridable on a form), and represent unique identifiers for the user.
For instance, a Brew is composed of up to eight batches before they are mixed together in a fermenter. The brewer might want to go back and refer to any one of these by its unique identifying number. I'm assigning these unique values based on the last highest value when creating a new Brew. However, the user may want to override these default values to some other numbers.
In most cases (smaller breweries), they'll probably only use the first two, but some larger breweries would use all eight.
There must be a better way to store these than having eight different attributes with the same name and a number at the end.
I'm using MySQL. Is there an easy/concise way to store an array or a JSON hash but still be able to edit these values on a form?
I would not store attributes like that. It will limit you in the future. Let say you want to know which brews have used attribute_4? You will have to scan the entire brews table, open the attributes field and deconstruct it to see if 4 is in there.
Much better to separate Brew and Attributes in two tables, and link them, like so:
Another benefit, is you can add attributes easily.
Storing JSON is ok, like #max pointed out. I just propose the normalized database way of doing it.
How to match a Set with a Big Collection of Sets stored in database.
[The collection may have millions of Sets].
Detailed Statement
[Prerequisite] A cluster has special property which is a set of attribute.
I will get an entity having a set of attribute.
If i have any existing cluster with exact same set of attribute (neither more nor less) then i will add the entity to that cluster. Else i will create a cluster having property as attribute set of new entity.
Above is the process of the clustering.
The problem is how i should store the data so that the system can run smoothly on very large dataset without performance issue.
What kind of database should i use for this? in SQL or NoSQL
What Possible Solution i thought of:
[MySQL]Store the attributes with cluster in a table so that clusterId to attributeId has m:n relation.[table cluster_attribute].
whenever an entity comes.
we run.
select clusterId,count(1) from cluster_attribute where attributeId in("comma separated IDs of attributes");
But this will not be good since we may find a long list of clusterId's which fullfills the above query.
In the same above table we perform query like.
select clusterId,count(1) cnt from cluster_attributes a
inner join cluster_attributes b on a.cluesterId=b.cluesterId
where b.attributeId in("comma separated IDs of attributes")
group by clusterId
having cnt = #sizeOfEntityAttributeSet;
This will scan much rows resulting slow query.
We store attribute as sorted Concatenation of attribute by any character | and make this column indexed.This way we will be able to query faster.But when ever i need to know which clusters have a certain attribute (A1), my query will go slow since i will need to use regexp search in mysql.
Items in set is non-duplicate.that is [a1,b1,c1] is valid while [a1,b1,a1,c1] is not.
millions of sets, each will hundreds of items.
Have 2 columns in the table for searching. One is the exact, complete, list of the values, sorted. It's a long string, probably TEXT. The other is a hash of that string. I might suggest MD5, then chop to 32 bits and put into INT UNSIGNED (or BINARY(4)). INDEX this column, but not UNIQUE.
Now, to check for existence, do likewise with the incoming 'set' -- build the string, and compute the hash. Look up the hashed value in the table. It will give you only a few rows, including some duds. Double check with the long string.
WHERE hash = $hash
AND str = '$str'
The lookup will be quite fast. The prep work (building the sorted string and computing the hash) will not be too difficult. It will be quite easy to code in, say, PHP.
Caveats:
This works only for an exact match of the set.
It scales quite well. If you have more than, say, a billion sets, then a 32-bit hash won't be big adequate. (But BIGINT and a longer BINARY would work.)
I have a large number of items stored in a Redis Sorted Set (of the order 100,000) that fairly frequently get updated. These items are objects encoded as JSON strings, and the rank for sorting in the set is derived (on insert, by my code) from a date/time property on the object.
Each item in the set has an Id property (which is a Guid encoded as a string) which uniquely identifies the item within the system.
When these items are updated, I need to either update the item within the sorted set, or delete and reinsert the item. The problem I have is how to find that item to perform the operation.
What I'm currently doing is loading the entire contents of the sorted set into memory, operating on that collection in my code and then writing the complete collection back to Redis. Whilst this works, it's not particularly efficient and won't scale well if the lists start to grow very large.
Would anybody have any suggestions as to how to do this in a more efficient manner? The only unique identifier I have for the items is the Id property as encoded in the item.
Many Thanks,
Richard.
Probably, your case is just a bad design choice.
You shouldn't store JSON strings in sorted sets: you need to store identifiers, and the whole JSON serialized objects should be stored in a hash.
This way, when you need to update an object, you update the whole hash key using hset and you can locate the whole object by its unique identifier.
In the other hand, any key in the hash must be present in your sorted set. When you add an object to the sorted set, you're adding its unique identifier.
When you need to list your objects in a particular order, you do the following operations:
You get a page of identifiers from the sorted set (for example, using zrange).
You get all objects from the page giving their identifiers to a hmget command.
I have a table where one of the columns is a sort of id string used to group several rows from the table. Let's say the column name is "map" and one of the values for map is e.g. "walmart". The column has an index on it, because I use to it filter those rows which belong to a certain map.
I have lots of such maps and I don't know how much space the different map values take up from the table. Does MYSQL recognizes the same map value is stored for multiple rows and stores it only once internally and only references it with an internal numeric id?
Or do I have to replace the map string with a numeric id explicitly and use a different table to pair map strings to ids if I want to decrease the size of the table?
MySQL will store the whole data for every row, regardless of whether the data already exists in a different row.
If you have a limited set of options, you could use an ENUM field, else you could pull the names into another table and join on it.
I think MySQL will duplicate your content each time : it stores data row by row, unless you explicitly specify otherwise (putting the data in another table, like you suggested).
Using another table will mean you need to add a JOIN in some of your queries : you might want to think a bit about the size of your data (are they that big ?), compared to the (small ?) performance loss you may encounter because of that join.
Another solution would be using an ENUM datatype, at least if you know in advance which string you will have in your table, and there are only a few of those.
Finally, another solution might be to store an integer "code" corresponding to the strings, and have those code translated to strings by your application, totally outside of the database (or use some table to store the correspondances, but have that table cached by your application, instead of using joins in SQL queries).
It would not be as "clean", but might be better for performances -- still, this may be some kind of micro-optimization that is not necessary in your case...
If you are using the same values over and over again, then there is a good functional reason to move it to a separate table, totally aside from disk space considerations: To avoid problems with inconsistent data.
Suppose you have a table of Stores, which includes a column for StoreName. Among the values in StoreName "WalMart" occurs 300 times, and then there's a "BalMart". Is that just a typo for "WalMart", or is that a different store?
Also, if there's other data associated with a store that would be constant across the chain, you should store it just once and not repeatedly.
Of course, if you're just showing locations on a map and you really don't care what they are, it's just a name to display, then this would all be irrelevant.
And if that's the case, then buying a bigger disk is probably a simpler solution than redesigning your database just to save a few bytes per record. Because if we're talking arbitrary strings for place names here, then trying to find duplicates and have look-ups for them is probably a lot of work for very little gain.
Let's say, I have :
Key | Indexes | Key-values
----+---------+------------
001 | 100001 | Alex
002 | 100002 | Micheal
003 | 100003 | Daniel
Lets say, we want to search 001, how to do the fast searching process using hash table?
Isn't it the same as we use the "SELECT * from .. " in mysql? I read alot, they say, the "SELECT *" searching from beginning to end, but hash table is not? Why and how?
By using hash table, are we reducing the records we are searching? How?
Can anyone demonstrate how to insert and retrieve hash table process in mysql query code? e.g.,
SELECT * from table1 where hash_value="bla" ...
Another scenario:
If the indexes are like S0001, S0002, T0001, T0002, etc. In mysql i could use:
SELECT * from table WHERE value = S*
isn't it the same and faster?
A simple hash table works by keeping the items on several lists, instead of just one. It uses a very fast and repeatable (i.e. non-random) method to choose which list to keep each item on. So when it is time to find the item again, it repeats that method to discover which list to look in, and then does a normal (slow) linear search in that list.
By dividing the items up into 17 lists, the search becomes 17 times faster, which is a good improvement.
Although of course this is only true if the lists are roughly the same length, so it is important to choose a good method of distributing the items between the lists.
In your example table, the first column is the key, the thing we need to find the item. And lets suppose we will maintain 17 lists. To insert something, we perform an operation on the key called hashing. This just turns the key into a number. It doesn't return a random number, because it must always return the same number for the same key. But at the same time, the numbers must be "spread out" widely.
Then we take the resulting number and use modulus to shrink it down to the size of our list:
Hash(key) % 17
This all happens extremely fast. Our lists are in an array, so:
_lists[Hash(key % 17)].Add(record);
And then later, to find the item using that key:
Record found = _lists[Hash(key % 17)].Find(key);
Note that each list can just be any container type, or a linked list class that you write by hand. When we execute a Find in that list, it works the slow way (examine the key of each record).
Do not worry about what MySQL is doing internally to locate records quickly. The job of a database is to do that sort of thing for you. Just run a SELECT [columns] FROM table WHERE [condition]; query and let the database generate a query plan for you. Note that you don't want to use SELECT *, since if you ever add a column to the table that will break all your old queries that relied on there being a certain number of columns in a certain order.
If you really want to know what's going on under the hood (it's good to know, but do not implement it yourself: that is the purpose of a database!), you need to know what indexes are and how they work. If a table has no index on the columns involved in the WHERE clause, then, as you say, the database will have to search through every row in the table to find the ones matching your condition. But if there is an index, the database will search the index to find the exact location of the rows you want, and jump directly to them. Indexes are usually implemented as B+-trees, a type of search tree that uses very few comparisons to locate a specific element. Searching a B-tree for a specific key is very fast. MySQL is also capable of using hash indexes, but these tend to be slower for database uses. Hash indexes usually only perform well on long keys (character strings especially), since they reduce the size of the key to a fixed hash size. For data types like integers and real numbers, which have a well-defined ordering and fixed length, the easy searchability of a B-tree usually provides better performance.
You might like to look at the chapters in the MySQL manual and PostgreSQL manual on indexing.
http://en.wikipedia.org/wiki/Hash_table
Hash tables may be used as in-memory data structures. Hash tables may also be adopted for use with persistent data structures; database indices sometimes use disk-based data structures based on hash tables, although balanced trees are more popular.
I guess you could use a hash function to get the ID you want to select from. Like
SELECT * FROM table WHERE value = hash_fn(whatever_input_you_build_your_hash_value_from)
Then you don't need to know the id of the row you want to select and can do an exact query. Since you know that the row will always have the same id because of the input you build the hash value form and you can always recreate this id through the hash function.
However this isn't always true depending on the size of the table and the maximum number of hashvalues (you often have "X mod hash-table-size" somewhere in your hash). To take care of this you should have a deterministic strategy you use each time you get two values with the same id. You should check Wikipedia for more info on this strategy, its called collision handling and should be mentioned in the same article as hash-tables.
MySQL probably uses hashtables somewhere because of the O(1) feature norheim.se (up) mentioned.
Hash tables are great for locating entries at O(1) cost where the key (that is used for hashing) is already known. They are in widespread use both in collection libraries and in database engines. You should be able to find plenty of information about them on the internet. Why don't you start with Wikipedia or just do a Google search?
I don't know the details of mysql. If there is a structure in there called "hash table", that would probably be a kind of table that uses hashing for locating the keys. I'm sure someone else will tell you about that. =)
EDIT: (in response to comment)
Ok. I'll try to make a grossly simplified explanation: A hash table is a table where the entries are located based on a function of the key. For instance, say that you want to store info about a set of persons. If you store it in a plain unsorted array, you would need to iterate over the elements in sequence in order to find the entry you are looking for. On average, this will need N/2 comparisons.
If, instead, you put all entries at indexes based on the first character of the persons first name. (A=0, B=1, C=2 etc), you will immediately be able to find the correct entry as long as you know the first name. This is the basic idea. You probably realize that some special handling (rehashing, or allowing lists of entries) is required in order to support multiple entries having the same first letter. If you have a well-dimensioned hash table, you should be able to get straight to the item you are searching for. This means approx one comparison, with the disclaimer of the special handling I just mentioned.