Getting a list of unique hash key values from dynamodb using boto - boto

I want to get a list of unique hash key values for a dynamodb table. The only way that I know to do it currently is to scan the entire table and then iterate over the scan. What is the better way?

rs = list(table.scan(range__eq="rangevalue"))
for i in rs:
print i['primarykey']
should do the trick. I'd love to hear cheaper ways to do the same thing.

Related

in indexedDB, is it possible to make an efficient equivalent of "where Column in (value1, value2.... value2)" call?

I want to implement this search using indexedDB:
where CustomerName in ('bob', 'fred'... 'nancy')
I can see two possibilities:
1) simply openCursor on object store, then loop through entire table, checking manually if a record is in ('bob', 'fred'... 'nancy')
2) using index, issue multiple calls to index openCursor('bob'), openCursor('fred')...
both openCursor take IDBKeyRange which does not seem to allow searching for multiple, non continues values
Is there another, more efficient way?
The fastest way would probably be to sort the keys you're searching for, open a cursor at the first one, IDBCursor.continue until you have all the values for that key and then IDBCursor.advance to the next key you're searching for. Repeat until you've done all the keys. That way you get all the values with only one cursor, but you can quickly skip over values you don't care about.
Either of your suggestions work.
A performance improvement to #1 would be to first sort the list of keys (e.g. using indexedDB.cmp() to implement a comparison function), and open the cursor on the first key (e.g. 'bob'). Then, as you iterate, once you see key that's after 'bob' but before 'fred' you continue('fred') to skip the intervening records. This avoids iterating over the records in the table you don't care about.
The latest Chrome/Firefox/Safari also support getAll(), which would be a variant on your #2 to get all the records for a given key at once (e.g. via getAll('nancy')) rather than having to iterate a cursor.

how data stored in b-tree for composite key in mysql? i want to know about tree representation of data

how b-tree looks like for below composite key?
primary key(imei_no,data_received_time)
both imei_no and data_received_time are bigInt.
plz give answer with example.
The data would be stored as a btree automatically, because with imei_no + data_received_time being the primary key, a b-tree index will be created by the system. So lookup is fast, which is what a b-tree is all about. The table itself, however, is just a common table of course.
If you want to show the table data as a b-tree, then have your GUI layer apply a b-tree algorithm on the data and show it accordingly.
If you want to store the table data as a b-tree, then add an ID and a parent ID and link the records thus. This makes your data hierarchical, however, and you need recursion in SQL to retrieve it properly. This can be quite hard - with MySQL especially, lacking recursive WITH clauses. (Or you just retrieve the data raw and again have your GUI layer apply the ordering and presentation. But if it's your GUI layer doing all this work, why store the data hierarchical in the first place?) There are certainly better ways than a database table to store b-trees. So if you want to do this, ask yourself first, why you want to do it.

What is the best way to merge 2 MySQL data dumps?

We have built an application with MySQL as the database. Every week we export the data dump from the database, and delete all the data. Now we want to merge all these dumps together for some data-analysis tasks.
The problem we are facing is that the "id" field for all the tables is Auto-Increment, so it starts with 1 in all the data dumps, which causes duplicate IDs in the table. I am sure there must be better ways to do it since it should be a pretty common task in MySQL administration.
What would be the best way to go about it?
If you can easily identify your foreign key fields (like they take the form *_id) then you can use the scripting language of your choice to modify the primary and foreign keys in the dump files by adding an "id space offset".
For example let's say you have two dump files and you know their primary key range does not exceed 1,000,000, you increment the primary and foreign keys in the second dump file by 1,000,000.
This is not entirely trivial to implement, as you will have to detect the position of the foreign key fields in the statements and then modify values at the same column position elsewhere in the statement.
If your foreign keys are not easily identifiable by a common naming convention then you must keep separate information per table about how to find their positions based on column position.
Good luck.
The best way would be that you have another database that acts as data warehouse into which you copy the contents of your app's database. After that, you don't truncate all the tables, you simply use DELETE FROM tablename - that way, your auto_increments won't get reset.
It's an ugly solution to have something exported, then truncate the database, then expect an import will proceed properly. Even if you go around the problem of clashing auto increments (there's ON DUPLICATE KEY statement that allows you to do something if a unique key constraint fails), nothing guarantees that relations between tables (foreign keys) will be preserved.
This is a broad topic and solution given is quick and not nice, some other people will probably suggest other methods, but if you are doing this to offload the db your app uses - it's a bad design. Try to google MySQL's partitioning support if you're aiming for better performance with larger data set.
For the data you've already dumped, load it into a table that doesn't use the ID column as a primary key. You don't have to define any primary key. You will have multiple rows with the same ID, but that won't impede your data analysis.
Going forward, you can set up a discipline where you dump and then DELETE the rows that are more than, say, one day old. That way the your ID will keep incrementing.
Or, you can copy this data to a table that uses the ARCHIVE storage engine. This is good for retaining data for analysis, because it compresses its contents.

MySQL Table primary keys

Greetings,
I have some mysql tables that are currently using an md5 hash as a primary key. I normally generate the hash with the value of a column. For instante, let's imagine I have a table called "Artists" with the fields id, name, num_members, year. I tend to make a md5($name) and use it has an ID.
I would like to know what are the downsides of doing this. Is it just better to use integers with AUTO_INCREMENT ? I tend to run away from this because it's just not worth the trouble of finding out what the last id inserted was, and what will be the next etc.
Can you give me some lights on this?
Thank you.
If you need a surrogate primary key, using an AUTO_INCREMENT field is better than an md5 hash, because it is fewer bytes of data, and database backends optimize for integer primary keys.
mysql_insert_id can be used if you need the last inserted id.
If you are generating the primary key as a hash of other columns, why not just use those other columns as a unique key, then join on those?
Another question is, what are the upsides of using an md5 hash? I can't think of any.
The MD5 isn't a true key in this case because it functionally depends on the name. That means that if you have two artists with the same name, you have duplicate "keys" for different records. You could make it a real key by hashing all the attributes together (and hoping that the probability gods don't send you a collision), or you could just save yourself the trouble and use an autoincrementing ID.
It seems like the way you're trying to use the MD5 isn't really buying you any benefit. If "$name" is unique, then why not just use "name" as the primary key? Calculating an MD5 hash and using it as a key for something that's already unique is redundant.
On the other hand, if "name" is not unique, then the MD5 hash won't be unique either and so it's pointless that way too.
Generally you use an MD5 hash when you don't want to store the actual value of the column. For instance, if you're storing passwords, you generally only store the MD5 hash of the password, not the password itself, so that you can't see people's passwords just by looking at the table contents.
If you don't have any unique fields, then you're stuck doing something like an auto-increment because it's at least guaranteed unique. If you use the built-in SQL auto-increment, then you'll just have to fetch the last one way or another. Alternately, if you can get away with keeping a unique counter locally in your application, that avoids having to use auto-increment, but isn't necessarily viable for most applications.
The first approach has one obvious disadvantage: if there are two artists of the same name there will be a primary key collision. Using an INT column with an auto-increment will ensure uniqueness.
Furthermore, though very unlikely, there is a chance that MD5 hashes of different strings could collide (I seem to recall the probability as being 1 in 36 to the power of 32).
The benefits are if you present the IDs to customers (say in a query string for a web form, though that is another no-no)... it prevents users guessing another one.
Personally I use auto-increment without problems (have moved DBs to new servers and everything without problems)

Which is faster: Many rows or many columns?

In MySQL, is it generally faster/more efficient/scalable to return 100 rows with 3 columns, or 1 row with 100 columns?
In other words, when storing many key => value pairs related to a record, is it better to store each key => value pair in a separate row with with the record_id as a key, or to have one row per record_id with a column for each key?
Also, assume also that keys will need to be added/removed fairly regularly, which I assume would affect the long term maintainability of the many column approach once the table gets sufficiently large.
Edit: to clarify, by "a regular basis" I mean the addition or removal of a key once a month or so.
You should never add or remove columns on a regular basis.
http://en.wikipedia.org/wiki/Entity-Attribute-Value_model
There are a lot of bad things about this model and I would not use it if there was any other alternative. If you don't know the majority (except a few user customizable fields) of data columns you need for your application, then you need to spend more time in design and figure it out.
If your keys are preset (known at design time), then yes, you should put each key into a separate column.
If they are not known in design time, then you have to return your data as a list of key-value pairs which you should later parse outside the RDBMS.
If you are storing key/value pairs, you should have a table with two columns, one for the key (make this the PK for the table) and one for the value (probably don't need this indexed at all). Remember, "The key, the whole key, and nothing but the key."
In the multi-column approach, you will find that you table grows without bound because removing the column will nuke all the values and you won't want to do it. I speak from experience here having worked on a legacy system that had one table with almost 1000 columns, most of which were bit fields. Eventually, you stop being able to make the case to delete any of the columns because someone might be using it and the last time you did it, you had work till 2 am rolling back to backups.
First: determine how frequently your data needs to be accessed. If the data always needs to be retrieved in one shot and most of it used then consider storing all the key pairs as a serialized value or as an xml value. If you need to do any sort of complex analysis on that data and you need the value pairs then columns are ok but limit them to values that you know you will need to perform your queries on. It’s generally easier to design queries that use one column for one parameter than row. You will also find it easier to work with
the returned values if they are all in one row than many.
Second: separate your most frequently accessed data and put it in its own table and the other data in another. 100 columns is a lot by the way so I recommend that you split your data into smaller chunks that will be more manageable.
Lastly: If you have data that may frequently change then you should use create the column (key) in one table and then use its numerical key value against which you would store the key value. This assumes that you will be using the same key more than once and should speed up your search when you go to do your lookup.