Multiple Mappers for the same class in different databases - sqlalchemy

I am currently working on a Wikipedia API which means that we have a
database for each language we want to use. The structure of each
database is identical, they only differ in their language. The only
place where this information is stored is in the name of the database.
When starting with one language the straight forward approach to use a
mapping between the tables to needed classes (e.g. Page) looked fine.
We defined an engine and corresponding metadata. When we added a
second
database with its own setup for engine and metadata we ran into the
following error:
ArgumentError:
Class '<class 'wp.orm.types.pages.Page'>' already has a primary mapper defined.
Use non_primary=True to create a non primary Mapper.clear_mappers() will remove
*all* current mappers from all classes.
I found an email saying that there must be at least one primary
mapper, so using this option for all databases doesn't seem feasible.
The next idea is to use sharding. For that we need a way to
distinguish
between the databases from the perspective of an instance, as noted in
the docs:
"You need a function which can return
a single shard id, given an instance
to be saved; this is called
"shard_chooser"
I am stuck here. Is there a way to get the database name given an
Object
it is loaded from? Or a possibility to add a static attribute based on
the engine? The alternative would be to add a language column to every
table which is just ugly.
Am I overseeing other possibilities? Any ideas how to define multiple
mappers for the same class, that map against tables in different
databases?

I asked this question on a mailing list and got this answer by Michael Bayer:
if you'd like distinct classes to
indicate that they "belong" in a
different database, and you have very
clear lines as to how this is
performed, use the "entity_name"
concept described at
http://www.sqlalchemy.org/trac/wiki/UsageRecipes/EntityName
. this sounds very much like your use
case.
The next idea is to use sharding. For that we need a way to
distinguish
between the databases from the perspective of an instance, as noted
in
the docs:
"You need a function which can return a single shard id, given an
instance to be saved; this is called "shard_chooser"
horizontal sharding is a method of
storing many homogeneous instances
across multiple databases, with the
implication that you're creating one
big "virtual" database among
partitions - the main concept is
that an individual instance gets
placed in different partitions based
on some ruleset. This is a little
like your use case as well but since
you have a very simple delineation i
think the "entity name" approach is
easier.
So the basic idea is to generate anonymous subclasses for each desired mapping which are distinguished by the Entity_Name. The details can be found in Michaels Link

Related

Relational Table Design For Single Object w/Multiple Types

I am creating a database for a web application and am looking for some suggestions to model a single entity that might have multiple types, with each type having differing attributes.
As an example assume that I want to create a relational model for a "Data Source" object. There will be some shared attributes of all data sources, such as a numerical identifier, a name, and a type. Each type will then have differing attributes based on the type. For the sake of argument let's say we have two types, "SFTP" and "S3".
For the S3 type we might have to store the bucket, AWSAccessKeyId, YourSecretAccessKeyID, etc. For SFTP we would have to store the address, username, password, potentially a key of some sort.
My first inclination would be to break out each type into their own table with any non-common fields being represented in that new table with a foreign key in the main "Data Source" table. What I don't like about that is that I would then have to know which table is associated with each type that is stored in the main table and rewrite the queries coming from the web app dynamically based on that type.
Is there a simple solution or best practices I'm missing here?
What you are describing is a situation where you want to implement table inheritance. There are three methods for doing this, all described in Martin Fowler's excellent book, Patterns of Enterprise Application Architecture.
What you describe as your first inclination is called Class Table Inheritance by Fowler. It is the method that I tend to use in my database designs, but doesn't always fit well. This method corresponds most closely to an OO view of the database, with a table representing an abstract class and other tables representing concrete implementations of the abstract class. Data must be queried and updated from multiple tables.
It sounds like what you actually want to use is called Single Table Inheritance by Fowler. In this method, you'd actually put columns for all of your data in one table, with a discriminator column to identify which fields are associated with the element type. Queries are generally simpler, although you do have to deal with the discriminator column.
Finally, the third type is called Concrete Table Inheritance by Fowler. In my mind, this is the least useful. In this method, you give up all concepts of having any kind of hierarchical data, and create a single table for each element type. Still, there are times when this might work for you.
All three methods have their pros and cons. You should consult the links above to see which might work best for you in your project.

Implementing inheritance in MySQL: alternatives and a table with only surrogate keys

This is a question that has probably been asked before, but I'm having some difficulty to find exactly my case, so I'll explain my situation in search for some feedback:
I have an application that will be registering locations, I have several types of locations, each location type has a different set of attributes, but I need to associate notes to locations regardless of their type and also other types of content (mostly multimedia entries and comments) to said notes. With this in mind, I came up with a couple of solutions:
Create a table for each location type, and a "notes" table for every location table with a foreign key, this is pretty troublesome because I would have to create a multimedia and comments table for every comments table, e.g.:
LocationTypeA
ID
Attr1
Attr2
LocationTypeA_Notes
ID
Attr1
...
LocationTypeA_fk
LocationTypeA_Notes_Multimedia
ID
Attr1
...
LocationTypeA_Notes_fk
And so on, this would be quite annoying to do, but after it's done, developing on this structure should not be so troublesome.
Create a table with a unique identifier for the location and point content there, like so:
Location
ID
LocationTypeA
ID
Attr1
Attr2
Location_fk
Notes
ID
Attr1
...
Location_fk
Multimedia
ID
Attr1
...
Notes_fk
As you see, this is far more simple and also easier to develop, but I just don't like the looks of that table with only IDs (yeah, that's truly the only objection I have to this, it's the option I like the most, to be honest).
Similar to option 2, but I would have an enormous table of attributes shaped like this:
Location
ID
Type
Attribute
Name
Value
And so on, or a table for each attribute; a la Drupal. This would be a pain to develop because then it would take several insert/update operations to do something on a location and the Attribute table would be several times bigger than the location table (or end up with an enormous amount of attribute tables); it also has the same issue of the surrogate-keys-only table (just it has a "type" now, which I would use to define the behavior of the location programmatically), but it's a pretty solution.
So, to the question: which would be a better solution performance and scalability-wise?, which would you go with or which alternatives would you propose? I don't have a problem implementing any of these, options 2 and 3 would be an interesting development, I've never done something like that, but I don't want to go with an option that will collapse on itself when the content grows a bit; you're probably thinking "why not just use Drupal if you know it works like you expect it to?", and I'm thinking "you obviously don't know how difficult it is to use Drupal, either that or you're an expert, which I'm most definitely not".
Also, now that I've written all of this, do you think option 2 is a good idea overall?, do you know of a better way to group entities / simulate inheritance? (please, don't say "just use inheritance!", I'm restricted to using MySQL).
Thanks for your feedback, I'm sorry if I wrote too much and meant too little.
ORM systems usually use the following, mostly the same solutions as you listed there:
One table per hierarchy
Pros:
Simple approach.
Easy to add new classes, you just need to add new columns for the additional data.
Supports polymorphism by simply changing the type of the row.
Data access is fast because the data is in one table.
Ad-hoc reporting is very easy because all of the data is found in one table.
Cons:
Coupling within the class hierarchy is increased because all classes are directly coupled to the same table.
A change in one class can affect the table which can then affect the other classes in the hierarchy.
Space potentially wasted in the database.
Indicating the type becomes complex when significant overlap between types exists.
Table can grow quickly for large hierarchies.
When to use:
This is a good strategy for simple and/or shallow class hierarchies where there is little or no overlap between the types within the hierarchy.
One table per concrete class
Pros:
Easy to do ad-hoc reporting as all the data you need about a single class is stored in only one table.
Good performance to access a single object’s data.
Cons:
When you modify a class you need to modify its table and the table of any of its subclasses. For example if you were to add height and weight to the Person class you would need to add columns to the Customer, Employee, and Executive tables.
Whenever an object changes its role, perhaps you hire one of your customers, you need to copy the data into the appropriate table and assign it a new POID value (or perhaps you could reuse the existing POID value).
It is difficult to support multiple roles and still maintain data integrity. For example, where would you store the name of someone who is both a customer and an employee?
When to use:
When changing types and/or overlap between types is rare.
One table per class
Pros:
Easy to understand because of the one-to-one mapping.
Supports polymorphism very well as you merely have records in the appropriate tables for each type.
Very easy to modify superclasses and add new subclasses as you merely need to modify/add one table.
Data size grows in direct proportion to growth in the number of objects.
Cons:
There are many tables in the database, one for every class (plus tables to maintain relationships).
Potentially takes longer to read and write data using this technique because you need to access multiple tables. This problem can be alleviated if you organize your database intelligently by putting each table within a class hierarchy on different physical disk-drive platters (this assumes that the disk-drive heads all operate independently).
Ad-hoc reporting on your database is difficult, unless you add views to simulate the desired tables.
When to use:
When there is significant overlap between types or when changing types is common.
Generic Schema
Pros:
Works very well when database access is encapsulated by a robust persistence framework.
It can be extended to provide meta data to support a wide range of mappings, including relationship mappings. In short, it is the start at a mapping meta data engine.
It is incredibly flexible, enabling you to quickly change the way that you store objects because you merely need to update the meta data stored in the Class, Inheritance, Attribute, and AttributeType tables accordingly.
Cons:
Very advanced technique that can be difficult to implement at first.
It only works for small amounts of data because you need to access many database rows to build a single object.
You will likely want to build a small administration application to maintain the meta data.
Reporting against this data can be very difficult due to the need to access several rows to obtain the data for a single object.
When to use:
For complex applications that work with small amounts of data, or for applications where you data access isn’t very common or you can pre-load data into caches.

(Somewhat) complicated database structure vs. simple — with null fields

I'm currently choosing between two different database designs. One complicated which separates data better then the more simple one. The more complicated design will require more complex queries, while the simpler one will have a couple of null fields.
Consider the examples below:
Complicated:
Simpler:
The above examples are for separating regular users and Facebook users (they will access the same data, eventually, but login differently). On the first example, the data is clearly separated. The second example is way simplier, but will have at least one null field per row. facebookUserId will be null if it's a normal user, while username and password will be null if it's a Facebook-user.
My question is: what's prefered? Pros/cons? Which one is easiest to maintain over time?
First, what Kirk said. It's a good summary of the likely consequences of each alternative design. Second, it's worth knowing what others have done with the same problem.
The case you outline is known in ER modeling circles as "ER specialization". ER specialization is just different wording for the concept of subclasses. The diagrams you present are two different ways of implementing subclasses in SQL tables. The first goes under the name "Class Table Inheritance". The second goes under the name "Single Table Inheritance".
If you do go with Class table inheritance, you will want to apply yet another technique, that goes under the name "shared primary key". In this technique, the id fields of facebookusers and normalusers will be copies of the id field from users. This has several advantages. It enforces the one-to-one nature of the relationship. It saves an extra foreign key in the subclass tables. It automatically provides the index needed to make the joins run faster. And it allows a simple easy join to put specialized data and generalized data together.
You can look up "ER specialization", "single-table-inheritance", "class-table-inheritance", and "shared-primary-key" as tags here in SO. Or you can search for the same topics out on the web. The first thing you will learn is what Kirk has summarized so well. Beyond that, you'll learn how to use each of the techniques.
Great question.
This applies to any abstraction you might choose to implement, whether in code or database. Would you write a separate class for the Facebook user and the 'normal' user, or would you handle the two cases in a single class?
The first option is the more complicated. Why is it complicated? Because it's more extensible. You could easily include additional authentication methods (a table for Twitter IDs, for example), or extend the Facebook table to include... some other facebook specific information. You have extracted the information specific to each authentication method into its own table, allowing each to stand alone. This is great!
The trade off is that it will take more effort to query, it will take more effort to select and insert, and it's likely to be messier. You don't want a dozen tables for a dozen different authentication methods. And you don't really want two tables for two authentication methods unless you're getting some benefit from it. Are you going to need this flexibility? Authentication methods are all similar - they'll have a username and password. This abstraction lets you store more method-specific information, but does that information exist?
Second option is just the reverse the first. Easier, but how will you handle future authentication methods and what if you need to add some authentication method specific information?
Personally I'd try to evaluate how important this authentication component is to the system. Remember YAGNI - you aren't gonna need it - and don't overdesign. Unless you need that extensibility that the first option provides, go with the second. You can always extract it at a later date if necessary.
This depends on the database you are using. For example Postgres has table inheritance that would be great for your example, have a look here:
http://www.postgresql.org/docs/9.1/static/tutorial-inheritance.html
Now if you do not have table inheritance you could still create views to simplify your queries, so the "complicated" example is a viable choice here.
Now if you have infinite time than I would go for the first one (for this one simple example and prefered with table inheritance).
However, this is making things more complicated and so will cost you more time to implement and maintain. If you have many table hierarchies like this it can also have a performance impact (as you have to join many tables). I once developed a database schema that made excessive use of such hierarchies (conceptually). We finally decided to keep the hierarchies conceptually but flatten the hierarchies in the implementation as it had gotten so complex that is was not maintainable anymore.
When you flatten the hierarchy you might consider not using null values, as this can also prove to make things a lot harder (alternatively you can use a -1 or something).
Hope these thoughts help you!
Warning bells are ringing loudly with the presence of two the very similar tables facebookusers and normalusers. What if you get a 3rd type? Or a 10th? This is insane,
There should be one user table with an attribute column to show the type of user. A user is a user.
Keep the data model as simple as you possibly can. Don't build it too much kung fu via data structure. Leave that for the application, which is far easier to alter than altering a database!
Let me dare suggest a third. You could introduce 1 (or 2) tables that will cater for extensibility. I personally try to avoid designs that will introduce (read: pollute) an entity model with non-uniformly applicable columns. Have the third table (after the fashion of the EAV model) contain a many-to-one relationship with your users table to cater for multiple/variable user related field.
I'm not sure what your current/short term needs are, but re-engineering your app to cater for maybe, twitter or linkedIn users might be painful. If you can abstract the content of the facebookUserId column into an attribute table like so
user_attr{
id PK
user_id FK
login_id
}
Now, the above definition is ambiguous enough to handle your current needs. If done right, the EAV should look more like this :
user_attr{
id PK
user_id FK
login_id
login_id_type FK
login_id_status //simple boolean flag to set the validity of a given login
}
Where login_id_type will be a foreign key to an attribute table listing the various login types you currently support. This gives you and your users flexibility in that your users can have multiple logins using different external services without you having to change much of your existing system

Implementing a database structure for generic objects

I'm building a PHP/MySQL website and I'm currently working on my database design. I do have some database and MySQL experience, but I've never structured a database from scratch for a real world application which hopefully is going to get some good traffic, so I'd love to hear advices from people who've already done it, in order to avoid common mistakes. I hope my explanations are not too confusing.
What I need
In my application, the user should be able to write a post (title + text), then create an "object" (which can be anything, like a video, or a song, etc.) and attach it to the post. The site has a list of predefined object types the user can create, and I should be able to add new types in the future. The user should also have the ability to see the object's details in a dedicated page and add a comment to it - the same applies to posts.
What I tried
I created an objects table with these fields: oid, type, name and date. This table contains records for anything the user should be able to add comments to (i.e. posts and objects). Then I created a postmeta table which contains additional post data (such as text, author, last edit date, etc.), a videometa table for data about the "video" object (URL, description, etc.), and so on. A postobject table (pid,oid) links objects to posts. Additionally, there's a comments table which contains the comment text, the author and the ID of the object it refers to.
Since the list of object types is predefined and is probably not going to change (though I still need the ability to add a type easily at any time without changing the app's code structure or the database design), and it is relatively small, it's not a problem to create a "meta" table for each type and make a corresponding PHP class in my application to handle it.
Finally, a page on the site needs to show a list of all the posts including the objects attached to it, sorted by date. So I get all the records from the objects table with type "post" and join it with postmeta to get the post metadata. Then I query postobject to get all the objects attached to this post, and comments to get all the comments.
The questions
Does this make any sense? Is it any good to design a database in this way for a real world site? I need to join quite a few tables to get all the data I need, and the objects table is going to become huge since it contains almost every item (only the type, name and creation date, though) - this is to keep the database and the app code flexible, but does it work in the real world, or is it too expensive in the long term? Am I thinking about it in the wrong way with this kind of OOP approach?
More specifically: suppose I need to list all the posts, including their attached objects and metadata. I would need to join these tables, at least: posts, postmeta, postobject and {$objecttype}meta (not to mention an users table to get all posts by a specific user, for example). Would I get poor performance doing this, even if I'm using only numeric indexes?
Also, I considered using a NoSQL database (MongoDB) for this project (thanks to Stuart Ellis advice). Apparently it seems much more suitable since I need some flexibility here. But my doubt is: metadata for my objects includes a lot of references to other records in the database. So how would I avoid data duplication if I can't use JOIN? Should I use DBRef and the techniques described here? How do they compare to MySQL JOINs used in the structure described above in terms of performance?
I hope these questions do make any sense. This is my first project of this kind and I just want to avoid to make huge mistakes before I launch it and find out I need to rework the design completely.
I'm not a NoSQL person, but I wonder whether this particular case might actually be handled best with a document database (MongoDB or CouchDB). Various type of objects with metadata attached sounds like the kind of scenario that MongoDB is designed for.
FWIW, you've got a couple of issues with your table and field naming that might bite you later. For example, type and date are rather generic, and also reserved words. You've also mixed singular and plural table names, which will throw any automatic object mapping.
Whichever database you use, it's a good idea to find an existing set of database naming conventions and apply it from the start - this will help you avoid subtle issues and ensure that your naming stays consistent. I tend to use the Rails naming conventions ATM, because they are well-known and fairly sensible.
Or you could store the object contents as a file, outside of the database, if you're concerned about the database space.
If you store anything in the database, you already have the object type in objects; so you could just add object_contents table with a long binary field to store the object. You don't need to create a new table for each new type.
I've seen a lot of JOIN's in real world web application (5 to 10). Objects table may get large, but that's indices are for. So far, I don't see anything wrong in your database. BTW, what felt strange to me - one post, one object, and separate comments for each? No ability to mix pictures with text?

Database schema - organise by object or data?

I'm refactoring a horribly interwoven db schema, it's not that it's overly normalised; just grown ugly over time and not terribly well laid out.
There are several tables (forum boards, forum posts, idea posts, blog entries) that share virtually identical data structures and composition, but are seperated simply because they represent different "objects" from the applications perspective. My initial reaction is to put everything that has the same data structure into the same table, and use a "type" column to distinguish data when performing a select.
Am I setting myself up for a fall by adopting this "all into one" approach and allowing (potentially) so many parts of the application to access the same table? FYI, I can't see this database growing to more than ~20mb over the next year or so...
There's basically three ways to store an object inheritance hierarchy in a relational database. Each has their own pros and cons. See:
http://www.martinfowler.com/eaaCatalog/singleTableInheritance.html
http://www.martinfowler.com/eaaCatalog/classTableInheritance.html
http://www.martinfowler.com/eaaCatalog/concreteTableInheritance.html
The book is great too. Luck would have it that chapter 3 - "Mapping to Relational Databases" - is available freely as a sample chapter. You can read more about the tradeoffs in there.
I used to dislike this "all into one" approach, but after I was forced to use it on a complex project a few years ago, I became a fan. If you index the table correctly, performance should be OK. You'll want an index on the type column to speed up your sort by type operations, for instance.
I now usually recommend that you use a single table to store similar objects. The only question then, is, do you want to use subtables to store data that's specific to a certain type of object? The answer to this question really depends on how different the structure of each object type is, and how many object types you'll have. If you have 50 object types with vastly differing structures, you may want to consider storing just the consistent object parts in the main table and creating a sub table for each object type.
In your example, however, I think you'd be fine just putting it all into a single table.
For more info, see here: http://www.agiledata.org/essays/mappingObjects.html
Don't lean too much on the "applications perspective", it tends to vary over time anway. Often databases are accessed by different applications too, and it usually outlives them all ...
When simliar objects are stored in different tables the reason may be that they actually represent the same domain object, but in a different state, or in a different step in a workflow. Then it often makes sense to keep them in one table and add some simple attributes to flag the state. If the workflow, or whatever it is changes, it's easier to change the database and application too, you may not need to add more tables or classes.