I`m working on web site which will test some applications or web sites with some test cases. And I dont know how to store this test cases which will be created by user. Is it okay to create separate table for each user? Or store all data in one table? So i have idea to create 3 new tables for each user (test_cases_x (will store all test cases which user has created), test_cases_history_x (will store references to all test cases which have been executed), test_cases_exe_x(will store all references to all test cases which are executing in this moment))
Is it okay to create separate table for each user?
No, this is defeating the whole idea of a relational database. You want the three tables but to link them by user id.
its hard without knowing all the information - however it usually better 99% of the time not to create specific tables on a per user basis but use the database to perform linkage (relationships).
If you're concerned your table will grow really large you can look at partition / sharding / archiving data to reduce it (please don't look there until you need to as premature optimization can actually make it perform slower)
Related
I'm currently developing an API for a company that didn't do a very good job on maintaining a good test database with test data. The MySQL database structure is quite big and complicated and the live database should be around 160-200GB.
And because I'm quite lazy and don't want to create test data for all the table from scratch, I was wondering what would be the best way to turn such a big database into a smaller test database that keeps all data with their relationships in a correct form. Is there an easy way to this with some kind of script that checks the database model and knows what kind of data it needs to keep or delete when reducing the database to smaller size?
Or am I doomed and have to go through the tedious task of creating my own test data?
Take a look at Jailer which describes itself as a "Database Subsetting and Browsing Tool". It is specifically designed to select a subset of data, following the database relationships/constraints to include all related rows from linked tables. To limit the amount of data you export, you can set a WHERE clause on the table you are exporting.
The issue of scrubbing your test data to remove customer data is still there, but this will be easier once you have a smaller subset to work with.
In adition to Liath recomend:
maybe its a hard way but u can just export your schema (no data) and then make a stored procedure to iterate on your (original) tables and make a simple:
insert into dest_table (fields) (select * from origin_table where (`external_keys already inserted`) limit 100)
or somethink like.
thanks to #Liath : external_keys already inserted you hav to make a filter to ensure that any external key of this table already exist on your test database. So you also need to iterate your tables in order by external keys
another way its to export your data and edit the sql.dump file to remove the unwanted data (realy hard way)
I would suggest that it doesn't matter how thorough you are the risk of getting live customer details into a test database is too high. What happens if you accidentally email or charge a real customer for something your testing!?
There are a number of products out there such as RedGate's Data Generator which will create test data for you based on your schema (there is a free trial I believe so you can check it meets your needs before committing).
Your other alternative? Hire a temp to enter data all day!
ETA: Sorry - I just saw you're looking more at MySQL rather than MSSQL which probably rules out the tool I recommended. A quick google produces similar results.
I'm programming a web application in which each user stores their tasks. (like a to-do application). Tasks will be stored in a table (for example: userstasks).
Which one is better?
1- userstasks table has a column named user_id that defines who created task?
2- a new table (e.g. usernametasks) will be created for each registered user that stores all their tasks?
P.S.: There are lots of users!
You always start with the simplest thing that works and stick with it until it's proven to be a performance problem. What you're talking about with #2 is termed "premature optimization". You only go down that road when #1 is having severe performance problems.
When you split data across different users, your ability to query across all users is severely diminished. For all intents, users will be living in different worlds. Reporting is nearly impossible.
For most applications that have a lot of reads, millions of records is not an issue. It's write-heavy applications that need special attention like this, or those with massive scale, like Reddit or Twitter. Since you're not making one of those, stick with the properly normalized structure first.
"Lots of users" probably means tens or hundreds of thousands. On a properly tuned MySQL instance that's not a big deal. If you need more scale, spin up some read-only secondary servers to spread out the load or look at using MySQL cluster.
I would go with option 1 (a table called tasks with a user_id foreign key) in the short run, assuming that a task can't have more than one user? If so then you'll need a JOIN table. Check into setting up an actual foreign key as well, this promotes referential integrity in the data itself.
I'm in the middle of redesigning an app that has 100,000s of records in a particular table (currently 250K and growing).
The table contains information of websites and domains.
For the sake of speed and resources, should I include all the data needed about either entity in the original table, or should I be using two lookup tables to store information not shared - for example one lookup table which stores all domain specific info and one which stores all site specific info?
Thanks
Ideally you should split them into 2 different tables because a single domain would correspond to multiple sites and if we go with the design in which the metadata of both the domain and site is stored in a single table, in that case there would need to be redundant info stored for the domain in every record of the site metadata. Instead, if we have 2 separate tables in which the domain table has one record per domain and a list of sites as one of the fields in the record and a domain name column in the site table to figure out the domain given a site, it would ensure organized storage and no redundancy of data. This is the major principle of traditional RDBMS systems and that is why we have the concept of multiple tables.
Also, you may consider using a NOSQL data store if you want to really scale your database as you said that the data is continuously increasing. Apache HBase may be a good solution which has this concept of grouping related information together.
Edit:
Clarification in the question:
Just to be clear, domain and sites are not linked. They're just different entities like a domain with no traffic or revenue would be classed as a domain and have domain related data stored for it like number of hyphens or registrar while a domain with a Wordpress install for example and exisitng traffic would be classed as a site - not a domain - and have site specific information stored. Would this change your answer?
In the case where they are not inter-related, I don't think that splitting the data into multiple tables is going to help in any way unless you are going for a distributed RDBMS system. In case of a single-node hosted DB, the rows are anyways indexed by the site/domain id and a large number of rows in a single table is not going to degrade performance but if you are looking at the humongous size of data and wish to divide it over multiple nodes in a cluster then having independent tables for them will help so that each table gets hosted on individual nodes and the DB is able to scale horizontally. That is the only benefit I see in this case.
The perfomance of your application largely depends on the type of queries the application uses. To store all data in one table does not necessarily reduce the performance but very well might enhance it. You are wasting disk space of course if your table holds the information that example.com is owned by Mr XY a few thousand times.
Normalizing your database (splitting your data up) can be helpful but one would have to know what you want to do with the data to answer that.
I have a multi-tenant MySQL database. For most things I use a tenant-id column in tables to discriminate but for a few purposes, it is much better to have a view that is already filtered by tenant id. So for example, I might have a view called 'quarterly_sales_view' but for tenant 30 I would have 'quarterly_sales_view_30' and for tenant 51 I would have 'quarterly_sales_view_51'. I create these views dynamically and everything is working great but we have just a few tenants right now and I realize this would never work for millions of tenants.
My question is, am I going to run into either performance problems or just hard limits with a few thousand, few hundred, or few dozen custom views?
ADDITIONAL INFO:
I am using a 3rd party tool (immature) that requires a table name (or view name, since it's read-only) and operates on that. In the context it's working, I can't let it have access to the entire view, so I create another view that is simply defined as SELECT * FROM MasterView WHERE TenantId = 30. I recognize this as a workaround for a poor limitation of having to have the tool work on the table directly. Luckily this tool is open source, so I can tweak it to use a different approach. I just wanted to have an idea of how long I had before the current approach blew up.
The primary concern within this question (IMO) should be less with performance and more with the design. First, the number of views should not affect performance, but, why do you need a view per tenant? Is it not possible to simply filter for a tenant by ID on a more generic view. E.g.:
SELECT * FROM vwMyTentants WHERE TenantId = 30
Whatever the reason, you should reconsider your approach because it is a sign of design smell.
I'm considering a design for a private messaging system and I need some input here, basically I have several questions regarding this. I've read most of the related questions and they've given me some thought already.
All of the basic messaging systems I've thus far looked into use a single table for all of the users' messages. With indexes etc this approach would seem fine.
What I wanted to know is if there would be any benefit to splitting the user messages into separate tables. So when a new user is created a new table is created (either in the same or a dedicated message database) which stores all of the messages - sent and received -for that user.
What are the pitfalls/benefits to approaching things that way?
I'm writing in PHP would the code required to write be particularly more cumbersome than the first large table option?
Would the eventual result, with a large amount of smaller tables be a more robust, trouble free design than one large table?
In the event of large amounts of concurrent users, how would the performance of the server compare where dealing with one large versus many small tables?
Any help with those questions or other input would be appreciated. I'm currently working through a smaller scale design for my test site before rewriting the PM module and would like to optimise it. My poor human brain handles separate table far more easily, but the same isn't necessarily so for a computer.
You'll just get headaches from moving to small numerous tables. Databases are made for handling lots of data, let it do it's thing.
You'll likely end up using dynamic table names in queries (SELECT * FROM $username WHERE ...), making smart features like stored procedures and possibly parameterized queries a lot trickier if not outright impossible. Usually a really bad idea.
Try rewriting SELECT * FROM messages WHERE authorID = 1 ORDER BY date_posted DESC, but where "messages" is anywhere between 1 and 30,000 different tables. Keeping your table relations monogamous will keep them bidirectional, way more useful.
If you think table size will really be a problem, set up an "archived messages" clone table and periodically move old & not-unread messages there where they won't get in the way. Also note how most forum software with private messaging allows for limiting user inbox sizes. There are a few ways to solve the problem while keeping things sane.
I'm agreeing with #MarkR here - in that initially the one table for messages is definitely the way to proceed. As time progresses and should you end up with a very large table then you can consider how to partition the table to best proceed. That's counter to the way I'd normally advise design, but we're talking about one table which is fairly simple - not a huge enterprise system.
A very long time ago (pre availability of SQL databases) I built a system that stored private and public messages, and I can confirm that once you split a message base logical entity into more than one everything¹ becomes a lot more complicated; and I doubt that a user per file is the right approach - the overheads will be massive compared to the benefit.
Avoid auto-increment[2] - and using natural keys is very important to the future scalability. Designing well to ensure that you can insert and retrieve without locking will be of more benefit.
¹ Indexing, threading, searching, purging/archiving.
² Natural keys are better if you can find one for your data as the autoincremented ID does not describe the data at all and databases are good at locating based on the primary key, so a natural primary key can improve things. Autoincrement can cause problems with a distributed database; it also leaks data when presented externally (to see the number of users registered just create a new account and check your user ID). If you can't find a natural key then a UUID (or GUID) may still be a better option - providing that the database has good support for this as a primary key. See When to use an auto-incremented primary key and when not to
Creating one table per user certainly won't scale well when there are a large number of users with a small number of messages. The way MySQL handles table opening/closing, very large numbers of tables (> 10k, say) become quite inefficient, especially at server startup and shutdown, as well as trying to backup non-transactional tables.
However, the way you've worded your question sounds like a case of premature optimisation. Make it work first, then fix performance problems. This is always the right way to do things.
Partitioning / sharding will become necessary once your scale gets high enough. But there are a lot of other things to worry about in the mean time. Sort them out first :)
One table is the right way to go from an RDBMS PoV. I recommend you use it until you know better.
Splitting large amounts of data into smaller sets makes sense if you're trying to avoid locking issues: for example - locking the messages table - doing big selects or updating huge amounts of data at once. In this case long running queries could block whole table and everyone needs to wait... You should ask yourself if this going to happen in your case? At least for me it looks like messaging system is not going to have such things because all information is being pushed into table or retrieved from it in rather small sets. If this is a user centric application - so, for example, getting all messages for single user is quite easy and fast to do, the same goes also for creating new messages for one or another particular user... Unless you would have really huge amounts of users/messages in your system.
Splitting data into multiple tables has also some drawbacks - you will need kind of management system or logic how do you split everything - giving separate table for each user could grow up soon into hundreds or thousands of tables - which is, in my opinion, not that nice. Therefore probably you would need some other criteria how to split the data. If you want splitting logic to be dynamic and easy adjustable - you would probably need also to save it in DB somehow. As you see complexity grows...
As advantage of such data sharding could be the scalability - you could easy put different sets of data on different machines once single machine is not able to handle whole load.
It depends how your message system works.
Are there cuncurrency issue?
Does it need to be scalable as the application accomodate more customers?
Designing one table will perfectly work on small, one message at a time single user system.
However, if you are considering multiple user, concurrent messaging system, the tables should be splited
Data model for Real time application is recommended to be "normalized"(Spliting table) due to "locking & latching" and data redundency issue.
Locking policy varies by Database Vendor. If you have tables that have updates & select by applicaiton concurrently, "Locking"(page level, row level, table level depending on vendor) issue araise. Some bad DB & app design completely lock the table so message never go through.
Redendency issue is more clear. If you use only one table, some information(like user. I guess one user could send multiple messages) is redundent.
Try to google with "normalization", 'Locking"..