A quick bit of background - We have a table "orders" which has about 10k records written into it per day. This is the most queried table in the database. To keep the table small we plan to move the records written into it a week or so ago into a different table. This will be done by an automated job. While we understand it would make sense to pop the history off to a separate server we currently just have a single DB server.
The orders table is in databaseA. Following are the approaches we are considering:
Create a new schema databaseB and create an orders table that contains the history?
Create a table ordershistory in databaseA.
It would be great if we could get pointers as to which design would give a better performance?
EDIT:
Better performance for
Querying the current orders - since its not weighed down by the past data
Querying the history
You could either:
Have a separate archival table, possibly in other database. This can potentially compilcate querying.
Use partitioning.
I'm not sure how effective the MySQL partitioning is. For alternatives, you may take a look at PostgreSQL partitioning. Most commercial databases support it too.
I take it from your question that you only want to deal with current orders.
In the past, have used 3 tables on busy sites
new orders,
processing orders,
filled orders,
and a main orders table
orders
All these tables have a relation to the orders table and a primary key.
eg new_orders_id, orders_id
processing_orders_id, orders_id ....
using a left join to find new and processing orders should be relatively efficient
Related
There are two big (millions of records) one-to-one tables:
course
prerequisite with a foreign key reference to the course table
in single-node relational MySQL database. A join is needed to list the full description of all the courses.
An alternative is to have only one single table to contain both the course and prerequisite data in the same database.
Question: is the performance of the join query still slower than that of a simple select query without join on the single denormalized table albeit the fact that they are on the same single-node MYSQL database?
It's true that denormalization is often done to shorten the work to look up one record with its associated details. This usually means the query responds in less time.
But denormalization improves one query at the expense of other queries against the same data. Making one query faster will often make other queries slower. For example, what if you want to query the set of courses that have a given prerequisite?
It's also a risk when you use denormalization that you create data anomalies. For example, if you change a course name, you would also need to update all the places where it is named as a prerequisite. If you forget one, then you'll have a weird scenario where the obsolete name for a course is still used in some places.
How will you know you found them all? How much work in the form of extra queries will you have to do to double-check that you have no anomalies? Do those types of extra queries count toward making your database slower on average?
The purpose of normalizing a database is not performance. It's avoiding data anomalies, which reduces your work in other ways.
I believe this question does not address specifically MySQL - which is the database that I'm using -, and it's one about best practices.
Up until now, my problems could be solved by creating tables and querying them (sometimes JOINing here and there). But there is something that I'm doing that doesn't feel right, and it triggers me whenever I need a denormalized data alongside my "common" queries.
Example Use-case
So that I can express myself better, let's create a superficial scenario where:
a user can buy a product, generating a purchase (let's ignore the fact that the purchase can only have a single product);
and we need to query the products with the total amount of times it has been purchased;
To solve our use-case, we could define a simple structure made by:
product table:
product_id [INT PK]
user table:
user_id [INT PK]
purchase table:
purchase_id [INT PK]
product_id [INT FK NOT NULL]
user_id [INT FK NOT NULL]
Here is where it doesn't feel right: When we need to retrieve a list of products with the total times it has been purchased, I would create the query:
# There are probably faster queries than this to reach the same output
SELECT
product.product_id,
(SELECT COUNT(*) FROM purchase
WHERE purchase.product_id = product.product_id)
FROM
product
My concern origin is that I've read that COUNT does a full table scan, and it scares me to perform the query above when scaled to thousands of products being purchased - even though I've created an INDEX with the product_id FK on purchase (MySQL does this by default).
Possible solutions
My knowledge with relational databases is pretty shallow, so I'm kind of lost when comparing what are the alternatives (plausible ones) for these kinds of problems. To not say that I haven't done my homework (searching before asking), I've found plausible to:
Create Transactions:
When INSERTing a new purchase, it must always be inside a transaction that also updates the product table with the purchase.product_id.
Possible Problems: human error. Someone might manually insert a purchase without doing the transaction and BAM - we have an inconsistency.
Create Triggers:
Whenever I insert, delete or update some row in some specific table, I would update my products table with a new value (bought_amount). So the table would become:
product table:
product_id [INT PK]
bought_amount [INT NOT NULL];
Possible problems: are triggers expensive? is there a way that the insertion succeeds but the trigger won't - thus leaving me with an inconsistency?
Question
Updating certain tables to store data that constantly changes is a plausible approach with RDBMSs? Is it safer and - in the long term - more beneficial to just keep joining and counting/summing other occurrences?
I've found a couple of useful questions/answers regarding this matter, but none of them addressed this subject in a wide perspective.
Please take into consideration my ignorance about RDBMS, as I may be suggesting nonsense Possible solutions.
The usual way to get a count per key is
SELECT product_id, COUNT(*)
FROM purchase
GROUP BY product_id
You don't need to mention the product table, because all it contains is the key column. Now although that uses COUNT(*), it doesn't need a full table scan for every product_id because the SQL engine is smart enough to see the GROUP BY.
But this produces a different result to your query: for products that have never been purchased, my query simply won't show them; your query will show the product_id with count zero.
Then before you start worrying about implementation and efficiency, what question(s) are you trying to answer? If you want to see all products whether purchased or not, then you must scan the whole product table and look up from that to purchase. I would go
SELECT product_id, count
FROM product
OUTER JOIN (SELECT product_id, COUNT(*) AS count
FROM purchase
GROUP BY product_id) AS purch
ON product.product_id = purch.product_id
As regards your wider questions (not sure I fully understand them), in the early days SQL was quite inefficient at this sort of joining and aggregating, and schema often were denormalised with repeated columns in multiple tables. SQL engines are now much smarter, so that's not necessary. You might see that old-fashioned practice in older textbooks. I would ignore it and design your schema as normalised as possible.
This query:
SELECT p.product_id,
(SELECT COUNT(*)
FROM purchase pu
WHERE pu.product_id = p.product_id
)
FROM product p;
has to scan both product and purchase. I'm not sure why you are emotional about one table scan but not the other.
As for performance, this can take advantage of an index on purchase(product_id). In MySQL, this will probably be faster than the equivalent (left) join version.
You should not worry about performance of such queries until that becomes an issue. If you need to increase performance of such a query, first I would ask: Why? That is a lot of information being returned -- about all products over all time. More typically, I would expect someone to care about one product or a period of time or both. And, those concerns would suggest the development of a datamart.
If performance is an issue, you have many alternatives, such as:
Defining a data mart to periodically summarize the data into more efficient structures for such queries.
Adding triggers to the database to summarize the data, if the results are needed in real-time.
Developing a methodology for maintaining the data that also maintains the summaries, either at the application-level or using stored procedures.
What doesn't "feel right" to you is actually the tremendous strength of a relational database (with a reasonable data model). You can keep it up-to-date. And you can query it using a pretty concise language that meets business needs.
Possible Problems: human error. Someone might manually insert a purchase without doing the transaction and BAM - we have an inconsistency.
--> Build a Stored Procedure that does both steps in a transaction, then force users to go through that.
Possible problems: are triggers expensive? is there a way that the insertion succeeds but the trigger won't - thus leaving me with an inconsistency?
Triggers are not too bad. But, again, I would recommend forcing users through a Stored Procedure that does all the desired steps.
Note: Instead of Stored Procedures, you could have an application that does the necessary steps; then force users to go through the app and give them no direct access to the database.
A database is the "source of truth" on the data. It is the "persistent" repository for such. It should not be considered the entire engine for building an application.
As for performance:
Summing over a million rows may take a noticeable amount of time.
You can easily do a hundred single-row queries (select/insert/update) per second.
Please think through numbers like that.
I am creating a database and a project. In this project we will create different-different companies. We have two options for create database.
Create a common table for all companies and save all information in a single table. Suppose company_daily_records which will have all companies data. Suppose a company have 1,00,000 records and we have 1000 companies so this company_daily_records will have 1,00,000*1000 records
Create separate db table for each company so their will be 1000 company_daily_records tables and each table will have 1,00,000 records.
Which db performance will be good,
Also which db SQL language we should prefer?
1) if you create separate database for each company, which is more likely, then your record will be organized. But if your project deal with all companies at the same time then you have to switch your connection frequently.
2) if you create one database for all companies, it is possible also you just have to add a additional table of 'company' includes all companies that can be used as foreign_key in e.g 'employee' table to separate employees from specific company...
But it has complexity of records as its not in very organized form.
As you mention the daily record can be in billions, I suggest you to go with separate databases that will surely save searching, query time which is the most important aspect...
--> I think you can use mysql to manage your record.
Thankyou
I would not suggest create a table for each companies because:
How do you know what/how many companies there will be?
When you have a new company, you would possibly need to create a new table in database, and update your application code manually. It could be made automatic, but not an easy task though
Because you are at the early state now, it is fine to with the traditional way of relational database. That is to a company table a company_record table. You can worry about performance later when it happens or when you have spare time for optimization
Don't design the schema for a large dataset until you have some thoughts on how how the data will be inserted and queried.
You need to avoid scanning 100 million (10 crore) rows to get an answer; it will be painfully slow. That implies indexing.
NoSQL implies no indexing, or you have to build the indexes yourself. You would be better off with a real RDBMS doing such heavy-lifting for you.
If you split by company into tables or databases or partitions or shards:
Today you have 1000 tables (etc), tomorrow you have 1123.
Any operation that goes across companies will be difficult and slow.
Working with 1000 tables/dbs/partition, or especially shards, has inefficiencies.
I vote for a single 'large' (but not 'huge') table with a SMALLINT UNSIGNED (2-bytes) column for company_id.
Since you are into the "Data Warehouse" realm, Summary Tables come to mind.
Will you be deleting "old" data? That is another thing to worry about in large tables.
Inserting 1000 rows per day is no problem. (1000/second would be another story.)
I'm looking to create an order history for each client at work using MySQL.
I was wondering if it is best practice to create a separate table for each client, with each row identifying an order they've placed - or - to have one table with all orders, and have a column with an identifier for each client, which would be called to populate their order history.
We're looking at around 50-100 clients, with 10-20 orders a year that would be added to each of them so I am trying to make this as efficient as I am, performance wise.
Any help would be appreciated. Thanks.
It is never a good idea to create a separate table for specific data (e.g. per client) as this destroys relational integrity / flexibility within the RDBMS itself. You would have to have something external that adds/removes the tables, and the tables wouldn't have integrity between each other.
The answer is in your second sentence: One table for orders that has a column that points to the unique identifier for clients. This is as efficient as possible, especially for such small numbers.
I have to store 2 dates for almost every table in database e.g. tbl_clients tbl_users tbl_employers tbl_position tbl_payments tbl_quiz tbl_email_reminder etc.
Most times i store "date_created" and "date_modified" sometimes few extra dates.
Whats would be the best approach to storing dates in MySQL database performance wise (site that might have a lot of customers later maybe 500,000+)
Option 1: Add 2 columns for dates to each table.
Option 2: Create table "tbl_dates" exclusively for dates.
I was thinking option 2 will work faster as i only need dates displayed on one specific page e.g. "report.php" am i right?
Also how many columns i should put max in "tbl_dates" without driving it too slow.
For the general case (a row creation and a row modification timestamp) I would put them in the same table as they relate to. Otherwise, you'll find that the consequent joins you will need will slow down your queries more than the simple approach.
In any case, you don't want to get into the habit of building "general tables" to which many tables can JOIN - this is because ideally you would create foreign keys for each relationship, but this won't work if some rows belong to tbl_clients, some to tbl_users... (etc).
Admittedly your MySQL engine may prevent you from using foreign keys - depending on which one you're using - but (for me at least) the point stands.