1 very large table or 3 large table? MySQL Performance - mysql

Assume a very large database. A table with 900 million records.
Method A:
Table: Posts
+----------+-------------- +------------------+----------------+
| id (int) | item_id (int) | post_type (ENUM) | Content (TEXT) |
+----------+---------------+------------------+----------------+
| 1 | 1 | user | some text ... |
+----------+---------------+------------------+----------------+
| 2 | 1 | page | some text ... |
+----------+---------------+------------------+----------------+
| 3 | 1 | group | some text ... |
// row 1 : User with ID 1 has a post with ID #1
// row 2 : Page with ID 1 has a post with ID #2
// row 3 : Group with ID 1 has a post with ID #3
The goal is displaying 20 records from all 3 post_types in a page.
SELECT * FROM posts LIMIT 20
But I am worried about number of records for this method
Method B:
Separate 900 million records to 3 tables with 300 millions for each one.
Table: User Posts
+----------+-------------- +----------------+
| id (int) | user_id (int) | Content (TEXT) |
+----------+---------------+----------------+
| 1 | 1 | some text ... |
+----------+---------------+----------------+
| 2 | 2 | some text ... |
+----------+---------------+----------------+
| 3 | 3 | some text ... |
Table: Page Posts
+----------+-------------- +----------------+
| id (int) | page_id (int) | Content (TEXT) |
+----------+---------------+----------------+
| 1 | 1 | some text ... |
+----------+---------------+----------------+
| 2 | 2 | some text ... |
+----------+---------------+----------------+
| 3 | 3 | some text ... |
Table: Group Posts
+----------+----------------+----------------+
| id (int) | group_id (int) | Content (TEXT) |
+----------+----------------+----------------+
| 1 | 1 | some text ... |
+----------+----------------+----------------+
| 2 | 2 | some text ... |
+----------+----------------+----------------+
| 3 | 3 | some text ... |
now to get a list of 20 posts to display
SELECT * FROM User_Posts LIMIT 10
SELECT * FROM Page_Posts LIMIT 10
SELECT * FROM group_posts LIMIT 10
// and make an array or object of result. and display in output.
In this method, I should sort them in an array in php, and then semd them to page.
Which method is preferred?
Separating a 900 million records table to three tables will affect on speed of reading and writing in mysql?

This is actually a discussion about Singe - Table - Inheritance vs. Table Per Class Inheritance and missing out joined inheritance. The former is related to Method A, the second to your Method B and Method C would be as having all IDs of your posts in one table and deferring specific attributes for group or user - posts ijto different tables.
Whilst having a big sized table always has its negativ impacts related to table full scans the approach of splitting tables has it's own , too. It depends on how often your application needs to access the whole list of posts vs. only retrieving certain post types.
Another consideration you should take into account is data partitioning which can be done with MySQL or Oracle Database e.g. which is a way of organizing your data within tables given opportunities for information lifecycle (which data is accessed when and how often, can part of it be moved and compressed reducing database size and increasing the speed for accessing the left part of the data in the table), which is basically split into three major techniques:
Range based partitioning, list based partitioning and hash based partitioning.
Other features not so commonly supported related to reducing table sizes are the ones dealing with insert's with timestamp invalidating the inserted data automatically after a certain timeperiod has expired.
What indeed is a major application design decision and can boost performance is to distinguish between read and writeaccesses to the database at application level.
Consider a MySQL - Backend: Because writeaccesses are obviously more critical to database performance then read accesses you could setup a MySQL - Instance for writing to the database and another one as replicant of this for the readaccesses, though this is also discussable, mainly when it comes to RDT (real time decisions), where absolute consistency of data at any given time is a must.
Using object pools as a layer between your application and the database also is a technique to improve application performance though I don't know of existing solutions in the PHP world yet. Oracle Hot Cache is a pretty sophisticated example of it.
You could build your own one implemented on top of a in - memory database or using memcache, though.

Related

How to structure a MySQL query to join with an exclusion

In short; we are trying to return certain results from one table based on second level criteria of another table.
I have a number of source data tables,
So:
Table DataA:
data_id | columns | stuff....
-----------------------------
1 | here | etc.
2 | here | poop
3 | here | etc.
Table DataB:
data_id | columnz | various....
-----------------------------
1 | there | you
2 | there | get
3 | there | the
4 | there | idea.
Table DataC:
data_id | column_s | others....
-----------------------------
1 | where | you
2 | where | get
3 | where | the
4 | where | idea.
Table DataD: etc. There are more and more will be added ongoing
And a relational table of visits, where there are "visits" to some of these other data rows in these other tables above.
Each of the above tables holds very different sets of data.
The way this is currently structured is like this:
Visits Table:
visit_id | reference | ref_id | visit_data | columns | notes
-------------------------------------------------------------
1 | DataC | 2 | some data | etc. | so this is a reference
| | | | | to a visit to row id
| | | | | 2 on table DataC
2 | DataC | 3 | some data | etc. | ...
3 | DataB | 4 | more data | etc. | so this is a reference
| | | | | to a visit to row id
| | | | | 4 on table DataB
4 | DataA | 1 | more data | etc. | etc. etc.
5 | DataA | 2 | more data | etc. | you get the idea
Now we currently list the visits by various user given criteria, such as visit date.
however the user can also choose which tables (ie data types) they want to view, so a user has to tick a box to show they want data from DataA table, and DataC table but not DataB, for example.
The SQL we currently have works like this; the column list in the IN conditional is dynamically generated from user choices:
SELECT visit_id,columns, visit_data, notes
FROM visits
WHERE visit_date < :maxDate AND visits.reference IN ('DataA','DataC')
The Issue:
Now, we need to go a step beyond this and list the visits by a sub-criteria of one of the "Data" tables,
So for example, DataA table has a reference to something else, so now the client wants to list all visits to numerous reference types, and IF the type is DataA then to only count the visits if the data in that table fits a value.
For example:
List all visits to DataB and all visits to DataA where DataA.stuff = poop
The way we currently work this is a secondary SQL on the results of the first visit listing, exampled above. This works but is always returning the full table of DataA when we only want to return a subset of DataA but we can't be exclusive about it outside of DataA.
We can't use LEFT JOIN because that doesn't trim the results as needed, we can't use exclusionary joins (RIGHT / INNER) because that then removes anything from DataC or any other table,
We can't find a way to add queries to the WHERE because again, that would loose any data from any other table that is not DataA.
What we kind of need is a JOIN within an IF/CASE clause.
Pseudo SQL:
SELECT visit_id,columns, visit_data, notes
FROM visits
IF(visits.reference = 'DataA')
INNER JOIN DataA ON visits.ref_id = DataA.id AND DataA.stuff = 'poop'
ENDIF
WHERE visit_date < 2020-12-06 AND visits.reference IN ('DataA','DataC')
All criteria in the WHERE clause are set by the user, none are static (This includes the DataA.stuff criteria too).
So with the above example the output would be:
visit_id | reference | ref_id | visit_data | columns | notes
-------------------------------------------------------------
1 | DataC | 2 | some data | etc. |
2 | DataC | 3 | some data | etc. |
5 | DataA | 1 | more data | etc. |
We can't use Union because the different Data tables contain lots of different details.
Questions:
There may be a very straightforward answer to this but I can't see it,
How can we approach trying to achieve this sort of partial exclusivity?
I suspect that our overarching architecture structure here could be improved (the system complexity has grown organically over a number of years). If so, what could be a better way of building this?
What we kind of need is a JOIN within an IF/CASE clause.
Well, you should know that's not possible in SQL.
Think of this analogy to function calls in a conventional programming language. You're essentially asking for something like:
What we need is a function call that calls a different function depending on the value you pass as a parameter.
As if you could do this:
call $somefunction(argument);
And which $somefunction you call would be determined by the function called, depending on the value of argument. This doesn't make any sense in any programming language.
It is similar in SQL — the tables and columns are fixed at the time the query is parsed. Rows of data are not read until the query is executed. Therefore one can't change the tables depending on the rows executed.
The simplest answer would be that you must run more than one query:
SELECT visit_id,columns, visit_data, notes
FROM visits
INNER JOIN DataA ON visits.ref_id = DataA.id AND DataA.stuff = 'poop'
WHERE visit_date < 2020-12-06 AND visits.reference = 'DataA';
SELECT visit_id,columns, visit_data, notes
FROM visits
WHERE visit_date < 2020-12-06 AND visits.reference = 'DataC';
Not every task must be done in one SQL query. If it's too complex or difficult to combine two tasks into one query, then leave them separate and write code in the client application to combine the results.

How to add index to a mysql database to improve performance?

I'm working on a project that synchronizes contents on multiple clusters. I want to use two programs to do this task. One program periodically updates the existence status of contents on each cluster, The other program do the content copy when one content is not on all clusters.
The database tables are designed in following way.
table contents -- stores content information.
+----+----------+
| id | name |
+----+----------+
| 1 | content_1|
| 2 | content_2|
| 3 | content_3|
+----+----------+
table clusters -- stores cluster information
+----+----------+
| id | name |
+----+----------+
| 1 | cluster_1|
| 2 | cluster_2|
+----+----------+
table content_cluster -- each record indicates that one content is on one cluster
+----------+----------+-------------------+
|content_id|cluster_id| last_update_date|
+----------+----------+-------------------+
| 1 | 1 |2020-10-01T11:30:00|
| 2 | 2 |2020-10-01T11:30:00|
| 3 | 1 |2020-10-01T10:30:00|
| 3 | 2 |2020-10-01T10:30:00|
+----------+----------+-------------------+
The first program updates these tables periodically (a little part may changed, most of the table stay the same). The second program iteratively gets one content record which is not on all clusters. The select query is something as follow.
SELECT content_id
FROM content_cluster
GROUP BY content_id
HAVING COUNT(cluster_id) < <cluster_number>
LIMIT 1;
It seems not so effective as I have to group the table on my each query.
As I am not familiar with database, I'm wondering if it is a good way to design the database. Do I have to add some indices? How should I write select query to make it effective?

How should I design my database for different user has different devices?

I was confused about database design.
This is my request: There are different users; a user can has one or more devices. Some users will have thousands devices even ten thousand devices. A device has many operation records. One day adds 10-20 operation records.
I prepare to create a table for user record userId, passWord. A different user creates different device list table record deviceNum, devicState. A device has a operation table record operation records.
So, my database will have many many tables. I guess I have more one hundred thousand devices. Should I create one hundred thousand devices tables?
From you've said you need 3 tables only:
Users - user_id, username, pass, data, etc...
Devices - device_id, user_id (owner), device_data....
OpRecords - record_id, device_id, record_data_stuff...
All users go to the first table with additional fields you need for user data like address, phone, last login time, etc.
All devices are registered into the Devices table with each device having association to the user who owns it - the user_id. You can make as much columns as needed - like device type, device name, parameters, workhours and so on...
All records go to a single table and record is associated with the device that did the job (device_id). You can make your additional columns here to write record-specific data like start_time, end_time, and so on...
You don't need to have a separate table for the record list of each device. When you need to show the records of a single device just make SELECT with device_id = X and you will receive the records for your device only.
UPDATE:
ten thousand devices ,everyone make ten operation records one day .ten
days
It makes about 1 million records for 10 days. It is a chalange. But I don't think splitting this data into different tables will give you a better performance.
You should set as minimum columns in the records table as possible. Place there the minimum required information. Try not to use strings and if possible use only integer numbers. If you use fields to write date/time use TIMESTAMP as it is closest to integer number. Create indexes on all columns you will use to search data - device_id, time (if you have such column) an so on. This will increase disk usage but I think will raise the performance.
Consider if you can backup your data to external file after a given period - in example after a month you would free 3 million records.
I would suggest using a layout like this:
User
+----+----------+----------+-------+
| id | username | password | email |
+----+----------+----------+-------+
| 1 | jon | xxxx | xxxx |
| 2 | melissa | xxxx | xxxx |
+----+----------+----------+-------+
Device
+----+---------+--------+
| id | user_id | name |
+----+---------+--------+
| 1 | 1 | iPad |
| 2 | 1 | iPhone |
| 3 | 1 | Laptop |
| 4 | 2 | Laptop |
+----+---------+--------+
Operation
+----+-----------+-----+-------+
| id | device_id | key | value |
+----+-----------+-----+-------+
| 1 | 1 | x | y |
| 2 | 1 | y | z |
| 3 | 2 | x | x |
| 4 | 2 | z | y |
| 5 | 3 | x | y |
+----+-----------+-----+-------+
I hope that I got your question right.

How can I implement a viewed system for my website's posts?

Here is my current structure:
// posts
+----+--------+----------+-----------+------------+
| id | title | content | author_id | date_time |
+----+--------+----------+-----------+------------+
| 1 | title1 | content1 | 435 | 1468111492 |
| 2 | title2 | content2 | 657 | 1468113910 |
| 3 | title3 | content3 | 712 | 1468113791 |
+----+--------+----------+-----------+------------+
// viewed
+----+---------------+---------+------------+
| id | user_id_or_ip | post_id | date_tiem |
+----+---------------+---------+------------+
| 1 | 324 | 1 | 1468111493 |
| 2 | 546 | 3 | 1468111661 |
| 3 | 135.54.12.1 | 1 | 1468111691 |
| 5 | 75 | 1 | 1468112342 |
| 6 | 56.26.32.1 | 2 | 1468113190 |
| 7 | 56.26.32.1 | 3 | 1468113194 |
| 5 | 75 | 2 | 1468112612 |
+----+---------------+---------+------------+
Here is my query:
SELECT p.*,
(SELECT count(*) FROM viewed WHERE post_id = :id) AS total_viewed
FROM posts p
WHERE id = :id
Currently I've faced with a huge date for viewed table. Well what's wrong with my table structure (or database design)? In other word how can I improve it?
A website like stackoverflow has almost 12 million posts. Each post has (on average) 500 viewed. So the number of viewed's rows should be:
12000000 * 500 = 6,000,000,000 rows
Hah :-) .. Honestly I cannot even read that number (btw that number will grow up per sec). Well how stackoverflow handles the number of viewed for each post? Will it always calculate count(*) from viewed per post showing?
You are not likely to need partitioning, redis, nosql, etc, until you have many millions of rows. Meanwhile, let's see what we can do with what you do have.
Let's start by dissecting your query. I see WHERE id=... but no LIMIT or ORDER BY. Let's add to your table
INDEX(id, timestamp)
and use
WHERE id = :id
ORDER BY timestamp DESC
LIMIT 10
Any index is sorted by what is indexed. That is the 10 rows you are looking for are adjacent to each other. Even if the data is pushed out of cached, there will probably be only one block to provide those 10 rows.
But a "row" in a secondary index in InnoDB does not contain the data to satisfy SELECT *. The index "row" contains a pointer to the actual 'data' row. So, there will be 10 lookups to get them.
As for view count, let's implement that a different way:
CREATE TABLE ViewCounts (
post_id ...,
ct MEDIUMINT UNSIGNED NOT NULL,
PRIMARY KEY post_id
) ENGINE=InnoDB;
Now, given a post_id, it is very efficient to drill down the BTree to find the count. JOINing this table to the other, we get the individual counts with another 10 lookups.
So, you say, "why not put them in the same table"? The reason is that ViewCounts is changing so frequently that those actions will clash with other activity on Postings. It is better to keep them separate.
Even though we hit a couple dozen blocks, that is not bad compared to scanning millions of rows. And, this kind of data is somewhat "cacheable". Recent postings are more frequently accessed. Popular users are more frequently accessed. So, 100GB of data can be adequately cached in 10GB of RAM. Scaling is all about "counting the disk hits".

MySQL - how to optimize query to count votes

Just after some opinions on the best way to achieve the following outcome:
I would like to store in my MySQL database products which can be voted on by users (each vote is worth +1). I also want to be able to see how many times in total a user has voted.
To my simple mind, the following table structure would be ideal:
table: product table: user table: user_product_vote
+----+-------------+ +----+-------------+ +----+------------+---------+
| id | product | | id | username | | id | product_id | user_id |
+----+-------------+ +----+-------------+ +----+------------+---------+
| 1 | bananas | | 1 | matthew | | 1 | 1 | 2 |
| 2 | apples | | 2 | mark | | 2 | 2 | 2 |
| .. | .. | | .. | .. | | .. | .. | .. |
This way I can do a COUNT of the user_product_vote table for each product or user.
For example, when I want to look up bananas and the number of votes to show on a web page I could perform the following query:
SELECT p.product AS product, COUNT( v.id ) as votes
FROM product p
LEFT JOIN user_product_vote v ON p.id = v.product_id
WHERE p.id =1
If my site became hugely successful (we can all dream) and I had thousands of users voting on thousands of products, I fear that performing such a COUNT with every page view would be highly inefficient in terms of server resources.
A more simple approach would be to have a 'votes' column in the product table that is incremented each time a vote is added.
table: product
+----+-------------+-------+
| id | product | votes |
+----+-------------+-------+
| 1 | bananas | 2 |
| 2 | apples | 5 |
| .. | .. | .. |
While this is more resource friendly - I lose data (eg. I can no longer prevent a person from voting twice as there is no record of their voting activity).
My questions are:
i) am I being overly worried about server resources and should just stick with the three table option? (ie. do I need to have more faith in the ability of the database to handle large queries)
ii) is their a more efficient way of achieving the outcome without losing information
You can never be over worried about resources, when you first start building an application you should always have resources, space, speed etc. in mind, if your site's traffic grew dramatically and you never built for resources then you start getting into problems.
As for the vote system, personally I would keep the votes like so:
table: product table: user table: user_product_vote
+----+-------------+ +----+-------------+ +----+------------+---------+
| id | product | | id | username | | id | product_id | user_id |
+----+-------------+ +----+-------------+ +----+------------+---------+
| 1 | bananas | | 1 | matthew | | 1 | 1 | 2 |
| 2 | apples | | 2 | mark | | 2 | 2 | 2 |
| .. | .. | | .. | .. | | .. | .. | .. |
Reasons:
Firstly user_product_vote does not contain text, blobs etc., it's purely integer so it takes up less resources anyways.
Secondly, you have more of a doorway to new entities within your application such as Total votes last 24 hr, Highest rated product over the past 24 hour etc.
Take this example for instance:
table: user_product_vote
+----+------------+---------+-----------+------+
| id | product_id | user_id | vote_type | time |
+----+------------+---------+-----------+------+
| 1 | 1 | 2 | product |224.. |
| 2 | 2 | 2 | page |218.. |
| .. | .. | .. | .. | .. |
And a simple query:
SELECT COUNT(id) as total FROM user_product_vote WHERE vote_type = 'product' AND time BETWEEN(....) ORDER BY time DESC LIMIT 20
Another thing is if a user voted at 1AM and then tried to vote again at 2PM, you can easily check when the last time they voted and if they should be allowed to vote again.
There are so many opportunities that you will be missing if you stick with your incremental example.
In regards to your count(), no matter how much you optimize your queries it would not really make a difference on a large scale.
With an extremely large user-base your resource usage will be looked at from a different perspective such as load balancers, mainly server settings, Apache, catching etc., there's only so much you can do with your queries.
If my site became hugely successful (we can all dream) and I had thousands of users voting on thousands of products, I fear that performing such a COUNT with every page view would be highly inefficient in terms of server resources.
Don't waste your time solving imaginary problems. mysql is perfectly able to process thousands of records in fractions of a second - this is what databases are for. Clean and simple database and code structure is far more important than the mythical "optimization" that no one needs.
Why not mix and match both? Simply have the final counts in the product and users tables, so that you don't have to count every time and have the votes table , so that there is no double posting.
Edit:
To explain it a bit further, product and user table will have a column called "votes". Every time the insert is successfull in user_product_vote, increment the relevant user and product records. This would avoid dupe votes and you wont have to run the complex count query every time as well.
Edit:
Also i am assuming that you have created a unique index on product_id and user_id, in this case any duplication attempt will automatically fail and you wont have to check in the table before inserting. You will just to make sure the insert query ran and you got a valid value for the "id" in the form on insert_id
You have to balance the desire for your site to perform quickly (in which the second schema would be best) and the ability to count votes for specific users and prevent double voting (for which I would choose the first schema). Because you are only using integer columns for the user_product_vote table, I don't see how performance could suffer too much. Many-to-many relationships are common, as you have implemented with user_product_vote. If you do want to count votes for specific users and prevent double voting, a user_product_vote is the only clean way I can think of implementing it, as any other could result in sparse records, duplicate records, and all kinds of bad things.
You don't want to update the product table directly with an aggregate every time someone votes - this will lock product rows which will then affect other queries which are using products.
Assuming that not all product queries need to include the votes column, you could keep a separate productvotes table which would retain the running totals, and keep your userproductvote table as a means to enforce your user voting per product business rules / and auditing.