Break this Query into batches - mysql

I have a MySQL table contacts, with structure as follows
+--------------+----------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+----------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| contactee_id | int(11) | NO | MUL | 0 | |
| contacter_id | int(11) | NO | MUL | 0 | |
+--------------+----------+------+-----+---------+----------------+
contactee_id and contacter_id are both ids, which together defines a relationship between two users. In order to calculate the count of relations, a user have, I have the following query
INSERT INTO followers (id, followers)
SELECT contactee_id, 1
FROM contacts
ON DUPLICATE KEY
UPDATE followers = followers + 1
The problem with this query is that it locks the contacts table for too long (more than 16 minutes). I want to get it done in batches, so that the SQL does not locks contacts table for too long. Few ways, I thought of, but they all need to lock the entire table. Is there a way this could be done?

If you just want the count of relations use the count and group by together like
SELECT contactee_id,count(contacter_id) FROM contacts group by contactee_id;
This will give you all the contactee_id and the number of contacter_id's for each contactee

Run query for some records and then save the id of the last record in a table or filesystem, start next query from that id and update it every cycle.

Related

Database design for dictionary site

For a single language dictionary with about 10k words on it, where some words are repeated but with different meaning, would it be ok to use a single table design?
+------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| word | varchar(128) | NO | | NULL | |
| definition | varchar(500) | NO | | NULL | |
| example | text | NO | | NULL | |
| date | datetime | NO | | NULL | |
| votes | int(4) | NO | | 0 | |
| name | varchar(30) | NO | | NULL | |
+------------+--------------+------+-----+---------+----------------+
Example queries im using:
SELECT * FROM definitions WHERE word = ? ORDER BY votes DESC LIMIT 10
SELECT word, definition FROM definitions ORDER BY date DESC LIMIT 4
SELECT DISTINCT word FROM definitions WHERE word LIKE ? LIMIT 100
Also the votes row get updated everytime someone votes.
Would be better to have a one-to-many design instead? My main goal is performance.
your table looks like it would be stable and only searching will be performed on it.
the only column that will cause the table to perform insert or update operation may affect your performance. You should only get the votes to other table along with word id. whenever a vote is inserted , it will not perform insert operation on your main table. that will increase your table performance in longer terms.
Select data from both table using join.
For only 10K words (or did you mean rows), and those queries, performance will be 'good enough'. However, these are needed:
INDEX(date)
INDEX(word, votes)
Hint.. If new definitions will come in often, then ORDER BY votes DESC LIMIT 10 will tend to not show them (when there are more than 10). So, you should probably have some formula involving the date at which the definition was added and the number of votes. It might be something like votes / TIMESTAMPDIFF(DAY, date, NOW()) or to temper it: (votes + 1) / DATEDIFF(DAY, date, NOW() + INTERVAL 2 DAY). That would go in the ORDER BY.

Restrict insertion based on a count

So, I need to safely restrict the insertion of entries in a table based on the count of other entries in that same table. Say we have the following table:
resource:(id, foreign_key)
I need to create up to a number of entries based on the foreign key. So, as soon as I reach a count, let's say 100 for our example, I want to restrict creating more entries.
The obvious answer would be something like that:
count the entries with the specified foreign key.
if count < limit insert the new entry
And in fact, that's what I have been using. The thing is, this approach is not fail-proof since between 1 and 2 there might occur another insertion. I considered the possibility of using transactions but (unless I'm completely misunderstanding transactions) this has the same issue:
start transaction
insert the new entry
if entries have exceeded the limit, rollback. otherwise commit
Now, say we already have 99/100 entries and two transactions run at the same time. They both will commit since they don't see each-other's entries.
Short of actually creating the entry and then delete it if it's invalid (which feels kindof messy in my mind) I can't think of a way to solve this issue. Any ideas?
edit: upon request I'm providing sample data:
table1
+-------------+------------------+------+-----+----------------+
| Field | Type | Null | Key | Extra |
+-------------+------------------+------+-----+----------------+
| id | int(10) unsigned | NO | PRI | auto_increment |
| limit | int(10) unsigned | NO | MUL | |
+-------------+------------------+------+-----+----------------+
table2
+-------------+------------------+------+-----+----------------+
| Field | Type | Null | Key | Extra |
+-------------+------------------+------+-----+----------------+
| id | int(10) unsigned | NO | PRI | auto_increment |
| foreign_id | int(10) unsigned | NO | MUL | |
+-------------+------------------+------+-----+----------------+
and some sample data:
table1
+----+----------+
| id | limit |
+----+----------+
| 1 | 5 |
+----+----------+
table2
+----+---------------+
| id | foreign_id |
+----+---------------+
| 1 | 1 |
+----+---------------+
| 2 | 1 |
+----+---------------+
| 3 | 1 |
+----+---------------+
| 4 | 1 |
+----+---------------+
At this point, let's say that two users attempt to create table2 entries. The first one will have to be accepted and the 2nd rejected.
With the first approach, if both users go through step 1 (counting the old entries) and then through step 2 (insert the new entry) both entries will be created.
With the second approach, if both of them run at the same time, they both will count 4 slots before themselves and commit instead of one of them rollbacking.
Halo Mate, a Stored Procedure similar to this structure may help you
UPDATE
DROP PROCEDURE IF EXISTS sp_insert_record;
DELIMITER //
CREATE PROCEDURE sp_insert_record(
IN insert_value1 INT(9),
IN chosen_id INT(9)
)
BEGIN
SELECT id, `limit`
INTO #id, #limit
FROM table1
WHERE id = chosen_id;
START TRANSACTION;
INSERT INTO table2 (id, foreign_id)
VALUES (insert_value1, chosen_id);
SELECT COUNT(id)
INTO #count
FROM table2
WHERE foreign_id = #id;
IF #count <= #limit THEN
COMMIT;
ELSE
ROLLBACK;
END IF;
END//
DELIMITER ;
By using a Stored Procedure, you can also add any validation or process based on your requirements.
Hope this can be of help, cheers!

Constructing a DB for best performance

I'm working on "online streaming" project and I need some help in constructing a DB for best performance. Currently I have one table containing all relevant information for the player including file, poster image, post_id etc.
+---------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| post_id | int(11) | YES | | NULL | |
| file | mediumtext | NO | | NULL | |
| thumbs_img | mediumtext | YES | | NULL | |
| thumbs_size | mediumtext | YES | | NULL | |
| thumbs_points | mediumtext | YES | | NULL | |
| poster_img | mediumtext | YES | | NULL | |
| type | int(11) | NO | | NULL | |
| uuid | varchar(40) | YES | | NULL | |
| season | int(11) | YES | | NULL | |
| episode | int(11) | YES | | NULL | |
| comment | text | YES | | NULL | |
| playlistName | text | YES | | NULL | |
| time | varchar(40) | YES | | NULL | |
| mini_poster | mediumtext | YES | | NULL | |
+---------------+-------------+------+-----+---------+----------------+
With 100k records it takes around 0.5 sec for a query and performance constantly degrading as I have more records.
+----------+------------+----------------------------------------------------------------------+
| Query_ID | Duration | Query |
+----------+------------+----------------------------------------------------------------------+
| 1 | 0.04630675 | SELECT * FROM dle_playerFiles where post_id in ('7000') AND type='1' |
+----------+------------+----------------------------------------------------------------------+
explain SELECT * FROM dle_playerFiles where post_id in ('7000') AND type='1';
+----+-------------+-----------------+------+---------------+------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------+------+---------------+------+---------+------+-------+-------------+
| 1 | SIMPLE | dle_playerFiles | ALL | NULL | NULL | NULL | NULL | 61777 | Using where |
+----+-------------+-----------------+------+---------------+------+---------+------+-------+-------------+
How can I improve DB structure? How big websites like youtube construct their database?
Generally when query time is directly proportional to the number of rows, that suggests a table scan, which means for a query like
SELECT * FROM dle_playerFiles where post_id in ('7000') AND type='1'
The database is executing that literally, as in, iterate over every single row and check if it meets criteria.
The typical solution to this is an index, which is a precomputed list of values for a column (or set of columns) and a list of rows which have said value.
If you create an index on the post_id column on dle_playerFiles, then the index would essentially say
1: <some row pointer>, <some row pointer>, <some row pointer>
2: <some row pointer>, <some row pointer>, <some row pointer>
...
100: <some row pointer>, <some row pointer>, <some row pointer>
...
7000: <some row pointer>, <some row pointer>, <some row pointer>
250000: <some row pointer>, <some row pointer>, <some row pointer>
Therefore, with such an index in place, the above query would simply look at node 7000 of the index and know which rows contain it.
Then the database only needs to read the rows where post_id is 7000 and check if their type is 1.
This will be much quicker because the database never needs to look at every row to handle a query. The costs of an index:
Storage space - this is more data and it has to be stored somewhere
Update time - databases keep indexes in sync with changes to the table automatically, which means that INSERT, UPDATE and DELETE statements will take longer because they need to update the data. For small and efficient indexes, this tradeoff is usually worth it.
For your query, I recommend you create an index on 2 columns. Make them part of the same index, not 2 separate indexes:
create index ix_dle_playerFiles__post_id_type on dle_playerFiles (post_id, type)
Caveats to this working efficiently:
SELECT * is bad here. If you are returning every column, then the database must go to the table to read the columns because the index only contains the columns for filtering. If you really only need one or two of the columns, specify them explicitly in the SELECT clause and add them to your index. Do NOT do this for many columns as it just bloats the index.
Functions and type conversions tend to prevent index usage. Your SQL wraps the integer types post_id and type in quotes so they are interpreted as strings. The database may feel that an index can't be used because it has to convert everything. Remove the quotes for good measure.
If I read your Duration correctly, it appears to take 0.04630675 (seconds?) to run your query, not 0.5s.
Regardless, proper indexing can decrease the time required to return query results. Based on your query SELECT * FROM dle_playerFiles where post_id in ('7000') AND type='1', an index on post_id and type would be advisable.
Also, if you don't absolutely require all the fields to be returned, use individual column references of the fields you require instead of the *. The fewer fields, the quicker the query will return.
Another way to optimize a query is to ensure that you use the smallest data types possible - especially in primary/foreign key and index fields. Never use a bigint or an int when a mediumint, smallint or better still, a tinyint will do. Never, ever use a text field in a PK or FK unless you have no other choice (this one is a DB design sin that is committed far too often IMO, even by people with enough training and experience to know better) - you're far better off using the smallest exact numeric type possible. All this has positive impacts on storage size too.

MySQL: Count items by categories

I've created a table that holds items according to categories:
+------------+---------------------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+---------------------+------+-----+-------------------+----------------+
| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| name | varchar(30) | YES | | NULL | |
| category | varchar(30) | YES | MUL | NULL | |
| timestamp | timestamp | NO | | CURRENT_TIMESTAMP | |
| data | mediumblob | YES | | NULL | |
+------------+---------------------+------+-----+-------------------+----------------+
Old data is deleted using a sliding window technique, meaning that only the last N items in each category are kept in the table.
How can I keep track the total number of the items per category, and the timestamp of the first item in the category?
Edit - COUNT and MIN on the original table won't work, because this is a Sliding Window data structure meaning that the first items have already been deleted.
Clearly you need to keep a separate table when you delete the records. Your table should summarize the categories and include the fields:
Category first start time
Total number of items in the category
and so on.
When you go to delete, you need to update this table. In general, I prefer to use stored procedures to handle database maintenance, so this code could be added to the stored procedure. Others prefer triggers, so you could have a delete trigger that does the same thing.
try with SELECT count(id) FROM table GROUP BY category

Delete duplicate rows GROUP BY with LIKE

I have a string in database (mysql) which is like:
{"StateId":73,"CallTime":"\/Date(1336365498912+0500)\/","CallId":"1336365489.14157","Target":"agi://127.0.0.1"}},"Profile":{"$type":"DataWriter.DbProfile, DataWriterObjects","Name":"DataService","Provider":"mssql","ConnectionString":"Data Source=localhost\\mydb; Database=mydb; User Id=sa; Password=admin;"}}
The string is a JSON object which contains multiple fields. The problem is that I have multiple duplicate rows which I want to remove from the database. A row is considered a duplicate if the CallId and StateId is same but the CallTime is different. So first I want to get list of the duplicates (GROUP BY) of those rows which have CallId same and ignore the difference in CallTime. The below record has different CallTime from the first one but same CallId, hence it is considered a duplicate (basically need not to consider CallTime for duplicate)
{"StateId":73,"CallTime":"\/Date(1336365498913+0500)\/","CallId":"1336365489.14157","Target":"agi://127.0.0.1"}},"Profile":{"$type":"DataWriter.DbProfile, DataWriterObjects","Name":"DataService","Provider":"mssql","ConnectionString":"Data Source=localhost\\mydb; Database=mydb; User Id=sa; Password=admin;"}}
So how do I do a GROUP BY? Basically everything in the GROUP BY should be matched ignoring the CallTime value.
The table structure is
mysql> describe Statements;
+------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+-------------+------+-----+---------+----------------+
| SequenceId | bigint(10) | NO | PRI | NULL | auto_increment |
| Profile | varchar(32) | YES | MUL | NULL | |
| CacheItem | text | NO | | NULL | |
+------------+-------------+------+-----+---------+----------------+
After that I want to delete the duplicates. Anyone help me out?
I think your database is not atomic enough, you may have to split out your JSON string into separate fields