Database design for dictionary site - mysql

For a single language dictionary with about 10k words on it, where some words are repeated but with different meaning, would it be ok to use a single table design?
+------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| word | varchar(128) | NO | | NULL | |
| definition | varchar(500) | NO | | NULL | |
| example | text | NO | | NULL | |
| date | datetime | NO | | NULL | |
| votes | int(4) | NO | | 0 | |
| name | varchar(30) | NO | | NULL | |
+------------+--------------+------+-----+---------+----------------+
Example queries im using:
SELECT * FROM definitions WHERE word = ? ORDER BY votes DESC LIMIT 10
SELECT word, definition FROM definitions ORDER BY date DESC LIMIT 4
SELECT DISTINCT word FROM definitions WHERE word LIKE ? LIMIT 100
Also the votes row get updated everytime someone votes.
Would be better to have a one-to-many design instead? My main goal is performance.

your table looks like it would be stable and only searching will be performed on it.
the only column that will cause the table to perform insert or update operation may affect your performance. You should only get the votes to other table along with word id. whenever a vote is inserted , it will not perform insert operation on your main table. that will increase your table performance in longer terms.
Select data from both table using join.

For only 10K words (or did you mean rows), and those queries, performance will be 'good enough'. However, these are needed:
INDEX(date)
INDEX(word, votes)
Hint.. If new definitions will come in often, then ORDER BY votes DESC LIMIT 10 will tend to not show them (when there are more than 10). So, you should probably have some formula involving the date at which the definition was added and the number of votes. It might be something like votes / TIMESTAMPDIFF(DAY, date, NOW()) or to temper it: (votes + 1) / DATEDIFF(DAY, date, NOW() + INTERVAL 2 DAY). That would go in the ORDER BY.

Related

MySQL query slow with OR in WHERE statement

I have a SQL query which looks simple but runs very slow ~4s:
SELECT tblbooks.*
FROM tblbooks LEFT JOIN
tblauthorships ON tblbooks.book_id = tblauthorships.book_id
WHERE (tblbooks.added_by=3 OR tblauthorships.author_id=3)
GROUP BY tblbooks.book_id
ORDER BY tblbooks.book_id DESC
LIMIT 10
EXPLAIN result:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+----------------+-------+-------------------+---------+---------+------------------------+------+-------------+
| 1 | SIMPLE | tblbooks | index | fk_books__users_1 | PRIMARY | 62 | NULL | 10 | Using where |
| 1 | SIMPLE | tblauthorships | ref | book_id | book_id | 62 | tblbooks.book_id | 1 | Using where |
+------+-------------+----------------+-------+-------------------+---------+---------+------------------------+------+-------------+
2 rows in set (0.000 sec)
If I run the above query individually on each part of OR in WHERE statement, both queries return result in less than 0.01s.
Simplified schema:
tblbooks (~1 million rows):
| Field | Type | Null | Key | Default | Extra |
+---------------+-----------------------+------+-----+---------------------+----------------+
| id | int(10) unsigned | NO | MUL | NULL | auto_increment |
| book_id | varchar(20) | NO | PRI | NULL | |
| added_by | int(11) unsigned | NO | MUL | NULL | |
+---------------+-----------------------+------+-----+---------------------+----------------+
tblauthorships (< 100 rows):
| Field | Type | Null | Key | Default | Extra |
+---------------+------------------+------+-----+---------------------+----------------+
| authorship_id | int(11) unsigned | NO | PRI | NULL | auto_increment |
| book_id | varchar(20) | NO | MUL | NULL | |
| author_id | int(11) unsigned | NO | MUL | NULL | |
+---------------+------------------+------+-----+---------------------+----------------+
Both book_id and author_id columns in tblauthorships have their index created.
Can anyone point me to the right direction?
Note: I'm aware of book_id varchar issue.
My usual analogy for indexing is a telephone book. It's sorted by last name then by first name. If you look up a person by last name, you can find them efficiently. If you look up a person by last name AND first name, it's also efficient. But if you look up a person by first name only, the sort order of the book doesn't help, and you have to search every page the hard way.
Now what happens if you need to search a telephone book for a person by last name OR first name?
SELECT * FROM TelephoneBook WHERE last_name = 'Thomas' OR first_name = 'Thomas';
This is just as bad as searching only by first name. Since all entries matching the first name you searched should be included in the result, you have to find them all.
Conclusion: Using OR in an SQL search is hard to optimize, given that MySQL can use only one index per table in a given query.
Solution: Use two queries and UNION them:
SELECT * FROM TelephoneBook WHERE last_name = 'Thomas'
UNION
SELECT * FROM TelephoneBook WHERE first_name = 'Thomas';
The two individual queries each use an index on the respective column, then the results of both queries are unified (by default UNION eliminates duplicates).
In your case you don't even need to do the join for one of the queries:
(SELECT b.*
FROM tblbooks AS b
WHERE b.added_by=3)
UNION
(SELECT b.*
FROM tblbooks AS b
INNER JOIN tblauthorships AS a USING (book_id)
WHERE a.author_id=3)
ORDER BY book_id DESC
LIMIT 10
The two answers so far are not very optimal. Since they have both UNION and LIMIT, let me further optimize their answers:
( SELECT ...
ORDER BY ...
LIMIT 10
) UNION DISTINCT
( SELECT ...
ORDER BY ...
LIMIT 10
)
ORDER BY ...
LIMIT 10
This gives each SELECT a chance to optimize the ORDER BY and LIMIT, making them faster. Then the UNION DISTINCT dedups. Finally, the first 10 are peeled off to make the resultset.
If there will be pagination via OFFSET, this optimization gets trickier. See http://mysql.rjweb.org/doc.php/index_cookbook_mysql#or
Also... Your table needs two indexes:
INDEX(added_by)
INDEX(author_id)
(Please use SHOW CREATE TABLE; it is more descriptive than DESCRIBE.)

MySQL slow query with some specific values (stock data)

I have some stocks data like this
+--------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------+---------------+------+-----+---------+-------+
| date | datetime | YES | MUL | NULL | |
| open | decimal(20,4) | YES | | NULL | |
| close | decimal(20,4) | YES | | NULL | |
| high | decimal(20,4) | YES | | NULL | |
| low | decimal(20,4) | YES | | NULL | |
| volume | decimal(20,4) | YES | | NULL | |
| code | varchar(6) | YES | MUL | NULL | |
+--------+---------------+------+-----+---------+-------+
with three indexes, a multi-columns index of date and code, an index of date and an index of code.
The table is large, with 3000+ distinct stocks and each stock has minute data of nearly ten years.
I would like to fetch the last date of a specific stock, so I run the following sql:
SELECT date FROM tablename WHERE code = '000001' ORDER BY date DESC LIMIT 1;
However, this query works well for most stocks (<1 sec) but has very bad performance for some specific stocks (>1 hour). For example, just change the query to
SELECT date FROM tablename WHERE code = '000029' ORDER BY date DESC LIMIT 1;
and it just seems to freeze forever.
One thing I know is that the stock "000029" has no more data after 2016 and "good" stocks all have data until yesterday, but I'm not sure if all "bad" stocks have this characteristic.
First, let's shrink the table size. This will help speed some.
decimal(20,4) takes 10 bytes. It has 16 decimal places to the left of the decimal point; what stock is that large? I don't know of one needing more than 6. On the other hand, is 4 on the right enough?
Normalize the 'code'. "3000+ distinct stocks" can be represented by a 2-byte SMALLINT UNSIGNED NOT NULL, instead of the current ~7 bytes.
'000029' smacks of ZEROFILL??
DESCRIBE is not as descriptive as SHOW CREATE TABLE. What is the PRIMARY KEY? It can make a big difference in this kind of table.
Do not make any columns NULL; make them all NOT NULL.
Use InnoDB and do have an explicit PRIMARY KEY.
I would expect these to be optimal, but I need to see some more typical queries in order to be sure.
PRIMARY KEY(code, date)
INDEX(date)

MySql - Index optimization

We are having a Analytics product. For each of our customer we give one JavaScript code, they put that in their web sites. If a user visit our customer site the java script code hit our server so that we store this page visit on behalf of our customer. Each of our customer contains unique domain name that means customer determined by domain nam
Database server : MySql 5.6
Table rows : 400 million
Following is our table schema.
+---------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| domain | varchar(50) | NO | MUL | NULL | |
| guid | binary(16) | YES | | NULL | |
| sid | binary(16) | YES | | NULL | |
| url | varchar(2500) | YES | | NULL | |
| ip | varbinary(16) | YES | | NULL | |
| is_new | tinyint(1) | YES | | NULL | |
| ref | varchar(2500) | YES | | NULL | |
| user_agent | varchar(255) | YES | | NULL | |
| stats_time | datetime | YES | | NULL | |
| country | char(2) | YES | | NULL | |
| region | char(3) | YES | | NULL | |
| city | varchar(80) | YES | | NULL | |
| city_lat_long | varchar(50) | YES | | NULL | |
| email | varchar(100) | YES | | NULL | |
+---------------+------------------+------+-----+---------+----------------+
In above table guid represents visitor of our customer site and sid represents visitor session of our customer site. That means for every sid there should be associated guid.
We need queries like following
Query 1 : Find unique,total visitors
SELECT count(DISTINCT guid) AS count,count(guid) AS total FROM page_views WHERE domain = 'abc' AND stats_time BETWEEN '2015-10-05 00:00:00' AND '2015-10-04 23:59:59'
composite index planning : domain,stats_time,sid
Query 2 : Find unique,total sessions
SELECT count(DISTINCT sid) AS count,count(sid) AS total FROM page_views WHERE domain = 'abc' AND stats_time BETWEEN '2015-10-05 00:00:00' AND '2015-10-04 23:59:59'
composite index planning : domain,stats_time,guid
Query 3: Find visitors,sessions by country ,by region, by city
composite index planning : domain,country
composite index planning : domain,region
Each combination is requiring new composite index. That means huge index file, we can't keep this in memory so performance of the queries are low.
Is there any way optimize this index combinations to reduce index size and improve performance.
Just for grins, run this to see what type of spread you have...
select
country, region, city,
DATE_FORMAT(colName, '%Y-%m-%d') DATEONLY, count(*)
from
yourTable
group by
country, region, city,
DATE_FORMAT(colName, '%Y-%m-%d')
order by
count(*) desc
and then see how many rows it returns. Also, what sort of range does the COUNT column generate. Instead of just an index, does it make sense to create a separate aggregation table on the key elements you are trying to provide with data mining.
If so, I would recommend looking at a similar post also on the stack here. This shows a SAMPLE on how, but I would first look at the counts before suggesting further. But if you have it broken down on a daily basis, what MIGHT this be reduced to.
Additionally, you might want to create pre-aggregate tables ONCE to get started, then have a nightly procedure that builds any new records based on a day just completed. This way it is never running through all 400M records.
If your pre-aggregate tables store based on just the date (y,m,d only), your queries rolled-up per day would shorten querying requirements. The COUNT(*) is just an example basis, but your could add count( distinct whateverColumn ) as needed. Then, you could query the SUM( aggregateColumn ) based on domain, date range, etc. If your 400M records gets reduced down to 7M records, I would also have a minimum index on the (domain, dateOnlyField, and maybe country) to optimize your domain, date-range queries. Once you get something narrowed down at whatever level make sense, you could always drill into the raw data for the granular level.

Constructing a DB for best performance

I'm working on "online streaming" project and I need some help in constructing a DB for best performance. Currently I have one table containing all relevant information for the player including file, poster image, post_id etc.
+---------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| post_id | int(11) | YES | | NULL | |
| file | mediumtext | NO | | NULL | |
| thumbs_img | mediumtext | YES | | NULL | |
| thumbs_size | mediumtext | YES | | NULL | |
| thumbs_points | mediumtext | YES | | NULL | |
| poster_img | mediumtext | YES | | NULL | |
| type | int(11) | NO | | NULL | |
| uuid | varchar(40) | YES | | NULL | |
| season | int(11) | YES | | NULL | |
| episode | int(11) | YES | | NULL | |
| comment | text | YES | | NULL | |
| playlistName | text | YES | | NULL | |
| time | varchar(40) | YES | | NULL | |
| mini_poster | mediumtext | YES | | NULL | |
+---------------+-------------+------+-----+---------+----------------+
With 100k records it takes around 0.5 sec for a query and performance constantly degrading as I have more records.
+----------+------------+----------------------------------------------------------------------+
| Query_ID | Duration | Query |
+----------+------------+----------------------------------------------------------------------+
| 1 | 0.04630675 | SELECT * FROM dle_playerFiles where post_id in ('7000') AND type='1' |
+----------+------------+----------------------------------------------------------------------+
explain SELECT * FROM dle_playerFiles where post_id in ('7000') AND type='1';
+----+-------------+-----------------+------+---------------+------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------+------+---------------+------+---------+------+-------+-------------+
| 1 | SIMPLE | dle_playerFiles | ALL | NULL | NULL | NULL | NULL | 61777 | Using where |
+----+-------------+-----------------+------+---------------+------+---------+------+-------+-------------+
How can I improve DB structure? How big websites like youtube construct their database?
Generally when query time is directly proportional to the number of rows, that suggests a table scan, which means for a query like
SELECT * FROM dle_playerFiles where post_id in ('7000') AND type='1'
The database is executing that literally, as in, iterate over every single row and check if it meets criteria.
The typical solution to this is an index, which is a precomputed list of values for a column (or set of columns) and a list of rows which have said value.
If you create an index on the post_id column on dle_playerFiles, then the index would essentially say
1: <some row pointer>, <some row pointer>, <some row pointer>
2: <some row pointer>, <some row pointer>, <some row pointer>
...
100: <some row pointer>, <some row pointer>, <some row pointer>
...
7000: <some row pointer>, <some row pointer>, <some row pointer>
250000: <some row pointer>, <some row pointer>, <some row pointer>
Therefore, with such an index in place, the above query would simply look at node 7000 of the index and know which rows contain it.
Then the database only needs to read the rows where post_id is 7000 and check if their type is 1.
This will be much quicker because the database never needs to look at every row to handle a query. The costs of an index:
Storage space - this is more data and it has to be stored somewhere
Update time - databases keep indexes in sync with changes to the table automatically, which means that INSERT, UPDATE and DELETE statements will take longer because they need to update the data. For small and efficient indexes, this tradeoff is usually worth it.
For your query, I recommend you create an index on 2 columns. Make them part of the same index, not 2 separate indexes:
create index ix_dle_playerFiles__post_id_type on dle_playerFiles (post_id, type)
Caveats to this working efficiently:
SELECT * is bad here. If you are returning every column, then the database must go to the table to read the columns because the index only contains the columns for filtering. If you really only need one or two of the columns, specify them explicitly in the SELECT clause and add them to your index. Do NOT do this for many columns as it just bloats the index.
Functions and type conversions tend to prevent index usage. Your SQL wraps the integer types post_id and type in quotes so they are interpreted as strings. The database may feel that an index can't be used because it has to convert everything. Remove the quotes for good measure.
If I read your Duration correctly, it appears to take 0.04630675 (seconds?) to run your query, not 0.5s.
Regardless, proper indexing can decrease the time required to return query results. Based on your query SELECT * FROM dle_playerFiles where post_id in ('7000') AND type='1', an index on post_id and type would be advisable.
Also, if you don't absolutely require all the fields to be returned, use individual column references of the fields you require instead of the *. The fewer fields, the quicker the query will return.
Another way to optimize a query is to ensure that you use the smallest data types possible - especially in primary/foreign key and index fields. Never use a bigint or an int when a mediumint, smallint or better still, a tinyint will do. Never, ever use a text field in a PK or FK unless you have no other choice (this one is a DB design sin that is committed far too often IMO, even by people with enough training and experience to know better) - you're far better off using the smallest exact numeric type possible. All this has positive impacts on storage size too.

Break this Query into batches

I have a MySQL table contacts, with structure as follows
+--------------+----------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+----------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| contactee_id | int(11) | NO | MUL | 0 | |
| contacter_id | int(11) | NO | MUL | 0 | |
+--------------+----------+------+-----+---------+----------------+
contactee_id and contacter_id are both ids, which together defines a relationship between two users. In order to calculate the count of relations, a user have, I have the following query
INSERT INTO followers (id, followers)
SELECT contactee_id, 1
FROM contacts
ON DUPLICATE KEY
UPDATE followers = followers + 1
The problem with this query is that it locks the contacts table for too long (more than 16 minutes). I want to get it done in batches, so that the SQL does not locks contacts table for too long. Few ways, I thought of, but they all need to lock the entire table. Is there a way this could be done?
If you just want the count of relations use the count and group by together like
SELECT contactee_id,count(contacter_id) FROM contacts group by contactee_id;
This will give you all the contactee_id and the number of contacter_id's for each contactee
Run query for some records and then save the id of the last record in a table or filesystem, start next query from that id and update it every cycle.