Optimizing searches for big MySQL table - mysql

I'm working with a MariaDB (MySQL) table which mainly contains information about the whole world cities, their latitude/longitude and the country code (2 characters) where the city is. The table is so big, over 2.5 milion rows.
show columns from Cities;
+---------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| city | varchar(255) | YES | | NULL | |
| lat | float | NO | | NULL | |
| lon | float | NO | | NULL | |
| country | varchar(255) | YES | | NULL | |
+---------+--------------+------+-----+---------+----------------+
I want to implement a city searcher, so I have to optimize the SELECTS, not the INSERTS or UPDATES (it will be always the same information).
I thought that I should:
create an index (by city? by city and country?)
create partitions (by country?)
Should I do both tasks? If so... How could I do them? Could anyone give me several advices? I'm a little bit lost.
PS. I tryied this to create and index by city and country (I don't know if I am doing it well...):
CREATE INDEX idx_cities ON Cities(city (30), country (2));

Do not use "prefix indexing". Simply use INDEX(city, country) This will work very well for either of these:
WHERE city = 'London' -- 26 results, half in the US
WHERE city = 'London' AND country = 'CA' -- one result
Do not use Partitions. The table is too small, and there is no performance benefit.
Since there are only 2.5M rows, use id MEDIUMINT UNSIGNED to save 2.5MB.
What other queries will you have? If you need to "find the 10 nearest cities to a given lat/lng", then see this.
Your table, including index(es), might be only 300MB.

Related

How can I 'weight' a fulltext search on one table (podcasts_episodes) by a field (weight) in another table (podcasts) that coresponds to the id?

How can I 'weight' a fulltext search on one table (podcasts_episodes) by a field (weight) in another table (podcasts) that coresponds to the id?
An example of both tables:
mysql> describe podcasts;
+--------------+--------------+
| Field | Type |
+--------------+--------------+
| id | int(12) |
| name | varchar(200) |
| weight | float |
+--------------+--------------+
mysql> describe podcasts_episodes;
+-------------+--------------+
| Field | Type |
+-------------+--------------+
| id | int(12) |
| name | varchar(200) |
| link | varchar(255) |
| description | text |
| podcast | int(12) |
| weight | float |
+-------------+--------------+
Rather than having to set a weight for each episode, it would be far simpler to use a common one in the podcasts table.
In the podcasts_episodes table, the field 'podcast' matches the id in the podcsts table for the podcast that episode belongs to.
The sql query I have been using, which needs updated as described is:
SELECT
*, MATCH(name,description) AGAINST('industrial') AS relevance
FROM
podcasts_episodes
WHERE
MATCH(name,description) AGAINST('industrial')
ORDER BY
weight DESC, relevance DESC
LIMIT 10
The expected outcome should be that only relevant results show up, but results weighted higher should get a boost above lower weighted results.

MySQL: WHERE Condition against all Columns without specifying them

So long story short:
I have table A which might expand in columns in the future. I'd like to write a php pdo prepared select statement with a WHERE clause which applies the where condition to ALL columns on the table. To prevent having to update the query manually if columns are added to the table later on, I'd like to just tell the query to check ALL columns on the table.
Like so:
$fetch = $connection->prepare("SELECT product_name
FROM products_tbl
WHERE _ANYCOLUMN_ = ?
");
Is this possible with mysql?
EDIT:
To clarify what I mean by "having to expand the table" in the future:
MariaDB [foundationtests]> SHOW COLUMNS FROM products_tbl;
+----------------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------------+--------------+------+-----+---------+----------------+
| product_id | int(11) | NO | PRI | NULL | auto_increment |
| product_name | varchar(100) | NO | UNI | NULL | |
| product_manufacturer | varchar(100) | NO | MUL | diverse | |
| product_category | varchar(100) | NO | MUL | diverse | |
+----------------------+--------------+------+-----+---------+----------------+
4 rows in set (0.011 sec)
Here you can see the current table. Basically, products are listed here by their name, and they are accompanied by their manufacturers (say, Bosch) and category (say, drill hammer). Now I want to add another "attribute" to the products, like their price.
In such a case, I'd have to add another column, and then I'd have to specify this new column inside my MySQL queries.

MySQL slow query with some specific values (stock data)

I have some stocks data like this
+--------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------+---------------+------+-----+---------+-------+
| date | datetime | YES | MUL | NULL | |
| open | decimal(20,4) | YES | | NULL | |
| close | decimal(20,4) | YES | | NULL | |
| high | decimal(20,4) | YES | | NULL | |
| low | decimal(20,4) | YES | | NULL | |
| volume | decimal(20,4) | YES | | NULL | |
| code | varchar(6) | YES | MUL | NULL | |
+--------+---------------+------+-----+---------+-------+
with three indexes, a multi-columns index of date and code, an index of date and an index of code.
The table is large, with 3000+ distinct stocks and each stock has minute data of nearly ten years.
I would like to fetch the last date of a specific stock, so I run the following sql:
SELECT date FROM tablename WHERE code = '000001' ORDER BY date DESC LIMIT 1;
However, this query works well for most stocks (<1 sec) but has very bad performance for some specific stocks (>1 hour). For example, just change the query to
SELECT date FROM tablename WHERE code = '000029' ORDER BY date DESC LIMIT 1;
and it just seems to freeze forever.
One thing I know is that the stock "000029" has no more data after 2016 and "good" stocks all have data until yesterday, but I'm not sure if all "bad" stocks have this characteristic.
First, let's shrink the table size. This will help speed some.
decimal(20,4) takes 10 bytes. It has 16 decimal places to the left of the decimal point; what stock is that large? I don't know of one needing more than 6. On the other hand, is 4 on the right enough?
Normalize the 'code'. "3000+ distinct stocks" can be represented by a 2-byte SMALLINT UNSIGNED NOT NULL, instead of the current ~7 bytes.
'000029' smacks of ZEROFILL??
DESCRIBE is not as descriptive as SHOW CREATE TABLE. What is the PRIMARY KEY? It can make a big difference in this kind of table.
Do not make any columns NULL; make them all NOT NULL.
Use InnoDB and do have an explicit PRIMARY KEY.
I would expect these to be optimal, but I need to see some more typical queries in order to be sure.
PRIMARY KEY(code, date)
INDEX(date)

MySql - Index optimization

We are having a Analytics product. For each of our customer we give one JavaScript code, they put that in their web sites. If a user visit our customer site the java script code hit our server so that we store this page visit on behalf of our customer. Each of our customer contains unique domain name that means customer determined by domain nam
Database server : MySql 5.6
Table rows : 400 million
Following is our table schema.
+---------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| domain | varchar(50) | NO | MUL | NULL | |
| guid | binary(16) | YES | | NULL | |
| sid | binary(16) | YES | | NULL | |
| url | varchar(2500) | YES | | NULL | |
| ip | varbinary(16) | YES | | NULL | |
| is_new | tinyint(1) | YES | | NULL | |
| ref | varchar(2500) | YES | | NULL | |
| user_agent | varchar(255) | YES | | NULL | |
| stats_time | datetime | YES | | NULL | |
| country | char(2) | YES | | NULL | |
| region | char(3) | YES | | NULL | |
| city | varchar(80) | YES | | NULL | |
| city_lat_long | varchar(50) | YES | | NULL | |
| email | varchar(100) | YES | | NULL | |
+---------------+------------------+------+-----+---------+----------------+
In above table guid represents visitor of our customer site and sid represents visitor session of our customer site. That means for every sid there should be associated guid.
We need queries like following
Query 1 : Find unique,total visitors
SELECT count(DISTINCT guid) AS count,count(guid) AS total FROM page_views WHERE domain = 'abc' AND stats_time BETWEEN '2015-10-05 00:00:00' AND '2015-10-04 23:59:59'
composite index planning : domain,stats_time,sid
Query 2 : Find unique,total sessions
SELECT count(DISTINCT sid) AS count,count(sid) AS total FROM page_views WHERE domain = 'abc' AND stats_time BETWEEN '2015-10-05 00:00:00' AND '2015-10-04 23:59:59'
composite index planning : domain,stats_time,guid
Query 3: Find visitors,sessions by country ,by region, by city
composite index planning : domain,country
composite index planning : domain,region
Each combination is requiring new composite index. That means huge index file, we can't keep this in memory so performance of the queries are low.
Is there any way optimize this index combinations to reduce index size and improve performance.
Just for grins, run this to see what type of spread you have...
select
country, region, city,
DATE_FORMAT(colName, '%Y-%m-%d') DATEONLY, count(*)
from
yourTable
group by
country, region, city,
DATE_FORMAT(colName, '%Y-%m-%d')
order by
count(*) desc
and then see how many rows it returns. Also, what sort of range does the COUNT column generate. Instead of just an index, does it make sense to create a separate aggregation table on the key elements you are trying to provide with data mining.
If so, I would recommend looking at a similar post also on the stack here. This shows a SAMPLE on how, but I would first look at the counts before suggesting further. But if you have it broken down on a daily basis, what MIGHT this be reduced to.
Additionally, you might want to create pre-aggregate tables ONCE to get started, then have a nightly procedure that builds any new records based on a day just completed. This way it is never running through all 400M records.
If your pre-aggregate tables store based on just the date (y,m,d only), your queries rolled-up per day would shorten querying requirements. The COUNT(*) is just an example basis, but your could add count( distinct whateverColumn ) as needed. Then, you could query the SUM( aggregateColumn ) based on domain, date range, etc. If your 400M records gets reduced down to 7M records, I would also have a minimum index on the (domain, dateOnlyField, and maybe country) to optimize your domain, date-range queries. Once you get something narrowed down at whatever level make sense, you could always drill into the raw data for the granular level.

Constructing a DB for best performance

I'm working on "online streaming" project and I need some help in constructing a DB for best performance. Currently I have one table containing all relevant information for the player including file, poster image, post_id etc.
+---------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| post_id | int(11) | YES | | NULL | |
| file | mediumtext | NO | | NULL | |
| thumbs_img | mediumtext | YES | | NULL | |
| thumbs_size | mediumtext | YES | | NULL | |
| thumbs_points | mediumtext | YES | | NULL | |
| poster_img | mediumtext | YES | | NULL | |
| type | int(11) | NO | | NULL | |
| uuid | varchar(40) | YES | | NULL | |
| season | int(11) | YES | | NULL | |
| episode | int(11) | YES | | NULL | |
| comment | text | YES | | NULL | |
| playlistName | text | YES | | NULL | |
| time | varchar(40) | YES | | NULL | |
| mini_poster | mediumtext | YES | | NULL | |
+---------------+-------------+------+-----+---------+----------------+
With 100k records it takes around 0.5 sec for a query and performance constantly degrading as I have more records.
+----------+------------+----------------------------------------------------------------------+
| Query_ID | Duration | Query |
+----------+------------+----------------------------------------------------------------------+
| 1 | 0.04630675 | SELECT * FROM dle_playerFiles where post_id in ('7000') AND type='1' |
+----------+------------+----------------------------------------------------------------------+
explain SELECT * FROM dle_playerFiles where post_id in ('7000') AND type='1';
+----+-------------+-----------------+------+---------------+------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------+------+---------------+------+---------+------+-------+-------------+
| 1 | SIMPLE | dle_playerFiles | ALL | NULL | NULL | NULL | NULL | 61777 | Using where |
+----+-------------+-----------------+------+---------------+------+---------+------+-------+-------------+
How can I improve DB structure? How big websites like youtube construct their database?
Generally when query time is directly proportional to the number of rows, that suggests a table scan, which means for a query like
SELECT * FROM dle_playerFiles where post_id in ('7000') AND type='1'
The database is executing that literally, as in, iterate over every single row and check if it meets criteria.
The typical solution to this is an index, which is a precomputed list of values for a column (or set of columns) and a list of rows which have said value.
If you create an index on the post_id column on dle_playerFiles, then the index would essentially say
1: <some row pointer>, <some row pointer>, <some row pointer>
2: <some row pointer>, <some row pointer>, <some row pointer>
...
100: <some row pointer>, <some row pointer>, <some row pointer>
...
7000: <some row pointer>, <some row pointer>, <some row pointer>
250000: <some row pointer>, <some row pointer>, <some row pointer>
Therefore, with such an index in place, the above query would simply look at node 7000 of the index and know which rows contain it.
Then the database only needs to read the rows where post_id is 7000 and check if their type is 1.
This will be much quicker because the database never needs to look at every row to handle a query. The costs of an index:
Storage space - this is more data and it has to be stored somewhere
Update time - databases keep indexes in sync with changes to the table automatically, which means that INSERT, UPDATE and DELETE statements will take longer because they need to update the data. For small and efficient indexes, this tradeoff is usually worth it.
For your query, I recommend you create an index on 2 columns. Make them part of the same index, not 2 separate indexes:
create index ix_dle_playerFiles__post_id_type on dle_playerFiles (post_id, type)
Caveats to this working efficiently:
SELECT * is bad here. If you are returning every column, then the database must go to the table to read the columns because the index only contains the columns for filtering. If you really only need one or two of the columns, specify them explicitly in the SELECT clause and add them to your index. Do NOT do this for many columns as it just bloats the index.
Functions and type conversions tend to prevent index usage. Your SQL wraps the integer types post_id and type in quotes so they are interpreted as strings. The database may feel that an index can't be used because it has to convert everything. Remove the quotes for good measure.
If I read your Duration correctly, it appears to take 0.04630675 (seconds?) to run your query, not 0.5s.
Regardless, proper indexing can decrease the time required to return query results. Based on your query SELECT * FROM dle_playerFiles where post_id in ('7000') AND type='1', an index on post_id and type would be advisable.
Also, if you don't absolutely require all the fields to be returned, use individual column references of the fields you require instead of the *. The fewer fields, the quicker the query will return.
Another way to optimize a query is to ensure that you use the smallest data types possible - especially in primary/foreign key and index fields. Never use a bigint or an int when a mediumint, smallint or better still, a tinyint will do. Never, ever use a text field in a PK or FK unless you have no other choice (this one is a DB design sin that is committed far too often IMO, even by people with enough training and experience to know better) - you're far better off using the smallest exact numeric type possible. All this has positive impacts on storage size too.