mysql optimize data content: multi column or simple column hash data - mysql

I actually have a table with 30 columns. In one day this table can get around 3000 new records!
The columns datas look like :
IMG Name Phone etc..
http://www.site.com/images/image.jpg John Smith 123456789 etc..
http://www.site.com/images/image.jpg Smith John 987654321 etc..
I'm looking a way to optimize the size of the table but also the response time of the sql queries. I was thinking of doing something like :
Column1
http://www.site.com/images/image.jpg|John Smith|123456789|etc..
And then via php i would store each value into an array..
Would it be faster ?
Edit
So to take an example of the structure, let's say i have two tables :
package
package_content
Here is the structure of the table package :
id | user_id | package_name | date
Here is the structure of the table package_content :
id | package_id | content_name | content_description | content_price | content_color | etc.. > 30columns
The thing is for each package i can get up to 16rows of content. For example :
id | user_id | package_name | date
260 11 Package 260 2013-7-30 10:05:00
id | package_id | content_name | content_description | content_price | content_color | etc.. > 30columns
1 260 Content 1 Content 1 desc 58 white etc..
2 260 Content 2 Content 2 desc 75 black etc..
3 260 Content 3 Content 3 desc 32 blue etc..
etc...
Then with php i make like that
select * from package
while not EOF {
show package name, date etc..
select * from package_content where package_content.package_id = package.id and package.id = package_id
while not EOF{
show package_content name, desc, price, color etc...
}
}

Would it be faster? Definitely not. If you needed to search by Name or Phone or etc... you'd have to pull those values out of Column1 every time. You'd never be able to optimize those queries, ever.
If you want to make the table smaller it's best to look at splitting some columns off into another table. If you'd like to pursue that option, post the entire structure. But note that the number of columns doesn't affect speed that much. I mean it can, but it's way down on the list of things that will slow you down.
Finally, 3,000 rows per day is about 1 million rows per year. If the database is tolerably well designed, MySQL can handle this easily.
Addendum: partial table structures plus sample query and pseudocode added to question.
The pseudocode shows the package table being queried all at once, then matching package_content rows being queried one at a time. This is a very slow way to go about things; better to use a JOIN:
SELECT
package.id,
user_id,
package_name,
date,
package_content.*
FROM package
INNER JOIN package_content on package.id = package_content.id
WHERE whatever
ORDER BY whatever
That will speed things up right away.
If you're displaying on a web page, be sure to limit results with a WHERE clause - nobody will want to see 1,000 or 3,000 or 1,000,000 packages on a single web page :)
Finally, as I mentioned before, the number of columns isn't a huge worry for query optimization, but...
Having a really wide result row means more data has to go across the wire from MySQL to PHP, and
It isn't likely you'll be able to display 30+ columns of information on a web page without it looking terrible, especially if you're reading lots of rows.
With that in mind, you'll be better of picking specific package_content columns in your query instead of picking them all with a SELECT *.

Don't combine any columns, this is no use and might even be slower in the end.
You should use indexes on a column where you query at. I do have a website with about 30 columns where atm are around 600.000 results. If you use EXPLAIN before a query, you should see if it uses any indexes. If you got a JOIN with 2 values and a WHERE at the same table. You should make a combined index with the 3 columns, in order from JOIN -> WHERE. If you join on the same table, you should see this as a seperate index.
For example:
SELECT p.name, p.id, c.name, c2.name
FROM product p
JOIN category c ON p.cat_id=c.id
JOIN category c2 ON c.parent_id=c2.id AND name='Niels'
WHERE p.filterX='blaat'
You should have an combined index at category
parent_id,name
AND
id (probably the AI)
A index on product
cat_id
filterX
With this easy solution you can optimize queries from NOT DOABLE to 0.10 seconds, or even faster.
If you use MySQL 5.6 you should step over to INNODB because MySQL is better with optimizing JOINS and sub queries. Also MySQL will try to run them into MEMORY which will make it a lot faster aswel. Please keep in mind that backupping INNODB tables might need some extra attention.
You might also think about making MEMORY tables for super fast querieing (you do still need indexes).
You can also optimize by making integers size 4 (4 bytes, not 11 characters). And not always using VARCHAR 255.

Related

MySQL query performance with reference tables

For the following 2 tables structures, assuming the data volume is really high:
cars table
Id | brand name | make year | purchase year | owner name
Is there any query performance benefit with structuring it this way and joining the 2 tables instead?
cars table
Id | brand_id | make year | purchase year | owner name
brands table
Id | name
Also, if all 4 columns fall in my where clause, does it make sense indexing any?
I would at least have INDEX(owner_name) since that is very selective. Having INDEX(owner_name, model_year) won't help enough to matter for this type of data. There are other cases where I would recommend a 4-column composite index.
"data volume is really high". If you are saying there are 100K rows, then it does not matter much. If you are saying a billion rows, then we need to get into a lot more details.
"data volume is really high". 10 queries/second -- Yawn. 1000/second -- more details, please.
2 tables vs 1.
Data integrity - someone could mess up the data either way
Speed -- a 1-byte TINYINT UNSIGNED (range 0..255) is smaller than an average of about 7 bytes for VARCHAR(55) forbrand. But it is hardly enough smaller to matter on space or speed. (And if you goof and makebrand_idaBIGINT`, which is 8 bytes; well, oops!)
Indexing all columns is different than having no indexes. But "indexing all" is ambiguous:
INDEX(user), INDEX(brand), INDEX(year), ... is likely to make it efficient to search or sort by any of those columns.
INDEX(user, brand, year), ... makes it especially efficient to search by all those columns (with =), or certain ORDER BYs.
No index implies scanning the entire table for any SELECT.
Another interpretation of what you said (plus a little reading between the lines): Might you be searching by any combination of columns? Perhaps non-= things like year >= 2016? Or make IN ('Toyota', 'Nissan')?
Study http://mysql.rjweb.org/doc.php/index_cookbook_mysql
An argument for 1 table
If you need to do
WHERE brand = 'Toyota'
AND year = 2017
Then INDEX(brand, year) (in either order) is possible and beneficial.
But... If those two columns are in different tables (as with your 2-table example), then you cannot have such an index, and performance will suffer.

Calculating frequency of password hashes efficiently in MySQL

For my bachelor thesis I have to analyze a password leak and I have a table with 2 colums MEMBER_EMAIL and MEMBER_HASH
I want to calculate the frequency of each hash efficiently
So that the output looks like:
Hash | Amount
----------------
2e3f.. | 345
2f2e.. | 288
b2be.. | 189
My query until now was straight forward:
SELECT MEMBER_HASH AS hashed, count(*) AS amount
FROM thesis.fulllist
GROUP BY hashed
ORDER BY amount DESC
While it works fine for smaller tables, i have problems computing the query on the whole list (112 mio. entries), where it takes me over 2 days, ending in a weird connection timeout error even if my settings regarding that are fine.
So I wonder if there is a better way to calculate (as i can't really think of any), would appreciate any help!
Your query can't be optimized as it's quite simple. The only way I think to improve the way the query is executed is to index the "MEMBER_HASH".
This is how you can do it :
ALTER TABLE `table` ADD INDEX `hashed` (`MEMBER_HASH`);

MYSQL slow duration or fetch time depending on "distinct" command

I have a pretty small, simple MYSQL table for holding precalculated financial data. The table looks like:
refDate | instrtument | rate|startDate |maturityDate|carry1|carry2|carry3
with 3 indices defined as:
unique unique_ID(refDate,instrument)
refDate (refDate)
instrument (instrument)
rows right now is about 10 million, though for each refDate, there are only about 5000 distinct instruments right now
I have a query that self joins on this table to generate an output like:
refDate|rate instrument=X | rate instrument = Y| rate instrument=Z|....
basically returning time series data which I can then do my own analytics in.
Here is the problem: my original query looked like:
Select distinct AUDSpot1yFq.refDate,AUDSpot1yFq.rate as 'AUDSpot1yFq',
AUD1y1yFq.rate as AUD1y1yFq
from audratedb AUDSpot1yFq inner join audratedb AUD1y1yFq on
AUDSpot1yFq.refDate=AUD1y1yFq.refDate
where AUDSpot1yFq.instrument = 'AUDSpot1yFq' and
AUD1y1yFq.instrument = 'AUD1y1yFq'
order by AUDSpot1yFq.refDate
Note, in this particular query for timing below, I was actually getting 10 different instruments, which means the query was much longer but followed this same pattern of naming, inner joins, and where statements.
This was slow, in workbench I time it as 7-8 second duration (but near 0 fetch time, as I have workbench on the machine running the server). When I stripped the distinct, the duration drops to 0.25-0.5 seconds (far more manageable) and when I stripped the "order by" it got even faster (<0.1 seconds, at which point I don't care). But my Fetchtime exploded to ~7 seconds. So in total, I gain nothing but it has all become a Fetch time issue. When I run this query from the python scripts that will be doing the lifting and work, I get roughly the same timing whether I include distinct or not.
when I run an explain on my cut down query (which has the horrid fetch time) I get:
1 SIMPLE AUDSpot1yFq ref unique_ID,refDate,instrument instrument 39 const 1432 100.00 Using where
1 SIMPLE AUD1y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
1 SIMPLE AUD2y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
1 SIMPLE AUD3y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
1 SIMPLE AUD4y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
1 SIMPLE AUD5y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
1 SIMPLE AUD6y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
1 SIMPLE AUD7y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
1 SIMPLE AUD8y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
1 SIMPLE AUD9y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
I now realize distinct is not required, and order by is something I can throw out and sort in pandas when I get the output to a dataframe. That is great. But I don't know how to get the Fetch time down. I'm not going to win any competency competitions on this website, but I have searched as much as I can and can't find a solution for this issue. Any help is greatly appreciated.
~cocoa
(I had to simplify the table aliases in order to read it:)
Select distinct
s.refDate,
s.rate as AUDSpot1yFq,
y.rate as AUD1y1yFq
from audratedb AS s
join audratedb AS y on s.refDate = y.refDate
where s.instrument = 'AUDSpot1yFq'
and y.instrument = 'AUD1y1yFq'
order by s.refDate
Index needed:
INDEX(instrument, refDate) -- To filter and sort, or
INDEX(instrument, refDate, rate) -- to also "cover" the query.
That assumes the query is not more complex than you said. I see that the EXPLAIN already has many more tables. Please provide SHOW CREATE TABLE audratedb and the entire SELECT.
Back to your questions...
DISTINCT is done one of two ways: (1) sort the table, then dedup, or (2) dedup in a hash in memory. Keep in mind that you are dedupping all 3 columns (refDate, s.rate, y.rate).
ORDER BY is a sort after gathering all the data. However, with the suggested index (not the indexes you had), the sort is not needed, since the index will get the rows in the desired order.
But... Having both DISTINCT and ORDER BY may confuse the Optimizer to the point where it does something 'dumb'.
You say that (refDate,instrument) is UNIQUE, but you do not mention a PRIMARY KEY, nor have you mentioned which Engine you are using. If you are using InnoDB, then PRIMARY KEY(instrument, refDate), in that order, would further speed things up, and avoid the need for any new index.
Furthermore, it is redundant to have (a,b) and also (a). That is, your current schema does not need INDEX(refDate), but by changing the PK, you would not need INDEX(instrument), instead.
Bottom line: Only
PRIMARY KEY(instrument, refDate),
INDEX(refDate)
and no other indexes (unless you can show some query that needs it).
More on the EXPLAIN. Notice how the Rows column says 1432, 1, 1, ... That means that it scanned an estimated 1432 rows of the first table. This is probably far more than necessary because of lack of a proper index. Then it needed to look at only 1 row in each of the other tables. (Can't get better than that.)
How many rows in the SELECT without the DISTINCT or ORDER BY? That tells you how much work was needed after doing the fetching and JOINing. I suspect it is only a few. A "few" is really cheap for DISTINCT and ORDER BY; hence I think you were barking up the wrong tree. Even 1432 rows would be very fast to process.
As for the buffer_pool... How big is the table? Do SHOW TABLE STATUS. I suspect the table is more than 1GB, hence it cannot fit in the buffer_pool. Hence raising that cache size would let the query run in RAM, not hitting the disk (at least after it gets cached). Keep in mind that running a query on a cold cache will have lots of I/O. As the cache warms up, queries will run faster. But if the cache is too small, you will continue to need I/O. I/O is the slowest part of the processing.
I hope you have at least 6GB of RAM; otherwise, 2G could be dangerously large. Swapping is really bad for performance.
The question doesn't mention existing indexes, or show the output from an EXPLAIN for any of the queries.
The quick answer to improve performance is to add an index:
... ON audratedb (instrument,refdate,rate)
To answer why we'd want to add that index, we'd need to understand how MySQL processes SQL statements, what operations are possible, and which are required. To see how MySQL is actually processing your statement, you need to use EXPLAIN to see the query plan.

database schema one column entry references many rows from another table

Let's say we have a table called Workorders and another table called Parts. I would like to have a column in Workorders called parts_required. This column would contain a single item that tells me what parts were required for that workorder. Ideally, this would contain the quantities as well, but a second column could contain the quantity information if needed.
Workorders looks like
WorkorderID date parts_required
1 2/24 ?
2 2/25 ?
3 3/16 ?
4 4/20 ?
5 5/13 ?
6 5/14 ?
7 7/8 ?
Parts looks like
PartID name cost
1 engine 100
2 belt 5
3 big bolt 1
4 little bolt 0.5
5 quart oil 8
6 Band-aid 0.1
Idea 1: create a string like '1-1:2-3:4-5:5-4'. My application would parse this string and show that I need --> 1 engine, 3 belts, 5 little bolts, and 4 quarts of oil.
Pros - simple enough to create and understand.
Cons - will make deep introspection into our data much more difficult. (costs over time, etc)
Idea 2: use a binary number. For example, to reference the above list (engine, belt, little bolts, oil) using an 8-bit integer would be 54, because 54 in binary representation is 110110.
Pros - datatype is optimal concerning size. Also, I am guessing there are tricky math tricks I could use in my queries to search for parts used (don't know what those are, correct me if I'm in the clouds here).
Cons - I do not know how to handle quantity using this method. Also, Even with a 64-bit BIGINT still only gives me 64 parts that can be in my table. I expect many hundreds.
Any ideas? I am using MySQL. I may be able to use PostgreSQL, and I understand that they have more flexible datatypes like JSON and arrays, but I am not familiar with how querying those would perform. Also it would be much easier to stay with MySQL
Why not create a Relationship table?
You can create a table named Workorders_Parts with the following content:
|workorderId, partId|
So when you want to get all parts from a specific workorder you just type:
select p.name
from parts p inner join workorders_parts wp on wp.partId = p.partId
where wp.workorderId = x;
what the query says is:
Give me the name of parts that belongs to workorderId=x and are listed in table workorders_parts
Remembering that INNER JOIN means "INTERSECTION" in other words: data i'm looking for should exist (generally the id) in both tables
IT will give you all part names that are used to build workorder x.
Lets say we have workorderId = 1 with partID = 1,2,3, it will be represented in our relationship table as:
workorderId | partId
1 | 1
1 | 2
1 | 3

MySQL, how to repeat same line x times

I have a query that outputs address order data:
SELECT ordernumber
, article_description
, article_size_description
, concat(NumberPerBox,' pieces') as contents
, NumberOrdered
FROM customerorder
WHERE customerorder.id = 1;
I would like the above line to be outputted NumberOrders (e.g. 50,000) divided by NumberPerBox e.g. 2,000 = 25 times.
Is there a SQL query that can do this, I'm not against using temporary tables to join against if that's what it takes.
I checked out the previous questions, however the nearest one:
is to be posible in mysql repeat the same result
Only gave answers that give a fixed number of rows, and I need it to be dynamic depending on the value of (NumberOrdered div NumberPerBox).
The result I want is:
Boxnr Ordernr as_description contents NumberOrdered
------+--------------+----------------+-----------+---------------
1 | CORDO1245 | Carrying bags | 2,000 pcs | 50,000
2 | CORDO1245 | Carrying bags | 2,000 pcs | 50,000
....
25 | CORDO1245 | Carrying bags | 2,000 pcs | 50,000
First, let me say that I am more familiar with SQL Server so my answer has a bit of a bias.
Second, I did not test my code sample and it should probably be used as a reference point to start from.
It would appear to me that this situation is a prime candidate for a numbers table. Simply put, it is a table (usually called "Numbers") that is nothing more than a single PK column of integers from 1 to n. Once you've used a Numbers table and aware of how it's used, you'll start finding many uses for it - such as querying for time intervals, string splitting, etc.
That said, here is my untested response to your question:
SELECT
IV.number as Boxnr
,ordernumber
,article_description
,article_size_description
,concat(NumberPerBox,' pieces') as contents
,NumberOrdered
FROM
customerorder
INNER JOIN (
SELECT
Numbers.number
,customerorder.ordernumber
,customerorder.NumberPerBox
FROM
Numbers
INNER JOIN customerorder
ON Numbers.number BETWEEN 1 AND customerorder.NumberOrdered / customerorder.NumberPerBox
WHERE
customerorder.id = 1
) AS IV
ON customerorder.ordernumber = IV.ordernumber
As I said, most of my experience is in SQL Server. I reference http://www.sqlservercentral.com/articles/Advanced+Querying/2547/ (registration required). However, there appears to be quite a few resources available when I search for "SQL numbers table".