MYSQL speed of this query and optimized table structure - mysql

I need to build a new table system, it'll store an id and 10 varchars(255) per id. That is all it will need to store. Other than the obvious insert/delete/update on whole rows only, the only other query that will be run is a SELECT * FROM table WHERE id='id'. 7 million records.
I have 2 structures I came up with, which are:
(1) - single table, id,then 10 varchars, no joins, nothing fancy, id is primary key, simple select *.
(2) - 2 tables, first has id, then 10 integer elements, second has integer(auto increment) and the varchars. This would use a join. Hence, I would guess 10 joins per query.
Clearly 2 is better as a formal structure and for later table structural changes, BUT, in terms of SPEED of querying alone, which is better?

If you will always be storing exactly 10 VARCHAR fields and each VARCHAR has its own meaning (like it's always a name, or always an address etc.), then just create a table with 10 fields.
Your second solution is called EAV (entity-attribute-value), which is mostly used for sparse matrices (when you have lots of possible attributes with only few of them being set for a given entity). It is scalable and maintainable, but will be less efficient for the query like yours.

definitely in terms of speed the first one is better. Joins are really slow on big tables

Related

Performance, Why JOIN is faster than IN

I tried to optimize some PHP code that performs a lot of queries on different tables (that include data).
The logic was to take some fields from each table by neighborhood id(s) depends whether it was city(a lot of neighborhoods ids) or specific neighborhood.
For example, assume that I have 10 tables of this format:
neighborhood_id | some_data_field
The queries were something like that:
SELECT `some_data_field`
FROM `table_name` AS `data_table`
LEFT JOIN `neighborhoods_table` AS `neighborhoods` ON `data_table`.`neighborhood_id' = `neighborhoods`.`neighborhood_id`
WHERE `neighborhood`.`city_code` = SOME_ID
Because there were like 10 queries like that I tried to optimize the code by removing the join from 10 queries and perform one query to neighborhoods table to get all neighborhood codes.
Then in each query I did WHERE IN on the neighborhoods ids.
The expected result was a better performance, but it turns out that it wasn't better.
When I perform a request to my server the first query takes 20ms the second takes more and the third takes more and so on. (the second and the third take something like 200ms) but with JOIN the first query takes 40ms but the rest of the queries take 20ms-30ms.
The first query in request shows us that where in is faster but I assume that MYSQL has some cache when dealing with JOINs.
So I wanted to know how can I improove my where in queries?
EDIT
I read the answer and comments and I understood I didn't explain well why I have 10 tables because each table categorized by property.
For example, one table contains values by floor and one by rooms and one by date
so it isn't possible to union all tables to one table.
Second Edit
I'm still misunderstood.
I don't have only one data column per table, every table has it's own amount of fields, it can be 5 fields for one table and 3 for another. and different data types or formatting types, it can be date or money present
ation ,additionally, I perform in my queries some calculations about those fields, some times it can be AVG or weighted average and in some tables it's only pure select.
Additionally I perform group by by some fields in one table it can be by rooms and in other it can be by floor
For example, assume that I have 10 tables of this format:
This is the basis of your problem. Don't store the same information in multiple tables. Store the results in a single table and let MySQL optimize the query.
If the original table had "information" -- say the month the data was generated -- then you may need to include this as an additional column.
Once the data is in a single table, you can use indexes and partitioning to speed the queries.
Note that storing the data in a single table may require changes to your ingestion processes -- namely, inserting the data rather than creating a new table. But your queries will be simpler and you can optimize the database.
As for which is faster, an IN or a JOIN. Both are doing similar things under the hood. In some circumstances, one or the other is faster, but both should make use of indexes and partitions if they are available.

Best database design to have efficient analysis on it with some millions records

I have a basic question about database designing.
I have a lot of files which I have to read and insert them in database. Each file has some thousand lines and each line has about 30 fields (by these types: small int, int, big int, varchar, json). Of course I use multi threads along with bulk inserting in order to increase insert speed (finally I have 30-40 millions records).
After inserting I want to have some sophisticated analysis and the performance is important to me.
Now I get each line fields and I'm ready to insert so I have 3 approaches:
1- One big table:
In this case I can create a big table with 30 columns and stores all of the files fields in that. So there is a table with huge size which I want to have a lot of analysis on it.
2- A fairly large table (A) and some little tables (B)s
In this case I can create some little tables which consist of the columns that have fairly identical records if we separate them from the other columns. So these little tables just has some hundred or thousand records instead of 30 millions records. So in fairly large table (A), I emit the columns which I put them in another table and I use a foreign key instead of them. Finally I has a table (A) with 20 columns and 30 millions records and some tables (B) with 2-3 columns and 100-50000 records for each of them. So in order to analysis the table A, I have to use some joins ,for example in select and ...
3- just a fairly large table
In this case I can create a fairly large table like table A in above case (with 20 columns) and instead of using foreign keys, I use a mapping between source columns and destination columns (this is something like foreign keys but has a little difference). For example I have 3 columns c1,c2,c3 that in case 2, I put them in another table B and use foreign key to access them, but now I assign a specific number to each distinctive records consist of c1,c2,c3 at inserting time and store the relation between the record and its assigned value in the program codes. So this table is completely like the table A in case number 2 but there is no need to use join in select or ...
While the inserting time is important, the analysis time that I will have is more important to me, so I want to know your opinion about which of these case is better and also I will glad to see the other solutions.
From a design perspective 30 to 40 million is not that bad a number. Performance is fully dependent on how you would design your DB to be.
If you are using SQL Server then you could consider putting the large table on a separate database file group. I have worked on one case in a similar fashion where we had around 1.8 Billion record in a single table.
For the analysis if you are not going to look into the entire data in one shot. You could consider a vertical partitioning of the data. You could use a partition schema based on your need. Some sample could be to split the data as yearly partitions and this will help if your analysis will be limited to a years worth of data(just an example).
The major thing would be de-normalization /normalization based on your need and of course non clustered/clustered indexing of the data. Again this will depend on what sort of analysis queries you would be using.
A single thread can INSERT one row at a time and finish 40M rows in a day or two. With LOAD DATA, you can do it in perhaps an hour or less.
But is loading the real question? For doing grouping, summing, etc, the question is about SELECT. For "analytics", the question is not one of table structure. Have a single table for the raw data, plus one or more "Summary tables" to make the selects really fast for your typical queries.
Until you give more details about the data, I cannot give more details about a custom solution.
Partitioning (vertical or horizontal) is unlikely to help much in MySQL. (Again, details needed.)
Normalization shrinks the data, which leads to faster processing. But, it sounds like the dataset is so small that it will all fit in RAM?? (I assume your #2 is 'normalization'?)
Beware of over-normalization.

long many to many-database table: best performance practice

I have a question about performance of my MYSQL database design.
Table A has a lot of records, say a million, and table B also has a million. There is another table C in which every record id of A is connected to every row in B and this connection has an additional value 1 or 0. So functionally speaking, every record in A has a boolean vector where B contains the 'variables' of the vector and 1 or 0 is the value. It's explained more graphically in the image on bottom.
Table C will have a lot of write and read actions (select all values from a record of A), so the the table is very actively used. And table C is really long, a million times a million rows.
My first question is, will the length of the table give a performance
issue? the database needs to be really fast.
My second question is, if this is badly designed, whether there is a better design to achieve what i want. For instance I can think of storing the entire B vector of each A record inside of each row in A. Then table C will not be necessary. But it will make selecting, reading, writing much more difficult.
The table design is fine and shouldn't be a problem, because you access records via IDs which should be indexed. Depending on your typical queries you should also consider adding composite indexes (c(a_id,b_id), c(a_id,value), c(b_id,value), c(a_id,b_id,value)).
However, as there exist only two states, 0 and 1, you may decide only to store one of them. I.e. if you store all state 1 records only, all pairs not in the table have state 0 then implicitly. This pays especially when the states are unevenly distributed (say 90% of the records have state 0 and only 10% have state 1) or you usually access only one of the states (e.g. you always look for 1s).
Answer to your first question
Millions of records in a table with multiple read and write won't be a
bottleneck if you are following best practices of mysql.
Your engine should be innodb.
Your select queries should not involve a full table scan.
Your table should have desired indexes.
Answer to your second question
You should look for all your possible use cases, because either way is
a good idea if a use case supports it.
If you split your data across multiple tables than join operation is
to be performed if needed.

What works faster "longer table with less columns" or "shorter table with more columns"?

I have to make decision how to plan table that will be used to store dates.
I have about 20 different dates for each user and guess 100 000 users right now and growing.
So question is for SELECT query what will work faster if I make table with 20 fields? e.g.
"user_dates"
userId, date_registered, date_paid, date_started_working, ... date_reported, date_fired 20 total fields with 100 000 records in table
or make 2 tables it like
first table "date_types" with 3 fields and 20 records for above column names.
id, date_type_id, date_type_name
1 5 date_reported
2 3 date_registerd
...
and second table with 3 fields actual records
"user_dates"
userId, date_type, date
201 2 2012-01-28
202 5 2012-06-14
...
but then with 2 000 000 records ?
I think second option is more universal if I need to add more dates I can do it from front end just adding record to "date_type" table and then using it in "user_dates" however I am now worried about performance with 2 million records in table.
So which option you think will work faster?
A longer table will have a larger index. A wider table will have a smaller index but take more psychical space and probably have more overhead. You should carefully examine your schema to see if normalization is complete.
I would, however, go with your second option. This is because you don't need to necessarily have the fields exist if they are empty. So if the user hasn't been fired, no need to create a record for them.
If the dates are pretty concrete and the users will have all (or most) of the dates filled in, then I would go with the wide table because it's easier to actually write the queries to get the data. Writing a query that asks for all the users that have date1 in a range and date2 in a range is much more difficult with a vertical table.
I would only go with the longer table if you know you need the option to create date types on the fly.
The best way to determine this is through testing. Generally the sizes of data you are talking about (20 date columns by 100K records) is really pretty small in regards to MySQL tables, so I would probably just use one table with multiple columns unless you think you will be adding new types of date fields all the time and desire a more flexible schema. You just need to make sure you index all the fields that will be used in for filtering, ordering, joining, etc. in queries.
The design may also be informed by what type of queries you want to perform against the data. If for example you expect that you might want to query data based on a combination of fields (i.e. user has some certain date, but not another date), the querying will likely be much more optimal on the single table, as you would be able to use a simple SELECT ... WHERE query. With the separate tables, you might find yourself needing to do subselects, or odd join conditions, or HAVING clauses to perform the same kind of query.
As long as the user ID and the date-type ID are indexed on the main tables and the user_dates table, I doubt you will notice a problem when querying .. if you were to query the entire table in either case, I'm sure it would take a pretty long time (mostly to send the data, though). A single user lookup will be instantaneous in either case.
Don't sacrifice the relation for some possible efficiency improvement; it's not worth it.
Usually I go both ways: Put the basic and most oftenly used attributes into one table. Make an additional-attributes table to put rarley used attributes in, which then can be fetched lazily from the application layer. This way you are not doing JOIN's every time you fetch a user.

What are some optimization techniques for MySQL table with 300+ million records?

I am looking at storing some JMX data from JVMs on many servers for about 90 days. This data would be statistics like heap size and thread count. This will mean that one of the tables will have around 388 million records.
From this data I am building some graphs so you can compare the stats retrieved from the Mbeans. This means I will be grabbing some data at an interval using timestamps.
So the real question is, Is there anyway to optimize the table or query so you can perform these queries in a reasonable amount of time?
Thanks,
Josh
There are several things you can do:
Build your indexes to match the queries you are running. Run EXPLAIN to see the types of queries that are run and make sure that they all use an index where possible.
Partition your table. Paritioning is a technique for splitting a large table into several smaller ones by a specific (aggregate) key. MySQL supports this internally from ver. 5.1.
If necessary, build summary tables that cache the costlier parts of your queries. Then run your queries against the summary tables. Similarly, temporary in-memory tables can be used to store a simplified view of your table as a pre-processing stage.
3 suggestions:
index
index
index
p.s. for timestamps you may run into performance issues -- depending on how MySQL handles DATETIME and TIMESTAMP internally, it may be better to store timestamps as integers. (# secs since 1970 or whatever)
Well, for a start, I would suggest you use "offline" processing to produce 'graph ready' data (for most of the common cases) rather than trying to query the raw data on demand.
If you are using MYSQL 5.1 you can use the new features.
but be warned they contain lot of bugs.
first you should use indexes.
if this is not enough you can try to split the tables by using partitioning.
if this also wont work, you can also try load balancing.
A few suggestions.
You're probably going to run aggregate queries on this stuff, so after (or while) you load the data into your tables, you should pre-aggregate the data, for instance pre-compute totals by hour, or by user, or by week, whatever, you get the idea, and store that in cache tables that you use for your reporting graphs. If you can shrink your dataset by an order of magnitude, then, good for you !
This means I will be grabbing some data at an interval using timestamps.
So this means you only use data from the last X days ?
Deleting old data from tables can be horribly slow if you got a few tens of millions of rows to delete, partitioning is great for that (just drop that old partition). It also groups all records from the same time period close together on disk so it's a lot more cache-efficient.
Now if you use MySQL, I strongly suggest using MyISAM tables. You don't get crash-proofness or transactions and locking is dumb, but the size of the table is much smaller than InnoDB, which means it can fit in RAM, which means much quicker access.
Since big aggregates can involve lots of rather sequential disk IO, a fast IO system like RAID10 (or SSD) is a plus.
Is there anyway to optimize the table or query so you can perform these queries
in a reasonable amount of time?
That depends on the table and the queries ; can't give any advice without knowing more.
If you need complicated reporting queries with big aggregates and joins, remember that MySQL does not support any fancy JOINs, or hash-aggregates, or anything else useful really, basically the only thing it can do is nested-loop indexscan which is good on a cached table, and absolutely atrocious on other cases if some random access is involved.
I suggest you test with Postgres. For big aggregates the smarter optimizer does work well.
Example :
CREATE TABLE t (id INTEGER PRIMARY KEY AUTO_INCREMENT, category INT NOT NULL, counter INT NOT NULL) ENGINE=MyISAM;
INSERT INTO t (category, counter) SELECT n%10, n&255 FROM serie;
(serie contains 16M lines with n = 1 .. 16000000)
MySQL Postgres
58 s 100s INSERT
75s 51s CREATE INDEX on (category,id) (useless)
9.3s 5s SELECT category, sum(counter) FROM t GROUP BY category;
1.7s 0.5s SELECT category, sum(counter) FROM t WHERE id>15000000 GROUP BY category;
On a simple query like this pg is about 2-3x faster (the difference would be much larger if complex joins were involved).
EXPLAIN Your SELECT Queries
LIMIT 1 When Getting a Unique Row
SELECT * FROM user WHERE state = 'Alabama' // wrong
SELECT 1 FROM user WHERE state = 'Alabama' LIMIT 1
Index the Search Fields
Indexes are not just for the primary keys or the unique keys. If there are any columns in your table that you will search by, you should almost always index them.
Index and Use Same Column Types for Joins
If your application contains many JOIN queries, you need to make sure that the columns you join by are indexed on both tables. This affects how MySQL internally optimizes the join operation.
Do Not ORDER BY RAND()
If you really need random rows out of your results, there are much better ways of doing it. Granted it takes additional code, but you will prevent a bottleneck that gets exponentially worse as your data grows. The problem is, MySQL will have to perform RAND() operation (which takes processing power) for every single row in the table before sorting it and giving you just 1 row.
Use ENUM over VARCHAR
ENUM type columns are very fast and compact. Internally they are stored like TINYINT, yet they can contain and display string values.
Use NOT NULL If You Can
Unless you have a very specific reason to use a NULL value, you should always set your columns as NOT NULL.
"NULL columns require additional space in the row to record whether their values are NULL. For MyISAM tables, each NULL column takes one bit extra, rounded up to the nearest byte."
Store IP Addresses as UNSIGNED INT
In your queries you can use the INET_ATON() to convert and IP to an integer, and INET_NTOA() for vice versa. There are also similar functions in PHP called ip2long() and long2ip().