Performance, Why JOIN is faster than IN - mysql

I tried to optimize some PHP code that performs a lot of queries on different tables (that include data).
The logic was to take some fields from each table by neighborhood id(s) depends whether it was city(a lot of neighborhoods ids) or specific neighborhood.
For example, assume that I have 10 tables of this format:
neighborhood_id | some_data_field
The queries were something like that:
SELECT `some_data_field`
FROM `table_name` AS `data_table`
LEFT JOIN `neighborhoods_table` AS `neighborhoods` ON `data_table`.`neighborhood_id' = `neighborhoods`.`neighborhood_id`
WHERE `neighborhood`.`city_code` = SOME_ID
Because there were like 10 queries like that I tried to optimize the code by removing the join from 10 queries and perform one query to neighborhoods table to get all neighborhood codes.
Then in each query I did WHERE IN on the neighborhoods ids.
The expected result was a better performance, but it turns out that it wasn't better.
When I perform a request to my server the first query takes 20ms the second takes more and the third takes more and so on. (the second and the third take something like 200ms) but with JOIN the first query takes 40ms but the rest of the queries take 20ms-30ms.
The first query in request shows us that where in is faster but I assume that MYSQL has some cache when dealing with JOINs.
So I wanted to know how can I improove my where in queries?
EDIT
I read the answer and comments and I understood I didn't explain well why I have 10 tables because each table categorized by property.
For example, one table contains values by floor and one by rooms and one by date
so it isn't possible to union all tables to one table.
Second Edit
I'm still misunderstood.
I don't have only one data column per table, every table has it's own amount of fields, it can be 5 fields for one table and 3 for another. and different data types or formatting types, it can be date or money present
ation ,additionally, I perform in my queries some calculations about those fields, some times it can be AVG or weighted average and in some tables it's only pure select.
Additionally I perform group by by some fields in one table it can be by rooms and in other it can be by floor

For example, assume that I have 10 tables of this format:
This is the basis of your problem. Don't store the same information in multiple tables. Store the results in a single table and let MySQL optimize the query.
If the original table had "information" -- say the month the data was generated -- then you may need to include this as an additional column.
Once the data is in a single table, you can use indexes and partitioning to speed the queries.
Note that storing the data in a single table may require changes to your ingestion processes -- namely, inserting the data rather than creating a new table. But your queries will be simpler and you can optimize the database.
As for which is faster, an IN or a JOIN. Both are doing similar things under the hood. In some circumstances, one or the other is faster, but both should make use of indexes and partitions if they are available.

Related

How to (efficiently) get the start, end, and count of timeseries data from all SQL tables?

I have a massive amount of SQL tables (50,000+) each with 100,000+ time series data points. I'm just looking for the most efficient way to get the start, end, and count of each table.
I've tried the following in a loop, but its very slow, I time out when I try to query just 500 tables. Is there any way to improve this?
SELECT
min(timestamp) as start,
max(timestamp) as end,
count(value) as count,
FROM
table_NAME
Edit: To provide some context. Data is coming from a large number of sensors for engineering equipment. Each sensor has its own stream of data, including collection interval.
The type of SQL database is dependent on the building, there will be a few different types.
As for what the data will be used for, I need to know which trends are current and how old they are. If they are not current, I need to fix them. If there are very few data points, I need to check configuration of data collection.
(Note: The following applies to MySQL.)
Auto-generate query
Use informtation_schema.TABLES to list all the table and generate the SELECT statements. Then copy/paste to run them.
Or write a Stored Procedure to do the above, including the execution. It might be better to have the SP build a giant UNION ALL to find all the results as one "table".
min/max
As already mentioned, if you don't have an index on timestamp, it will have to read all 5 billion rows -- which is a lot slower than fetching just the first and last of values from 50K indexes.
COUNT
Use COUNT(*) instead of COUNT(value) -- The latter goes to the extra effort of checking value for NOT NULL.
The COUNT(*) will need to read an entire index. That is, if you do have INDEX(timestamp), the COUNT will the slow part. Consider the following: Don't do the COUNT; instead, do SHOW TABLE STATUS;; it will find estimates of the number of rows for every table in the current database. That will be much faster.

MYSQL Query Optimization- Joins/Unions vs. table query

This is somewhat of a conceptual question. In terms of query optimization and speed, I am wondering which route would have the best performance and be the fastest. Suppose I am using JFreeChart (this bit is somewhat irrelevant). The entire idea of using JFreeChart with a MYSQL database is to query for two values, an X and a Y. Suppose the database is full of many different tables, and usually, the X and the Y come from two different tables. Would it be faster, in the query for the chart, to use joins and unions to get the two values...or... first create a table with the joined/union values, and then run queries on this new table (no joins or unions needed)? This would all be in one code mind you. So, overall: joins and unions to get X and Y values, or create a temporary table joining the values and then querying the chart with those.
It would, of course, be faster to pre-join the data and select from a single table than to perform a join. This assumes that you're saving one lookup per row and are properly using indexes in the first place.
However, even though you get performance improvements from dernormalization in such a manner, it's not commonly done. A few of the reason why it's not common include:
Redundant data takes up more space
With redundant data, you have to update both copies whenever something changes
JOINs are fast
JOINs on multiple rows can improve (they don't always require a lookup per row) with such things as the new Batched Key Access joins in MySQL 5.6, but it only helps with some queries, hence you have to tell MySQL which join type you want. It's not automatic.

Return number of related records for the results of a query

I have 2 related tables, Tapes & Titles related through the Fields TapeID
Tapes.TapeID & Titles.TapeID
I want to be able to query the Tapes Table on the Column Artist and then return the number of titles for each of the matching Artist records
My Query is as follows
SELECT Tapes.Artist,COUNT(Titles.TapeID)
FROM Tapes
INNER JOIN Titles on Titles.TapeID=Tapes.TapeID
GROUP BY Tapes.Artist
HAVING TAPES.Artist LIKE <ArtistName%>"
The query appears to run then seems to go into an indefinite loop
I get no syntax errors and no results
Please point out the error in my query
Here are two likely culprits for this poor performance. The first would be the lack of index on Tapes.TapeId. Based on the naming, I would expect this to be the primary key on the Tapes table. If there are no indexes, then you could get poor performance.
The second would involve the selectivity of the having clause. As written, MySQL is going to aggregate all the data for the group by and then filter out the groups. In many cases, this would not make much of a difference. But, if you have lots of data and the condition is selective (meaning few rows match), then moving it to a where clause would make a difference.
There are definitely other possibilities. For instance, the server could be processing other queries. An update query could be locking one of the tables. Or, the columns TapeId could have different types in the two tables.
You can modify your question to include the definition of the two tables. Also, put explain before the query and include the output in the question. This indicates the execution plan chosen by MySQL.

What works faster "longer table with less columns" or "shorter table with more columns"?

I have to make decision how to plan table that will be used to store dates.
I have about 20 different dates for each user and guess 100 000 users right now and growing.
So question is for SELECT query what will work faster if I make table with 20 fields? e.g.
"user_dates"
userId, date_registered, date_paid, date_started_working, ... date_reported, date_fired 20 total fields with 100 000 records in table
or make 2 tables it like
first table "date_types" with 3 fields and 20 records for above column names.
id, date_type_id, date_type_name
1 5 date_reported
2 3 date_registerd
...
and second table with 3 fields actual records
"user_dates"
userId, date_type, date
201 2 2012-01-28
202 5 2012-06-14
...
but then with 2 000 000 records ?
I think second option is more universal if I need to add more dates I can do it from front end just adding record to "date_type" table and then using it in "user_dates" however I am now worried about performance with 2 million records in table.
So which option you think will work faster?
A longer table will have a larger index. A wider table will have a smaller index but take more psychical space and probably have more overhead. You should carefully examine your schema to see if normalization is complete.
I would, however, go with your second option. This is because you don't need to necessarily have the fields exist if they are empty. So if the user hasn't been fired, no need to create a record for them.
If the dates are pretty concrete and the users will have all (or most) of the dates filled in, then I would go with the wide table because it's easier to actually write the queries to get the data. Writing a query that asks for all the users that have date1 in a range and date2 in a range is much more difficult with a vertical table.
I would only go with the longer table if you know you need the option to create date types on the fly.
The best way to determine this is through testing. Generally the sizes of data you are talking about (20 date columns by 100K records) is really pretty small in regards to MySQL tables, so I would probably just use one table with multiple columns unless you think you will be adding new types of date fields all the time and desire a more flexible schema. You just need to make sure you index all the fields that will be used in for filtering, ordering, joining, etc. in queries.
The design may also be informed by what type of queries you want to perform against the data. If for example you expect that you might want to query data based on a combination of fields (i.e. user has some certain date, but not another date), the querying will likely be much more optimal on the single table, as you would be able to use a simple SELECT ... WHERE query. With the separate tables, you might find yourself needing to do subselects, or odd join conditions, or HAVING clauses to perform the same kind of query.
As long as the user ID and the date-type ID are indexed on the main tables and the user_dates table, I doubt you will notice a problem when querying .. if you were to query the entire table in either case, I'm sure it would take a pretty long time (mostly to send the data, though). A single user lookup will be instantaneous in either case.
Don't sacrifice the relation for some possible efficiency improvement; it's not worth it.
Usually I go both ways: Put the basic and most oftenly used attributes into one table. Make an additional-attributes table to put rarley used attributes in, which then can be fetched lazily from the application layer. This way you are not doing JOIN's every time you fetch a user.

single table vs multiple table for millions of records

Here's the scenario, the old database has this kind of design
dbo.Table1998
dbo.Table1999
dbo.Table2000
dbo.table2001
...
dbo.table2011
and i merged all the data from 1998 to 2011 in this table dbo.TableAllYears
now they're both indexed by "application number" and has the same numbers of columns (56 columns actually..)
now when i tried
select * from Table1998
and
select * from TableAllYears where Year=1998
the first query has 139669 rows # 13 seconds
while the second query has same number of rows but # 30 seconds
so for you guys, i'm i just missing something or is multiple tables better than single table?
You should partition the table by year, this is almost equivalent to having different tables for each year. This way when you query by year it will query against a single partition and the performance will be better.
Try dropping an index on each of the columns that you're searching on (where clause). That should speed up querying dramatically.
So in this case, add a new index for the field Year.
I believe that you should use a single table. Inevitably, you'll need to query data across multiple years, and separating it into multiple tables is a problem. It's quite possible to optimize your query and your table structure such that you can have many millions of rows in a table and still have excellent performance. Be sure your year column is indexed, and included in your queries. If you really hit data size limitations, you can use partitioning functionality in MySQL 5 that allows it to store the table data in multiple files, as if it were multiple tables, while making it appear to be one table.
Regardless of that, 140k rows is nothing, and it's likely premature optimization to split it into multiple tables, and even a major performance detriment if you need to query data across multiple years.
If you're looking for data from 1998, then having only 1998 data in one table is the way to go. This is because the database doesn't have to "search" for the records, but knows that all of the records in this table are from 1998. Try adding the "WHERE Year=1998" clause to the Table1998 table and you should get a slightly better comparison.
Personally, I would keep the data in multiple tables, especially if it is a particularly large data set and you don't have to do queries on the old data frequently. Even if you do, you might want to look at creating a view with all of the table data and running the reports on that instead of having to query several tables.