How to write a fast counting query for a large table? - ms-access

I have two tables, Table1 with 100,000 rows and Table2 with 400,000 rows. Both tables have a field called Email. I need to insert a new field into Table1 which will indicate the number of times the Email from each row in Table1 appears in Table2.
I wrote a binary count function for Excel which performs this in a few seconds on this data sample. Is it possible to perform it this fast in Access?
Thank you.

Does this query express what you want to find from Table2?
SELECT Email, Count(*) AS number_matches
FROM Table2
GROUP BY Email;
If that is what you want, I don't understand why you would store number_matches in another table. Just use this query wherever/whenever you need number_matches.
You should have an index on Email for Table2.
Update: I offer this example to illustrate how fast Count() with GROUP BY can be for an indexed field.
SELECT really_big_table.just_some_text, Count(*) AS CountOfMatches
FROM really_big_table
GROUP BY really_big_table.just_some_text;
really_big_table contains 10,776,000 rows. That size is way beyond what you would ordinarily expect to find in a real-word Access (Jet/ACE) database. I keep it around for extreme stress testing of different database operations.
The field, just_some_text, is indexed. With that index, the query completes in well under a minute. I didn't bother to time it more precisely because I was only interested in a rough comparison with the several minutes the OP's similar query took for a table which includes less than 5% of the number of rows as mine.
I don't understand why the OP's query is so much slower by comparison. My intention here is to warn other readers not to dismiss this method. In my experience, the speed of operations like this ranges from acceptable to blazingly fast ... as long as the database engine has an appropriate index to work with. At least give it a try before you resort to copying values redundantly between tables.

Related

How to (efficiently) get the start, end, and count of timeseries data from all SQL tables?

I have a massive amount of SQL tables (50,000+) each with 100,000+ time series data points. I'm just looking for the most efficient way to get the start, end, and count of each table.
I've tried the following in a loop, but its very slow, I time out when I try to query just 500 tables. Is there any way to improve this?
SELECT
min(timestamp) as start,
max(timestamp) as end,
count(value) as count,
FROM
table_NAME
Edit: To provide some context. Data is coming from a large number of sensors for engineering equipment. Each sensor has its own stream of data, including collection interval.
The type of SQL database is dependent on the building, there will be a few different types.
As for what the data will be used for, I need to know which trends are current and how old they are. If they are not current, I need to fix them. If there are very few data points, I need to check configuration of data collection.
(Note: The following applies to MySQL.)
Auto-generate query
Use informtation_schema.TABLES to list all the table and generate the SELECT statements. Then copy/paste to run them.
Or write a Stored Procedure to do the above, including the execution. It might be better to have the SP build a giant UNION ALL to find all the results as one "table".
min/max
As already mentioned, if you don't have an index on timestamp, it will have to read all 5 billion rows -- which is a lot slower than fetching just the first and last of values from 50K indexes.
COUNT
Use COUNT(*) instead of COUNT(value) -- The latter goes to the extra effort of checking value for NOT NULL.
The COUNT(*) will need to read an entire index. That is, if you do have INDEX(timestamp), the COUNT will the slow part. Consider the following: Don't do the COUNT; instead, do SHOW TABLE STATUS;; it will find estimates of the number of rows for every table in the current database. That will be much faster.

Does it make sense too have a column dependent on the total number of row in another table too improve speed?

I have an application that requires the total number of interactions.
I have two tables. TABLE1 has a total column. TABLE2 has all of the interactions (columns) that update frequently. It has a column like didInteract. It is a 1:M relationship between TABLE1 and TABLE2.
Because my application uses the total interactions from TABLE2 if didInteract is true. I added the column total to TABLE1 so I wouldn't have to query all rows that match my criteria which could be costly. Therefore if a user interacts it performs two operations in the database. First, it creates a new interaction in TABLE2 if not already created and it increments the total interaction in TABLE1.
Does this logic make sense to do, or should I query TABLE2 to get the total (even if it may take a little longer) and remove the total column from TABLE1? Not sure if this passes 2NF although to me it sounds like an exception.
Technically what you describe is denormalization. There's a risk of data anomalies with any denormalization. For example, if you add an interaction but forget to increment TABLE1.total, then the total will be inaccurate. But clients who query that total won't know that it's inaccurate unless they double-check by querying the aggregate count from TABLE2. If that double-check is necessary to be sure, then there's no point in storing the total.
There are legitimate cases where denormalization is helpful. If you can depend on it being accurate, or if you don't care if it's inaccurate from time to time, and you periodically re-initialize the total from the count of TABLE2, then it could be good enough.
There's also the question of whether the slowness of querying TABLE2 directly is actually important. It's true that it's slower than querying the total as a precalculated count, but is the difference great enough that it makes your application fail its SLA?
These are tradeoffs, and which strategy is right for your app is up to you.

Performance, Why JOIN is faster than IN

I tried to optimize some PHP code that performs a lot of queries on different tables (that include data).
The logic was to take some fields from each table by neighborhood id(s) depends whether it was city(a lot of neighborhoods ids) or specific neighborhood.
For example, assume that I have 10 tables of this format:
neighborhood_id | some_data_field
The queries were something like that:
SELECT `some_data_field`
FROM `table_name` AS `data_table`
LEFT JOIN `neighborhoods_table` AS `neighborhoods` ON `data_table`.`neighborhood_id' = `neighborhoods`.`neighborhood_id`
WHERE `neighborhood`.`city_code` = SOME_ID
Because there were like 10 queries like that I tried to optimize the code by removing the join from 10 queries and perform one query to neighborhoods table to get all neighborhood codes.
Then in each query I did WHERE IN on the neighborhoods ids.
The expected result was a better performance, but it turns out that it wasn't better.
When I perform a request to my server the first query takes 20ms the second takes more and the third takes more and so on. (the second and the third take something like 200ms) but with JOIN the first query takes 40ms but the rest of the queries take 20ms-30ms.
The first query in request shows us that where in is faster but I assume that MYSQL has some cache when dealing with JOINs.
So I wanted to know how can I improove my where in queries?
EDIT
I read the answer and comments and I understood I didn't explain well why I have 10 tables because each table categorized by property.
For example, one table contains values by floor and one by rooms and one by date
so it isn't possible to union all tables to one table.
Second Edit
I'm still misunderstood.
I don't have only one data column per table, every table has it's own amount of fields, it can be 5 fields for one table and 3 for another. and different data types or formatting types, it can be date or money present
ation ,additionally, I perform in my queries some calculations about those fields, some times it can be AVG or weighted average and in some tables it's only pure select.
Additionally I perform group by by some fields in one table it can be by rooms and in other it can be by floor
For example, assume that I have 10 tables of this format:
This is the basis of your problem. Don't store the same information in multiple tables. Store the results in a single table and let MySQL optimize the query.
If the original table had "information" -- say the month the data was generated -- then you may need to include this as an additional column.
Once the data is in a single table, you can use indexes and partitioning to speed the queries.
Note that storing the data in a single table may require changes to your ingestion processes -- namely, inserting the data rather than creating a new table. But your queries will be simpler and you can optimize the database.
As for which is faster, an IN or a JOIN. Both are doing similar things under the hood. In some circumstances, one or the other is faster, but both should make use of indexes and partitions if they are available.

MySQL Performance

We have a data warehouse with denormalized tables ranging from 500K to 6+ million rows. I am developing a reporting solution, so we are utilizing database paging for performance reasons. Our reports have search criteria and we have created the necessary indexes, however, performance is poor when dealing with the million(s) row tables. The client is set on always knowing the total records, so I have to fetch the data as well as the record count.
Are there any other things I can do to help with performance? I'm not the MySQL dba and he has not really offered anything up, so I'm not sure what he can do configuration wise.
Thanks!
You should use "Partitioning"
It's main goal is to reduce the amount of data read for particular SQL operations so that overall response time is reduced.
Refer:
http://dev.mysql.com/tech-resources/articles/performance-partitioning.html
If you partition the large tables and store the parts on different servers, than your query will run faster.
see: http://dev.mysql.com/doc/refman/5.1/en/partitioning.html
Also note that using NDB tables you can use HASH keys that get looked up in O(1) time.
For the number of lines you can keep a running total in a separate table and update that. For example in a after insert and after delete trigger.
Although the trigger will slow down deletes/inserts this will be spread over time. Note that you don't have to keep all totals in one row, you can store totals per condition. Something like:
table field condition row_count
----------------------------------------
table1 field1 cond_x 10
table1 field1 cond_y 20
select sum(row_count) as count_cond_xy
from totals where field = field1 and `table` = table1
and condition like 'cond_%';
//just a silly example you can come up with more efficient code, but I hope
//you get the gist of it.
If you find yourself always counting along the same conditions, this can speed your redesigned select count(x) from bigtable where ... up from minutes to instantly.

MySQL SELECT efficiency

I am using PHP to interact with a MySQL database, and I was wondering if querying MySQL with a "SELECT * FROM..." is more or less efficient than a "SELECT id FROM...".
Less efficient.
If you think about it, SQL is going to send all the data for each row you select.
Imagine that you have a column called MassiveTextBlock - this column will be included when you SELECT * and so SQL will have to send all of that data over when you may not actually require it. In contrast, SELECT id is just grabbing a collection of numbers.
It is less efficient because you are fetching a lot more information than just SELECT id. Additionally, the second question is much more likely to be served using just an index.
It would depend on why you are selecting the data. If you are writing a raw database interface like phpMySQL, then it may make sense.
If you are doing multiple queries on the same table with the same conditions and concatenation operations, then a SELECT col1, col2 FROM table may make more sense to do than using two independent SELECT col1 FROM table and a SELECT col2 FROM table to get different portions of the same data, as this performs both operations in the same query.
But, in general, you should only select the columns you need from the table in order to minimize unnecessary data from being dredged up by the DBMS. The benefits of this increase greatly if your database is on a different server from the client server, or if your database server is old and/or slow.
There is NO CONDITION in which a SELECT * is unavoidable, but if there is, then your data model probably has some serious design flaws.
It depends on your indexes.
Selecting fewer columns can sometimes save a lot of time, if you select only columns that exist in the index that MySQL has used to fetch the results.
For example, if you have an index on the column id, and you perform this query:
SELECT id FROM mytable WHERE id>5
Then MySQL only needs to read the index, and does not need to even read the table row. If on the other hand, you select additional columns, as on:
SELECT id, name FROM mytable WHERE id>5
Then MySQL will need to read the id from the index, then go to the table row to read the name column.
If, however, you are reading columns that aren't in the index you're selecting on anyway, then reading more columns really won't make nearly as much difference, though it will make a small difference. If some columns contain a large amount of data, such as large TEXT or BLOB columns, then it'd be wise to exclude them if you don't need them.