Fast way to check for duplicates in large sql table

Fast way to check for duplicates in large sql table - mysql

I have a large table with more than 200,000 rows that I only need to check the last few thousand rows for duplicates (not all) before I insert a new row into. Currently I'm running this query for each row I want to add:
SELECT ID from table where date='' and time=''
And based on the response from that query I write the row if the response is empty.
The issue I have with doing this is that it takes a very long time, and as the database grows this only increases how long it takes.
I tried using LIMIT and OFFSET by saying SELECT ID from table where date='' and time='' limit 200000,18446744073709551615 which I thought would only search through rows after 200,000 to the end of the database however running this query doesn't seem to be any faster.
My question is this: Is there a more efficient way to "skip ahead" in the database and only search a portion of the rows instead of all of the rows?

You should be using INSERT IGNORE, and using a UNIQUE constraint on the table based on the columns that should be unique.
When using INSERT IGNORE, MySQL will automatically detect if the row is unique, and ignore the entry into the database. See this question for more information.
Additionally, searching a multi-million row database should be fast as long as you have the correct indexes on the table. You should not need to search a sub-set of data (Without keys, the database will be forced to do a row-scan, which could cause the delays you're talking about).
See this post for some additional ideas.
See also Avoiding Full Table Scans.

Related

Does it make sense to split a large table into smaller ones to reduce the number of rows (not columns)? [duplicate]

rails app, I have a table, the data already has hundreds of millions of records, I'm going to split the table to multiple tables, this can speed up the read and write.
I found this gem octopus, but he is a master/slave, I just want to split the big table.
or what can I do when the table too big?

Theoretically, a properly designed table with just the right indexes will be able to handle very large tables quite easily. As the table grows the slow down in queries and insertion of new records is supposed to be negligible. But in practice we find that it doesn't always work that way! However the solution definitely isn't to split the table into two. The solution is to partition.
Partitioning takes this notion a step further, by enabling you to
distribute portions of individual tables across a file system
according to rules which you can set largely as needed. In effect,
different portions of a table are stored as separate tables in
different locations. The user-selected rule by which the division of
data is accomplished is known as a partitioning function, which in
MySQL can be the modulus, simple matching against a set of ranges or
value lists, an internal hashing function, or a linear hashing
function.
If you merely split a table your code is going to become inifinitely more complicated, each time you do an insert or a retrieval you need to figure out which split you should run that query on. When you use partitions, mysql takes care of that detail for you an as far as the application is concerned it's still one table.

Do you have an ID on each row? If the answer is yes, you could do something like:
CREATE TABLE table2 AS (SELECT * FROM table1 WHERE id >= (SELECT COUNT(*) FROM table1)/2);
The above statement creates a new table with half of the records from table1.
I don't know if you've already tried, but an index should help in speed for a big table.
CREATE INDEX index_name ON table1 (id)
Note: if you created the table using unique constraint or primary key, there's already an index.

Whether or not SQL query (SELECT) continues or stops reading data from table when find the value

Greeting,
My question; Whether or no sql query (SELECT) continues or stops reading data (records) from table when find the value that I was looking for?
referance: "In order to return data for this query, mysql must start at the beginning of the disk data file, read in enough of the record to know where the category field data starts (because long_text is variable length), read this value, see if it satisfies the where condition (and so decide whether to add to the return record set), then figure out where the next record set is, then repeat."
link for referance: http://www.verynoisy.com/sql-indexing-dummies/#how_the_database_finds_records_normally

In general you don't know and you don't care, but you have to adapt when queries take too long to execute. When you do something like
select a,b,c from mytable where a=3 and b=5
then the database engine has a couple of options to optimize. When all these options fail, then it will do a "full table scan" - which means, it will have to examine the entire table to see which rows are eligible. When you have indices on e.g. column a then the database engine can optimize the search because it can pre-select rows where a has value 3. So, in general, make sure that you have indices for the columns that are most searched. (Perversely, some database engines get confused when you have too many indices and will fall back to a full table scan because they've lost their way...)
As to whether or not the scanning stops: In general, the database engine has to examine all data in the table (hopefully aided by indices) and won't stop after having found just one hit. If you want just the first hit, use a limit 1 clause to make sure that your result set has only one outcome. But then again, if you have a sort by clause, the database engine cannot stop after the first hit, there might be next ones that should get priority given the sorting.
Summarizing, how the db engine does its scan depends on how smart it is, what indices are available etc.. If your select queries take too long then consider re-organizing your indices, writing your select statements differently, or rebuilding the table.

The RDBMS reading data from disk is something you cannot know, you should not care and you must not rely on.
The issue is too broad to get a precise answer. The engine reads data from storage in blocks, a block can contain records that are not needed by the query at hand. If all the columns needed by the query is available in an index, the RDBMS won't even read the data file, it will only use the index. The data it needs could already be cached in memory (because it was read during the execution of a previous query). The underlying OS and the storage media also keep their own caches.
On a busy system, all these factors could lead to very different storage access patterns while running the same query several times on a couple of minutes apart.

Yes it scans the entire file. Unless you put something like
select * from user where id=100 limit 1
This of course will still search entire rows if id 100 is the last record.
If id is a primary key it will automatically be indexed and searching would be optimized

I'm sorry... I thought the table.
I will change question and I will explain it in the following image;
I understand that in CASE 1 all columns must be read with each iteration.
My question is: If it's the same in the CASE 2 or columns that are not selected in the query are excluded from reading in each iteration.
Also, are the both queries are the some in performance perspective?
Clarify:
CASE: 1 In first CASE select print all data
CASE: 2 In second CASE select print columns first_name and last_name
Whether in CASE 2 mysql server (SQL query) reads only columns first_name, last_name or read the entire table to get that data(rows)=(first_name, last_name)?
An interest of me how the server reads table row in CASE 1 and CASE 2?

Statistical data like display in website from large record set

I have 4 databases with tables having lots of data. My requirement is to show count of all the records in these tables on mouse hovering the corresponding div in UI(It is an asp.net website). Please note the count may change in every minute or in hour. (Means new records can be added or deleted from the table [using another application]). Now the issue is like, it is taking lot of time to get the count (since it has lots of data). So each mouse over, it is having a call to corresponding database and taking the count. Is there any better approach to implement this?
I am thinking of implementing something like as below.
http://www.worldometers.info/world-population/
But to change the figures like that in each second I need to have a call to the database. Right? (To get the latest count) Is there any better approach to show data like this statistics?
By the Way, I am using MySQL.
Thanks

You need to give more details - what table engines you are using, how does your count query look like, etc.
But assuming that you are using InnoDB, and you are trying to run count(*) or count(primary_id_column), you have to remember that InnoDB has clustered primary keys, that are stored with the data pages of the row itself, not in separate index pages, so the count will do full scan on the rows.
One thing you can try is to create additional, separate, not primary index on any unique column (like row's id etc,) and make sure (use explain query statement) that your count uses this index.
If this does not work for you, I would suggest to create separate table (for example with columns: table_name, row_count) to store counters in it and create triggers on insert and on delete on other tables (you need to count records in) to increment or decrement these values. From my experience (we monitor number of records in daily and hourly manners, on tables with hundreds of milions of records and heavy write load: ~150 inserts/sec) this is the best solution I have came up so far.

mysql database insert row between records

I need to insert rows of data in a specific order. Sometimes I forget to insert the row on time and I have to insert it later. Other rows have taken up its place though and till now I manually (programmatically of course) change the index of different number of rows - it could be a couple of rows or hundreds of rows. This is not very efficient and I was looking for another way to go. My thought was to order by date and create a "day's index" number to reorder only the day's records but I was wandering... is there any mysql way to reorder the rows? That is to inform mysql about the required row position and then let it update the primary keys?

I think you need to look at your table design. This is actually a non-problem for most applications because it would have been addressed at the start.
Now, you need to add a DateTime column to your table, and initialise it with some sensible values for the data that's already there. As new rows are added, set the DateTime column in each new row to the actual DateTime. If you have to add a row late, set the DateTime to the time the record should have been added.
When you query your table, use ORDER BY myDateTime (or whatever you decide to call it). Your rows should appear in the correct order.
For small tables (less than a few thousand rows) an index might not help much. For larger tables you should index your DateTime column. You'd have to run some tests to see what works best.

What you think is actually the solution. Create a Date column if not already, and then Create Index on that field, also use Order by in your Query. There is no way other than manual, and even if there is it is not recommended to play with MYSQL way of storing rows, because row storage is done by DB Engine and it is not ideal to play with them, as they store row in best optimal way, so why mess their efficiency for such a small thing.

mySQL (and MSSQL), using both indexed and non-indexed columns in where clause

The database I use is currently mySQL but maybe later MSSQL.
My questing is about how mySQL and msSQL takes care about indexed and nonindexed columns.
Lets say I have a simple table like this:
*table_ID -Auto increase. just a ID, indexed.
*table_user_ID -every user has a unique ID indexed
*table_somOtherID -some data..
*....
Lets say that I have A LOT!! of rows in this table, But the number of rows that every user add to this table is very small (10-100)
And I want to find one o a few specific rows in this table. a row or rows from a specific User(indexed column).
If I use the following WHERE clause:
..... WHERE table_user_ID= 'someID' AND table_someOtherID='anotherValue'.
Will the database first search for the indexed columns, and then search for the "anotherValue" inside of those rows, or how does the database handle this?
I guess the database will increase a lot if I have to index every column in all tables..
But what do you think, is it enough to index those columns that will decrease the number of rows to just ten maybe hundred?

Database optimizers generally work on a cost basis on indexes by looking at all the possible indexes to use based on the query. In your specific case it will see 2 columns - table_user_ID with an index and someOtherID without an index. If you really only have 10-100 rows per userID then the cost of this index will be very low and it will be used. This is because the cardinality is high and the DB can only read the few rows it needs and not touch the other rows for every other user its not interested in. However, if the cost to use the index is very high (very few unique userIDs and many entries per user) it might actually be more efficient to not use the index and scan the whole table to prevent random seeking action as it jumps around the table grabbing rows based on the index.
Once it picks the index then the DB just grabs the rows that match that index (10 to 100 in your case) and try to match them against your other criteria searching for rows where someOtherID='anotherValue'

But the number of rows that every user add to this table is very small (10-100)
You only need to index the user_id. It should give you good performance regardless of your query, as long as it includes the user_id in the filter. Until you have identified other use cases, it will pretty much work as you state
Will the database first search for the indexed columns, and then search for the "anotherValue" inside of those rows, or how does the database handle this?
Yes (in layman terms that is close).

In regards to SQL Server:
The ordering of the indexes are important depending on how you query and how the indexes are structured. If you create an index on the columns
-table_user_id
-table_someotherID
The index is ordered by the table_user_id first. Example:
1-2
1-5
1-6
2-3
2-5
2-6
For the first record on the index, 1 being the table user id, and 2 being some other value.
If you run a query with a where on table_user_id = blah, it will be very fast to use this index, since the table_user_id are indexed in order.
But if you run a query that only uses table_someotherID in the WHERE clause, it might not even use this index, as instead of doing a quick seek in the index for the matching value, it will do a rough scan of the index (which is less efficient than a seek).
Also SQL Server has a INCLUDE feature that associate the columns you want in the SELECT clause to the index you create on the WHERE or JOIN columns.
So to answer your question, it all depends on how you create the indexes and how you query them. You're right not to think about indexing every column, as indexes take up storage and performance hit when you do inserts and updates on the table.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Fast way to check for duplicates in large sql table - mysql

Related

Does it make sense to split a large table into smaller ones to reduce the number of rows (not columns)? [duplicate]

Whether or not SQL query (SELECT) continues or stops reading data from table when find the value

Statistical data like display in website from large record set

mysql database insert row between records

mySQL (and MSSQL), using both indexed and non-indexed columns in where clause

Categories

Resources