optimizing query and optimizing table - mysql

`CREATE TABLE emailAddress
(
ID int NOT NULL AUTO_INCREMENT,
EMAILID varchar(255),
LastIDfetched int,
PRIMARY KEY (ID)
)
SELECT LastIDfetched WHERE ID=1; //say this value is x
SELECT EMAILID FROM emailAddress WHERE ID>x && ID<x+100;
UPDATE emailAddress SET LastIDfetched=x+100 WHERE ID=1;`
Basically I am trying to fetch all the email id from the database using multiple computers running in parallel, so that none of the email id is fetched by 2 computer.
What is the best way to do this?
there are millions of email id.
here for example i have shown that in one query 100 email id is fetched, it can vary depending on the need.

My suggestion would be to query by autoincrement ID's. You may not get an exact split of records across candidate computers if there are gaps in you autoincrement system, but this should be pretty good.
One approach is to simply look at the remainder of the autoincrement ID and grab all items of a certain value.
SELECT `EMAILID`
FROM `emailAddress`
WHERE ID % X = Y
Here X would equal the number of of computers you are using. Y would be an integer between 0 and X - 1 that would be unique to each machine running the query.
The con here is that you would not be able to use an index on this query, so if you need to do this query a lot, or on a production system taking traffic, it could be problmematic.
Another approach would be to determine the number of rows in the table and split the queries into groups
SELECT COUNT(`ID`) FROM `emailAddress`; // get row count we will call it A below
SELECT `EMAILID`
FROM `emailAddress`
WHERE ID
ORDER BY ID ASC
LIMIT (A/X) * Y, (A/X)
Here again X is number of machines, and Y is unique integer for each machine (from 0 to X -1)
The benefit here is that you are able to use index on ID. The downside is that you may miss some rows if the number of row grows between the initial query and the queries that retrieve data.
I don't understand your lastFetchedID field, but it looked like it was an unnecessary mechanitoin you were trying to use to achieve what can easily be achieved as noted above.

Related

How check two tables content in two instance are same or not?

The Problem Description
I have two similar databases (MySQL) on two machines with a table with exactly the same structure. The order of data insertion can be different, but the final table contents must be the same except for the order of rows.
Table structur
#
Column
Type
Info
1
old_id
int(11)
unsigned, primary key, not null
2
new_id
int(11)
unsigned, primary key, not null
The unsuccessful solution
I tried this query to get the MD5 checksum of all records to be able to compare the result of each database:
SELECT
MD5(GROUP_CONCAT(CONCAT(`new_id`,':', `old_id`)))
FROM `maptable`
ORDER BY `new_id`,`old_id`
But there is a problem here. The checksum does not change even when I filter some rows. For example:
SELECT
MD5(GROUP_CONCAT(CONCAT(`new_id`, `old_id`)))
FROM `maptable`
WHERE mod(`new_id`, 2) = 0
ORDER BY `new_id`,`old_id`
Questions
Why does it not work? Is that because of the size of the GROUP_CONCAT() result? (the tables have about one million rows)
Do you have any simple query to get a checksum of the table content to check if the table content is the same? (I do not want to use Python, PHP, or any other programs to fetch the rows and make checksum.)

Unpredictable random number unique in table in MariaDB (not just incremental)

I have table of Users in MariaDB and I need to generate serviceNumber for each newly created user.
Example of this code: 865165
Only two requirements are:
Unique in User table
Unpredictable when creating user (Not based on AutoIncrement maybe?)
Is this possible with just database? Or I need to implement it in backend when creating user.
The only (theoretically) collision free solution would be to generate the serviceNumber with UUID() function (or maybe UUID_SHORT()).
The disadvantage of this solution is, that it can't be used in statement based replication, in this case you should create the value in your backend.
Steps 1-3 are setup steps:
Decide on how big you want the numbers to be. 6-digit numbers would let you have a million numbers.
Build a temp table t with all 6-digit numbers (t_num).
Build and populate a permanent table:
CREATE TABLE `u` (
id INT AUTO_INCREMENT NOT NULL,
t_num INT NOT NULL,
PRIMARY KEY(id)
);
INSERT INTO u (t_num)
SELECT t_num FROM t ORDER BY RAND();
4--Plan A You have an auto_inc id in the real table; simply join to u to get the random 6-digit number.
4--Plan B
BEGIN;
SELECT t_num FROM u ORDER BY id LIMIT 1 FOR UPDATE; -- to get the random id
DELETE FROM u ORDER BY id LIMIT 1; -- Remove it from the list of available ids
From MySQL reference manual:
To obtain a random integer R in the range i <= R < j, use the
expression FLOOR(i + RAND() * (j − i)). For example, to obtain a
random integer in the range the range 7 <= R < 12, use the following
statement: SELECT FLOOR(7 + (RAND() * 5));
Make the field holding serviceNumber uniqe and use the above math expression to generate serviceNumbers.

Handling a very large table without index

I have a very large table 20-30 million rows that is completely overwritten each time it is updated by the system supplying the data over which I have no control.
The table is not sorted in a particular order.
The rows in the table are unique, there is no subset of columns that I can be assured to have unique values.
Is there a way I can run a SELECT query followed by a DELETE query on this table with a fixed limit without having to trigger any expensive sorting/indexing/partitioning/comparison whilst being certain that I do not delete a row not covered by the previous select.
I think you're asking for:
SELECT * FROM MyTable WHERE x = 1 AND y = 3;
DELETE * FROM MyTable WHERE NOT (x = 1 AND y = 3);
In other words, use NOT against the same search expression you used in the first query to get the complement of the set of rows. This should work for most expressions, unless some of your terms return NULL.
If there are no indexes, then both the SELECT and DELETE will incur a table-scan, but no sorting or temp tables.
Re your comment:
Right, unless you use ORDER BY, you aren't guaranteed anything about the order of the rows returned. Technically, the storage engine is free to return the rows in any arbitrary order.
In practice, you will find that InnoDB at least returns rows in a somewhat predictable order: it reads rows in some index order. Even if your table has no keys or indexes defined, every InnoDB table is stored as a clustered index, even if it has to generate an internal key called GEN_CLUST_ID behind the scenes. That will be the order in which InnoDB returns rows.
But you shouldn't rely on that. The internal implementation is not a contract, and it could change tomorrow.
Another suggestion I could offer:
CREATE TABLE MyTableBase (
id INT AUTO_INCREMENT PRIMARY KEY,
A INT,
B DATE,
C VARCHAR(10)
);
CREATE VIEW MyTable AS SELECT A, B, C FROM MyTableBase;
With a table and a view like above, your external process can believe it's overwriting the data in MyTable, but it will actually be stored in a base table that has an additional primary key column. This is what you can use to do your SELECT and DELETE statements, and order by the primary key column so you can control it properly.

Efficient index to locate unprocessed entries in MySQL

I have a MySQL table that contains millions of entries.
Each entry must be processed at some point by a cron job.
I need to be able to quickly locate unprocessed entries, using an index.
So far, I have used the following approach: I add a nullable, indexed processedOn column that contains the timestamp at which the entry has been processed:
CREATE TABLE Foo (
...
processedOn INT(10) UNSIGNED NULL,
KEY (processedOn)
);
And then retrieve an unprocessed entry using:
SELECT * FROM Foo WHERE processedOn IS NULL LIMIT 1;
Thanks to MySQL's IS NULL optimization, the query is very fast, as long as the number of unprocessed entries if small (which is almost always the case).
This approach is good enough: it does the job, but at the same time I feel like the index is wasted because it's only ever used for WHERE processedOn IS NULL queries, and never for locating a precise value or range of values for this field. So this has an inevitable impact on storage space and INSERT performance, as every single timestamp is indexed for nothing.
Is there a better approach? Ideally the index would just contain pointers to the unprocessed rows, and no pointer to any processed row.
I know I could split this table into 2 tables, but I'd like to keep it in a single table.
What comes to my mind is to create a isProcessed column, with default value = 'N' and you set to 'Y' when processed (at the same time you set the processedOn column). Then create an index on the isProcessed field. When you query (with the where clause WHERE isProcessed = 'N'), it will respond very fast.
UPDATE: ALTERNATIVE with partitioning:
Create your table with partitions and define a field that will have just 2 values 1 or 0. This will create one partition for records with the field = 1 and another for records with field = 0.
create table test (field1 int, field2 int DEFAULT 0)
PARTITION BY LIST(field2) (
PARTITION p0 VALUES IN (0),
PARTITION p1 VALUES IN (1)
);
This way, if you want to query only the records with the field equal to one of the values, just do this:
select * from test partition (p0);
The query above will show only records with field2 = 0.
And if you need to query all records together, you just query the table normally:
select * from test;
As far as I was able to understand, this will help you with your need.
I have multiple answers and comments on others' answers.
First, let me assume that the PRIMARY KEY for Foo is id INT UNSIGNED AUTO_INCREMENT (4 bytes) and that the table is Engine=InnoDB.
Indexed Extra column
The index for the extra column would be, per row, the width of the extra column and the PRIMARY KEY, plus a bunch of overhead. With your processedOn, you are talking about 8 bytes (2 INTs). With a simple flag, 5 bytes.
Separate table
This table would have only id for the unprocessed items. It would take extra code to populate it. It's size would stay at some "high-water mark". So, if there were a burst of unprocessed items, it would grow, but not shrink back. (Here's a rare case where OPTIMIZE TABLE is useful.) InnoDB requires a PRIMARY KEY, and id would work perfectly. So, one column, no extra index. It is a lot smaller than the extra index discussed above. Finding something to work on:
$id = SELECT id FROM tbl LIMIT 1; -- don't care which one
process it
DELETE FROM tbl where id = $id
2 PARTITIONs, one processed, one not
No. When you change a row from processed to unprocessed, the row must be removed from one partition and inserted into the other. This is done behind the scenes by your UPDATE ... SET flag = 1. Also, both partitions have the "high-water" issue -- they will grow but not shrink. And the space overhead for partitioning may be as much as the other solutions.
SELECT by PARTITION ... requires 5.6. Without that, you would need an INDEX, so you are back to the index issues.
Continual Scanning
This incurs zero extra disk space. (That's better than you had hoped for, correct?) And it is not too inefficient. Here's how it works. Here is some pseudo-code to put into your cron job. But don't make it a cron job. Instead, let it run all the time. (The reason will become clear, I hope.)
SELECT #a := 0;
Loop:
# Get a clump
SELECT #z := id FROM Foo WHERE id > #a ORDER BY id LIMIT 1000,1;
if no results, Set #z to MAX(id)
# Find something to work on in that clump:
SELECT #id := id FROM Foo
WHERE id > #a
AND id <= #z
AND not-processed
LIMIT 1;
if you found something, process it and set #z := #id
SET #a := #z;
if #a >= MAX(id), set #a := 0; # to start over
SLEEP 2 seconds # or some amount that is a compromise
Go Loop
Notes:
It walks through the table with minimal impact.
It works even with gaps in id. (It could be made simpler if there were no gaps.) (If the PK is not AUTO_INCREMENT, it is almost identical.)
The sleep is to be a 'nice guy'.
Selective Index
MariaDB's dynamic columns and MySQL 5.7's JSON can index things, and I think they are "selective". One state would be to have the column empty, the other would be to have the flag set in the dynamic/json column. This will take some research to verify, and may require an upgrade.

Design of mysql database for large number of large matrix data

I am looking into storing a "large" amount of data and not sure what the best solution is, so any help would be most appreciated. The structure of the data is
450,000 rows
11,000 columns
My requirements are:
1) Need as fast access as possible to a small subset of the data e.g. rows (1,2,3) and columns (5,10,1000)
2) Needs to be scalable will be adding columns every month but the number of rows are fixed.
My understanding is that often its best to store as:
id| row_number| column_number| value
but this would create 4,950,000,000 entries? I have tried storing as just rows and columns as is in MySQL but it is very slow at subsetting the data.
Thanks!
Build the giant matrix table
As N.B. said in comments, there's no cleaner way than using one mysql row for each matrix value.
You can do it without the id column:
CREATE TABLE `stackoverflow`.`matrix` (
`rowNum` MEDIUMINT NOT NULL ,
`colNum` MEDIUMINT NOT NULL ,
`value` INT NOT NULL ,
PRIMARY KEY ( `rowNum`, `colNum` )
) ENGINE = MYISAM ;
You may add a UNIQUE INDEX on colNum, rowNum, or only a non-unique INDEX on colNum if you often access matrix by column (because PRIMARY INDEX is on ( `rowNum`, `colNum` ), note the order, so it will be inefficient when it comes to select a whole column).
You'll probably need more than 200Go to store the 450.000x11.000 lines, including indexes.
Inserting data may be slow (because there are two indexes to rebuild, and 450.000 entries [1 per row] to add when adding a column).
Edit should be very fast, as index wouldn't change and value is of fixed size
If you access same subsets (rows + cols) often, maybe you can use PARTITIONing of the table if you need something "faster" than what mysql provides by default.
After years of experience (20201 edit)
Re-reading myself years later, I would say the "cache" ideas are totally dumb, as it's MySQL role to handle these sort of cache (it should actually already be in the innodb pool cache).
A better thing would be, if matrix is full of zeroes, not storing the zero values, and consider 0 as "default" in the client code. That way, you may lightenup the storage (if needed: mysql should actually be pretty fast responding to queries event on such 5 billion row table)
Another thing, if storage makes issue, is to use a single ID to identify both row and col: you say number of rows is fixed (450000) so you may replace (row, col) with a single (id = 450000*col+row) value [tho it needs BIGINT so maybe not better than 2 columns)
Don't do like below: don't reinvent MySQL cache
Add a cache (actually no)
Since you said you add values, and doesn't seem to edit matrix values, a cache can speed up frequently asked rows/columns.
If you often read the same rows/columns, you can cache their result in another table (same structure to make it easier):
CREATE TABLE `stackoverflow`.`cachedPartialMatrix` (
`rowNum` MEDIUMINT NOT NULL ,
`colNum` MEDIUMINT NOT NULL ,
`value` INT NOT NULL ,
PRIMARY KEY ( `rowNum`, `colNum` )
) ENGINE = MYISAM ;
That table will be void at the beginning, and each SELECT on the matrix table will feed the cache. When you want to get a column / row:
SELECT the row/column from that caching table
If the SELECT returns a void/partial result (no data returned or not enough data to match the expected row/column number) then do the SELECT on the matrix table
Save the SELECT from the matrix table to the cachingPartialMatrix
If the caching matrix gets too big, clear it (the bigger cached matrix is, the slower it becomes)
Smarter cache (actually, no)
You can make it even smarter with a third table to count how many times a selection is done:
CREATE TABLE `stackoverflow`.`requestsCounter` (
`isRowSelect` BOOLEAN NOT NULL ,
`index` INT NOT NULL ,
`count` INT NOT NULL ,
`lastDate` DATETIME NOT NULL,
PRIMARY KEY ( `isRowSelect` , `index` )
) ENGINE = MYISAM ;
When you do a request on your matrix (one may use TRIGGERS) for the Nth-row or Kth-column, increment the counter. When the counter gets big enough, feed the cache.
lastDate can be used to remove some old values from the cache (take care: if you remove the Nth-column from cache entries because its ``lastDate```is old enough, you may break some other entries cache) or to regularly clear the cache and only leave the recently selected values.