How to set up MYSQL Tables for fast SELECT - mysql

The question is about *.FIT files (link to definition) (1 to extremely many and constantly more), from Sports watches, speedometers,
in which there is always a timestamp (1 to n seconds), as well as 1 to n further parameters (which also have either a timestamp or a counter from 1 to x).
To perform data analysis, I need the data in the database to calculate e.g. the heart rates in relation to the altitude over several FIT files / training units / time periods.
Because of the changing number of parameters in a FIT file (depending on the connected devices, the device that created the file, etc.) and the possibility to integrate more/new parameters in the future, my idea was to have a separate table for each parameter instead of writing everything in one big table (which would then have extremely many "empty" cells whenever a parameter is not present in a FIT file).
Basic tables:
1 x tbl_file
id
filename
date
1
xyz.fit
2022-01-01
2
vwx.fit
2022-01-02
..
..
..
n x tbl_parameter_xy / tbl_ parameter_yz / ....
id
timestamp/counter
file_id
value
1
0
1
value
2
1
1
value
3
0
2
value
..
..
..
..
And these parameter tables would then be linked to each other via the file_id as well as to the FIT File.
I then used a test server, set up a MYSQL-DB to test this and was shocked:
SELECT * FROM tbl_parameter_xy as x
LEFT JOIN tbl_parameter_yz as y
ON x.file_id = y.file_id
WHERE x.file_id = 999
Took almost 30 seconds to give me the results.
In my parameter tables there are 209918 rows.
file_id 999 consists of 1964 rows.
But my SELECT with JOIN returns 3857269 rows, so there must be an/the error and that's the reason why it takes 30sec.
In comparison, fetching from a "large complete" table was done in 0.5 seconds:
SELECT * FROM tbl_all_parameters
WHERE file_id = 999
After some research, I came across INDEX and thought I had the solution.
I created an index (file_id) for each of the parameter tables, but the result was even slower/same.
Right now I´m thinking about building that big "one in all" table, which makes it easier to handle and faster to select from, but I would have to update it frequently to insert new cols for new parameters. And I´m afraid it will grow so big it kills itself
I have 2 questions:
Which table setup is recommended, primary with focus on SELECT speed, secondary with size of DB.
Do I have a basic bug in my SELECT that makes it so slow?
EXPLAIN SELECT

You're getting a combinatorial explosion in your JOIN. Your result set contains one output row for every pair of input rows in your two parameter tables.
If you say
SELECT * FROM a LEFT JOIN b
with no ON condition at all you get COUNT(a) * COUNT(b) rows in your result set. And you said this
SELECT * FROM a LEFT JOIN b WHERE a.file_id = b.file_id
which gives you a similarly bloated result set.
You need another ON condition... possibly try this.
SELECT *
FROM tbl_parameter_xy as x
LEFT JOIN tbl_parameter_yz as y
ON x.file_id = y.file_id
AND x.timestamp = y.timestamp
if the timestamps in the two tables are somehow in sync.
But, with respect, I don't think you have a very good database design yet.
This is a tricky kind of data for which to create an optimal database layout, because it's extensible.
If you find yourself with a design where you routinely create new tables in production (for example, when adding a new device type) you almost certainly have misdesigned you database.
An approach you might take is creating an attribute / value table. It will have a lot of rows in it, but they'll be short and easy to index.
Your observations will go into a table like this.
file_id part of your primary key
parameter_id part of your primary key
timestamp part of your primary key
value
Then, when you need to, say, retrieve parameters 2 and 3 from a particular file, you would do
SELECT timestamp, parameter_id, value
FROM observation_table
WHERE file_id = xxxx
AND parameter_id IN (2,3)
ORDER BY timestamp, parameter_id
The multicolumn primary key I suggested will optimize this particular query.
Once you have this working, read about denormalization.

Related

A SQL query searching for rows that satisfy Column1 <= X <= Column2 is very slow

I am using a MySQL DB, and have the following table:
CREATE TABLE SomeTable (
PrimaryKeyCol BIGINT(20) NOT NULL,
A BIGINT(20) NOT NULL,
FirstX INT(11) NOT NULL,
LastX INT(11) NOT NULL,
P INT(11) NOT NULL,
Y INT(11) NOT NULL,
Z INT(11) NOT NULL,
B BIGINT(20) DEFAULT NULL,
PRIMARY KEY (PrimaryKeyCol),
UNIQUE KEY FirstLastXPriority_Index (FirstX,LastX,P)
) ENGINE=InnoDB;
The table contains 4.3 million rows, and never changes once initialized.
The important columns of this table are FirstX, LastX, Y, Z and P.
As you can see, I have a unique index on the rows FirstX, LastX and P.
The columns FirstX and LastX define a range of integers.
The query I need to run on this table fetches for a given X all the rows having FirstX <= X <= LastX (i.e. all the rows whose range contains the input number X).
For example, if the table contains the rows (I'm including only the relevant columns):
FirstX
LastX
P
Y
Z
100000
500000
1
111
222
150000
220000
2
333
444
180000
190000
3
555
666
550000
660000
4
777
888
700000
900000
5
999
111
750000
850000
6
222
333
and I need, for example, the rows that contain the value 185000, the first 3 rows should be returned.
The query I tried, which should be using the index, is:
SELECT P, Y, Z FROM SomeTable WHERE FirstX <= ? AND LastX >= ? LIMIT 10;
Even without the LIMIT, this query should return a small number of records (less than 50) for any given X.
This query was executed by a Java application for 120000 values of X. To my surprise, it took over 10 hours (!) and the average time per query was 0.3 seconds.
This is not acceptable, not even near acceptable. It should be much faster.
I examined a single query that took 0.563 seconds to make sure the index was being used. The query I tried (the same as the query above with a specific integer value instead of ?) returned 2 rows.
I used EXPLAIN to find out what was happening:
id 1
select_type SIMPLE
table SomeTable
type range
possible_keys FirstLastXPriority_Index
key FirstLastXPriority_Index
key_len 4
ref NULL
rows 2104820
Extra Using index condition
As you can see, the execution involved 2104820 rows (nearly 50% of the rows of the table), even though only 2 rows satisfy the conditions, so half of the index is examined in order to return just 2 rows.
Is there something wrong with the query or the index? Can you suggest an improvement to the query or the index?
EDIT:
Some answers suggested that I run the query in batches for multiple values of X. I can't do that, since I run this query in real time, as inputs arrive to my application. Each time an input X arrives, I must execute the query for X and perform some processing on the output of the query.
I found a solution that relies on properties of the data in the table. I would rather have a more general solution that doesn't depend on the current data, but for the time being that's the best I have.
The problem with the original query:
SELECT P, Y, Z FROM SomeTable WHERE FirstX <= ? AND LastX >= ? LIMIT 10;
is that the execution may require scanning a large percentage of the entries in the FirstX,LastX,P index when the first condition FirstX <= ? is satisfied by a large percentage of the rows.
What I did to reduce the execution time is observe that LastX-FirstX is relatively small.
I ran the query:
SELECT MAX(LastX-FirstX) FROM SomeTable;
and got 4200000.
This means that FirstX >= LastX – 4200000 for all the rows in the table.
So in order to satisfy LastX >= ?, we must also satisfy FirstX >= ? – 4200000.
So we can add a condition to the query as follows:
SELECT P, Y, Z FROM SomeTable WHERE FirstX <= ? AND FirstX >= ? - 4200000 AND LastX >= ? LIMIT 10;
In the example I tested in the question, the number of index entries processed was reduced from 2104820 to 18 and the running time was reduced from 0.563 seconds to 0.0003 seconds.
I tested the new query with the same 120000 values of X. The output was identical to the old query. The time went down from over 10 hours to 5.5 minutes, which is over 100 times faster.
WHERE col1 < ... AND ... < col2 is virtually impossible to optimize.
Any useful query will involve a "range" on either col1 or col2. Two ranges (on two different columns) cannot be used in a single INDEX.
Therefore, any index you try has the risk of checking a lot of the table:
INDEX(col1, ...) will scan from the start to where col1 hits .... Similarly for col2 and scanning until the end.
To add to your woes, the ranges are overlapping. So, you can't pull a fast one and add ORDER BY ... LIMIT 1 to stop quickly. And if you say LIMIT 10, but there are only 9, it won't stop until the start/end of the table.
One simple thing you can do (but it won't speed things up by much) is to swap the PRIMARY KEY and the UNIQUE. This could help because InnoDB "clusters" the PK with the data.
If the ranges did not overlap, I would point you at http://mysql.rjweb.org/doc.php/ipranges .
So, what can be done?? How "even" and "small" are the ranges? If they are reasonably 'nice', then the following would take some code, but should be a lot faster. (In your example, 100000 500000 is pretty ugly, as you will see in a minute.)
Define buckets to be, say, floor(number/100). Then build a table that correlates buckets and ranges. Samples:
FirstX LastX Bucket
123411 123488 1234
222222 222444 2222
222222 222444 2223
222222 222444 2224
222411 222477 2224
Notice how some ranges 'belong' to multiple buckets.
Then, the search is first on the bucket(s) in the query, then on the details. Looking for X=222433 would find two rows with bucket=2224, then decide that both are OK. But for X=222466, two rows have the bucket, but only one matches with firstX and lastX.
WHERE bucket = FLOOR(X/100)
AND firstX <= X
AND X <= lastX
with
INDEX(bucket, firstX)
But... with 100000 500000, there would be 4001 rows because this range is in that many 'buckets'.
Plan B (to tackle the wide ranges)
Segregate the ranges into wide and narrow. Do the wide ranges by a simple table scan, do the narrow ranges via my bucket method. UNION ALL the results together. Hopefully the "wide" table would much smaller than the "narrow" table.
You need to add another index on LastX.
The unique index FirstLastXPriority_Index (FirstX,LastX,P) represents the concatenation of these values, so it will be useless with the 'AND LastX >= ?' part of your WHERE clause.
It seems that the only way to make the query fast is to reduce the number of fetched and compared fields. Here is the idea.
We can declare a new indexed field (for instance UNSIGNED BIGINT) and store both values FistX and LastX in it using an offset for one of the fields.
For example:
FirstX LastX CombinedX
100000 500000 100000500000
150000 220000 150000220000
180000 190000 180000190000
550000 660000 550000660000
70000 90000 070000090000
75 85 000075000085
an alternative is to declare the field as DECIMAL and store FirstX + LastX / MAX(LastX) in it.
Later look for the values satisfying the conditions comparing the values with a single field CombinedX.
APPENDED
And then you can fetch the rows checking only one field:
by something like where param1=160000
SELECT * FROM new_table
WHERE
(CombinedX <= 160000*1000000) AND
(CombinedX % 1000000 >= 160000);
Here I assume that for all FistX < LastX. Of course, you can calculate the param1*offset in advance and store it in a variable against which the further comparisons will be done. Of course, you can consider not decimal offsets but bitwise shifts instead. Decimal offsets were chosen as they are easier to read by a human to show in the sample.
Eran, I believe the solution you found youself is the best in terms of minimum costs. It is normal to take into account distribution properties of the data in the DB during optimization process. Moreover, in large systems, it is usually impossible to achieve satisfactory performance, if the nature of the data is not taken into account.
However, this solution also has drawbacks. And the need to change the configuration parameter with every data change is the least. More important may be the following. Let's suppose that one day a very large range appears in the table. For example, let its length cover half of all possible values. I do not know the nature of ​​your data, so I can not definitely know if such a range can ever appear or not, so this is just an assumption. From the point of view to the result, it's okay. It just means that about every second query will now return one more record. But even just one such interval will completely kill your optimization, because the condition FirstX <=? AND FirstX> =? - [MAX (LastX-FirstX)] will no longer effectively cut off enough records.
Therefore, if you do not have assurance if too long ranges will ever come, I would suggest you to keep the same idea, but take it from other side.
I propose, when loading new data to the table, break all long ranges into smaller with a length not exceeding a certain value. You wrote that The important columns of this table are FirstX, LastX, Y, Z and P. So you can once choose some number N, and every time loading data to the table, if found the range with LastX-FirstX > N, to replace it with several rows:
FirstX; FirstX + N
FirstX + N; FirstX + 2N
...
FirstX + kN; LastX
and for the each row, keep the same values ​​of Y, Z and P.
For the data prepared that way, your query will always be the same:
SELECT P, Y, Z FROM SomeTable WHERE FirstX <=? AND FirstX> =? - N AND LastX> =?
and will always be equally effective.
Now, how to choose the best value for N? I would take some experiments with different values and see what would be better. And it is possible for the optimum to be less than the current maximum length of the interval 4200000. At first it could surprise one, because the lessening of N is surely followed by growth of the table so it can become much larger than 4.3 million. But in fact, the huge size of the table is not a problem, when your query uses the index well enough. And in this case with lessening of N, the index will be used more and more efficiently.
Indexes will not help you in this scenario, except for a small percentage of all possible values of X.
Lets say for example that:
FirstX contains values from 1 to 1000 evenly distributed
LastX contains values from 1 to 1042 evenly distributed
And you have following indexes:
FirstX, LastX, <covering columns>
LastX, FirstX, <covering columns>
Now:
If X is 50 the clause FirstX <= 50 matches approximately 5% rows while LastX >= 50 matches approximately 95% rows. MySQL will use the first index.
If X is 990 the clause FirstX <= 990 matches approximately 99% rows while LastX >= 990 matches approximately 5% rows. MySQL will use the second index.
Any X between these two will cause MySQL to not use either index (I don't know the exact threshold but 5% worked in my tests). Even if MySQL uses the index, there are just too many matches and the index will most likely be used for covering instead of seeking.
Your solution is the best. What you are doing is defining upper and lower bound of "range" search:
WHERE FirstX <= 500 -- 500 is the middle (worst case) value
AND FirstX >= 500 - 42 -- range matches approximately 4.3% rows
AND ...
In theory, this should work even if you search FirstX for values in the middle. Having said that, you got lucky with 4200000 value; possibly because the maximum difference between first and last is a smaller percentage.
If it helps, you can do the following after loading the data:
ALTER TABLE testdata ADD COLUMN delta INT NOT NULL;
UPDATE testdata SET delta = LastX - FirstX;
ALTER TABLE testdata ADD INDEX delta (delta);
This makes selecting MAX(LastX - FirstX) easier.
I tested MySQL SPATIAL INDEXES which could be used in this scenario. Unfortunately I found that spatial indexes were slower and have many constraints.
Edit: Idea #2
Do you have control over the Java app? Because, honestly, 0.3 seconds for an index scan is not bad. Your problem is that you're trying to get a query, run 120,000 times, to have a reasonable end time.
If you do have control over the Java app, you could either have it submit all the X values at once - and let SQL not have to do an index scan 120k times. Or you could even just program the logic on the Java side, since it would be relatively easy to optimize.
Original Idea:
Have you tried creating a Multiple-Column index?
The problem with having multiple indexes is that each index is only going to narrow it down to ~50% of the records - it has to then match those ~2 million rows of Index A against ~2 million rows of Index B.
Instead, if you get both columns in the same index, the SQL engine can first do a Seek operation to get to the start of the records, and then do a single Index Scan to get the list of records it needs. No matching one index against another.
I'd suggest not making this the Clustered Index, though. The reason for that? You're not expecting many results, so matching the Index Scan's results against the table isn't going to be time consuming. Instead, you want to make the Index as small as possible, so that the Index Scan goes as fast as possible. Clustered Indexes are the table - so a Clustered Index is going to have the same Scan speed as the table itself. Along the same lines, you probably don't want any other fields other than FirstX and LastX in your index - make that Index as tiny as you can, so that the scan flies along.
Finally, like you're doing now, you're going to need to clue the engine in that you're not expecting a large set of data back from the search - you want to make sure it's using that compact Index for its scan (instead of it saying, "Eh, I'd be better off just doing a full table scan.)
One way might be to partition the table by different ranges then only querying stuff that fit into a range hence making the amount it needs to check much smaller. This might not work since the java may be slower. But it might put less stress on the database.
There might be a way also to not Query the database so many times and have a more inclusive SQL(you might be able to send a list of values and have the sql send it to a different table).
Suppose you got the execution time down to 0.1 seconds. Would the resulting 3 hours, twenty minutes be acceptable?
The simple fact is that thousands of calls to the same query is incredibly inefficient. Quite aside from what the database has to endure, there is network traffic to think of, disk seek times and all kinds of processing overhead.
Supposing that you don't already have the 120,000 values for x in a table, that's where I would start. I would insert them into a table in batches of 500 or so at a time:
insert into xvalues (x)
select 14 union all
select 18 union all
select 42 /* and so on */
Then, change your query to join to xvalues.
I reckon that optimisation alone will get your run-time down to minutes or seconds instead of hours (based on many such optimisations I have done through the years).
It also opens up the door for further optimisations. If the x values are likely to have at least some duplicates (say, at least 20% of values occur more than once) it may be worth investigating a solution where you only run the query for unique values and do the insert into SomeTable for every x with the matching value.
As a rule: anything you can do in bulk is likely to exponentially outperform anything you do row by row.
PS:
You referred to a query, but a stored procedure can also work with an input table. In some RDBMSs you can pass a table as parameter. I don't think that works in MySQL, but you can create a temporary table that the calling code fills in and the stored procedure joins to. Or a permanent table used in the same way. The major drawback of not using a temp table, is that you may need to concern yourself with session management or discarding stale data. Only you will know if that is applicable to your case.
So, I dont have enough data to be sure of the run time. This will only work if column P is unique? In order to get two indexes working, I created two indexes and the following query...
Index A - FirstX, P, Y, Z
Index B - P, LastX
This is the query
select A.P, A.Y, A.Z
from
(select P, Y, Z from asdf A where A.firstx <= 185000 ) A
join
(select P from asdf A where A.LastX >= 185000 ) B
ON A.P = B.P
For some reason this seemed faster than
select A.P, A.Y, A.Z
from asdf A join asdf B on A.P = B.P
where A.firstx <= 185000 and B.LastX >= 185000
To optimize this query:
SELECT P, Y, Z FROM SomeTable WHERE FirstX <= ? AND LastX >= ? LIMIT 10;
Here's 2 resources you can use:
descending indexes
spatial indexes
Descending indexes:
One option is to use an index that is descending on FirstX and ascending on LastX.
https://dev.mysql.com/doc/refman/8.0/en/descending-indexes.html
something like:
CREATE INDEX SomeIndex on SomeTable (FirstX DESC, LastX);
Conversely, you could create instead the index (LastX, FirstX DESC).
Spatial indexes:
Another option is to use a SPATIAL INDEX with (FirstX, LastX). If you think of FirstX and LastX as 2D spatial coordinates, then your search what it does is select the points in a contiguous geographic area delimited by the lines FirstX<=LastX, FirstX>=0, LastX>=X.
Here's a link on spatial indexes (not specific to MySQL, but with drawings):
https://learn.microsoft.com/en-us/sql/relational-databases/spatial/spatial-indexes-overview
Another approach is to precalculate the solutions, if that number isn't too big.
CREATE TABLE SomeTableLookUp (
X INT NOT NULL
PrimaryKeyCol BIGINT NOT NULL,
PRIMARY KEY(X, PrimaryKeyCol)
);
And now you just pre-populate your constant table.
INSERT INTO SomeTableLookUp
SELECT X, PrimaryKeyCol
FROM SomeTable
JOIN (
SELECT DISTINCT X FROM SomeTable
) XS
WHERE XS.X BETWEEN StartX AND EndX
And now you can SELECT your answers directly.
SELECT SomeTable.*
FROM SomeTableLookup
JOIN SomeTable
ON SomeTableLookup.PrimaryKeyCol = SomeTable.PrimaryKeyCol
WHERE SomeTableLookup = ?
LIMIT 10

SSIS Flat File data column compare against table's column data range

I need to develop a SSIS Package where I will need to import/use a flat file(has only 1 column) to compare each row against existing table's 2 columns(start and end column).
Flat File data -
110000
111112
111113
112222
112525
113222
113434
113453
114343
114545
And compare each row of the flat file against structure/data -
id start end
8 110000 119099
8 119200 119999
3 200000 209999
3 200000 209999
2 300000 300049
2 770000 779999
2 870000 879999
Now, If need to implement this in a simple stored procedure that would farely simple, however I am not able to get my head around this if I have do it in SSIS package.
Any ideas? Any help much appreciated.
At the core, you will need to use a Lookup Component. Write a query, SELECT T.id, T.start, T.end FROM dbo.MyTable AS T and use that as your source. Map the input column to the start column and select the id so that it will be added to the data flow.
If you hit run, it will perform an exact lookup and only find values of 110000 and 119200. To convert it to a range query, you will need to go into the Advanced tab. There should be 3 things you can check: amount of memory, rows and customize the query. When you check the last one, you should get a query like
SELECT * FROM
(SELECT T.id, T.start, T.end FROM dbo.MyTable AS T`) AS ref
WHERE ref.start = ?
You will need to modify that to become
SELECT * FROM
(SELECT T.id, T.start, T.end FROM dbo.MyTable AS T`) AS ref
WHERE ? BETWEEN ref.start AND ref.end
It's been my experience that the range queries can become rather inefficient as it seems to cache what's been seen already so if the source file had 110001, 110002, 110003 you would see 3 unique queries sent to the database. For small data sets, that may not be so bad but it led to some ugly load times for my DW.
An alternative to this is to explode the ranges. For me, I had a source system that only kept date ranges and I needed to know by day what certain counts were. The range lookups were not performing well so I crafted a query to convert the single row with a range of 2010-01-01 to 2013-07-07 to many rows, each with a single date 2013-01-01, 2013-01-02... While this approach lead to a longer pre-execute phase (it took a few minutes as the query had to generate ~30k rows per day for the past 5 years), once cached locally it was a simple seek to find a given transaction by day.
Preferably, I'd create a numbers table, fill it to the max of int and be done with it but you might get by with just using an inline table valued function to generate numbers. Your query would then look something like
SELECT
T.id
, GN.number
FROM
dbo.MyTable AS T
INNER JOIN
-- Make this big enough to satisfy your theoretical ranges
dbo.GenerateNumbers(1000000) AS GN
ON GN.number BETWEEN T.start and T.end;
That would get used in a "straight" lookup without the need for any of the advanced features. The lookup is going to get very memory hungry though so make the query as tight as possible. For example, cast the GN.number from a bigint to an int in the source query if you know your values will fit in an int.

SQL - Comparing text(combinations) on 100million table

I have a problem.
I have a table that has around 80-100million records in it. In that table I have a field, that has stored from 3 up to 16 different "combinations"(varchar). Combination is a 4-digit number, a colon and a char(A-E), . For example:
'0001:A/0002:A/0005:C/9999:E'. In this case there are 4 different combinations (they can go up to 16). This field is in every row of the table, never a null.
Now the problem: I have to go through the table, find every row, and see if they are similar.
Example rows:
0001:A/0002:A/0003:C/0005:A/0684:A/0699:A/0701:A/0707:A/0709:A/0710:D/0711:C/0712:A/0713:A
0001:A/0002:A/0003:C
0001:A/0002:A/0003:A/0006:C
0701:A/0709:A/0711:C/0712:A/0713:A
As you can see, each of these rows is similar to the others (in some way). The thing that needs to be done here is when you send '0001:A/0002:A/0003:C' via program(or parameter in SQL), that it checks every row and see if they have the same "group". Now the catch here is that it has to go both ways and it has to be done "quick", and the SQL needs to compare them somehow.
So when you send '0001:A/0002:A/0003:C/0005:A/0684:A/0699:A/0701:A/0707:A/0709:A/0710:D/0711:C/0712:A/0713:A' it has to find all fields where there are 3-16 same combinations and return the rows. This 3-16 can be specified via parameter, but the problem is that you would need to find all possible combinations, because you can send '0002:A:/0711:C/0713:A', and as you can see you can send 0002:A as the first parameter.
But you cannot have indexing because a combination can be on any place in a string, and you can send different combinations that are not "attached" (there could be a different combination in the middle).
So, sending '0001:A/0002:A/0003:C/0005:A/0684:A/0699:A/0701:A/0707:A/0709:A/0710:D/0711:C/0712:A/0713:A' has to return all fields that has the same 3-16 fields
and it has to go both ways, if you send "0001:A/0002:A/0003:C" it has to find the row above + similar rows(all that contain all the parameters).
Some things/options I tried:
Doing LIKE for all send combinations is not practical + too slow
Giving a field full-index search isn't an option(don't know why exactly)
One of the few things that could work would be making some "hash" type of encoding for fields, calculating it via program, and searching for all same "hashes"(Don't know how would you do that, given that the hash would generate different combinations for similar texts, maybe some hash that would be written exactly for that
Making a new field, calculating/writing(can be done on insert) all possible combinations and checking via SQL/program if they have the same % of combinations, but I don't know how you can store 10080 combinations(in case of 16) into a "varchar" effectively, or via some hash code + knowing then which of them are familiar.
There is another catch, this table is in usage almost 24/7, doing combinations to check if they are the same in SQL is too slow because the table is too big, it can be done via program or something, but I don't have any clue on how could you store this in a new row that you would know somehow that they are the same. It is a possibility that you would calculate combinations, storing them via some hash code or something on each row insert, calculating "hash" via program, and checking the table like:
SELECT * FROM TABLE WHERE ROW = "a346adsad"
where the parameter would be sent via program.
This script would need to be executed really fast, under 1 minute, because there could be new inserts into the table, that you would need to check.
The whole point of this would be to see if there are any similar combinations in SQL already and blocking any new combination that would be "similar" for inserting.
I have been dealing with that problem for 3 days now without any possible solution, the thing that was the closest is different type of insert/hash like, but I don't know how could that work.
Thank you in advance for any possible help, or if this is even possible!
it checks every row and see if they have the same "group".
IMHO if the group is a basic element of your data structure, your database structure is flawed: it should have each group in its own cell to be normalized. The structure you described makes it clear that you store a composite value in the field.
I'd tear up the table into 3:
one for the "header" information of the group sequences
one for the groups themselves
a connecting table between the two
Something along these lines:
CREATE TABLE GRP_SEQUENCE_HEADER (
ID BIGINT PRIMARY KEY,
DESCRIPTION TEXT
);
CREATE TABLE GRP (
ID BIGINT PRIMARY KEY,
GROUP_TXT CHAR(6)
);
CREATE TABLE GRP_GRP_SEQUENCE_HEADER (
GROUP_ID BIGINT,
GROUP_SEQUENCE_HEADER_ID BIGINT,
GROUP_SEQUENCE_HEADER_ORDER INT, /* For storing the order in the sequence */
PRIMARY KEY(GROUP_ID, GROUP_SEQUENCE_HEADER_ID)
);
(of course, add the foreign keys, and most importantly the indexes necessary)
Then you only have to break up the input into groups, and execute a simple query on a properly indexed table.
Also, you would probably save on the disk space too by not storing duplicates...
A sample query for finding the "similar" sequences' IDs:
SELECT ggsh.GROUP_SEQUENCE_HEADER_ID,COUNT(1)
FROM GRP_GRP_SEQUENCE_HEADER ggsh
JOIN GRP g ON ggsh.GROUP_ID=g.GROUP_ID
WHERE g.GROUP_TXT IN (<groups to check for from the sequence>)
GROUP BY gsh.ID
HAVING COUNT(1) BETWEEN 3 AND 16 --lower and upper boundaries
This returns all the header IDs that the current sequence is similar to.
EDIT
Rethinking it a bit more, you could even break up the group into the two parts, but as I seem to understand, you always have full groups to deal with, so it doesn't seem to be necessary.
EDIT2 Maybe if you want to speed the process up even more, I'd recommend to translate the sequences using bijection into numeric data. For example, evaluate the first 4 numbers to be an integer, shift it by 4 bits to the left (multiply by 16, but quicker), and add the hex value of the character in the last place.
Examples:
0001/A --> 1 as integer, A is 10, so 1*16+10 =26
...
0002/B --> 2 as integer, B is 11, so 2*16+11 =43
...
0343/D --> 343 as integer, D is 13, so 343*16+13 =5501
...
9999/E --> 9999 as integer, E is 14, so 9999*16+14 =159998 (max value, if I understood correctly)
Numerical values are handled more efficiently by the DB, so this should result in an even better performance - of course with the new structure.
So basically you want to execute a complex string manipulation on 80-100 million rows in less than a minute! Ha, ha, good one!
Oh wait, you're serious.
You cannot hope to do these searches on the fly. Read Joel Spolsky's piece on getting Back to Basics to understand why.
What you need to do is hive off those 80-100 million strings into their own table, broken up into those discrete tokens i.e. '0001:A/0002:A/0003:C' is broken up into three records (perhaps of two columns - you're a bit a vague about the relationship between the numeric and alphabetic components of th etokens). Those records can be indexed.
Then it is simply a matter of tokenizing the search strings and doing a select joining the search tokens to the new table. Not sure how well it will perform: that rather depends on how many distinct tokens you have.
As people have commented you would benefit immensely from normalizing your data, but can you not cheat and create a temp table with the key and exploding out your column on the "/", so you go from
KEY | "0001:A/0002:A/0003:A/0006:C"
KEY1| "0001:A/0002:A/0003:A"
to
KEY | 0001:A
KEY | 0002:A
KEY | 0003:A
KEY | 0006:C
KEY1| 0001:A
KEY1| 0002:A
KEY1| 0003:A
Which would allow you to develop a query something like the following (not tested):
SELECT
t1.key
, t2.key
, COUNT(t1.*)
FROM
temp_table t1
, temp_table t2
, ( SELECT t3.key, COUNT(*) AS cnt FROM temp_table t3 GROUP BY t3.key) t4
WHERE
t1.combination IN (
SELECT
t5.combination
FROM
temp_table t5
WHERE
t5.key = t2.key)
AND t1.key <> t2.key
HAVING
COUNT(t1.*) = t4.cnt
So return the two keys where key1 is a proper subset of key?
I guess I can recommend to build special "index".
It will be quite big but you will achieve superspeedy results.
Let's consider this task as searching a set of symbols.
There are design conditions.
The symbols are made by pattern "NNNN:X", where NNNN is number [0001-9999] and X is letter [A-E].
So we have 5 * 9999 = 49995 symbols in alphabet.
Maximum length of words with this alphabet is 16.
We can build for each word set of combinations of its symbols.
For example, the word "abcd" will have next combinations:
abcd
abc
ab
a
abd
acd
ac
ad
bcd
bc
b
bd
cd
с
d
As symbols are sorted in words we have only 2^N-1 combinations (15 for 4 symbols).
For 16-symbols word there are 2^16 - 1 = 65535 combinations.
So we make for an additional index-organized table like this one
create table spec_ndx(combination varchar2(100), original_value varchar2(100))
Performance will be excellent with price of overhead - in the worst case for each record in the original table there will be 65535 "index" records.
So for 100-million table we will get 6-trillion table.
But if we have short values size of "special index" reduces drastically.

How can I add a "group" of rows and increment their "group id" in MySQL?

I have the following table my_table with primary key id set to AUTO_INCREMENT.
id group_id data_column
1 1 'data_1a'
2 2 'data_2a'
3 2 'data_2b'
I am stuck trying to build a query that will take an array of data, say ['data_3a', 'data_3b'], and appropriately increment the group_id to yield:
id group_id data_column
1 1 'data_1a'
2 2 'data_2a'
3 2 'data_2b'
4 3 'data_3a'
5 3 'data_3b'
I think it would be easy to do using a WITH clause, but this is not supported in MySQL. I am very new to SQL, so maybe I am organizing my data the wrong way? (A group is supposed to represent a group of files that were uploaded together via a form. Each row is a single file and the the data column stores its path).
The "Psuedo SQL" code I had in mind was:
INSERT INTO my_table (group_id, data_column)
VALUES ($NEXT_GROUP_ID, 'data_3a'), ($NEXT_GROUP_ID, 'data_3b')
LETTING $NEXT_GROUP_ID = (SELECT MAX(group_id) + 1 FROM my_table)
where the made up 'LETTING' clause would only evaluate once at the beginning of the query.
You can start a transaction do a select max(group_id)+1, and then do the inserts. Or even by locking the table so others can't change (insert) to it would be possible
I would rather have a seperate table for the groups if a group represents files which belong together, especially when you maybe want to save meta data about this group (like the uploading user, the date etc.). Otherwise (in this case) you would get redundant data (which is bad – most of the time).
Alternatively, MySQL does have something like variables. Check out http://dev.mysql.com/doc/refman/5.1/en/set-statement.html

randomizing large dataset

I am trying to find a way to get a random selection from a large dataset.
We expect the set to grow to ~500K records, so it is important to find a way that keeps performing well while the set grows.
I tried a technique from: http://forums.mysql.com/read.php?24,163940,262235#msg-262235 But it's not exactly random and it doesn't play well with a LIMIT clause, you don't always get the number of records that you want.
So I thought, since the PK is auto_increment, I just generate a list of random id's and use an IN clause to select the rows I want. The problem with that approach is that sometimes I need a random set of data with records having a spefic status, a status that is found in at most 5% of the total set. To make that work I would first need to find out what ID's I can use that have that specific status, so that's not going to work either.
I am using mysql 5.1.46, MyISAM storage engine.
It might be important to know that the query to select the random rows is going to be run very often and the table it is selecting from is appended to frequently.
Any help would be greatly appreciated!
You could solve this with some denormalization:
Build a secondary table that contains the same pkeys and statuses as your data table
Add and populate a status group column which will be a kind of sub-pkey that you auto number yourself (1-based autoincrement relative to a single status)
Pkey Status StatusPkey
1 A 1
2 A 2
3 B 1
4 B 2
5 C 1
... C ...
n C m (where m = # of C statuses)
When you don't need to filter you can generate rand #s on the pkey as you mentioned above. When you do need to filter then generate rands against the StatusPkeys of the particular status you're interested in.
There are several ways to build this table. You could have a procedure that you run on an interval or you could do it live. The latter would be a performance hit though since the calculating the StatusPkey could get expensive.
Check out this article by Jan Kneschke... It does a great job at explaining the pros and cons of different approaches to this problem...
You can do this efficiently, but you have to do it in two queries.
First get a random offset scaled by the number of rows that match your 5% conditions:
SELECT ROUND(RAND() * (SELECT COUNT(*) FROM MyTable WHERE ...conditions...))
This returns an integer. Next, use the integer as an offset in a LIMIT expression:
SELECT * FROM MyTable WHERE ...conditions... LIMIT 1 OFFSET ?
Not every problem must be solved in a single SQL query.