Need MySQL Query to Chop String and add Serial Number (letter) to End, Where Duplicates - mysql

Let's say we have some 10 character skus like this:
AB1234ZYXW
AB1234ZYXN
AB1234ZYXP
AB1234ZYXR
ZZ1234ZYXR
But we need them to be 8 characters. Chopping them at 8 would make them non unique (except the last one).
The non-unique ones would all look like : AB1234ZY
So my solution is to chop one more character off of all them, giving AB1234Z, then adding a serial number (actually serial letter). AB1234ZA, AB1234ZB, ...C ...D.
My first thought was to query the DB and do all the processing in PHP arrays, then send queries back to update. But since there can be 30,000 to process at a time, this will result in 30,000 UPDATE queries (one for each chopped sku anyway).
If it could be done with a single MySQL statement it would be much faster.
Any ideas?
EDIT:
To add more detail:
Total number of records could be 2,000 - 35,000 per batch. With the chopping, it will create groups of duplicates. If each group has less than 26 members, then 1 digit of serialization is enough. Otherwise 2 digits (26 x 26 = 676 and it's very unlikely a group would be larger than that). Ideally, the query would take into account the number of duplicates in each group and apply 1 or 2 digits of serialization depending. I know it's a lot to ask. jonstjohn's answer looks like a good start. I will test it tomorrow. I haven't used mysql variables for anything yet but it looks promising.

Try the following:
SET #num = 64;
UPDATE skus SET sku = concat(left(sku, 7),
char(if((#num := #num + 1) <= 90, #num, #num := 65)));
Here's the explanation.
The first line assigns the integer 64 to the mysql variable #num. The character code for 'A' is 65 and 'Z' is 90, so we assign one before the 'A' since it will be incremented in the update query.
The update query then updates each row of the skus table using the first 7 characters of the sku, plus an incrementing character (A-Z) using the #num variable. When it hits 'Z', it resets itselft to 65 (and uses that value for that row).
It should be very fast and efficient.

Related

When comparing current row with previous row the query is too slow

When subtracting the previous row from the current row the query is too slow, is there a more efficient way to do this?
I am trying to create a data filter which has the capacity to highlight events which occur sequentially to those that do not. I have a table of machine operational data 'source' which is ordered chronologically. Using a WHERE clause I filter out the data which is of less relevance to this particular analysis. The remaining data is inserted into a new table 'filtered'. Using the inserted ID numbers from 'source' I compare each row with its proceeding row to find the difference in value – if the difference is 1 then then the events have occurred in sequence and if the difference is null then they have not. My problem is with the length of time it takes to compare a row with the previous row. I have reduced my data volume to just 2.5% (275000 rows) of what it full volume will be and the query takes 3012 seconds according to the MySQL Workbench action output. I have experimented with structuring the query differently but ultimately have reached dead ends. So my question is – Is there a more efficient way to compare a row with its previous row ?
OK – here are some more details.
/*First I create the table for the filtered data */
drop table if exists filtered_dta;
create table filtered_dta
(
ID int (11) not null auto_increment,
IDx1 int (11),
primary key (ID)
);
/Then I insert the filtered data/
insert into filtered_dta (IDx1)
select seq from source
WHERE range_value < -1.75
and range_value > -5 ;
/* Then I compare each row with its previous */
select t1.ID, t1.IDx1,(t1.IDx1-t2.IDx1)
as seq_value
from filtered_dta t1
left outer join filtered_dta t2
on t1.IDx1 = t2.IDx1+1
order by IDx1
;
Here are sample tables.
Table - filtered_dta Results
| ID | IDx1 | | ID | IDx1 | seq_value |
1 3 1 3 null
2 4 2 4 1
3 7 3 7 null
4 12 4 12 null
5 13 5 13 1
6 14 6 14 1
A full data set from the source table is expected to be between 3 and 10 million rows. The database will create and use about 50 tables. This database is being used as a back end engine for simulation software which does not have the capacity to process this amount of data and give an appropriate analysis of the system which the data represents.
I have spent some time on the issue and have come across the following;
It may be possible that the find_seq table is creates with myISAM and requires converting to an innoDB table. I tried to set the default engine to innoDB but seen no noticeable differences.
This question was similar in its problem of a slow query MySQL query painfully slow on large data - but its issue lay in having a function in a where clause – from my action output I can see the where clause is not too slow.
I would appreciate any input anyone may have on this. Also I am not a proficient user of MySQL so if possible give details.
Kind regards.
You can use something like this template to identify sequential "islands" without a self-join:
SELECT #island := #island + IF(seqId <> #lastSeqId + 1, 1, 0) AS island
, orderQ.[fieldsYouWant]
, #lastSeqId := seqId
FROM (
SELECT [fieldsYouWant], [sequentialIdentifier] AS seqId
FROM [theTable] AS t
, (SELECT #island := 0, #lastSeqId := [somethingItCannotBe]) AS init_dnr -- Initializes variables, do not reference
WHERE [filteringConditionsMet]
ORDER BY [orderingCriteria]
) AS orderingQ
;
I tried keeping it as generic as possible, but you'll note I had to revert to the assumption that seqId was numeric and expected to increment by one. Conditions in the island calculation can be much more complicated if needed (for cases such as where (A, 1), (A, 2), (B, 3) should be two islands based on the sequence not being defined by a single value).
You can take this template further, to identify "island" boundaries and sizes by simple making the above query as subquery for something like:
SELECT island, MIN(seqId), MAX(seqId), COUNT(seqId)
FROM ([above query]) AS islandQ
GROUP BY island
;

MySQL- Counting rows VS Setting up a counter

I have 2 tables posts<id, user_id, text, votes_counter, created> and votes<id, post_id, user_id, vote>. Here the table vote can be either 1 (upvote) or -1(downvote). Now if I need to fetch the total votes(upvotes - downvotes) on a post, I can do it in 2 ways.
Use count(*) to count the number of upvotes and downvotes on that post from votes table and then do the maths.
Set up a counter column votes_counter and increment or decrement it everytime a user upvotes or downvotes. Then simply extract that votes_counter.
My question is which one is better and under what condition. By saying condition, I mean factors like scalability, peaktime et cetera.
To what I know, if I use method 1, for a table with millions of rows, count(*) could be a heavy operation. To avoid that situation, if I use a counter then during peak time, the votes_counter column might get deadlocked, too many users trying to update the counter!
Is there a third way better than both and as simple to implement?
The two approaches represent a common tradeoff between complexity of implementation and speed.
The first approach is very simple to implement, because it does not require you to do any additional coding.
The second approach is potentially a lot faster, especially when you need to count a small percentage of items in a large table
The first approach can be sped up by well designed indexes. Rather than searching through the whole table, your RDBMS could retrieve a few records from the index, and do the counts using them
The second approach can become very complex very quickly:
You need to consider what happens to the counts when a user gets deleted
You should consider what happens when the table of votes is manipulated by tools outside your program. For example, merging records from two databases may prove a lot more complex when the current counts are stored along with the individual ones.
I would start with the first approach, and see how it performs. Then I would try optimizing it with indexing. Finally, I would consider going with the second approach, possibly writing triggers to update counts automatically.
As this sounds a lot like StackExchange, I'll refer you to this answer on the meta about the database schema used on the site. The votes table looks like this:
Votes table:
Id
PostId
VoteTypeId, one of the following values:
1 - AcceptedByOriginator
2 - UpMod
3 - DownMod
4 - Offensive
5 - Favorite (if VoteTypeId = 5, UserId will be populated)
6 - Close
7 - Reopen
8 - BountyStart (if VoteTypeId = 8, UserId will be populated)
9 - BountyClose
10 - Deletion
11 - Undeletion
12 - Spam
15 - ModeratorReview
16 - ApproveEditSuggestion
UserId (only present if VoteTypeId is 5 or 8)
CreationDate
BountyAmount (only present if VoteTypeId is 8 or 9)
And so based on that it sounds like the way it would be run is:
SELECT VoteTypeId FROM Votes WHERE VoteTypeId = 2 OR VoteTypeId = 3
And then based on the value, do the maths:
int score = 0;
for each vote in voteQueryResults
if(vote == 2) score++;
if(vote == 3) score--;
Even with millions of results, this is probably going to be a very fast operation as it's so simple.

MySQL - Search for entries with 3 decimal places

I have a table with 7.5 million entries, and while importing some of the data a few of the column breaks messed up and the first digit of a column ended up stuck onto the end of the previous column.
For example, on a row it should say ELWS=123.44 and t2=17.00, and instead it read in ELWS=123.441 and t2=7.00.
This only happened in a few places.
Is there some way to search for the entries where ELWS ended up with 3 decimal places? Also, all fields are double type.
SELECT * FROM someTable
WHERE (ELWS * 1000) % 10 != 0
SQLFiddle here

Is there a possibility to change the order of a string with numeric value

I have some strings in my database. Some of them have numeric values (but in string format of course). I am displaying those values ordered ascending.
So we know, for string values, 10 is greater than 2 for example, which is normal. I am asking if there is any solution to display 10 after 2, without changing the code or the database structure, only the data.
If for example I have to display values from 1 to 10, I will have:
1
10
2
3
4
5
6
7
8
9
What I would like to have is
1
2
3
4
5
6
7
8
9
10
Is there a possibility to ad an "invisible character or string which will be interpreted as greater than 9". If i put a10 instead of 10, the a10 will be at the end but is there any invisible or less visible character for that.
So, I repeat, I am not looking for a programming or database structure solution, but for a simple workaround.
You could try to cast the value as an number to then order by it:
select col
from yourtable
order by cast(col AS UNSIGNED)
See SQL Fiddle with demo
You could try appending the correct number of zeroes to the front of the data:
01
02
03
..
10
11
..
99
Since you have a mixture of numbers and letters in this column - even if not in a single row - what you're really trying to do is a Natural Sort. This is not something MySQL can do natively. There are some work arounds, however. The best I've come across are:
Sort by length then value.
SELECT
mixedColumn
FROM
tableName
ORDER BY
LENGTH(mixedColumn), mixedColumn;
For more examples see: http://www.copterlabs.com/blog/natural-sorting-in-mysql/
Use a secondary column to use as a sort key that would contain some sort of normalized data (i.e. only numbers or only letters).
CREATE TABLE tableName (mixedColumn varchar, sortColumn int);
INSERT INTO tableName VALUES ('1',1), ('2',2), ('10',3),
('a',4),('a1',5),('a2',6),('b1',7);
SELECT
mixedColumn
FROM
tableName
ORDER BY
sortColumn;
This could get difficult to maintain unless you can figure out a good way to handle the ordering.
Of course if you were able to go outside of the database you'd be able to use natural sort functions from various programming languages.

How to randomly select multiple rows satisfying certain conditions from a MySQL table?

I'm looking for an efficient way of randomly selecting 100 rows satisfying certain conditions from a MySQL table with potentially millions of rows.
Almost everything I've found suggests avoiding the use of ORDER BY RAND(), because of poor performance and scalability.
However, this article suggests ORDER BY RAND() may still be used as a "nice and fast way" to fetch randow data.
Based on this article, below is some example code showing what I'm trying to accomplish. My questions are:
Is this an efficient way of randomly selecting 100 (or up to several hundred) rows from a table with potentially millions of rows?
When will performance become an issue?
SELECT user.*
FROM (
SELECT id
FROM user
WHERE is_active = 1
AND deleted = 0
AND expiretime > '.time().'
AND id NOT IN (10, 13, 15)
AND id NOT IN (20, 30, 50)
AND id NOT IN (103, 140, 250)
ORDER BY RAND()
LIMIT 100
)
AS random_users
STRAIGHT JOIN user
ON user.id = random_users.id
Is strongly urge you to read this article. The last segment will be covering the selection of multiple random row. And you should be able to notice the SELECT statement in the PROCEDURE that will be described there. That would be the spot where you add your specific WHERE conditions.
The problem with ORDER BY RAND() is that this operation has complexity of n*log2(n), while the method described in the article that I linked, has almost constant complexity.
Lets assume, that selecting random row from table, which contains 10 entries, using ORDER BY RAND() takes 1 time unit:
entries | time units
-------------------------
10 | 1 /* if this takes 0.001s */
100 | 20
1'000 | 300
10'000 | 4'000
100'000 | 50'000
1'000'000 | 600'000 /* then this will need 10 minutes */
And you wrote that you are dealing with table on scale of millions.
I'm afraid no-one's going to be able to answer your question with any accuracy. If you really want to know you'll need to run some benchmarks against your system (not the live one ideally but an exact copy). Benchmark this solution against a different solution (getting the random rows using PHP for example) and compare the numbers to what you/your client consider "good performance). Then ramp up your data trying to keep the distribution of column values as close to real as you can and see where performance starts to drop off. To be honest if it works for you now with a bit of headroom, then I'd go for it. When (if!) it becomes a bottleneck then you can look at it again - or just chuck extra iron at your database...
Preprocess as much as possible
try something like (VB-like example)
Dim sRND = New StringBuilder : Dim iRandom As New Random()
Dim iMaxID As Integer = **put you maxId here**
Dim Cnt as Integer=0
While Cnt < 100
Dim RndVal As Integer = iRandom.Next(1, iMaxID)
If Not ("10,13,15,20,30,50,103,140,250").Contains(RndVal) Then
Cnt += 1
sRND.Append("," & RndVal)
end if
End While
String.Format("SELECT * FROM (Select ID FROM(User) WHERE(is_active = 1) AND deleted = 0 AND expiretime > {0} AND id IN ({1}) .blahblablah.... LIMIT 100",time(), Mid(sRND.ToString, 2))
I didn't check for syntax but you'll get my drift I hope.
This will make MySql read records that fit the 'IN' and stop when it reaches 100 without the need to preprocess all records first.
Please let me know the elapsedtime difference if you try it. (I'm qurious)