How do I run a query to convert price column from text to numbers(BIGINT) in a database? Is it possible to do such processing using only SQL queries?
Price
————-
2 millions 5 hundreds thousands
52 thousands
3 hundreds 25 thousands
10 millions 30 thousands
UPDATE
Like what you guys commented, I guess this task is better done with other language instead of SQL.
Can you do it? Sure you can but you really shouldn't. Here's an example (abbreviated) of what it might look like:
SELECT CAST(CONCAT(
CASE WHEN priceData.millions IS NOT NULL
THEN LPAD(priceData.millions, 3, '0')
ELSE '000',
CASE WHEN priceData.thousands IS NOT NULL
THEN LPAD(priceData.thousands, 3, '0')
ELSE '000'
-- additional branches for hundreds, tens, whatever else you want to process
) as BIGINT) price
FROM (
SELECT
REPLACE(
REGEX_SUBSTR(table.price, '[0-9]* millions'),
' millions',
'') as millions,
REPLACE(
REGEX_SUBSTR(table.price, '[0-9]* thousands'),
' thousands',
'') as thousands,
REPLACE(
REGEX_SUBSTR(table.price, '[0-9]* hundreds'),
' hundreds',
'') as hundreds'
-- additional branches for hundreds, tens, whatever else you want to process
FROM table
) as priceData
This is a minimal proof of concept, but it will require a lot of building out before it works. It also makes a lot of assumptions about your data, will be insanely slow, and will make whoever comes to maintain your code want to gouge their eyes out. I mainly provided it to show you just how sad it will make you to put it all together.
The real solution would be to store the numbers as BIGINT before the data gets to the db at all. If you can't do that, I would do this kind of processing programmatically in whatever it is that's querying your db.
Related
I want to count how many columns in a row are not NULL.
The table is quite big (more than 100 columns), therefore I would like to not do it manually or using php (since I dont use php) using this approach Counting how many MySQL fields in a row are filled (or empty).
Is there a simple query I can use in a select like SELECT COUNT(NOT ISNULL(*)) FROM big_table;
Thanks in advance...
Agree with comments above:
There is something wrong in the data since there is a need for such analysis.
You can't completely make it automatic.
But I have a recipe for you for simplifying the process. There are only 2 steps needed to achieve your aim.
Step 0. In the step1 you'll need to get the name of your table schema. Normally, the devs know in what schema does the table reside, but still... Here is how you can find it
select *
from information_schema.tables
where table_name = 'test_table';
Step 1. First of all you need to get the list of columns. Getting just the list of cols won't help you out at all, but this list is all we need to be able to create SELECT statement, right? So, let's make database to prepare select statement for us
select concat('select (length(concat(',
group_concat(concat('ifnull(', column_name, ', ''###'')') separator ','),
')) - length(replace(concat(',
group_concat(concat('ifnull(', column_name, ', ''###'')') separator ','),
'), ''###'', ''''))) / length(''###'')
from test_table')
from information_schema.columns
where table_schema = 'test'
and table_name = 'test_table'
order by table_name,ordinal_position;
Step 3. Execute statement you've got on step 2.
select (length(concat(.. list of cols ..)) -
length(replace(concat(.. list of cols .. ), '###', ''))) / length('###')
from test_table
The select looks tricky but it's simple: first replace all nulls with some symbols that you're sure you'll never get in those columns. I usually do that replacing nulls with "###". that what all that "ifnull"s are here for.
Next, count symbols with "length". In my case it was 14
After that, replace all "###" with blanks and count length again. It's 11 now. For that I was using "length(replace" functions together
Last, just divide (14 - 11) by a length of a replacement string ("###" - 3). You'll get 1. This is exactly amount of nulls in my test string.
Here's a test case you can play with
Do not hesitate to ask if needed
I have a table which features 37 columns, of various types, INTs and VARCHARS and 2 LONGTEXT columns. The client wants a search to search the table and find the rows that match.
However, I'm having trouble with this. Here is what I've done so far:
1) My initial view was to do a massive set of OR queries - however I was put off this by the fact I would need to supply the search data ~30 times, which is massive repetition and I'm sure there;s a better way than this.
Code:
SELECT
MemberId,
MemNameTitle,
MemSName,
MemFName,
MemPostcode,
MemEmail
FROM
MemberData
WHERE
MemFName LIKE CONCAT('%',?,'%') OR
MemSName LIKE CONCAT('%',?,'%') OR
MemAddr LIKE CONCAT('%',?,'%') OR
MemPostcode LIKE CONCAT('%',?,'%') OR
MemEmail LIKE CONCAT('%',?,'%')
...etc...
Etc. Etc. That's a massive set of OR's and really unwieldy.
2) I thought I'd try and rework it to place all the columns in brackets and then only ask the query once, I saw a similar piece of code on SO but not sure that was correctly working, but it was an insprition, at least:
SELECT
MemberId,
MemNameTitle,
MemSName,
MemFName,
MemPostcode,
MemEmail
FROM
MemberData
WHERE
(MemNameTitle OR
MemFName OR
MemSName OR
MemAddr OR
MemPostcode OR
MemEmail OR
MemSkype OR
MemLinkedIn OR
MemFacebook OR
MemEmailTwo ...etc...) LIKE CONCAT('%',?,'%')
GROUP BY
MemberId
This code executes without apparent error but fails as it always returns no result, as in 0 fields returned. I can't see why, from an initial view,
3) So, with some research on OS I found a rearrangement using the IN keyword, but from previous questions on here Is it possible to use LIKE and IN for a WHERE statment? it appeared not to work.
What I wanted to get was something like:
SELECT
MemberId,
MemNameTitle,
MemSName,
MemFName,
MemPostcode,
MemEmail
FROM
MemberData
WHERE
MemNameTitle,
MemFName,
MemSName,
MemAddr,
MemPostcode,
MemEmail,
MemSkype,
MemLinkedIn,
...etc ...
MemFax,
MemberStatus,
CommitteeNotes,
SecondAddr,
SecondAddrPostcode IN (LIKE CONCAT('%',?,'%') )
This is crudy syntax but I hope you get the idea I want to get, I want to search many fields for the same value using a LIKE % % clause. Fields are variously TEXT/VARCHAR types.
4) I then looked into MySQL full text searches but this quickly became useless as this is only applied to TEXT type rather than VARCHAR type searching. I considered before each search changing each VARCHAR column to a TEXT column but figured that was also be relatively processor intensive and seemed illogical for a search that many people must want to do?
So, I'm out of ideas..... Can you help me search this way? Or suggest why my code in attempt 2 always returns Zero rows?
Cheers
Additional Work:
5) I have been looking at rearranging the IN clause statement and came up with this:
SELECT *(lazy typing!) WHERE
CONCAT('%',?,'%') IN
(MemNameTitle,
MemFName,
MemSName,
MemAddr,
MemPostcode,
MemEmail,
MemSkype,
...etc...
CommitteeNotes,
SecondAddr,
SecondAddrPostcode)
GROUP BY MemberId
However this returns a result, but the result is always the last row of the table. This doesn't work.
Solution 1:
From Ravinder, using CONCAT_WS for all the fields - this works in my case, although something in my mind does find CONCATs somewhat ugly, but oh well.
SELECT * FROM MemberData WHERE
CONCAT_WS('<*!*>',
MemNameTitle, MemFName, MemSName,
MemAddr, MemPostcode, MemEmail,
MemSkype, MemLinkedIn,
...etc...
MemberStatus, CommitteeNotes, SecondAddr,
SecondAddrPostcode)
LIKE CONCAT('%',?,'%')
GROUP BY MemberId ";
The table will eventually have a few thousand rows, and I am a little worried that as this query will concat 24 columns for each row on the table for each search, that this could easily become quite expensive and inefficient (ie slow), so if anyone has any ways of either
i) searching without CONCAT columns or
ii) making this solution faster/ more efficient
please share!!
There is a workaround solution. But I feel this is too crude and performance may not be that good.
where
concat_ws( "<*!*>", col1, col2, col3, ... ) like concat( '%', ?, '%' )
Here, I used '<*!*>' just as an example separator.
You have to use a pattern string as separator which, you are sure that,
is not part of the place holder value or
is not part of the generated string when 2 or more columns are
concatenated
Refer to Documentation:
MySQL: CONCAT_WS(separator,str1,str2,...)
It won't skip empty column values but NULLs.
One rather ugly way to do it would be
SELECT
MemberId,
MemNameTitle,
MemSName,
MemFName,
MemPostcode,
MemEmail
FROM
MemberData
WHERE
CONCAT(
MemNameTitle,
MemFName,
MemSName,
MemAddr,
MemPostcode,
MemEmail,
MemSkype,
MemLinkedIn,
...etc ...
MemFax,
MemberStatus,
CommitteeNotes,
SecondAddr,
SecondAddrPostcode) LIKE CONCAT('%',?,'%')
so you first concatenate all the columns you want to search and then look in the resulting big string for your text.
But i guess you can see that this is far from performant and optimal. But since you are using the % sign in the beginning and end of your searches, you couldn't use any indexes anyway.
Warning:
Be aware that this CONCAT may fail in case one of your columns contains a null value, because then the whole CONCAT will return null!
I have a table with 17.6 million rows in a MyISAM database.
I want to searh an article number in it, but the result can not depend on special chars as dot,comma and others.
I'm using a query like this:
SELECT * FROM `table`
WHERE
replace(replace(replace( replace( `haystack` , ' ', '' ),
'/', '' ), '-', '' ), '.', '' )
LIKE 'needle'
This method is very-very slow. table has an index on haystack, but EXPLAIN shows query can not use that, That means query must scan 17.6 million rows - in 3.8 sec.
Query runs in a page multiple times (10-15x), so the page loads extremly slow.
What should i do? Is it a bad idea to use replace inside the query?
As you do the replace on the actual data in the table, MySQL can't use the index, as it doesn't have any indexed data of the result of the replace which it needs to compare to the needle.
That said, if your replace settings are static, it might be a good idea to denormalize the data and to add a new column like haystack_search which contains the data with all the replaces applied. This column could be filled during an INSERT or UPDATE. An index on this column can then effectively be used.
Note that you probably want to use % in your LIKE query as else it is effectively the same as a normal equal comparison. Now, if you use a searchterm like %needle% (that is with a variable start), MySQL again can't use the index and falls back to a table scan as it only can use the index if it sees a fixed start of the search term, i.e. something like needle%.
So in the end, you might end up having to tune your database engine so that it can held the table in memory. Another alternative with MyISAM tables (or with MySQL 5.6 and up also with InnoDB tables) is to use a fulltext index on your data which again allows rather efficient searching.
It's "bad" to apply functions to the column as it will force a scan of the column.
Perhaps this is a better method:
SELECT list
, of
, relevant
, columns
, only
FROM your_table
WHERE haystack LIKE 'two[ /-.]needles'
In this scenario we are searching for "two needles", where the space between the words could be any of the character within the square brackets i.e. "two needles", "two/needles", "two-needles" or "two.needles".
You could try using LENGTH on the column, not sure if it gives a better affect. Also, when using LIKE you should use the %
SELECT * FROM `table`
WHERE
haystack LIKE 'needle%' AND
LENGTH(haystack) - LENGTH(REPLACE(haystack,'/','')) = 0 AND
LENGTH(haystack) - LENGTH(REPLACE(haystack,'-','')) = 0 AND
LENGTH(haystack) - LENGTH(REPLACE(haystack,'.','')) = 0;
If the haystack is exactly needle then do this
SELECT * FROM `table`
WHERE
haystack='needle';
I have a situation:
I have a database (MySQL) which contains products and their codes like this
BLACK SUGAR BS 709
HOT SAUCE AX889/9
TOMY 8861
I got an excel spreadsheet which I converted to CSV, this contains prices for the products. Its structure consists in 2 columns, code and price, like this:
BS709 23.00
AX 889 /9 10.89
8861 1.69
I made a script to update the products prices by searching in the database for the respective product code, using a FOREACH and %LIKE% query.
FOREACH row in CSV, search the database using "WHERE product_code LIKE %code%.
This is offcourse a primitive and not so succesfull way of updating the prices, because the codes in CSV are not an exact match (in syntax) of those in the database so if I have two products in the DB containing BS709 (BS70923) in their code I get multiple matches.
Is there a better way of doing this ?
You could trim the columns of spaces and other characters using MySQL replace() before comparing. This will return all exact matches, regardless of any spaces contained.
SELECT * FROM table WHERE REPLACE( product_code, ' ', '' ) LIKE 'code'
Given your examples, I would recommend removing all spaces from both, and then just looking for when the beginning or end of a code matches exactly:
where replace(e.code, ' ', '') like concat(replace(db.code, ' ', ''), '%') or
replace(e.code, ' ', '') like concat('%', replace(db.code, ' ', '')) or
replace(db.code, ' ', '') like concat(replace(e.code, ' ', ''), '%') or
replace(db.code, ' ', '') like concat('%', replace(e.code, ' ', ''));
This may not work for the specific case when one code is a prefix of another.
In any case, if the product codes in a spreadsheet are different from the product codes in the database, I think you have bigger problems. If you cannot really fix the spreadsheets, I would recommend that you manually/semi-automatically create a synonyms table in the database. This would have the Excel product code in one column and the correct product code in the other. Then you can do the lookup just by joining this together.
Yes. That is work. But probably less work than struggling with this problem and getting poor results that have to be repeatedly updated.
First off there seems to be no way to get an exact match using a full-text search. This seems to be a highly discussed issue when using the full-text search method and there are lots of different solutions to achieve the desired result, however most seem very inefficient. Being I'm forced to use full-text search due to the volume of my database I recently had to implement one of these solutions to get more accurate results.
I could not use the ranking results from the full-text search because of how it works. For instance if you searched for a movie called Toy Story and there was also a movie called The Story Behind Toy Story that would come up instead of the exact match because it found the word Story twice and Toy.
I do track my own rankings which I call "Popularity" each time a user access a record the number goes up. I use this datapoint to weight my results to help determine what the user might be looking for.
I also have the issue where sometimes need to fall back to a LIKE search and not return an exact match. I.e. searching Goonies should return The Goonies (most popular result)
So here is an example of my current stored procedure for achieving this:
DECLARE #Title varchar(255)
SET #Title = '"Toy Story"'
--need to remove quotes from parameter for LIKE search
DECLARE #Title2 varchar(255)
SET #Title2 = REPLACE(#title, '"', '')
--get top 100 results using full-text search and sort them by popularity
SELECT TOP(100) id, title, popularity As Weight into #TempTable FROM movies WHERE CONTAINS(title, #Title) ORDER BY [Weight] DESC
--check if exact match can be found
IF EXISTS(select * from #TempTable where Title = #title2)
--return exact match
SELECT TOP(1) * from #TempTable where Title = #title2
ELSE
--no exact match found, try using like with wildcards
SELECT TOP(1) * from #TempTable where Title like '%' + #title2 + '%'
DROP TABLE #TEMPTABLE
This stored procedure is executed about 5,000 times a minute, and crazy enough it's not bringing my server to it's knees. But I really want to know if there was a more efficient approach to this? Thanks.
You should use full text search CONTAINSTABLE to find the top 100 (possibly 200) candidate results and then order the results you found using your own criteria.
It sounds like you'd like to ORDER BY
exact match of the phrase (=)
the fully matched phrase (LIKE)
higher value for the Popularity column
the Rank from the CONTAINSTABLE
But you can toy around with the exact order you prefer.
In SQL that looks something like:
DECLARE #title varchar(255)
SET #title = '"Toy Story"'
--need to remove quotes from parameter for LIKE search
DECLARE #title2 varchar(255)
SET #title2 = REPLACE(#title, '"', '')
SELECT
m.ID,
m.title,
m.Popularity,
k.Rank
FROM Movies m
INNER JOIN CONTAINSTABLE(Movies, title, #title, 100) as [k]
ON m.ID = k.[Key]
ORDER BY
CASE WHEN m.title = #title2 THEN 0 ELSE 1 END,
CASE WHEN m.title LIKE #title2 THEN 0 ELSE 1 END,
m.popularity desc,
k.rank
See SQLFiddle
This will give you the movies that contain the exact phrase "Toy Story", ordered by their popularity.
SELECT
m.[ID],
m.[Popularity],
k.[Rank]
FROM [dbo].[Movies] m
INNER JOIN CONTAINSTABLE([dbo].[Movies], [Title], N'"Toy Story"') as [k]
ON m.[ID] = k.[Key]
ORDER BY m.[Popularity]
Note the above would also give you "The Goonies Return" if you searched "The Goonies".
If got the feeling you don't really like the fuzzy part of the full text search but you do like the performance part.
Maybe is this a path: if you insist on getting the EXACT match before a weighted match you could try to hash the value. For example 'Toy Story' -> bring to lowercase -> toy story -> Hash into 4de2gs5sa (with whatever hash you like) and perform a search on the hash.
In Oracle I've used UTL_MATCH for similar purposes. (http://docs.oracle.com/cd/E11882_01/appdev.112/e25788/u_match.htm)
Even though using the Jaro Winkler algorithm, for instance, might take awhile if you compare the title column from table 1 and table 2, you can improve performance if you partially join the 2 tables. I have in some cases compared person names on table 1 with table 2 using Jaro Winkler, but limited results not just above a certain Jaro Winkler threshold, but also to names between the 2 tables where the first letter is the same. For instance I would compare Albert with Aden, Alfonzo, and Alberto, using Jaro Winkler, but not Albert and Frank (limiting the number of situations where the algorithm needs to be used).
Jaro Winkler may actually be suitable for movie titles as well. Although you are using SQL server (can't use the utl_match package) it looks like there is a free library called "SimMetrics" which has the Jaro Winkler algorithm among other string comparison metrics. You can find detail on that and instructions here: http://anastasiosyal.com/POST/2009/01/11/18.ASPX?#simmetrics