How to make MySQL MATCH...AGAINST use various word separators? - mysql

I have a table with 300K string values. These values contain all types of word separators so it looks like this:
id value
1 A B C
2 A B_C
3 A_B-C
4 A-B-C
Let's say I want to find all four rows containing A and B. This query
SELECT * FROM table WHERE MATCH(value) AGAINST('+A +B' IN BOOLEAN MODE);
will return only one row with space separated values:
1 A B C
Is there a way to make MATCH...AGAINST use other word separators? I tried to use LIKE and it was too slow.

You will probably want to alter your app and schema just a little bit to solve this problem. You have two tasks:
Task 1: Transform your existing data
Assuming you need to keep the source data unchanged:
Step 1: Add a field to your schema, "searchFriendly", same datatype as the source data.
Step 2: Write a script to transform the data you already have. Get the whole data set and do string replaces to get spaces.
Step 3: Save that transformed data to the new searchFriendly field.
Task 2: Modify the app so that all future database save/update's on this data, also perform the transformation and save that data as well.
Step 1: Find the part of the app that saves these records.
Step 2: Before actually writing the data to the database, perform the transformation.
Step 3: Add the transformed data to your API call to save/update the record, under the searchFriendly field.

Related

Having troubles creating a wildcard sql code with multiple inputs

I'm currently working on creating a query that will pull data from a table that is linked on a certain part #. The challenge with this is the part #'s in the table have leading zeros. For example the part number I have is 8456790 but is stored in our table as 00000008456790. I'm able to get the desired results for one value using the following code:
select ZMATNR, ZLPN
FROM tblZMMGPNXREF
WHERE ZMATNR like ('%8456790%')
I have roughly 8,000 part #'s I want to run this code for but I know the syntax doesn't allow for me to paste all 8,000 parts at once.
Is there a quick way to run this code including all 8,000 part #s?
In most databases casting '00000008456790' to an integer should be enough:
select ZMATNR, ZLPN
FROM tblZMMGPNXREF
WHERE cast(ZMATNR as int) = 8456790
In Mysql it's even easier because of implicit conversion of '00000008456790' to the integer 8456790 when they are compared:
select ZMATNR, ZLPN
FROM tblZMMGPNXREF
WHERE ZMATNR = 8456790

Apache NiFi: Creating a new column by comparing multiple rows with various data

I have a csv which looks like this:
first,second,third,num1,num2,num3
12,312,433,0787388393,0783452323,01123124
12,124345,453,07821323,077213424,0123124421
33,2432,214,077213424,07821323,0234234211
I have to create another column according to the data stored in num1 and num2. There can be various values in the columns but the new column should only contain 2 values it's either original or fake. (I should only compare the first 3 digits in both num1 andnum2`.
For the mapping part I have another csv which looks like this (I have many more rows):
078,078,fake
072,078,original
077,078,original
My Output csv should look like this after mapping:
first,second,third,num1,num2,num3,status
12,312,433,0787388393,0783452323,01123124,fake
12,124345,453,07821323,072213424,0123124421,original
33,2432,214,078213424,07821323,0234234211,fake
Hope you can suggest me a nifi workflow to get the following done:
You can use LookupRecord for this, but due to the special logic you'll likely have to write your own ScriptedLookupService to read in the mapping file and compare the first 3 digits.

MySQL to CSV - separating multiple values

I have downloaded a MySQL table as CSV, which has over thousand entries of the following type:
id,gender,garment-color
1,male,white
2,"male,female",black
3,female,"red,pink"
Now, when I am trying to create a chart out of this data, it is taking "male" as one value, and "male,female" as a separate value.
So, for the above example, rather than counting 2 "male", and 3 "female", the chart is showing 3 separate categories ("male", "female", "male,female"), with one count each.
I want the output as follows, for chart to have the correct count:
id,gender,garment-color
1,male,white
2,male,black
2,female,black
3,female,red
3,female,pink
The only way I know is to copy the row in MS Excel and adjust the values manually, which is too tedious for 1000+ entries. Is there a better way?
From MySQL command line or whatever tool you are using to send queries to MySQL:
select * from the_table
into outfile '/tmp/out.txt' fields terminated by ',' enclosed by '"'
Then download /tmp/out.txt' from the server and it should be good to go assuming your data is good. If it is not, you might need to massage it with some SQL function use in theselect`.
The csv likely came from a poorly designed/normalized database that had both those values in the same row. You could try using selects and updates, along some built in string functions, on such rows to spawn additional rows containing the additional values and update their original rows to remove those values; but you will have to repeat until all commas are removed (if there is more than one in some field), and will have to determine if a row containing multiple fields with such comma-separated lists need multiplied out (i.e. should 2 gender and 4 color mean 8 rows total).
More likely, you'll probably want to create additional tables for X_garmentcolors, and X_genders; where X is whatever the original table is supposed to be describing. These tables would have an X_id field referencing the original row and a [garmentcolor|gender] value field holding one of the values in the original rows lists. Ideally, they should actually reference [gender|garmentcolor] lookup tables instead of holding actual values; but you'd have to do the grunt work of picking out all the unique colors and genders from your data first. Once that is done, you can do something like:
INSERT INTO X_[garmentcolor|gender] (X_id, Y_id)
SELECT X.X_id, Y.Y_id
FROM originalTable AS X
INNER JOIN valueTable AS Y
ON X.Y_valuelist LIKE CONCAT('%,' Y.value) -- Value at end of list
OR X.Y_valuelist LIKE CONCAT('%,' Y.value, ',%') -- Value in middle of list
OR X.Y_valuelist LIKE CONCAT(Y.value, ',%') -- Value at start of list
OR X.Y_valuelist = Y.value -- Value is entire list
;

Why does SSIS TOKEN function fail to count adjacent column delimiters?

I ran into a problem with SQL Server Integration Services 2012's new string function in the Expression Editor called TOKEN().
This is supposed to help you parse a delimited record. If the record comes out of a flat file, you can do this with the Flat File Source. In this case, I am dealing with old delimited import records that were stored as strings in a database VARCHAR field. Now they need to be extracted, massaged, and re-exported as delimited strings. For example:
1^Apple^0001^01/01/2010^Anteater^A1
2^Banana^0002^03/15/2010^Bear^B2
3^Cranberry^0003^4/15/2010^Crow^C3
If these strings are in a column called OldImportRecord, the delimiter is a caret (as shown), and we wish to put the fifth field into a Derived Column, we would use an expression like:
TOKEN(OldImportRecord,"^",5)
This returns Anteater, Bear, Crow, etc. In fact, we can create Derived Columns for each of the fields in this record (note that the index is one-based), change them as needed, and then build another delimited record for export.
Here's the problem. What if some of our data includes some empty strings (or Nulls rendered as empty strings)?
4^^0004^6/15/2010^Duck^D4
The TOKEN() fails to count the adjacent column delimiters, which throws off the column count. Now it only sees five columns instead of six columns. Our TOKEN(OldImportRecord,"^",5) returns "D4" instead of the intended "Duck". When we extract the fourth column, we wind up trying to put "Duck" into a Date column, and all sorts of fun ensues.
Here's a partial workaround:
TOKEN(REPLACE(OldImportRecord,"^^","^ ^"),"^",5)
Notice this misses every second delimiter pair, so it will fail for a string like "5^^^^Emu^E5", which looks like"5^ ^^ ^Emu^E5" after the REPLACE(). The column count is still wrong.
So here's my full workaround. This includes two nested REPLACE statements(), an RTRIM() to remove the superfluous spaces, and a DT_STR cast because I would like to keep the result in VARCHAR:
(DT_STR,255,1252)RTRIM(TOKEN(REPLACE(REPLACE(OldImportRecord,"^^","^ ^"),"^^","^ ^"),"^",5))
I am posting this for information, since others may also run into this problem.
Does anyone have a better workaround, or even a real solution?
Reason for the issue:
TOKEN method in SSIS uses the implementation of strtok function in C++. I gathered this information while reading the book Microsoft® SQL Server® 2012 Integration Services. It is mentioned as note on page 113 (I like this book! Lots of nice information.).
I searched for the implementation of strtok function and I found the following links.
INFO: strtok(): C Function -- Documentation Supplement - The code sample in this link shows that the function does ignore consecutive delimiter characters.
The answers to the following SO questions point out that strtok function is designed to ignore consecutive delimiters.
Need to know when no data appears between two token separators using strtok()
strtok_s behaviour with consecutive delimiters
I think that the TOKEN and TOKENCOUNT functions are working as per design but whether that is how SSIS should behave might be a question for the Microsoft SSIS team.
Original Post - Above section is an update:
I created a simple package in SSIS 2012 based on your data inputs. As you had described in your question, the TOKEN function does not behave as intended. I agree with you that the function doesn't seem to work. This post is not an answer to your original issue.
Here is an alternative way to write the expression in a relatively simpler fashion. This will only work if the last segment in your input record will always have a value (say A1, B2, C3 etc.).
Expression can be rewritten as:
This statement will take the input record as the parameter, the delimiter caret (^) as the second parameter. The third parameter calculates the total number segments in the records when split by the delimiter. If you have data in the last segment, you are guaranteed to have two segments. You can then subtract 1 to fetch the penultimate segment.
(DT_STR,50,1252)TOKEN(OldImportRecord,"^",TOKENCOUNT(OldImportRecord,"^") - 1)
I created a simple package with data flow task. OLE DB source retrieves the data and the derived transformation parses and splits the data as per the screenshot below. The output is then inserted into the destination table. You can see the source and destination tables in the last screenshot. Destination table has two columns. The first column stores the penultimate segment data and the segments count based on the delimiter (which again isn't correct). You can notice that the last record didn't fetch the correct results. If the last record didn't have the value 8, then the above expression will fail because the expression will evaluate to zero index.
Hope that helps to simplify your expression.
If you don't hear from anyone else, I would recommend logging this issue in Microsoft Connect website.
Create table and populate scripts:
CREATE TABLE [dbo].[SourceTable](
[OldImportRecord] [varchar](50) NOT NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[DestinationTable](
[NewImportRecord] [varchar](50) NOT NULL,
[CaretCount] [int] NOT NULL
) ON [PRIMARY]
GO
INSERT INTO dbo.SourceTable (OldImportRecord) VALUES
('1^Apple^0001^01/01/2010^Anteater^A1'),
('2^Banana^0002^03/15/2010^Bear^B2'),
('3^Cranberry^0003^4/15/2010^Crow^C3'),
('4^^0004^6/15/2010^Duck^D4'),
('5^^^^Emu^E5'),
('6^^^^Geese^F6'),
('^^^^Pheasant^G7'),
('8^^^^Sparrow^');
GO
Derived column transformation inside data flow task:
Data in source and destination tables:
Not only does TOKEN skip adjacent delimiters, it also skips leading and trailing delimiters as well. So, using your example, if you had a field "good" field that looks like this:
1^Apple^0001^01/01/2010^Anteater^A1
Followed by one with adjacent and leading delimiters like this:
^^^0004^6/15/2010^Duck^
TOKENCOUNT would only find two delimiters and you'd end up with 0004 assigned to Token1, 6/15/2010 for Token2, and Duck for Token3.
I used a different kind of replace. Rather than placing spaces between adjacent delimiters, which wouldn't help with leading or training, I used replace to surround the delimiters with characters I absolutely wouldn't find in my text. The following Expression works well for me. It's wordy, but it is what it is.
(DT_STR,255,1252)REPLACE(TOKEN(REPLACE(OldImportRecord,"^","~^~"),"^",1),"~","")
Of course, you'd replace the number 1 with whatever Token you wanted and adjust the cast according to your needs. Hope that helps.

Load file into mysql

I need to load a tab delimited file into mysql database. My database is set up with columns: ID;A;B;C;D;E and my file is a dump of columns ID and D. How can I load this file into my db and replace just columns ID and D, which out changing the values of columns C and E. When I load it in now, columns C and E are changed to DEFAULT:NULL.
I already answered a similar question like this here, but in your case, you'd want to load the csv into a temporary table, then use a simple update SQL statement to update the specific columbs from your temporary table to your production table.
You can update specific column using this command:
UPDATE the_table
SET D=<value-of-D>
WHERE ID=<value-of-ID>
Then run this command for each row in the tab-delimited file, replacing the D and ID values for each.
You can use a stored procedure or a PHP program to do the needful.
For MySQL, the SP would need to open the file using load_file() and storing the value in a variable. Then the program needs to loop through by finding "\n" which stands for new line to get a whole line in a string.
Then the program needs to be find the first tab position using locate() and by using substring() get the first ID col. Then the program needs to find the 4th tab i.e. 3 more tabs by using locate() and its 3rd parameter. This will be the starting position of your D column. Then find the next tab character again using locate() and its 3rd parameter which will give you the end character of the D column. Using substring() get the content of the D column. Use a update command to update the row's D column using ID as the search key in the where clause.
Since the above will loop through all lines you will update all rows of data.