MySQL - Escaping ampersand (&) in fulltext searches - mysql

We are using a fulltext search to search for the name of a company and all is going well until we have a company with an ampersand in its name, e.g. 'M&S'.
SELECT name FROM company WHERE MATCH (name) against ('M&S' IN BOOLEAN MODE);
This fails to return any results as MySQL is treating the ampersand as a boolean operator. The boolean mode is desired so it can't simply be turned off.
What I'm looking for is a way to escape the ampersand so that MySQL treats it correctly and finds the record.
Ditching fulltext search in favour of LIKEs isn't exactly an option either
Thanks for your help

Seems like & isn't considered a word character in the collation you use for your fulltext search.
so you have to create your own collation (or recompile your MySQL server) where you add & to the list of word characters like i found out in the MySQL docs (
http://dev.mysql.com/doc/refman/5.0/en/fulltext-fine-tuning.html) :
If you want to change the set of characters that are considered word
characters, you can do so in several ways, as described in the
following list. After making the modification, you must rebuild the
indexes for each table that contains any FULLTEXT indexes. Suppose
that you want to treat the hyphen character ('-') as a word character.
Use one of these methods:
Modify the MySQL source: In myisam/ftdefs.h, see the true_word_char()
and misc_word_char() macros. Add '-' to one of those macros and
recompile MySQL.
Modify a character set file: This requires no recompilation. The
true_word_char() macro uses a “character type” table to distinguish
letters and numbers from other characters. . You can edit the contents
of the array in one of the character set XML files to
specify that '-' is a “letter.” Then use the given character set for
your FULLTEXT indexes. For information about the array
format, see Section 10.3.1, “Character Definition Arrays”.
Add a new collation for the character set used by the indexed columns,
and alter the columns to use that collation. For general information
about adding collations, see Section 10.4, “Adding a Collation to a
Character Set”. For an example specific to full-text indexing, see
Section 12.9.7, “Adding a Collation for Full-Text Indexing”.
UPDATE: in case you are using latin1 collation, open your XML file which is at mysql/share/charsets/latin1.xml. and find the corresponding character code in a map - in this case you can take the map for lower case or upper case because this doesn't matter for the ampersand symbol:
<lower>
<map>
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
40 61 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F
70 71 72 73 74 75 76 77 78 79 7A 5B 5C 5D 5E 5F
60 61 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F
70 71 72 73 74 75 76 77 78 79 7A 7B 7C 7D 7E 7F
80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F
90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F
A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF
B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF
E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 EA EB EC ED EE EF
F0 F1 F2 F3 F4 F5 F6 D7 F8 F9 FA FB FC FD FE DF
E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 EA EB EC ED EE EF
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 FA FB FC FD FE FF
</map>
</lower>
the ampersand's unicode is U+0026 and in utf-8 encoding it's 0x26, so search for 26 in the map - which is in the 3rd row, 7th column.
then in the ctype-map change the type of the character from 10 which means punctuation to 01 which means small letter:
<ctype>
<map>
00
20 20 20 20 20 20 20 20 20 28 28 28 28 28 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
48 10 10 10 10 10 01 10 10 10 10 10 10 10 10 10
84 84 84 84 84 84 84 84 84 84 10 10 10 10 10 10
10 81 81 81 81 81 81 01 01 01 01 01 01 01 01 01
01 01 01 01 01 01 01 01 01 01 01 10 10 10 10 10
10 82 82 82 82 82 82 02 02 02 02 02 02 02 02 02
02 02 02 02 02 02 02 02 02 02 02 10 10 10 10 20
10 00 10 02 10 10 10 10 10 10 01 10 01 00 01 00
00 10 10 10 10 10 10 10 10 10 02 10 02 00 02 01
48 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
01 01 01 01 01 01 01 10 01 01 01 01 01 01 01 02
02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
02 02 02 02 02 02 02 10 02 02 02 02 02 02 02 02
</map>
</ctype>
restart your MySQL server and the corresponding collation is handling & like it was a small letter.
of course it's better to first copy and rename your new collation XML-file and to also copy and paste the corresponding lines in the Index.xml (don't forget to use a new unused id in the XML tags there) and link them to your new collation XML-file so you don't lose your original collation.
you can find the full documentation where i got most of the information from here:
http://dev.mysql.com/doc/refman/5.0/en/full-text-adding-collation.html
Note - For all those working with Mysql 5.7 version use an unused collation id. The mysql article http://dev.mysql.com/doc/refman/5.0/en/fulltext-fine-tuning.html is for Mysql 5.5 version. To get maximum collation Id use following Query -
SELECT MAX(ID) FROM INFORMATION_SCHEMA.COLLATIONS;

EDIT: so the & is splitting it into two separate words... since they are 1 letter it is not returning anything. I tested with "Ma&Sa".. my ft_min_word_len = 4... and it didn't return anything so since the length of that string > 4 but its not returning it has to be splitting it into two words... it looks like the suggestion northkildonan made is what you have to do.
So this may or may not be an answer.. but I hope it is helpful for figuring this out.. try this.
first: run this statement -- SHOW VARIABLES LIKE 'ft_min_word_len'; and affirm that the length is actually = 2
if it is i'm not sure how it is any different than a word that is longer than a length of 4
Second: I did this and got results.
SET UP:
I set up a sample table on my localhost database...
create table company(
`id` int,
`name` varchar(55)
);
insert into company
(`id`, `name`)
values
(1, 'oracle'),
(2, 'microsoft'),
(3, 'M&S'),
(4, 'dell');
TESTS:
tested when ft_min_word_len = 4 and obviously it didn't return anything.
SELECT `name` FROM company WHERE MATCH (`name`) against ("M&S" IN BOOLEAN MODE);
I didn't want to try restarting my localhost database to reset the length to 2 (incase I accidentally mess something up because I use it a lot)..
but I got the idea of trying to look for the name of a company that was longer than a length of 4 with the & in it.
MORE SETUP:
insert into company
(`id`, `name`)
values
(5, 'Mary&Sasha');
ANOTHER TEST:
SELECT `name` FROM company WHERE MATCH (`name`) against ("Mary&Sasha" IN BOOLEAN MODE);
this returned http://screencast.com/t/Rx8mh98OUp
I also did this just incase the collation was messing it up but I doubt that was the problem..
COLLATION STUFF:
ALTER TABLE company MODIFY
`name` VARCHAR(55)
CHARACTER SET latin1
COLLATE latin1_german2_ci;
you can also check your tables collation with:
SHOW TABLE STATUS;
hope this is at least some help :)

& is not a special character in mysql therefore you are able to store and search for the expression &
you can test that as followed
SELECT name FROM `testing` WHERE name LIKE '%&%'
also please try somthing like the following to replace the &.
SET #searchstring = 'M&S';
SET #searchstring = REPLACE(#searchstring,'&','&');
SELECT name FROM company WHERE MATCH (name) against (#searchstring IN BOOLEAN MODE);
You may also take a look at regexp.
http://dev.mysql.com/doc/refman/5.1/en/regexp.html
Here the & is used as followed.
mysql> SELECT '&' REGEXP '[[.ampersand.]]';
The following query is also getting you the result
SELECT *
FROM `testing`
WHERE `name` REGEXP CONVERT( _utf8 'M&S'
USING latin1 ) COLLATE latin1_german2_ci
LIMIT 0 , 30
please also read this thread, maybe you can understand it better then me. This is SQL but they seem to have solved the problem
http://forums.asp.net/t/1073707.aspx?Full+text+search+and+sepcial+characters+like+ampersand+
sorry I couldn´t help more

Related

Reverse CRC16 calculation

I'm trying to understand how is calculated the CRC at the end of a radio packet.
Here are a few examples:
11 00 01 0D 30 10 05 1F 11 ED 7E 01 00 01 B9 33
11 00 01 0D 30 10 05 1F 11 ED 7E 01 00 00 B9 32
11 00 01 1D 30 10 05 1F 11 ED 7E 01 00 00 EA CC
11 00 01 2D 30 10 05 1F 11 ED 7E 01 00 00 1E CE
The 4th byte is a sequence number. All other bytes are constant. The last 2 bytes definitely look like a CRC16, as these are the only ones changing when the sequence byte increases. The last 2 bytes are not related to the time, as I can reproduce that exact same sequence anytime.
Here are a few more examples, from the same device but with a different command:
16 00 01 60 20 10 05 1F 11 ED 7E 01 02 00 04 00 02 00 65 32 CC
16 00 01 CB 20 10 31 53 11 ED 7E 01 42 00 04 00 02 00 65 B4 B9
This time again, the last 2 bytes look like a CRC16.
I've tried many CRC calculations, using online calculators like crccalc.com.
I've also used the RevEng tool, but got no results.
I can't figure out the method of calculation, so I must be missing something.
Any help to determine the calculation would be welcome.
Thanks!
It is the CRC-16/XMODEM, computed on your examples with the first three bytes and the last two bytes before the CRC removed, and then, oddly, that CRC exclusive-or'ed with the two bytes that precede it (those that were excluded from the CRC calculation). The resulting 16-bit value is appended in big-endian order.

Converting a hexdump back to a rar

I have a plaintext file that I wish to convert to something I can extract.
00000000 52 61 72 21 1a 07 01 00 f3 e1 82 eb 0b 01 05 07 |Rar!............|
00000010 00 06 01 01 80 80 80 00 3b fd 42 9f 51 02 03 31 |........;.B.Q..1|
00000020 a0 02 06 82 03 80 83 02 20 15 d4 6e 5b 46 b6 57 |........ ..n[F.W|
00000030 80 03 01 09 69 6e 73 74 72 2e 74 78 74 30 01 00 |....instr.txt0..|
00000040 03 0f 44 a5 ce af b3 09 b9 96 44 22 f4 99 ef 04 |..D.......D"....|
This is part of the file which made me believe it is a rar file. I tried using xxd with the -r option to no avail.
I tried the solution from here but it also didn't work.
Any ideas?
To solve my own question, and for future reference.
Use vim's visual block select to copy just the hex values into 'justhexvalues.txt'.
Then use xxd:
xxd -r -p justhexvalues.txt answer.rar
That was it.

How do you compress image data for LZW encoding for .GIF files?

I am having trouble understanding how to compress image data for the 89a specification for .gif files. Say for example I am trying to make a 3x2 .GIF. Let me construct a sample color code table and walk through an example [of what I think is correct].
Color code | Color
------------------
0 | Brown
1 | Red
2 | Green
3 | Black
The image I want to create is this.
3x2 pixels (6 pixels total)
----------
Br Br Br
Br R Br
Compressing with LZW walks me through this process. This is the final code table I get.
Code table
----------
# | code
0 | 0
1 | 1
2 | 2
3 | 3
4 | clear
5 | eoi // end of information
6 | 0 0
7 | 0 0 0
8 | 0 1
9 | 1 0
With an eventual value of 4 0 6 0 1 0 5 that are my codes. Because I wrote out a code 0 0 0, this code value equals 7, so I had to increase my code size from 3 > 4 bits for subsequent codes. So, here are the bytes of my image data (from my code table).
100 - 4
000 - 0
110 - 6
0000 - 0
0001 - 1
0000 - 0
0101 - 5
I end up encoding my image data as
10000100 - 132
00100001 - 33
10100000 - 160
00000000 - 0
Which ends up looking like this in my final .gif file (I've put brackets around the values that correspond to the image data)
47 49 46 38 39 61 03 00 02 00 f1 00 00 b9 7a 56
ff 00 00 00 ff 00 00 00 00 21 ff 0b 4e 45 54 53
43 41 50 45 32 2e 30 03 01 ff ff 00 21 f9 04 04
64 00 00 00 2c 00 00 00 00 03 00 02 00 00 [02 04
84 21 a0 00 00] 3b
// Explanation
02 - Minimum LZW code size
04 - Data sub-block of 4 bytes
84 - 132 in decimal
21 - 33 in decimal
a0 - 160 in decimal
00 - 0 in decimal
00 - Termination byte
My image looks something like this (why is there green in here instead of red?). I blew the image up since 2x3 pixels is a bit hard to read.
Is there something fundamental that I am missing? I appreciate your time to look at this with me.
Found the error, it lies in the code size when compressing LZW image data.
When you are creating the code table when compressing image data with LZW, you need to increment your code size when you've added a code that equals to 2^(code size). So, instead of incrementing the code size by one after adding code 7 | 0 0 0 (as shown in the table above), I needed to instead increment the code size by one after adding 8 | 0 1 (because 8 = 2^(code size == 3)).
This is how the image data changes by incrementing the code size as described
100 - 4
000 - 0
110 - 6
000 - 0
0001 - 1
0000 - 0
0101 - 5
And then, how the resulting image data bytes has changed.
10000100 - 132
00010001 - 17
01010000 - 80
I've put brackets around the data to show a comparison from the full .gif data to show what has changed (after applying the fix). This is the same .gif file from above.
47 49 46 38 39 61 03 00 02 00 f1 00 00 b9 7a 56
ff 00 00 00 ff 00 00 00 00 21 ff 0b 4e 45 54 53
43 41 50 45 32 2e 30 03 01 ff ff 00 21 f9 04 04
64 00 00 00 2c 00 00 00 00 03 00 02 00 00 [02 03
84 11 50 00] 3b
// Explanation
02 - Minimum LZW code size
03 - Data sub-block of 3 bytes
84 - 132 in decimal
11 - 17 in decimal
50 - 80 in decimal
00 - Termination byte

Access 2013 - Sort Irregular lengths strings to oder

I'm trying to sort some strings with hexa numbers, my problem is that they are to irregular and hard for my knowledge in Access so I could really use some help!
From every file "Files" are one REQUEST string with a corresponding RESPONSE string they are similar at the first 4 characters "16xx" and always at the 8-9 character "xx" sometimes in more places and at character 5-6 are +40 added to the RESPONSE ex 19 -> 59. I took some examples from my table (the real table is 600 rows with different string from 24 different files)
ID = pimekey, Files = file where string came form, Nr = what nr the string had in file, String = the string I would like to sort, TYPE = if it was a REQUEST or RESPONSE
I would like to make pairs of them in a new query like this...
...so that they are aliened in Files order with the REQUEST before the RESPONSE.
I have tried making different queries all day to sort this out but can´t get the syntax right. Tried sorting through using SQL left, iif , mid,len function with Update queries, but I either get syntax error, nothing or the wrong values... Is there a way of doing this or are they to irregular to even sort?
Thanks
EDIT
from one file how it look now:
ID Files Nr String Type
1 1 1 1636 19 02 2F REQUEST
2 2 2 1637 19 02 2F REQUEST
3 2 3 1631 19 02 2F REQUEST
4 3 4 1637 19 04 0A 1B 47 FF REQUEST
28 1 10 1636 59 02 FF RESPONSE
29 2 11 1637 59 02 FF RESPONSE
30 2 12 1631 59 02 7F C1 00 00 28 C2 A4 RESPONSE
31 3 13 1637 59 04 0A 1B 47 00 RESPONSE
how I would want it:
ID Files Nr String Type
1 1 1 1636 19 02 2F REQUEST
28 1 10 1636 59 02 FF RESPONSE
2 2 2 1637 19 02 2F REQUEST
29 2 11 1637 59 02 FF RESPONSE
3 2 3 1631 19 02 2F REQUEST
30 2 12 1631 59 02 7F C1 00 00 28 C2 A4 RESPONSE
4 3 4 1637 19 04 0A 1B 47 FF REQUEST
31 3 13 1637 59 04 0A 1B 47 00 RESPONSE
You can try something like this (MYSQL). It use user defined variable to "generate" field for ordering. I suppose FIL is the name of the table:
SELECT ID, FILES, NR, STRING, TYPE
FROM (
SELECT *
, #o:= CASE WHEN TYPE='REQUEST' THEN #o+2 ELSE 0 END ord
, #p:= CASE WHEN TYPE= 'RESPONSE' THEN #p+2 ELSE 0 END ord2
, #o+#p AS ord_tot
FROM FIL A
CROSS JOIN (SELECT #o:=-1,#p:=2 ) T1
ORDER BY TYPE, FILES, NR
) B
ORDER BY ord_tot;
Output:
ID FILES NR STRING TYPE
1 1 1 1636 19 02 2F REQUEST
28 1 10 1636 59 02 FF RESPONSE
2 2 2 1637 19 02 2F REQUEST
29 2 11 1637 59 02 FF RESPONSE
3 2 3 1631 19 02 2F REQUEST
30 2 12 1631 59 02 7F C1 00 00 28 C2 A4 RESPONSE
4 3 4 1637 19 04 0A 1B 47 FF REQUEST
31 3 13 1637 59 04 0A 1B 47 00 RESPONSE
The simplest way to accomplish this would be to use MySQL's built hexadecimal format and then just use ORDER BY on the correct fields. This is of course assuming you can modify how your data is stored to conform to the hexadecimal format described.

Dissecting MySQL InnoDB record format to restore from raw disk

I had a mysql database stored on a USB thumb drive which has irreparably lost its file allocation table. Therefore, I cannot get to the ibdata1 file as a whole. I can, however locate the record pages which were used using a hex editor.
All the data is there, but I have to read each record myself and play back new SQL statements to a database restored from a 6 month old backup.
Because I have a backup, I know the table structure. and can find a record in the new database that I know roughly equates to a small block of binary data. However, I am having trouble determining exactly where the record starts and decoding the record data.
The CREATE statement for the table is:
CREATE TABLE ExpenseTransactions (
idExpenseTransactions int(11) NOT NULL AUTO_INCREMENT,
TransactionDate datetime NOT NULL,
DollarAmount float DEFAULT NULL,
PoundAmount float DEFAULT NULL,
Location varchar(255) DEFAULT NULL,
MinorCategory int(11) NOT NULL,
Comment varchar(255) DEFAULT NULL,
Recurring bit(1) NOT NULL DEFAULT b'0',
Estimate bit(1) NOT NULL DEFAULT b'0',
PRIMARY KEY (idExpenseTransactions),
KEY MinorCategory (MinorCategory)
) ENGINE=InnoDB AUTO_INCREMENT=4687 DEFAULT CHARSET=utf8;
A clean record looks like this:
'2924', '2013-11-01 00:00:00', '60', NULL, 'George', '66', 'Lawn Maintenance', '1', '0'
The hex bytes associated with this record are next. I am pretty certain have more bytes than necessary to recreate the record, but I have marked what I believe is the id field to try to give some reference point.
10 06 02 00 01 70 00 41 80 00 0B 6C 00 00 00 00 07 05 86 00 00 01 4A 0E B1 80 00 12 4F 23 1F C1 40 00 00 70 42 47 65 6F 72 67 65 80 00 00 42 4C 61 77 6E 20 4D 61 69 6E 74 65 6E 61 6E 63 65 01 00
I can fathom out the strings easily enough and I can pick out the 4 bytes making up the MinorCategory. The last 2 bytes should represent the 2 bit values. The rest is more difficult.
The record in question is correctly identified, and per my blog post The physical structure of records in InnoDB, here's how it decodes:
Header:
10 Length of Comment = 16 bytes
06 Length of Location = 6 bytes
02 Nullable field bitmap (PoundAmount = NULL)
00 Info flags and number of records owned
01 70 Heap number and record type
00 41 Offset to next record = +65 bytes
Record:
80 00 0B 6C idExpenseTransactions = 2924
00 00 00 00 07 05 TRX_ID
86 00 00 01 4A 0E B1 ROLL_PTR
80 00 12 4F 23 1F C1 40 TransactionDate = "2013-11-01 00:00:00"
00 00 70 42 DollarAmount = 60.0
(No data, PoundAmount = NULL)
47 65 6F 72 67 65 Location = "George"
80 00 00 42 MinorCategory = 66
4C 61 77 6E 20 4D 61 69 Comment = "Lawn Maintenance"
6E 74 65 6E 61 6E 63 65 (Comment continues...)
01 Recurring = 1
00 Estimate = 0