I am trying to identify Spanish ID numbers using REGEX on MySQL. I am took this regex to adapt it to my dataset, as the items are not isolated and might not start/end with those characters. The expressions are:
Original: ^(x?\d{8}|[xyz]\d{7})[trwagmyfpdxbnjzsqvhlcke]$
Mine:[0-9]{8,8}[A-Za-z]{1}
When I run the search using my REGEX, this is a sample of what I get:
GOOD --> 47099085T
GOOD --> D73654109H
NOT OK --> 8.30781719e-05
NOT OK --> 0113:11:19%2000:54:17.042828927Z
How can I modify [0-9]{8,8}[A-Za-z]{1} to exclude the "NOT OK" items?
Spanish ID syntax:
The number of the National Identity Document includes 8 digits and one letter for security. The letter is found by taking all 8 digits as a number and dividing it by 23. The remainder of this digit, which is between 0 and 22, gives the letter used for security. The letters I, Ñ, O, U are not used. The letters I and O are not used – to avoid confusions with the numbers 0 and 1. The Ñ is not used to avoid confusions with N.
Remainder: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Letter: T R W A G M Y F P D X B N J Z S Q V H L C K E
-- EDIT II --
After running a test on a bigger data set, I have found other matches that should be excluded.
How can I modify (^|[^0-9.])([0-9]{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke]) to DO NOT match:
70ce4827ce88530583ed5a1a40245f24
BE4-SGS-V2-00199982a5aa
2945a6bf-86b6-4ea0-94d9-aec84980762d
0x01010083B5627CCA663946A282DE573804AA85
xmp.iid:FE7F11740720681189A59382544B2855
Ok, according to documentation the Spanish ID system (DNI) is structured thus:
The number of the National Identity Document includes 8 digits and one letter for security. The letter is found by taking all 8 digits as a number and dividing it by 23. The remainder of this digit, which is between 0 and 22, gives the letter used for security. The letters I, Ñ, O, U are not used. The letters I and O are not used – to avoid confusions with the numbers 0 and 1. The Ñ is not used to avoid confusions with N.
Remainder: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Letter: T R W A G M Y F P D X B N J Z S Q V H L C K E
After some exploration with Negative Lookaheads and completely failing to get them to work, we can use a more manual approach to a solution, by manually checking that the found "block" of 8 integers is not preceeded by an integer or a decimal point:
/[^\.\d][\d]{8}[TRWAGMYFPDXBNJZSQVHLCKE]/gmi
MySQL safe/syntax version:
(^|[^0-9.])([0-9]{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke])
Example usage using REGEX_REPLACE to return rows where the id_column matches the ID syntax and returns those syntax strings:
SELECT REGEXP_REPLACE(`id_column`,
'(^|[^\\d.])(\\d{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke])', '$2') as id_output
FROM `table_name`
WHERE id_column REGEXP '(^|[^\\d.])(\\d{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke])'
NOTE: Prior to MySQL 8.0.17, the result returned by this function used the UTF-16 character set; in MySQL 8.0.17 and later, the character set and collation of the expression searched for matches is used. (Bug #94203, Bug #29308212)
This matches the two correct matches on your example as well as checking that only one of the valid letters comes after the numerical match.
It is important to note that using the max value in the qualifier {min,max} is pretty irrelevant because it does not mean no more than max should exist in the source string. Please see here for further reading.
What does my Regex do:
Checks that a set of 8 integers is not preceeded by either another integer or a decimal point (so 9 integers are never "captured").
Checks that the set of 8 found integers is immediately followed by one of the valid letters of either case.
You can see my Regex in action here and the corresponding MySQL demo here.
47099085T // matches
D73654109H // matches
8.30781719e-05 // unmatched
0113:11:19%2000:54:17.042828927Z // unmatched
Related
For example, if n=9, then how many different values can be represented in 9 binary digits (bits)?
My thinking is that if I set each of those 9 bits to 1, I will make the highest number possible that those 9 digits are able to represent. Therefore, the highest value is 1 1111 1111 which equals 511 in decimal. I conclude that, therefore, 9 digits of binary can represent 511 different values.
Is my thought process correct? If not, could someone kindly explain what I'm missing? How can I generalize it to n bits?
29 = 512 values, because that's how many combinations of zeroes and ones you can have.
What those values represent however will depend on the system you are using. If it's an unsigned integer, you will have:
000000000 = 0 (min)
000000001 = 1
...
111111110 = 510
111111111 = 511 (max)
In two's complement, which is commonly used to represent integers in binary, you'll have:
000000000 = 0
000000001 = 1
...
011111110 = 254
011111111 = 255 (max)
100000000 = -256 (min) <- yay integer overflow
100000001 = -255
...
111111110 = -2
111111111 = -1
In general, with k bits you can represent 2k values. Their range will depend on the system you are using:
Unsigned: 0 to 2k-1
Signed: -2k-1 to 2k-1-1
What you're missing: Zero is a value
A better way to solve it is to start small.
Let's start with 1 bit. Which can either be 1 or 0. That's 2 values, or 10 in binary.
Now 2 bits, which can either be 00, 01, 10 or 11 That's 4 values, or 100 in binary... See the pattern?
Okay, since it already "leaked": You're missing zero, so the correct answer is 512 (511 is the greatest one, but it's 0 to 511, not 1 to 511).
By the way, an good followup exercise would be to generalize this:
How many different values can be represented in n binary digits (bits)?
Without wanting to give you the answer here is the logic.
You have 2 possible values in each digit. you have 9 of them.
like in base 10 where you have 10 different values by digit say you have 2 of them (which makes from 0 to 99) : 0 to 99 makes 100 numbers. if you do the calcul you have an exponential function
base^numberOfDigits:
10^2 = 100 ;
2^9 = 512
There's an easier way to think about this. Start with 1 bit. This can obviously represent 2 values (0 or 1). What happens when we add a bit? We can now represent twice as many values: the values we could represent before with a 0 appended and the values we could represent before with a 1 appended.
So the the number of values we can represent with n bits is just 2^n (2 to the power n)
The thing you are missing is which encoding scheme is being used. There are different ways to encode binary numbers. Look into signed number representations. For 9 bits, the ranges and the amount of numbers that can be represented will differ depending on the system used.
I'm trying to do a query on a Japanese dictionary DB I created that identifies repeating words—words like ニコニコ (niko niko), ピカピカ (pika pika), etc. While I know how to do LIKE %% queries, I'm not certain how to get it to define a pattern off one part and see if the other part matches it.
Parameters:
All of the words I'm looking for are 4 double-byte characters long
Pattern A consists of the first two characters, Pattern B consists of the last two
The query is being run on a headwords table that is structured rather simply: It has two fields, id and headword
Collation on the table is set to utf8_bin
We want to filter to search only headwords that are 4 characters long, then identify Pattern A and see if Pattern B is identical. If so, return the id.
Bonus: If there is a way to run the search as straight utf8 instead of utf8_bin, that would be helpful for picking up some additional matches (e.g. つれづれ tsure dure). The headwords column has a UNIQUE index on it, and requires utf8_bin collation to enforce the index properly for normal operations.
Data & Result ExampleAdded per Strawberry's suggestion
id | headword
=============
1 | たべる
2 | あらわれる
3 | ばかばかしい
4 | ニコニコ
5 | じゅんびする
6 | ぴかぴか
7 | する
8 | つれづれ
9 | ひとびと
10 | ひと
Desired result would return ids 4 and 6; an optimal result would also return 8 and 9.
1 is too short by 1 character, and Pattern A (たべ) does not match Pattern B (る)
2 is too long by 1 character, and Pattern A (あら) does not match Pattern B (われ). Ditto for 5
3 has matches for Patterns A and B (ばか), however it's too long at 6 characters
7 and 10 are too short by 2 characters. While there's a possible Pattern A (e.g. ひと in 10 appears in ひとびと in 9), it's not long enough to provide a Pattern B to compare against
In PHP, this is what you are looking for: preg_match('/^(..)\1$/u', 'ニコニコ') will be true.
The u qualifier says that characters are utf8.
The .. finds any 2 characters.
The \1 is a back-reference to (..), hence matching a duplicate.
The ^ and $ 'anchor' the regexp to the start and end of the target string.
The 'ニコニコ' is merely one of the test cases.
So, start at the beginning, find 2 utf8 characters, make sure they are immediately repeated, and nothing else follows.
Here's a question I've come across:
Assume each X represents one bit, either 0 or 1. Consider the 8-bit unsigned binary numbers A = 1XXX XXXX and B = 0XXX XXXX. Which of the following are true (you may tick more than one answer):
A B > A
B A > 127
C Can't tell which one A or B is larger
D B < 127
E A > B
Explanations needed (0 understanding on this). Thanks!
The key to the answer is in the word unsigned. This means that the MSB (left most bit) is not being used to indicate the results sign. Processors perform mathematical operations such as add, subtract and comparison on numbers using twos compliment, this means that to know what the numeric value of a binary word is we must know if it is signed (can contain negative values) or unsigned (positive numbers only).
So in the above case the values are unsigned, which means A is always greater than B and that A has the MSB of an 8 bit value set to 1 so must be at least 128.
In the same way that we count in units of 10s binary works in units of two:
Binary
128 64 32 16 8 4 2 1
Decimal
1000 100 10 1
However if the binary value were signed the left most bit would be used to express positve (0) or negative (1) and when negative we need to invert the value and add one to get back to the (Negative) result.
Ok,so I know that the binary equivalent of 104 is 1101000.
10=1010
4=0100
so , 104=1101000 (how to get this??how these both mix together and get this binary?)
And from the example here...
the octets from "hellohello" are E8 32 9B FD 46 97 D9 EC 37.
This bit is inserted to the left which yields 1 + 1101000 = 11101000 ("E8").
I still understand this part , but how to convert 11101000 to E8?
I'm so sorry for all these noob questions , I just learn it yesterday , I googled and search for a whole day but still not really understand the concept...
Thank you.
Ok,so I know that the binary equivalent of 104 is 1101000.
10=1010
4=0100
You can't break apart a number like 104 into 10 and 4 when changing bases. You need to look at the number 104 in its entirety. Start with a table of bit positions and their decimal equivalents:
1 1
2 10
4 100
8 1000
16 10000
32 100000
64 1000000
128 10000000
Look up the largest decimal number that is still smaller than your input number: 104 -- it is 64. Write that down:
1000000
Subtract 64 from 104: 104-64=40. Repeat the table lookup with 40 (32 in this case), and write down the corresponding bit pattern below the first one -- aligning the lowest-bit on the furthest right:
1000000
100000
Repeat with 40-32=8:
1000000
100000
1000
Since there's nothing left over after the 8, you're finished here. Sum those three numbers:
1101000
That's the binary representation of 104.
To convert 1101000 into hexadecimal we can use a little trick, very similar to your attempt to use 10 and 4, to build the hex version from the binary version without much work -- look at groups of four bits at a time. This trick works because four bits of base 2 representation completely represent the range of options of base 16 representations:
Bin Dec Hex
0000 0 0
0001 1 1
0010 2 2
0011 3 3
0100 4 4
0101 5 5
0110 6 6
0111 7 7
1000 8 8
1001 9 9
1010 10 A
1011 11 B
1100 12 C
1101 13 D
1110 14 E
1111 15 F
The first group of four bits, (insert enough leading 0 to pad it to four
bits) 0110 is 6 decimal, 6 hex; the second group of four bits, 1000 is
8 decimal, 8 hexadecimal, so 0x68 is the hex representation of 104.
I think you are making some confusions:
104 decimal is 1101000 which is not formed by two groups splitting 104 into 10 and 4.
The exception is for hex numbers that can be formed by two groups 4 binary numbers (2^4 = 16).
So 111010000 = E8 translates into 1110 = E and 8 = 10000. 1110 (binary) would be 14 (decimal) and equivalent of E (hex).
Hex numbers go from 0 to 15 (decimal) where:
10 (decimal) = A (hex)
11(decimal) = B(hex)
...
15(decimal) = F(hex)
What you're missing here is the general formula for digital numbers.
104 = 1*10^2 + 0*10^1 + 4*10^0
Similarly,
0100b = 0*2^3 + 1*2^2 + 0*2^1 + 0*0^0
And for a hexidecimal number, the letters A-F stand for the numbers 10-15. So,
E8 = 14*16^1 + 8*16^0
As you go from right to left, each digit represents the coefficient of the next higher power of the base (also called the radix).
In programming, if you have an integer value (in the internal format of the computer, probably binary, but it isn't relevant), you can extract the right most digit with the modulus operation.
x = 104
x % 10 #yields 4, the "ones" place
And then you can get "all but" the rightmost digit with integer division (integer division discards the remainder which we no longer need).
x = x / 10 #yields 10
x % 10 #now yields 0, the "tens" place
x = x / 10 #yields 1
x % 10 #now yields 1, the "hundreds" place
So if you do modulus and integer division in a loop (stopping when x == 0), you can output a number in any base.
This is basic arithmetic. See binary numeral system & radix wikipedia entries.
For example, if n=9, then how many different values can be represented in 9 binary digits (bits)?
My thinking is that if I set each of those 9 bits to 1, I will make the highest number possible that those 9 digits are able to represent. Therefore, the highest value is 1 1111 1111 which equals 511 in decimal. I conclude that, therefore, 9 digits of binary can represent 511 different values.
Is my thought process correct? If not, could someone kindly explain what I'm missing? How can I generalize it to n bits?
29 = 512 values, because that's how many combinations of zeroes and ones you can have.
What those values represent however will depend on the system you are using. If it's an unsigned integer, you will have:
000000000 = 0 (min)
000000001 = 1
...
111111110 = 510
111111111 = 511 (max)
In two's complement, which is commonly used to represent integers in binary, you'll have:
000000000 = 0
000000001 = 1
...
011111110 = 254
011111111 = 255 (max)
100000000 = -256 (min) <- yay integer overflow
100000001 = -255
...
111111110 = -2
111111111 = -1
In general, with k bits you can represent 2k values. Their range will depend on the system you are using:
Unsigned: 0 to 2k-1
Signed: -2k-1 to 2k-1-1
What you're missing: Zero is a value
A better way to solve it is to start small.
Let's start with 1 bit. Which can either be 1 or 0. That's 2 values, or 10 in binary.
Now 2 bits, which can either be 00, 01, 10 or 11 That's 4 values, or 100 in binary... See the pattern?
Okay, since it already "leaked": You're missing zero, so the correct answer is 512 (511 is the greatest one, but it's 0 to 511, not 1 to 511).
By the way, an good followup exercise would be to generalize this:
How many different values can be represented in n binary digits (bits)?
Without wanting to give you the answer here is the logic.
You have 2 possible values in each digit. you have 9 of them.
like in base 10 where you have 10 different values by digit say you have 2 of them (which makes from 0 to 99) : 0 to 99 makes 100 numbers. if you do the calcul you have an exponential function
base^numberOfDigits:
10^2 = 100 ;
2^9 = 512
There's an easier way to think about this. Start with 1 bit. This can obviously represent 2 values (0 or 1). What happens when we add a bit? We can now represent twice as many values: the values we could represent before with a 0 appended and the values we could represent before with a 1 appended.
So the the number of values we can represent with n bits is just 2^n (2 to the power n)
The thing you are missing is which encoding scheme is being used. There are different ways to encode binary numbers. Look into signed number representations. For 9 bits, the ranges and the amount of numbers that can be represented will differ depending on the system used.