Finding numbers with a specific length from csv file - csv

I'm working with a csv file from a customer, which holds a large amount of data. The data is extracted from an SQL database and the commas therefore signify the different columns. In one of these columns there are 10 digit numbers. For some reason all 10 digit numbers starting with 0 have been converted to 9 digit numbers with the 0 removed. I need to find all these instances and insert a 0 at the beginning of the 9 digit number.
A complication in the data is that another column also contains 9 digit numbers, and these do not need to be modified. I can assume, however that all those numbers start with 0 and all the numbers i need to find do not start with 0.
I'm currently using notepad++ trying to fix the problem and found the regular expression \d{9} which finds all numbers with 9 digits, but that is not what I'm looking for
Below i have an example of how the data could look. The column that needs all 9 digit numbers converted is on the left, and the other column with 9 digit numbers is on the right.
An example of the data that is causing the trouble could be:
Column 1
Column 2
2323232323
002132413
231985313
004542435
In this example I need to find the second line of column 1 and insert a 0 in front of the number.

Ctrl+H
Find what: \b(?!0)\d{9}\b
Replace with: 0$0
TICK Wrap around
SELECT Regular expression
Replace all
Explanation:
\b # word boundary, make sure ae haven't digit before
(?!0) # negative lookahead, make sure the next character is not 0
\d{9} # 9 digits
\b # word boundary, make sure ae haven't digit after
Replacement:
0 # 0 to be inserted
$0 # the whole match (i.e. 9 digts)
Screenshot (before):
Screenshot (after):

Using Notepad++ do CTRL + H (search and replace utility).
Tick Regular Expression
Find what ? ([^0-9])(\d{9})([^0-9])
Replace with ? \10\2\3
Explanation :
([^0-9])(\d{9})([^0-9]) matches a 9 digit number surrounded by a non-digit on each side (including line return / comma, etc) :
Each (....) "captures" a group for later use (in "replace").
[^0-9] is a non-number character
\d{9} is a 9 digits number
\10\2\3 is a 0 right after the first captured group \1 (it was just one character here) followed by the 9 digit number (2nd captured group : \2) and the character that was after that number (3rd captured group : \3).
Limit :
It won't match a number at the very beginning of the file (before any other character) or at the very end (after every character). Adding a newline at the end of the file is one workaround, or fixing the last number manually if there is no newline before EOF.

Related

Allow only 1 letter and unlimited numbers in input

So, banging my head on the wall with a regex test site and can't seem to nail this one down.
Trying to make it so an HTML is validated to only allow 1 letter maximum, but unlimited numbers, no other characters.
W123 = valid
124X = valid
1234 = valid
WW12 = invalid
Nothing incredible here, only two possible cases:
the string starts with a digit
the string doesn't start with a digit
In the two scenari only one letter is allowed.
To express that you need a group and an alternation:
^(?:\d+[A-Z]?|[A-Z]\d)\d*$
In the two branches stored inside the group, you can see that there's at least one digit (in the first because I used the + quantifier, in the second because of \d). The first branch matches any string that starts with a digit with and without a letter (because of the optional letter [A-Z]?), the second one matches strings that start with a letter. \d* at the end of the pattern matches remaining digits.
Obviously the pattern is enclosed between anchors for the start ^ and the end of the string $.
You look for zero or more digits followed by a letter followed by zero or more digits or alternatively at least one digit.
^\d*[a-zA-Z]\d*$|^\d+$

Regex of max 5 decimal separate by +

i'm trying to make a regex to add to a input pattern (HTML) to check if is valid,
it need to be valid only if the input contain a string composed by decimal(with 1 or 2 number after comma) or integer number separated by a +
and maximum of 5 number
and it can not start or end with a + or it can not be possible to have a number with comma without number after (i use comma instead of dot for decimal)
for example
10+5,1+6,20 OK
10 OK
6+4+8,9+3+9+3 NO
10,2+4+6+ NO
10,+5 NO
i've tried with something like this but id doesn't work very well
((\d{1,3}|(\d*,\d{1,2})*)+(\+)?){1,5}
also i've tried with this:
^((\s*)|([0-9]\d{0,9}(\,\d{1,2})?%?))*(\+((\s*)|([0-9]\d{0,9}(\,\d{1,2})?%?))+){0,4}$
but it doesn't work very well with the 2 digit max for the decimal and ending +
any suggestions ??
i've made some test here:
https://regexr.com/5jsfv
it should pass the first 3 and faile on the last 4
thanks
You can use
^\d+(?:,\d{1,2})?(?:\+\d+(?:,\d{1,2})?){0,4}$
In the HTML pattern attribute use it as
pattern="\d+(?:,\d{1,2})?(?:\+\d+(?:,\d{1,2})?){0,4}"
See the regex demo.
NOTE: If you want to limit the number of digits in the integer part to be max 3, replace the \d+ with \d{1,3}:
^\d{1,3}(?:,\d{1,2})?(?:\+\d{1,3}(?:,\d{1,2})?){0,4}$
Details:
^ - start of string (implicit in pattern regex)
\d+(?:,\d{1,2})? - one or more digits and then an optional sequence of a , and one or two digits
(?:\+\d+(?:,\d{1,2})?){0,4} - zero to four occurrences of a + char followed with one or more digits and then an optional sequence of a , and one or two digits
$ - end of string (implicit in pattern regex)

Validating using HTML5 pattern

I need to validate two possible patterns for the input using HTML5 pattern.
123456789a (first 9 digits should be exactly numbers and then an alphabetical character) nothing more nothing less
OR
123456789012 (exactly 12 digits nothing more nothing less)
I tried ^([0-9]{12,12})|([0-9]{9,9}[A-Za-z]{1,1}), ^([0-9]{12})|([0-9]{9,9}[A-Za-z])$, and many more but the problem is if user enters an alphabet character when the total length is between 9 and 12, then it takes as a valid input. But it should be not.
Valid input is either 12 digits, or 9 digits with one char.
What have I done wrong?
You could check for 9 digits at the start of the string: (see ^, beginning of input assertion, \d, digit character class and the x{n} quantifier)
^\d{9}
followed by either an alphabetical character or 3 more digits, and the end of the string: (see the non capturing group (?: ... ), [ ... ], the character set, x|y and $, end of input assertion)
(?:[a-zA-Z]|\d{3})$
So the expression would be:
^\d{9}(?:[a-zA-Z]|\d{3})$

How to identify different patterns of numbers in a column?

I am trying to write one single formula to identify all the patterns in a column/field. For example: Below are the five different patterns
AG 5643 895468 UWEB
7546 695321 IJJK
PE 45612384
8642567921
16724385
Formula for
First pattern: Contains 4 numbers 6 numbers
'*[0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9][0-9][0-9] *' This is not working. Can we specify the length? Something like this [0-9]{4} - 4 digit number?
First pattern should pick second one also.
3rd one: first 2 characters are alphabets 8 or 10 digit numbers
4th one: 10 digit number
5th one 8 digit number
Thanks in advance!
If you're working in MySQL you can use regular expressions with the RLIKE filter operator.
For example, WHERE text RLIKE '[0-9]{8}' finds all the rows with any consecutive sequence of eight digits in them anywhere. (http://sqlfiddle.com/#!9/44996/1/0)
WHERE text RLIKE '^[0-9]{8}%' finds the rows consisting of nothing but an eight-digit sequence. (http://sqlfiddle.com/#!9/44996/2/0)
WHERE text RLIKE '^[0-9A-Z]{2} ' finds the rows starting with two letters or digits and then a space. (http://sqlfiddle.com/#!9/44996/3/0)
You get the idea. Regular expressions have a lot of power to them, generally beyond the scope of a SO answer to explain. Beware, though. This is a common saying: If you solve a problem with e regular expression, now you have two problems. You need to be careful with them.

Regarding TCL format command

% format %2s 100
100
% format %.2s 100
10
%
%
% format %0.2s 100
10
%
I am not able to understand the difference between %2s and %.2s .
Can anyone explain me ?
TCL format command manual page specifies that format string can consist of six different parts. In this case second, third and fourth portions are of interest.
If there is a character from set [-+ 0#], they specify justification of the field, if there should be padding, sign shown of numbers, etc. 0 in the third example specifies that number should be padded with zeros instead of spaces. However, in this example there is nothing to pad.
If there is some other number without dot (2 in the first example), the number is interpreted as minimum field length and number is padded with spaces if necessary.
If there is a dot, the number after if interpreted as precision indicator and way it behaves differs depending on the other format parameters. For strings it means the maximum number of characters.
With
format %4.2s foo
you then get
fo
That is, at most two characters are printed, but the field width is at minimum 4 characters.
If you are actually trying to print a number instead of string, then the sixth (the only mandatory) field is important. "s" means "print as is". For numbers you want to use for example "d" which means decimal (integer) or "f" for floating point. Check the manual for the whole list.
With
format %4.2d 100 # Print with at least two numbers and with field width of 4 characters
you get
100
With
format %08.2f 123.45678 # Field width 8, pad with zeros, print two decimals
you get
00123.46
In the last example notice that all numbers and the dot are counted for the field length and that the number has been rounded.