I am trying to do a Regular Expression search on string assigned to the HTML content of web search. The pattern I am trying to match has the following format HQ 12345 the second fragment could also start with a letter so HQ A12345 is also a possibility. As shown in the code below the regex pattern I am using is "HQ .*[0-9]".
Problem is when i run the regex search the pattern matched is not just HQ 959693 but also includes the rest of the html file content as shown in the snapshot of the message box below.
Sub Test()
Dim mystring As String
mystring = getHTMLData("loratadine")
Dim rx As New RegExp
rx.IgnoreCase = True
rx.MultiLine = False
rx.Global = True
rx.Pattern = "HQ .*[0-9]"
Dim mtch As Variant
For Each mtch In rx.Execute(mystring)
Debug.Print mtch
MsgBox(mtch)
Next
End Sub
Public Function getHTMLData (ByVal name As String) As String
Dim XMLhttp: Set XMLhttp = CreateObject("MSXML2.ServerXMLHTTP")
XMLhttp.setTimeouts 2000, 2000, 2000, 2000
XMLhttp.Open "GET", "http://rulings.cbp.gov/results.asp?qu=" & name & "&p=1", False
XMLhttp.send
If XMLhttp.Status = 200 Then
getHTMLData = XMLhttp.responsetext
Else
getHTMLData = ""
End If
End Function
Use ? to specify non-greedy, otherwise the match will consume up until the last digit of the entire string. Also, you are only matching one digit occurrence. Add a + to specify "one or more" so it will match your goal:
HQ .*?[0-9]+
Alternatively, you can try to use a negated character class like so:
HQ [^0-9]*[0-9]+
Or you can even simplify it further:
HQ [^\d]*\d+
Regex matching is by default greedy. Unfortunately I didn't manage to reproduce precisely your issue, but I am pretty sure it is because you a long string which is being matched by '.*' to a number at the end.
I find this link useful, see the explaination near the bottom about the greediness of *
http://www.autohotkey.com/docs/misc/RegEx-QuickRef.htm
I suggest changing your Regex to:
HQ .*?[0-9]+
That will match the "HQ " and any number of characters, followed by any number of numeric characters. It will also consume the minimal amount in the ".*", because of the "?".
Please respond if this does not work and I will getting your Regex running in Excel.
Related
In MS Access I have a table with a Short Text field named txtPMTaskDesc in which some records contains numbers, and if they do, at different positions in the string. I would like to recover these numbers from the text string if possible for sorting purposes.
There are over 26000 records in the table, so I would rather handle it in a query over using VBA loops etc.
Sample Data
While the end goal is to recover the whole number, I was going to start with just identifying the position of the first numerical value in the string. I have tried a few things to no avail like:
InStr(1,[txtPMTaskDesc],"*[0-9]*")
Once I get that, I was going to use it as a part of a Mid() function to pull out it and the character next to it like below. (its a bit dodgy, but there is never more than a two-digit number in the text string)
IIf(InStr(1,[txtPMTaskDesc],"*[0-9]*")>0,Mid([txtPMTaskDesc],InStr(1,[txtPMTaskDesc],"*[0-9]*"),2)*1,0)
Any assistance appreciated.
If data is truly representative and number always preceded by "- No ", then expression in query can be like:
Val(Mid(txtPMTaskDesc, InStr(txtPMTaskDesc, "- No ") + 5))
If there is no match, a 0 will return, however, if field is null, the expression will error.
If string does not have consistent pattern (numbers always in same position or preceded by some distinct character combination that can be used to locate position), don't think can get what you want without VBA. Either loop through string or explore Regular Expressions aka RegEx. Set reference to Microsoft VBScript Regular Expressions x.x library.
Function GetNum(strS AS String)
Dim re As RegExp, Match As Object
Set re = New RegExp
re.Pattern = "[\d+]+"
Set Match = re.Execute(strS)
GetNum = Null
If Match.Count > 0 Then GetNum = Match(0)
End Function
Input of string "Fuel Injector - No 1 - R&I" returns 1.
Place function in a general module and call it from query.
SELECT table.*, GetNum(Nz(txtPMTaskDesc,"")) AS Num FROM table;
Function returns Null if there is no number match.
Well, does the number you want ALWAYS have a - No xxxx - format?
If yes, then you could have this global function in VBA like this:
Public Function GNUM(v As Variant) As Long
If IsNull(v) Then
GNUM = 0
Exit Function
End If
Dim vBuf As Variant
vBuf = Split(v, " - No ")
Dim strRes As String
If UBound(vBuf) > 0 Then
strRes = Split(vBuf(1), "-")(0)
GNUM = Trim(strRes)
Else
GNUM = 0
End If
End Function
Then your sql will be like this:
SELECT BLA, BLA, txtPMTaskDesc, GNUM([txtPMTaskDesc] AS TaskNum
FROM myTable
So you can create/have a public VBA function, and it can be used in the sql query.
It just a question if " - No -" is ALWAYS that format, then THEN the number follows this
So we have "space" "-" "space" "No" "space" "-" -- then the number and the " -"
How well this will work depends on how consistent this text is.
I am trying to get three values from a large html file. I thought I could use the substring method, but was informed that the position of the data may change. Basically, in the following code I need to pick out "Total number of records: 106", "Number of records imported:106", and "Number of records rejected: 0"
<B>Total number of records : </B>106</Font><br><Font face="arial" size="2"><B>Number of records imported : </B>106</Font><br><Font face="arial" size="2"><B>Number of records rejected : </B>0</Font>
I hope this is clear enough. Thanks in advance!
Simple string operations like IndexOf() and Substring() should be plenty to do the job. Regular Expressions would be another approach that'd take less code (and may allow more flexibility if the HTML tags can vary), but as Mark Twain would say, I didn't have time for a short solution, so I wrote a long one instead.
In general you'll get better results around here by showing you've at least made a reasonable attempt first and showing where you got stuck. But for this time...here you go. :-)
Private Shared Function GetMatchingCount(allInputText As String, textBefore As String, textAfter As String) As Integer?
'Find the first occurrence of the text before the desired number
Dim startPosition As Integer = allInputText.IndexOf(textBefore)
'If text before was not found, return Nothing
If startPosition < 0 Then Return Nothing
'Move the start position to the end of the text before, rather than the beginning.
startPosition += textBefore.Length
'Find the first occurrence of text after the desired number
Dim endPosition As Integer = allInputText.IndexOf(textAfter, startPosition)
'If text after was not found, return Nothing
If endPosition < 0 Then Return Nothing
'Get the string found at the start and end positions
Dim textFound As String = allInputText.Substring(startPosition, endPosition - startPosition)
'Try converting the string found to an integer
Try
Return CInt(textFound)
Catch ex As Exception
Return Nothing
End Try
End Function
Of course, it'll only work if the text before and after is always the same. If you use that with a driver console app like this (but without the Shared, since it'd be in a Module then)...
Sub Main()
Dim allText As String = "<B>Total number of records : </B>106</Font><br><Font face=""arial"" size=""2""><B>Number of records imported : </B>106</Font><br><Font face=""arial"" size=""2""><B>Number of records rejected : </B>0</Font>"""""
Dim totalRecords As Integer? = GetMatchingCount(allText, "<B>Total number of records : </B>", "<")
Dim recordsImported As Integer? = GetMatchingCount(allText, "<B>Number of records imported : </B>", "<")
Dim recordsRejected As Integer? = GetMatchingCount(allText, "<B>Number of records rejected : </B>", "<")
Console.WriteLine("Total: {0}", totalRecords)
Console.WriteLine("Imported: {0}", recordsImported)
Console.WriteLine("Rejected: {0}", recordsRejected)
Console.ReadKey()
End Sub
...you'll get output like so:
Total: 106
Imported: 106
Rejected: 0
I need to change some html tags by others.
For example, I want to change the
<EM></EM>
tags to
<strong></strong>
tags, except when the word inside the
<EM>
tags is et al, ie.:
<EM>et al</EM>.
Is there a way in which I can use a single replace operation for matching the EM word inside the the start and closing tags
<> </>
or the only way is by using 2 replacement operations, like
"(<EM>)(?!et al)", "<strong>"
Edit:
I'm using VBA inside MSAccess.
This is my UDF:
'--------------------------------------------------------------------
' Name: RegExpReplace
' Purpose: Replace text in a string using Regular Expressions.
' Requires: Microsoft VBScript Regular Expressions 5.5
' Author: Diego F.Pereira-Perdomo
' Date: Dec-27-2012
'--------------------------------------------------------------------
Public Function RegExpReplace(ByVal strInput As String, _
ByVal strPattern As String, _
ByVal strReplace As String, _
Optional booIgnCase As Boolean = False, _
Optional booGlobal As Boolean = True) As String
Dim oRegExp As RegExp
Dim strOutp As String
Set oRegExp = New RegExp
With oRegExp
.IgnoreCase = booIgnCase
.Global = booGlobal
.pattern = strPattern
strOutp = .Replace(strInput, strReplace)
RegExpReplace = strOutp
End With
Set oRegExp = Nothing
End Function
Edit:
After some research about regex capabilities with VBScript (and VBScript syntax), the simpliest way seems to be:
Dim re: Set re = New RegExp
re.Pattern = "<em([^>]*)>(?!carmen</em>)([\s\S]*?)</em>"
re.Global = True
re.IgnoreCase = True
Dim str: str = "<em class=""truc"">where</em> in the <eM>world</em> is <em>carmen</em> sandiego?"
Dim rep: rep = "<strong$1>$2</strong>"
MsgBox re.Replace(str, rep)
Pattern description:
<em # literal: <em
([^>]*) # capture group 1: all characters except > zero or more times
> # literal: >
(?!carmen</em>) # lookahead assertion: not followed by "carmen</em>"
( # capture group 2:
[\s\S] # all that is a white character + all that is not a white character
# = all possible characters (including newlines)
*? # repeat zero or more times (lazy)
) # close capture group 2
</em> # literal: </em>
The pattern is designed to exclude exactly "carmen". If you want to exclude substrings that contains "carmen", you must make some change to the pattern and take care to not check the word outside the tags (<em>blah blah blah</em> carmen)
the most simple way:
<em([^>]*)>((?:(?!carmen)[\s\S])*?)</em>
note that this way is particulary inefficient since the regex engine must check (?!carmen) for each character.
An other way:
<em([^>]*)>((?:[^<c]+|c(?!armen)|<(?!/em>))*)</em>
This pattern seems to be a good idea, but there's a problem. All works fine when the string contains the closing tag </em>, but if the closing tag is missing your script will simply crash because of catastropic backtracking. You can find more information about this here.
A way to solve the problem is to use an atomic group (?>..) (inside which the regex engine is not allowed to backtrack) in place of the non capturing group (?:..), but VBS regexes (as Javascript) doesn't have this feature.
However you can emulate this feature using a lookahead, a capturing group and a backreference: (?=(pattern))\1 is equivalent to (?>pattern). (because a lookahead is naturaly atomic)
If I rewrite the precedent pattern with this trick, I obtain:
<em([^>]*)>((?:(?=([^<c]+|c(?!armen)|<(?!/em>)))\3)*)</em>
This expression works perfect.
<(em)>((?!.*?et al).*?)</\1>
So essentially it captures
(em)
for using it in the end tag
</\1>
excludes the string even if there are characters before
(?!.*?et al)
or after
(?!.*?et al).*?
and captures the result
((?!.*?et al).*?)
Well, more the less that's what it does :)
For replacing using my function these are some examples:
Ex.1:
?RegExpReplace("<em>et al</em>", _
"<(em)>((?!.*?et al).*?)</\1>", _
"<strong>$2</strong>", _
True)
Result:
<em>et al</em>
Ex.2:
?RegExpReplace("<em>et al </em>", _
"<(em)>((?!.*?et al).*?)</\1>", _
"<strong>$2</strong>", _
True)
Result:
<em>et al </em>
Ex.3:
?RegExpReplace("<em> et al</em>", _
"<(em)>((?!.*?et al).*?)</\1>", _
"<strong>$2</strong>", _
True)
Result:
<em> et al</em>
Ex.4
?RegExpReplace("<em>et a</em>", _
"<(em)>((?!.*?et al).*?)</\1>", _
"<strong>$2</strong>", _
True)
Result
<strong>et a</strong>
Ex.5
?RegExpReplace("<em>t al</em>", _
"<(em)>((?!.*?et al).*?)</\1>", _
"<strong>$2</strong>", _
True)
Result:
<strong>t al</strong>
Note the use of the backreferences in the searching pattern and in the replacing string. In the searching pattern one has to use the backslash and the reference number; in the replacing string one has to use the dollar sign and the reference number.
Finally, I have to disagree with the notion that RegExp are not useful or moreover dangerous for editing html (docs or strings).
Parsing an html is easily accomplished with the DOM and that is the recommended tool with no doubt.
So I use the DOM for parsing the Html, extract the different parts and RegExp for modifying the details.
Hope this help others.
Regards,
Diego
Sorry, another question about MsAccess.
I have data set:
Phone Number
444-514-9864
555-722-2273
333-553- 4535
000-000- 0000
550-322-6888
444-896-5371
322-533-1448
222.449.2931
222.314.5208
222.745.6001
I need it to look like (222) 896-5371.
How do I do it in Ms Access or MsExcel?
You can use the Instr, mid, Left and Right functions to make this work. I have made 1 example, with msdn you should be able to figure out the rest
Dim OldPhoneNumber As String
Dim NewPhoneNumber As String
Dim PreFix As String
Dim PreFix2 As String
' You can replace this line in Access, just make sure the full phone number is stored in "OldPhoneNumber"
OldPhoneNumber = Worksheets(<worksheet name>).Range(<cell name>).Value
PreFix = Left(OldPhoneNumber, InStr(1, OldPhoneNumber, "-", 1))
PreFix2 = Left(OldPhoneNumber, InStr(1, OldPhoneNumber, "-", 1) - 1)
NewPhoneNumber = Replace(OldPhoneNumber, PreFix, "(" & PreFix2 & ") ")
Debug.Print (NewPhoneNumber)
Seeing as not all your phone numbers are formatted the same way, you would have to make a different rule for every different formatted phone number (you need 1 that checks for "-" and one that checks for "." You also might want to filter out the spaces
In Access you set the "Input mask" to : "("000") "000"-"0000;1;_
All the references http://office.microsoft.com/en-ca/access-help/input-mask-syntax-and-examples-HP005187550.aspx
Input mask will only work for new data. You will need to create a macro or function to update your existing data to be consistent with your desired format
After applying the unpivot procedure, I have an Amount column that has blanks and other characters ( like "-"). I would like to convert those non-numberic values to zero. I use replace procedure but it only converts one at the time.
Also, I tried to use the following script
/**
Public Overrides Sub Input()_ProcessInputRows(ByVal Row As Input()Buffer)
If Row.ColumnName_IsNull = False Or Row.ColumnName = "" Then
Dim pattern As String = String.Empty
Dim r As Regex = Nothing
pattern = "[^0-9]"
r = New Regex(pattern, RegexOptions.Compiled)
Row.ColumnName = Regex.Replace(Row.ColumnName, pattern, "")
End If
End Sub
**/
but i'm getting error.I don't much about script so maybe I placed in the wrong place. The bottom line is that I need to convert those non-numberic values.
Thank you in advance for your help.
I generally look at regular expressions as a great way to introduce another problem into an existing one.
What I did to simulate your problem was to write a select statement that added 5 rows. 2 with valid numbers, the rest were an empty string, string with spaces and one with a hyphen.
I then wired it up to a Script Component and set the column as read/write
The script I used is as follows. I verified there was a value there and if so, I attempted to convert the value to an integer. If that failed, then I assigned it zero. VB is not my strong suit so if this could have been done more elegantly, please edit my script.
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
' Ensure we have data to work with
If Not Row.ColumnName_IsNull Then
' Test whether it's a number or not
' TryCast doesn't work with value types so I'm going the lazy route
Try
' Cast to an integer and then back to string because
' my vb is weak
Row.ColumnName = CStr(CType(Row.ColumnName, Integer))
Catch ex As Exception
Row.ColumnName = 0
End Try
End If
End Sub