Pulling out specific text in an html file using vb.net - html

I am trying to get three values from a large html file. I thought I could use the substring method, but was informed that the position of the data may change. Basically, in the following code I need to pick out "Total number of records: 106", "Number of records imported:106", and "Number of records rejected: 0"
<B>Total number of records : </B>106</Font><br><Font face="arial" size="2"><B>Number of records imported : </B>106</Font><br><Font face="arial" size="2"><B>Number of records rejected : </B>0</Font>
I hope this is clear enough. Thanks in advance!

Simple string operations like IndexOf() and Substring() should be plenty to do the job. Regular Expressions would be another approach that'd take less code (and may allow more flexibility if the HTML tags can vary), but as Mark Twain would say, I didn't have time for a short solution, so I wrote a long one instead.
In general you'll get better results around here by showing you've at least made a reasonable attempt first and showing where you got stuck. But for this time...here you go. :-)
Private Shared Function GetMatchingCount(allInputText As String, textBefore As String, textAfter As String) As Integer?
'Find the first occurrence of the text before the desired number
Dim startPosition As Integer = allInputText.IndexOf(textBefore)
'If text before was not found, return Nothing
If startPosition < 0 Then Return Nothing
'Move the start position to the end of the text before, rather than the beginning.
startPosition += textBefore.Length
'Find the first occurrence of text after the desired number
Dim endPosition As Integer = allInputText.IndexOf(textAfter, startPosition)
'If text after was not found, return Nothing
If endPosition < 0 Then Return Nothing
'Get the string found at the start and end positions
Dim textFound As String = allInputText.Substring(startPosition, endPosition - startPosition)
'Try converting the string found to an integer
Try
Return CInt(textFound)
Catch ex As Exception
Return Nothing
End Try
End Function
Of course, it'll only work if the text before and after is always the same. If you use that with a driver console app like this (but without the Shared, since it'd be in a Module then)...
Sub Main()
Dim allText As String = "<B>Total number of records : </B>106</Font><br><Font face=""arial"" size=""2""><B>Number of records imported : </B>106</Font><br><Font face=""arial"" size=""2""><B>Number of records rejected : </B>0</Font>"""""
Dim totalRecords As Integer? = GetMatchingCount(allText, "<B>Total number of records : </B>", "<")
Dim recordsImported As Integer? = GetMatchingCount(allText, "<B>Number of records imported : </B>", "<")
Dim recordsRejected As Integer? = GetMatchingCount(allText, "<B>Number of records rejected : </B>", "<")
Console.WriteLine("Total: {0}", totalRecords)
Console.WriteLine("Imported: {0}", recordsImported)
Console.WriteLine("Rejected: {0}", recordsRejected)
Console.ReadKey()
End Sub
...you'll get output like so:
Total: 106
Imported: 106
Rejected: 0

Related

Find the position of the first occurrence of any number in string (if present) in MS Access

In MS Access I have a table with a Short Text field named txtPMTaskDesc in which some records contains numbers, and if they do, at different positions in the string. I would like to recover these numbers from the text string if possible for sorting purposes.
There are over 26000 records in the table, so I would rather handle it in a query over using VBA loops etc.
Sample Data
While the end goal is to recover the whole number, I was going to start with just identifying the position of the first numerical value in the string. I have tried a few things to no avail like:
InStr(1,[txtPMTaskDesc],"*[0-9]*")
Once I get that, I was going to use it as a part of a Mid() function to pull out it and the character next to it like below. (its a bit dodgy, but there is never more than a two-digit number in the text string)
IIf(InStr(1,[txtPMTaskDesc],"*[0-9]*")>0,Mid([txtPMTaskDesc],InStr(1,[txtPMTaskDesc],"*[0-9]*"),2)*1,0)
Any assistance appreciated.
If data is truly representative and number always preceded by "- No ", then expression in query can be like:
Val(Mid(txtPMTaskDesc, InStr(txtPMTaskDesc, "- No ") + 5))
If there is no match, a 0 will return, however, if field is null, the expression will error.
If string does not have consistent pattern (numbers always in same position or preceded by some distinct character combination that can be used to locate position), don't think can get what you want without VBA. Either loop through string or explore Regular Expressions aka RegEx. Set reference to Microsoft VBScript Regular Expressions x.x library.
Function GetNum(strS AS String)
Dim re As RegExp, Match As Object
Set re = New RegExp
re.Pattern = "[\d+]+"
Set Match = re.Execute(strS)
GetNum = Null
If Match.Count > 0 Then GetNum = Match(0)
End Function
Input of string "Fuel Injector - No 1 - R&I" returns 1.
Place function in a general module and call it from query.
SELECT table.*, GetNum(Nz(txtPMTaskDesc,"")) AS Num FROM table;
Function returns Null if there is no number match.
Well, does the number you want ALWAYS have a - No xxxx - format?
If yes, then you could have this global function in VBA like this:
Public Function GNUM(v As Variant) As Long
If IsNull(v) Then
GNUM = 0
Exit Function
End If
Dim vBuf As Variant
vBuf = Split(v, " - No ")
Dim strRes As String
If UBound(vBuf) > 0 Then
strRes = Split(vBuf(1), "-")(0)
GNUM = Trim(strRes)
Else
GNUM = 0
End If
End Function
Then your sql will be like this:
SELECT BLA, BLA, txtPMTaskDesc, GNUM([txtPMTaskDesc] AS TaskNum
FROM myTable
So you can create/have a public VBA function, and it can be used in the sql query.
It just a question if " - No -" is ALWAYS that format, then THEN the number follows this
So we have "space" "-" "space" "No" "space" "-" -- then the number and the " -"
How well this will work depends on how consistent this text is.

Regular Expression Pattern Matching to HTML content

I am trying to do a Regular Expression search on string assigned to the HTML content of web search. The pattern I am trying to match has the following format HQ 12345 the second fragment could also start with a letter so HQ A12345 is also a possibility. As shown in the code below the regex pattern I am using is "HQ .*[0-9]".
Problem is when i run the regex search the pattern matched is not just HQ 959693 but also includes the rest of the html file content as shown in the snapshot of the message box below.
Sub Test()
Dim mystring As String
mystring = getHTMLData("loratadine")
Dim rx As New RegExp
rx.IgnoreCase = True
rx.MultiLine = False
rx.Global = True
rx.Pattern = "HQ .*[0-9]"
Dim mtch As Variant
For Each mtch In rx.Execute(mystring)
Debug.Print mtch
MsgBox(mtch)
Next
End Sub
Public Function getHTMLData (ByVal name As String) As String
Dim XMLhttp: Set XMLhttp = CreateObject("MSXML2.ServerXMLHTTP")
XMLhttp.setTimeouts 2000, 2000, 2000, 2000
XMLhttp.Open "GET", "http://rulings.cbp.gov/results.asp?qu=" & name & "&p=1", False
XMLhttp.send
If XMLhttp.Status = 200 Then
getHTMLData = XMLhttp.responsetext
Else
getHTMLData = ""
End If
End Function
Use ? to specify non-greedy, otherwise the match will consume up until the last digit of the entire string. Also, you are only matching one digit occurrence. Add a + to specify "one or more" so it will match your goal:
HQ .*?[0-9]+
Alternatively, you can try to use a negated character class like so:
HQ [^0-9]*[0-9]+
Or you can even simplify it further:
HQ [^\d]*\d+
Regex matching is by default greedy. Unfortunately I didn't manage to reproduce precisely your issue, but I am pretty sure it is because you a long string which is being matched by '.*' to a number at the end.
I find this link useful, see the explaination near the bottom about the greediness of *
http://www.autohotkey.com/docs/misc/RegEx-QuickRef.htm
I suggest changing your Regex to:
HQ .*?[0-9]+
That will match the "HQ " and any number of characters, followed by any number of numeric characters. It will also consume the minimal amount in the ".*", because of the "?".
Please respond if this does not work and I will getting your Regex running in Excel.

SSIS Convert Blank or other values to Zeros

After applying the unpivot procedure, I have an Amount column that has blanks and other characters ( like "-"). I would like to convert those non-numberic values to zero. I use replace procedure but it only converts one at the time.
Also, I tried to use the following script
/**
Public Overrides Sub Input()_ProcessInputRows(ByVal Row As Input()Buffer)
If Row.ColumnName_IsNull = False Or Row.ColumnName = "" Then
Dim pattern As String = String.Empty
Dim r As Regex = Nothing
pattern = "[^0-9]"
r = New Regex(pattern, RegexOptions.Compiled)
Row.ColumnName = Regex.Replace(Row.ColumnName, pattern, "")
End If
End Sub
**/
but i'm getting error.I don't much about script so maybe I placed in the wrong place. The bottom line is that I need to convert those non-numberic values.
Thank you in advance for your help.
I generally look at regular expressions as a great way to introduce another problem into an existing one.
What I did to simulate your problem was to write a select statement that added 5 rows. 2 with valid numbers, the rest were an empty string, string with spaces and one with a hyphen.
I then wired it up to a Script Component and set the column as read/write
The script I used is as follows. I verified there was a value there and if so, I attempted to convert the value to an integer. If that failed, then I assigned it zero. VB is not my strong suit so if this could have been done more elegantly, please edit my script.
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
' Ensure we have data to work with
If Not Row.ColumnName_IsNull Then
' Test whether it's a number or not
' TryCast doesn't work with value types so I'm going the lazy route
Try
' Cast to an integer and then back to string because
' my vb is weak
Row.ColumnName = CStr(CType(Row.ColumnName, Integer))
Catch ex As Exception
Row.ColumnName = 0
End Try
End If
End Sub

"String or binary data would be truncated." linq exception, cant find which field has exceeded max length

String or binary data would be truncated. linq exception, cant find which field has exceeded max length.
i have around 350 fields. i checked each and every textbox maxlength with database field maxlength, everything seems to be correct, but i still get the exception.
please help
Troubleshooting this error with 350 fields can be extremely difficult, and SQL Server Profiler isn't much help in this case (finding the long string in the generated SQL is like finding a needle in a haystack).
So, here is an automated way to find the actual strings that are exceeding the database size limit. This is a solution that's out there on the internet, in various forms. You probably don't want to leave it in your production code, since the attribute/property searching is pretty inefficient, and it'll add extra overhead on every save. I'd just throw it in your code when you encounter this problem, and remove it when you're done.
How it works: it iterates over all properties on an object you're about to save, finding the properties with a LINQ to SQL ColumnAttribute. Then, if the ColumnAttribute.DbType contains "varchar", you know it's a string and you can parse that part of the attribute to find the maximum length.
Here's how to use it:
foreach (object update in context.GetChangeSet().Updates)
{
FindLongStrings(update);
}
foreach (object insert in context.GetChangeSet().Inserts)
{
FindLongStrings(insert);
}
context.SubmitChanges();
And here's the method:
public static void FindLongStrings(object testObject)
{
foreach (PropertyInfo propInfo in testObject.GetType().GetProperties())
{
foreach (ColumnAttribute attribute in propInfo.GetCustomAttributes(typeof(ColumnAttribute), true))
{
if (attribute.DbType.ToLower().Contains("varchar"))
{
string dbType = attribute.DbType.ToLower();
int numberStartIndex = dbType.IndexOf("varchar(") + 8;
int numberEndIndex = dbType.IndexOf(")", numberStartIndex);
string lengthString = dbType.Substring(numberStartIndex, (numberEndIndex - numberStartIndex));
int maxLength = 0;
int.TryParse(lengthString, out maxLength);
string currentValue = (string)propInfo.GetValue(testObject, null);
if (!string.IsNullOrEmpty(currentValue) && maxLength != 0 && currentValue.Length > maxLength)
Console.WriteLine(testObject.GetType().Name + "." + propInfo.Name + " " + currentValue + " Max: " + maxLength);
}
}
}
}
Update 12/03/2019 -
This answer is referenced on Linqpad.net website for the same error. In Linqpad (version 5) (that uses LinqToSql) the columns are no longer listed as properties instead they are fields . Use the following to iterate through fields:
foreach (FieldInfo propInfo in testObject.GetType().GetFields())
...
...
string currentValue = (string)propInfo.GetValue(testObject);
...
...
If you checked the max length of every textbox to the max length of every field, it is entirely possible the error is happening through a trigger. Are there triggers on the table?
And, hey, why not have the same solution (#shaunmartin's) in VB.Net, too? Sometimes you just gotta debug someone else's code!
Use:
For Each update as Object In context.GetChangeSet().Updates
FindLongStrings(update)
Next
For Each insert as Object In context.GetChangeSet().Inserts
FindLongStrings(insert)
Next
And core
Public Shared Sub FindLongStrings(ByVal testObject As Object)
Dim propInfo As PropertyInfo
For Each propInfo In testObject.GetType().GetProperties()
Dim attribute As ColumnAttribute
For Each attribute In propInfo.GetCustomAttributes(GetType(ColumnAttribute), True)
If attribute.DbType.ToLower().Contains("varchar") Then
Dim dbType As String = attribute.DbType.ToLower()
Dim numberStartIndex As Integer = dbType.IndexOf("varchar(") + 8
Dim numberEndIndex As Integer = dbType.IndexOf(")", numberStartIndex)
Dim lengthString As String = dbType.Substring(numberStartIndex, (numberEndIndex - numberStartIndex))
Dim maxLength As Integer = 0
Integer.TryParse(lengthString, maxLength)
Dim currentValue As String = CType(propInfo.GetValue(testObject, Nothing), String)
If Not String.IsNullOrEmpty(currentValue) AndAlso maxLength <> 0 AndAlso currentValue.Length > maxLength Then
Console.WriteLine(testObject.GetType().Name & "." & propInfo.Name & " " & currentValue & " Max: " & maxLength)
End If
End If
Next
Next
End Sub
But with a lot of data or a lot of fields this takes a lot of time for sure. The debugging side of linq is lacking - the exception thrown should tell you the field!
i set the max length for all the 350 fields. i guess thats the only way. thanks for your support.

Removing non-alphanumeric characters in an Access Field

I need to remove hyphens from a string in a large number of access fields. What's the best way to go about doing this?
Currently, the entries are follow this general format:
2010-54-1
2010-56-1
etc.
I'm trying to run append queries off of this field, but I'm always getting validation errors causing the query to fail. I think the cause of this failure is the hypens in the entries, which is why I need to remove them.
I've googled, and I see that there are a number of formatting guides using vbscript, but I'm not sure how I can integrate vb into Access. It's new to me :)
Thanks in advance,
Jacques
EDIT:
So, Ive run a test case with some values that are simply text. They don't work either, the issue isn't the hyphens.
I'm not sure that the hyphens are actually the problem without seeing sample data / query but if all you need to do is get rid of them, the Replace function should be sufficient (you can use this in the query)
example: http://www.techonthenet.com/access/functions/string/replace.php
If you need to do some more advanced string manipulation than this (or multiple calls to replace) you might want to create a VBA function you can call from your query, like this:
http://www.pcreview.co.uk/forums/thread-2596934.php
To do this you'd just need to add a module to your access project, and add the function there to be able to use it in your query.
I have a function I use when removing everything except Alphanumeric characters. Simply create a query and use the function in the query on whatever field you are trying to modify. Runs much faster than find and replace.
Public Function AlphaNumeric(inputStr As String)
Dim ascVal As Integer, originalStr As String, newStr As String, counter As Integer, trimStr As String
On Error GoTo Err_Stuff
' send to error message handler
If inputStr = "" Then Exit Function
' if nothing there quit
trimStr = Trim(inputStr)
' trim out spaces
newStr = ""
' initiate string to return
For counter = 1 To Len(trimStr)
' iterate over length of string
ascVal = Asc(Mid$(trimStr, counter, 1))
' find ascii vale of string
Select Case ascVal
Case 48 To 57, 65 To 90, 97 To 122
' if value in case then acceptable to keep
newStr = newStr & Chr(ascVal)
' add new value to existing new string
End Select
Next counter
' move to next character
AlphaNumeric = newStr
' return new completed string
Exit Function
Err_Stuff:
' handler for errors
MsgBox Err.Number & " " & Err.Description
End Function
Just noticed the link to the code, looks similar to mine. Guess this is just another option.