Extracing Non String Text HTML VBA

Extracing Non String Text HTML VBA - html

So, I am trying to get a date out of html using VBA in Excel, and I am having issues finding a way to extract the text that I want it appears as:
<SPAN id=ctl00_ContentPlaceHolder1_lblDateCreated2>5/22/2012 8:14:08 PM</SPAN>
I want extract the 5/22/2012 8:14:08, but as it is not a string and in between the carats, I don't know exactly how to do it. Any tips?

I figured out that I was using ".innerText" incorrectly, and I was able to get it working with the following snippet.
Doc.getElementById("ctl00_ContentPlaceHolder1_lblDateCreated2").innerText

You could do this in VBA with split:
theString = "<SPAN id=ctl00_ContentPlaceHolder1_lblDateCreated2>5/22/2012 8:14:08 PM</SPAN>"
Temp = Split(theString, "ContentPlaceHolder1_lblDateCreated2>")(1)
Final = Split(Temp, "</")(0)
The first Split will return an array of two parts:
Temp(0) = "<SPAN id=ctl00_"
Temp(1) = "5/22/2012 8:14:08 PM</SPAN>"
Next we Split Temp(1) to remove the closing SPAN tag and return just the date and time.

I think you're just looking for a Mid() formula. If that URL/Span part in A1, put this in A2 (or wherever):
=MID(A1,SEARCH(">",A1)+1,FIND("</",A1)-FIND(">",A1)-1)

Related

How to extract the hyperlink text from a <a> html tag?

Given a string containing 'blabla text blabla', I want to extract 'text' from it.
regexp doc suggests '<(\w+).*>.*</\1>' expression, but it extracts the whole <a> ... </a> thing.
Of course I can continue using strfind like this:
line = 'blabla text blabla';
atag = regexp(line,'<(\w+).*>.*</\1>','match', 'once');
from = strfind(atag, '>');
to = strfind(atag, '<');
text = atag((from(1)+1):(to(2)-1))
, but, can I use another expression to find text at once?

You can use the extractHTMLText function in Matlab, you can read about it in the following link.
Example that get the desired output:
line = 'blabla text blabla';
l = split(extractHTMLText(line), ' ');
l{2}
If you don't want to use a built in function you could use regex as Nick suggested.
line = 'blabla text blabla';
[atag,tok] = regexp(line,'<(\w+).*>(.*?)</\1>','match','tokens');
t = tok(1,1){1};
t{2}
and you'll get the desired output

You can simply use a Group.
Update of your pattern will be something like this:
<(\w+).*>(.*)<\/\1>
and this one include all tags:
<.*>(.*)<.*>
Regex101

If you are using JQuery try this. No Regex required. But this might negatively impact performance if the DOM is hefty.
$jqueryobj = $(line);
var text = $jqueryobj.find("a").text();

How to write regex expression for this type of text?

I'm trying to extract the price from the following HTML.
<td>$75.00/<span class='small font-weight-bold text-
danger'>Piece</span></small> *some more text here* </td>
What is the regex expression to get the number 75.00?
Is it something like:
<td>$*/<span class='small font-weight-bold text-danger'>

The dollar sign is a special character in regex, so you need to escape it with a backslash. Also, you only want to capture digits, so you should use character classes.
<td>\$(\d+[.]\d\d)<span
As the other respondent mentioned, regex changes a bit with each implementing language, so you may have to make some adjustments, but this should get you started.

I think you can go with /[0-9]+\.[0-9]+/.
[0-9] matches a single number. In this example you should get the number 7.
The + afterwards just says that it should look for more then just one number. So [0-9]+ will match with 75. It stops there because the character after 5 is a period.
Said so we will add a period to the regex and make sure it's escaped. A period usually means "every character". By escaping it will just look for a period. So we have /[0-9]+\./ so far.
Next we just to add [0-9]+ so it will find the other number(s) too.
It's important that you don't give it the global-flag like this /[0-9]+\.[0-9]+/g. Unless you want it to find more then just the first number/period-combination.
There is another regex you can use. It uses the parentheses to group the part you're looking for like this: /<td>\$(.+)<span/
It will match everything from <td>$ up to <span. From there you can filter out the group/part you're looking for. See the examples below.
// JavaScript
const text = "<td>$something<span class='small font-weight..."
const regex = /<td>\$(.+)<span/g
const match = regex.exec(text) // this will return an Array
console.log( match[1] ) // prints out "something"
// python
text = "<td>$something<span class='small font-weight..."
regex = re.compile(r"<td>\$(.+)<span")
print( regex.search(text).group(1) ) // prints out "something"

As an alternative you could use a DOMParser.
Wrap your <td> inside a table, use for example querySelector to get your element and get the first node from the childNodes.
That would give you $75.00/.
To remove the $ and the trailing forward slash you could use slice or use a regex like \$(\d+\.\d+) and get the value from capture group 1.
let html = `<table><tr><td>$75.00/<span class='small font-weight-bold text-
danger'>Piece</span></small> *some more text here* </td></tr></table>`;
let parser = new DOMParser();
let doc = parser.parseFromString(html, "text/html");
let result = doc.querySelector("td");
let textContent = result.childNodes.item(0).nodeValue;
console.log(textContent.slice(1, -1));
console.log(textContent.match(/\$(\d+\.\d+)/)[1]);

JSFL: convert text from a textfield to a HTML-format string

I've got a deceptively simple question: how can I get the text from a text field AND include the formatting? Going through the usual docs I found out it is possible to get the text only. It is also possible to get the text formatting, but this only works if the entire text field uses only one kind of formatting. I need the precise formatting so that I convert it to a string with html-tags.
Personally I need this so I can pass it to a custom-made text field component that uses HTML for formatting. But it could also be used to simply export the contents of any text field to any other format. This could be of interest to others out there, too.
Looking for a solution elsewhere I found this:
http://labs.thesedays.com/blog/2010/03/18/jsfl-rich-text/
Which seems to do the reverse of what I need, convert HTML to Flash Text. My own attempts to reverse this have not been successful thus far. Maybe someone else sees an easy way to reverse this that I’m missing? There might also be other solutions. One might be to get the EXACT data of the text field, which should include formatting tags of some kind(XML, when looking into the contents of the stored FLA file). Then remove/convert those tags. But I have no idea how to do this, if at all possible. Another option is to cycle through every character using start- and endIndex, and storing each formatting kind in an array. Then I could apply the formatting to each character. But this will result in excess tags. Especially for hyperlinks! So can anybody help me with this?

A bit late to the party but the following function takes a JSFL static text element as input and returns a HTML string (using the Flash-friendly <font> tag) based on the styles found it its TextRuns array. It's doing a bit of basic regex to clear up some tags and double spaces etc. and convert /r and /n to <br/> tags. It's probably not perfect but hopefully you can see what's going on easy enough to change or fix it.
function tfToHTML(p_tf)
{
var textRuns = p_tf.textRuns;
var html = "";
for ( var i=0; i<textRuns.length; i++ )
{
var textRun = textRuns[i];
var chars = textRun.characters;
chars = chars.replace(/\n/g,"<br/>");
chars = chars.replace(/\r/g,"<br/>");
chars = chars.replace(/ /g," ");
chars = chars.replace(/. <br\/>/g,".<br/>");
var attrs = textRun.textAttrs;
var font = attrs.face;
var size = attrs.size;
var bold = attrs.bold;
var italic = attrs.italic;
var colour = attrs.fillColor;
if ( bold )
{
chars = "<b>"+chars+"</b>";
}
if ( italic )
{
chars = "<i>"+chars+"</i>";
}
chars = "<font size=\""+size+"\" face=\""+font+"\" color=\""+colour+"\">"+chars+"</font>";
html += chars;
}
return html;
}

How to remove some html tags?

I'm trying to find a regex for VBScript to remove some html tags and their content from a string.
The string is,
<H2>Title</H2><SPAN class=tiny>Some
text here</SPAN><LI>Some list
here</LI><SCRITP>Some script
here</SCRITP><P>Some text here</P>
Now, I'd like to EXCLUDE <SPAN class=tiny>Some text here</SPAN> and <SCRITP>Some script here</SCRITP>
Maybe someone has a simple solution for this, thanks.

This should do the trick in VBScript:
Dim myRegExp, ResultString
Set myRegExp = New RegExp
myRegExp.IgnoreCase = True
myRegExp.Global = True
myRegExp.Pattern = "<span class=tiny>[\s\S]*?</span>|<script>[\s\S]*?</script>"
ResultString = myRegExp.Replace(SubjectString, "")
SubjectString is the variable with your original HTML and ResultString receives the HTML with all occurrences of the two tags removed.
Note: I'm assuming scritp in your sample is a typo for script. If not, adjust my code sample accordingly.

You could do this a lot easier using css:
span.tiny {
display: none;
}
or using jQuery:
$("span.tiny").hide();

I think you want this
$(function(){
$('span.tiny').remove();
$('script').remove();
})

replace keyword within html string

I am looking for a way to replace keywords within a html string with a variable. At the moment i am using the following example.
returnString = Replace(message, "[CustomerName]", customerName, CompareMethod.Text)
The above will work fine if the html block is spread fully across the keyword.
eg.
<b>[CustomerName]</b>
However if the formatting of the keyword is split throughout the word, the string is not found and thus not replaced.
e.g.
<b>[Customer</b>Name]
The formatting of the string is out of my control and isn't foolproof. With this in mind what is the best approach to find a keyword within a html string?

Try using Regex expression. Create your expressions here, I used this and it works well.
http://regex-test.com/validate/javascript/js_match

Use the text property instead of innerHTML if you're using javascript to access the content. That should remove all tags from the content, you give back a clean text representation of the customer's name.
For example, if the content looks like this:
<div id="name">
<b>[Customer</b>Name]
</div>
Then accessing it's text property gives:
var name = document.getElementById("name").text;
// sets name to "[CustomerName]" without the tags
which should be easy to process. Do a regex search now if you need to.
Edit: Since you're doing this processing on the server-side, process the XML recursively and collect the text element's of each node. Since I'm not big on VB.Net, here's some pseudocode:
getNodeText(node) {
text = ""
for each node.children as child {
if child.type == TextNode {
text += child.text
}
else {
text += getNodeText(child);
}
}
return text
}
myXml = xml.load(<html>);
print getNodeText(myXml);
And then replace or whatever there is to be done!

I have found what I believe is a solution to this issue. Well in my scenario it is working.
The html input has been tweaked to place each custom field or keyword within a div with a set id. I have looped through all of the elements within the html string using mshtml and have set the inner text to the correct value when a match is found.
e.g.
Function ReplaceDetails(ByVal message As String, ByVal customerName As String) As String
Dim returnString As String = String.Empty
Dim doc As IHTMLDocument2 = New HTMLDocument
doc.write(message)
doc.close()
For Each el As IHTMLElement In doc.body.all
If (el.id = "Date") Then
el.innerText = Now.ToShortDateString
End If
If (el.id = "CustomerName") Then
el.innerText = customerName
End If
Next
returnString = doc.body.innerHTML
return returnString
Thanks for all of the input. I'm glad to have a solution to the problem.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008