Extract href value from html string using QRegExp

Extract href value from html string using QRegExp - html

I am downloading a web page and I am trying to extract some values from it.
The places of the page that I am interested in are of this type:
<a data-track=\"something\" href=\"someurl\" title=\"Heaven\"><img src=\"somesource.jpg\" /></a>
and I need to extract the href (someurl) value. Note that there are multiple entries like the one above in the HTML string that I have and thus I will use a list to store all the URLs that I extract from the string.
This is what I've tried so far:
QString html_str=myfile();
QRegExp regex("<a data-track\\=\"something\" href\\=\".*(?=\" title)");
if(regex.indexIn(html_str) != -1){
QStringList list;
QString str;
list = regex.capturedTexts();
foreach(str,list)
qDebug() << str.remove("<a data-track=\"something\" href=\"");
}
With the above code I get only one occurrence (list.count() == 1) which contains the whole HTML string from the first occurrence of someurl till the end of the file, without the <a data-track="something" href="" in it, which have all been removed.

I'd do it like this: (make sure you double check your regex)
QRegExp regex("<a data-track=\"something\" href=\".*(?=\" title)");
if (regex.indexIn(html_str) != -1) qDebug() << html_str.cap().remove(<a data-track=\"something\" href=\");

You can use a while loop to control the position of the "html_str"
pos = regex.indexIn(htmlContent); // get the first position
while(pos = regex.indexIn(htmlContent, pos) != -1){ // continue next
QStringList list;
list = regex.capturedTexts();
foreach(QString url, list) {
// do something
}
pos += regex.matchedLength();
}

Related

SSIS Script howto append text to end of each row in flat file?

I currently have a flat file with around 1million rows.
I need to add a text string to the end of each row in the file.
I've been trying to adapt the following code but not having any success :-
public void Main()
{
// TODO: Add your code here
var lines = System.IO.File.ReadAllLines(#"E:\SSISSource\Source\Source.txt");
foreach (string item in lines)
{
var str = item.Replace("\n", "~20221214\n");
var subitems = str.Split('\n');
foreach (var subitem in subitems)
{
// write the data back to the file
}
}
Dts.TaskResult = (int)ScriptResults.Success;
}
I can't seem to get the code to recognise the carriage return "\n" & am not sure howto write the row back to the file to replace the existing rather than add a new row. Or is the above code sending me down a rabbit hole & there is an easier method ??
Many thanks for any pointers &/or assistance.

Read all lines is likely getting rid of the \n in each record. So your replace won't work.
Simply append your string and use #billinKC's solution otherwise.
BONUS:
I think DateTime.Now.ToString("yyyyMMdd"); is what you are trying to append to each line

Thanks #billinKC & #KeithL
KeithL you were correct in that the \n was stripped off. So I used a slightly amended version of #billinKC's code to get what I wanted :-
string origFile = #"E:\SSISSource\Source\Sourcetxt";
string fixedFile = #"E:\SSISSource\Source\Source.fixed.txt";
// Make a blank file
System.IO.File.WriteAllText(fixedFile, "");
var lines = System.IO.File.ReadAllLines(#"E:\SSISSource\Source\Source.txt");
foreach (string item in lines)
{
var str = item + "~20221214\n";
System.IO.File.AppendAllText(fixedFile, str);
}
As an aside KeithL - thanks for the DateTime code however the text that I am appending is obtained from a header row in the source file which is being read into a variable in an earlier step.

I read your code as
For each line in the file, replace the existing newline character with ~20221214 newline
At that point, the value of str is what you need, just write that! Instead, you split based on the new line which gets you an array of values which could be fine but why do the extra operations?
string origFile = #"E:\SSISSource\Source\Sourcetxt";
string fixedFile = #"E:\SSISSource\Source\Source.fixed.txt";
// Make a blank file
System.IO.File.WriteAllText(fixedFile, "");
var lines = System.IO.File.ReadAllLines(#"E:\SSISSource\Source\Source.txt");
foreach (string item in lines)
{
var str = item.Replace("\n", "~20221214\n");
System.IO.File.AppendAllText(fixedFile, str);
}
Something like this ought to be what you're looking for.

Parsing string retrieved with Jsoup in Android

I am writing an Android App that will read some info from a website and display it on the App's screen. I am using the Jsoup library to get the info in the form of a string. First, here's what the website html looks like:
<strong>
Now is the time<br />
For all good men<br />
To come to the aid<br />
Of their country<br />
</strong>
Here's how I'm retrieving and trying to parse the text:
Document document = Jsoup.connect(WEBSITE_URL).get();
resultAggregator = "";
Elements nodePhysDon = document.select("strong");
//check results
if (nodePhysDon.size()> 0) {
//get value
donateResult = nodePhysDon.get(0).text();
resultAggregator = donateResult;
}
if (resultAggregator != "") {
// split resultAggregator into an array breaking up with br /
String donateItems[] = resultAggregator.split("<br />");
}
But then donateItems[0] is not just "Now is the time", It's all four strings put together. I have also tried without the space between "br" and "/", and get the same result. If I do resultAggregator.split("br"); then donateItems[0] is just the first word: "Now".
I suspect the problem is the Jsoup method select is stripping the tags out?
Any suggestions? I can't change the website's html. I have to work with it as is.

Try this:
//check results
if (nodePhysDon.size()> 0) {
//use toString() to get the selected block with tags included
donateResult = nodePhysDon.get(0).toString();
resultAggregator = donateResult;
}
if (resultAggregator != "") {
// remove <strong> and </strong> tags
resultAggregator = resultAggregator.replace("<strong>", "");
resultAggregator = resultAggregator.replace("</strong>", "");
//then split with <br>
String donateItems[] = resultAggregator.split("<br>");
}
Make sure to split with <br> and not <br />

Difficulty splitting array and returning a value from within it; Javascript

I have got an array that consists of strings. I have made a function that searches the array based on the search term parameter. However, when i run the code it only ever outputs the string at index 0 of the array. I want it to return the corresponding url in the array when a search is run.
Any help would be very much appreciated. Thanks in advance.

So you are trying to return URL based on the String after the ~?
Do the line
arrayOfURL[i].toLowerCase().split('~')[i];
seem weird to you? Imagine as i increases, eg. i = 4
arrayOfURL[4].toLowerCase().split('~')[4];
Does that last [4] make sense?
I am guessing the reason it never got past the first element is because the code actually erroring out on that part.
I think what you want is (likewise for the return line, you'll want [0]
arrayOfURL[i].toLowerCase().split('~')[1];
I would also take a look at
if (z >= searchtoLower)
what are you trying to compare there?

The problem may be in the second i param:
var z = arrayOfURL[i].toLowerCase().split('~')[i];
The string will be splitted into 2 parts (index 0, 1). Why did you select part i?

This is a correct version of your program:
var arrayOfURL = [
"http://www.google.co.uk~Google is a search engine.",
"http://www.yahoo.co.uk~Yahoo is another search engine.",
"http://bing.com~Bing is a decision engine."
];
function findURL(arrayOfURL,search)
{
var searchtoLower = search.toLowerCase();
for (var i = 0; i < arrayOfURL.length; i++)
{
var z = arrayOfURL[i].toLowerCase().split('~')[1];
if (z.indexOf(searchtoLower) != -1)
return arrayOfURL[i];
}
return "Nothing Found!";
}
findURL(arrayOfURL,"decision")
I hope it can help you.

I think you should be doing
var terms = arrayOfURL[i].toLowerCase().split('~');
if(0 <= terms[1].indexOf(searchToLower))
// ^ ^
// | |-- 0 <= indexOf method determines
// | if searchToLower is a substring of terms[1]
// |
// |-- term[1] gets the part after the first "~"
and
return terms[0]; //terms[0] is the part before the first "~"
I would also consider returning null or the empty string "" in case of failure (instead of returning the arbritrary "Nothing Found!" message)

iTextSharp HTML to PDF preserving spaces

I am using the FreeTextBox.dll to get user input, and storing that information in HTML format in the database. A samle of the user's input is the below:
                                                                     133 Peachtree St NE                                                                     Atlanta,  GA 30303                                                                     404-652-7777                                                                      Cindy Cooley                                                                     www.somecompany.com                                                                     Product Stewardship Mgr                                                                     9/9/2011Deidre's Company123 Test StAtlanta, GA 30303Test test.  
I want the HTMLWorker to perserve the white spaces the users enters, but it strips it out. Is there a way to perserve the user's white space? Below is an example of how I am creating my PDF document.
Public Shared Sub CreatePreviewPDF(ByVal vsHTML As String, ByVal vsFileName As String)
Dim output As New MemoryStream()
Dim oDocument As New Document(PageSize.LETTER)
Dim writer As PdfWriter = PdfWriter.GetInstance(oDocument, output)
Dim oFont As New Font(Font.FontFamily.TIMES_ROMAN, 8, Font.NORMAL, BaseColor.BLACK)
Using output
Using writer
Using oDocument
oDocument.Open()
Using sr As New StringReader(vsHTML)
Using worker As New html.simpleparser.HTMLWorker(oDocument)
worker.StartDocument()
worker.SetInsidePRE(True)
worker.Parse(sr)
worker.EndDocument()
worker.Close()
oDocument.Close()
End Using
End Using
HttpContext.Current.Response.ContentType = "application/pdf"
HttpContext.Current.Response.AddHeader("Content-Disposition", String.Format("attachment;filename={0}.pdf", vsFileName))
HttpContext.Current.Response.BinaryWrite(output.ToArray())
HttpContext.Current.Response.End()
End Using
End Using
output.Close()
End Using
End Sub

There's a glitch in iText and iTextSharp but you can fix it pretty easily if you don't mind downloading the source and recompiling it. You need to make a change to two files. Any changes I've made are commented inline in the code. Line numbers are based on the 5.1.2.0 code rev 240
The first is in iTextSharp.text.html.HtmlUtilities.cs. Look for the function EliminateWhiteSpace at line 249 and change it to:
public static String EliminateWhiteSpace(String content) {
// multiple spaces are reduced to one,
// newlines are treated as spaces,
// tabs, carriage returns are ignored.
StringBuilder buf = new StringBuilder();
int len = content.Length;
char character;
bool newline = false;
bool space = false;//Detect whether we have written at least one space already
for (int i = 0; i < len; i++) {
switch (character = content[i]) {
case ' ':
if (!newline && !space) {//If we are not at a new line AND ALSO did not just append a space
buf.Append(character);
space = true; //flag that we just wrote a space
}
break;
case '\n':
if (i > 0) {
newline = true;
buf.Append(' ');
}
break;
case '\r':
break;
case '\t':
break;
default:
newline = false;
space = false; //reset flag
buf.Append(character);
break;
}
}
return buf.ToString();
}
The second change is in iTextSharp.text.xml.simpleparser.SimpleXMLParser.cs. In the function Go at line 185 change line 248 to:
if (html /*&& nowhite*/) {//removed the nowhite check from here because that should be handled by the HTML parser later, not the XML parser

Thanks for the help everyone. I was able to find a small work around by doing the following:
vsHTML.Replace(" ", " ").Replace(Chr(9), " ").Replace(Chr(160), " ").Replace(vbCrLf, "<br />")
The actual code does not display properly but, the first replace is replacing white spaces with , Chr(9) with 5 , and Chr(160) with .

I would recommend using wkhtmltopdf instead of iText. wkhtmltopdf will output the html exactly as rendered by webkit (Google Chrome, Safari) instead of iText's conversion. It is just a binary that you can call. That being said, I might check the html to ensure that there are paragraphs and/or line breaks in the user input. They might be stripped out before the conversion.

Trim string to length ignoring HTML

This problem is a challenging one. Our application allows users to post news on the homepage. That news is input via a rich text editor which allows HTML. On the homepage we want to only display a truncated summary of the news item.
For example, here is the full text we are displaying, including HTML
In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us.
We want to trim the news item to 250 characters, but exclude HTML.
The method we are using for trimming currently includes the HTML, and this results in some news posts that are HTML heavy getting truncated considerably.
For instance, if the above example included tons of HTML, it could potentially look like this:
In an attempt to make a bit more space in the office, kitchen, I've pulled...
This is not what we want.
Does anyone have a way of tokenizing HTML tags in order to maintain position in the string, perform a length check and/or trim on the string, and restore the HTML inside the string at its old location?

Start at the first character of the post, stepping over each character. Every time you step over a character, increment a counter. When you find a '<' character, stop incrementing the counter until you hit a '>' character. Your position when the counter gets to 250 is where you actually want to cut off.
Take note that this will have another problem that you'll have to deal with when an HTML tag is opened but not closed before the cutoff.

Following the 2-state finite machine suggestion, I've just developed a simple HTML parser for this purpose, in Java:
http://pastebin.com/jCRqiwNH
and here a test case:
http://pastebin.com/37gCS4tV
And here the Java code:
import java.util.Collections;
import java.util.LinkedList;
import java.util.List;
public class HtmlShortener {
private static final String TAGS_TO_SKIP = "br,hr,img,link";
private static final String[] tagsToSkip = TAGS_TO_SKIP.split(",");
private static final int STATUS_READY = 0;
private int cutPoint = -1;
private String htmlString = "";
final List<String> tags = new LinkedList<String>();
StringBuilder sb = new StringBuilder("");
StringBuilder tagSb = new StringBuilder("");
int charCount = 0;
int status = STATUS_READY;
public HtmlShortener(String htmlString, int cutPoint){
this.cutPoint = cutPoint;
this.htmlString = htmlString;
}
public String cut(){
// reset
tags.clear();
sb = new StringBuilder("");
tagSb = new StringBuilder("");
charCount = 0;
status = STATUS_READY;
String tag = "";
if (cutPoint < 0){
return htmlString;
}
if (null != htmlString){
if (cutPoint == 0){
return "";
}
for (int i = 0; i < htmlString.length(); i++){
String strC = htmlString.substring(i, i+1);
if (strC.equals("<")){
// new tag or tag closure
// previous tag reset
tagSb = new StringBuilder("");
tag = "";
// find tag type and name
for (int k = i; k < htmlString.length(); k++){
String tagC = htmlString.substring(k, k+1);
tagSb.append(tagC);
if (tagC.equals(">")){
tag = getTag(tagSb.toString());
if (tag.startsWith("/")){
// closure
if (!isToSkip(tag)){
sb.append("</").append(tags.get(tags.size() - 1)).append(">");
tags.remove((tags.size() - 1));
}
} else {
// new tag
sb.append(tagSb.toString());
if (!isToSkip(tag)){
tags.add(tag);
}
}
i = k;
break;
}
}
} else {
sb.append(strC);
charCount++;
}
// cut check
if (charCount >= cutPoint){
// close previously open tags
Collections.reverse(tags);
for (String t : tags){
sb.append("</").append(t).append(">");
}
break;
}
}
return sb.toString();
} else {
return null;
}
}
private boolean isToSkip(String tag) {
if (tag.startsWith("/")){
tag = tag.substring(1, tag.length());
}
for (String tagToSkip : tagsToSkip){
if (tagToSkip.equals(tag)){
return true;
}
}
return false;
}
private String getTag(String tagString) {
if (tagString.contains(" ")){
// tag with attributes
return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(" "));
} else {
// simple tag
return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(">"));
}
}
}

You can try the following npm package
trim-html
It cutting off sufficient text inside html tags, save original html stricture, remove html tags after limit is reached and closing opened tags.

If I understand the problem correctly, you want to keep the HTML formatting, but you want to not count it as part of the length of the string you are keeping.
You can accomplish this with code that implements a simple finite state machine.
2 states: InTag, OutOfTag
InTag:
- Goes to OutOfTag if > character is encountered
- Goes to itself any other character is encountered
OutOfTag:
- Goes to InTag if < character is encountered
- Goes to itself any other character is encountered
Your starting state will be OutOfTag.
You implement a finite state machine by procesing 1 character at a time. The processing of each character brings you to a new state.
As you run your text through the finite state machine, you want to also keep an output buffer and a length so far encountered varaible (so you know when to stop).
Increment your Length variable each time you are in the state OutOfTag and you process another character. You can optionally not increment this variable if you have a whitespace character.
You end the algorithm when you have no more characters or you have the desired length mentioned in #1.
In your output buffer, include characters you encounter up until the length mentioned in #1.
Keep a stack of unclosed tags. When you reach the length, for each element in the stack, add an end tag. As you run through your algorithm you can know when you encounter a tag by keeping a current_tag variable. This current_tag variable is started when you enter the InTag state, and it is ended when you enter the OutOfTag state (or when a whitepsace character is encountered while in the InTag state). If you have a start tag you put it in the stack. If you have an end tag, you pop it from the stack.

Here's the implementation that I came up with, in C#:
public static string TrimToLength(string input, int length)
{
if (string.IsNullOrEmpty(input))
return string.Empty;
if (input.Length <= length)
return input;
bool inTag = false;
int targetLength = 0;
for (int i = 0; i < input.Length; i++)
{
char c = input[i];
if (c == '>')
{
inTag = false;
continue;
}
if (c == '<')
{
inTag = true;
continue;
}
if (inTag || char.IsWhiteSpace(c))
{
continue;
}
targetLength++;
if (targetLength == length)
{
return ConvertToXhtml(input.Substring(0, i + 1));
}
}
return input;
}
And a few unit tests I used via TDD:
[Test]
public void Html_TrimReturnsEmptyStringWhenNullPassed()
{
Assert.That(Html.TrimToLength(null, 1000), Is.Empty);
}
[Test]
public void Html_TrimReturnsEmptyStringWhenEmptyPassed()
{
Assert.That(Html.TrimToLength(string.Empty, 1000), Is.Empty);
}
[Test]
public void Html_TrimReturnsUnmodifiedStringWhenSameAsLength()
{
string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I";
Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(source));
}
[Test]
public void Html_TrimWellFormedHtml()
{
string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
"In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>" +
"</div>";
string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";
Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(expected));
}
[Test]
public void Html_TrimMalformedHtml()
{
string malformedHtml = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
"In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>";
string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";
Assert.That(Html.TrimToLength(malformedHtml, 250), Is.EqualTo(expected));
}

I'm aware this is quite a bit after the posted date, but i had a similiar issue and this is how i ended up solving it. My concern would be the speed of regex versus interating through an array.
Also if you have a space before an html tag, and after this doesn't fix that
private string HtmlTrimmer(string input, int len)
{
if (string.IsNullOrEmpty(input))
return string.Empty;
if (input.Length <= len)
return input;
// this is necissary because regex "^" applies to the start of the string, not where you tell it to start from
string inputCopy;
string tag;
string result = "";
int strLen = 0;
int strMarker = 0;
int inputLength = input.Length;
Stack stack = new Stack(10);
Regex text = new Regex("^[^<&]+");
Regex singleUseTag = new Regex("^<[^>]*?/>");
Regex specChar = new Regex("^&[^;]*?;");
Regex htmlTag = new Regex("^<.*?>");
while (strLen < len)
{
inputCopy = input.Substring(strMarker);
//If the marker is at the end of the string OR
//the sum of the remaining characters and those analyzed is less then the maxlength
if (strMarker >= inputLength || (inputLength - strMarker) + strLen < len)
break;
//Match regular text
result += text.Match(inputCopy,0,len-strLen);
strLen += result.Length - strMarker;
strMarker = result.Length;
inputCopy = input.Substring(strMarker);
if (singleUseTag.IsMatch(inputCopy))
result += singleUseTag.Match(inputCopy);
else if (specChar.IsMatch(inputCopy))
{
//think of as 1 character instead of 5
result += specChar.Match(inputCopy);
++strLen;
}
else if (htmlTag.IsMatch(inputCopy))
{
tag = htmlTag.Match(inputCopy).ToString();
//This only works if this is valid Markup...
if(tag[1]=='/') //Closing tag
stack.Pop();
else //not a closing tag
stack.Push(tag);
result += tag;
}
else //Bad syntax
result += input[strMarker];
strMarker = result.Length;
}
while (stack.Count > 0)
{
tag = stack.Pop().ToString();
result += tag.Insert(1, "/");
}
if (strLen == len)
result += "...";
return result;
}

Wouldn't the fastest way be to use jQuery's text() method?
For example:
<ul>
<li>One</li>
<li>Two</li>
<li>Three</li>
</ul>
var text = $('ul').text();
Would give the value OneTwoThree in the text variable. This would allow you to get the actual length of the text without the HTML included.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Extract href value from html string using QRegExp - html

I'd do it like this: (make sure you double check your regex) QRegExp regex("<a data-track=\"something\" href=\".*(?=\" title)"); if (regex.indexIn(html_str) != -1) qDebug() << html_str.cap().remove(<a data-track=\"something\" href=\");

Related

SSIS Script howto append text to end of each row in flat file?

Parsing string retrieved with Jsoup in Android

Difficulty splitting array and returning a value from within it; Javascript

iTextSharp HTML to PDF preserving spaces

Trim string to length ignoring HTML

Categories

Resources