How to write regular expression for found href of a tags? [duplicate] - html

This question already has answers here:
How to get one "a href" out of many in one html class with jSoup
(2 answers)
Closed 7 years ago.
I need to found href of a tags in string such as this .
<li>باغ بلور<span class="ur">bipardeh94.blogfa.com</span><span class="ds">فرهنگی-خبری-علمی</span></li>
<li>هزار نکته <span class="ur">avaejam.blogfa.com</span><span class="ds"> يك نكته از هزار نكته باشد تا بعد </span></li>
<li>روابط عمومی دانشگاه آزاداسلامی کنگاور<span class="ur">prkangavar.blogfa.com</span><span class="ds">اخبار دانشگاه</span></li>
I use this code :
string regex = "href=\"(.*)\"";
Match match = Regex.Match(codeHtml, regex);
if (match.Success)
{
textBox1.Text += match.Value +"\n";
}
This code found first href and then return all codes.

Does this regex work?
string regex = "href=\"([^\"]*)\"";
[^\"]* allows everything inside the href's quotes to be anything but a quote
For how to match all tags, please use Regex.Matches

Related

Is there a way to fix quotes that are inside of each other without them clashing? [duplicate]

This question already has answers here:
How to escape double quotes in a title attribute
(7 answers)
How do I properly escape quotes inside HTML attributes?
(6 answers)
Closed 2 days ago.
I'm making a list of links that have bookmarklets inside. The problem is that there are quotes in the bookmarklet that clash with the quotes. Is there a way to fix this, or otherwise is there a different way to do it?
Code:
<a href='javascript:(function() { var l = document.querySelector("link[rel*='icon']") || document.createElement('link'); l.type = 'image/x-icon'; l.rel = 'shortcut icon'; l.href = 'https://google.com/favicon.ico'; document.getElementsByTagName('head')[0].appendChild(l); document.title = 'Google';})();'>Code</a>
I tried changing the quote type, but that doesn't work. I want the javascript to be inside the link.

BeatifulSoup Extract String in div tag [duplicate]

This question already has answers here:
how to get text from within a tag, but ignore other child tags
(2 answers)
Closed 2 years ago.
I have the following HTML:
<div class="interesting"><span>a</span> <span>b</span> c</div><div>d</div>
I am trying to use beautifulsoup to extract the string c.
However, soup.div.string is None. I could call get_text() to get a b c and then I parse the text again. But I feel it defeats the purpose of using beautifulsoup.
Any suggestion?
=====================
Update:
I added to my example string above as I noticed that it actually causes soup.div.find(text=True, recursive=False) fails to return text in div. So this question isn't a duplicate anymore.
soup = BeautifulSoup('<div class="interesting"><span>a</span> <span>b</span> c</div><div>d</div>', 'html.parser')
div = soup.find('div', class_='interesting')
print(div.find_all_next(text=True)[-1])
above code prints d
This should help you:
div = soup.find('div',class_ = "interesting")
print(div.find_all(text=True)[-1].strip()) #Prints the last text present within the div tag
Output:
c
Here is the full code:
from bs4 import BeautifulSoup
html = '<div class="interesting"><span>a</span> <span>b</span> c</div><div>d</div>'
soup = BeautifulSoup(html,'html5lib')
div = soup.find('div',class_ = "interesting")
print(div.find_all(text=True)[-1].strip())

html5 - can't format `\n` as new line in rendered string [duplicate]

This question already has answers here:
Why does the browser renders a newline as space?
(6 answers)
Closed 3 years ago.
I have the following tag but '\n' inside item.value not formatted correctly .
<td ng-if="flag">{{item.value}}</td>
HTML needs a <br/> tag. Use this regex on your value.
item.value = item.value.replace(/(?:\r\n|\r|\n)/g, '<br>');
let item = {};
item.value= "Hi I am some text with a \n line break";
item.value = item.value.replace(/(?:\r\n|\r|\n)/g, '<br>');
document.write(item.value);

RegEx for capturing an attribute value in a HTML element [duplicate]

This question already has answers here:
Extract Title from html link
(2 answers)
Closed 3 years ago.
I have a problem to extract text in the html tag using regex.
I want to extract the text from the following html code.
Google
The result:
TEXTDATA
I want to extract only the text TEXTDATA
I have tried but I have not succeeded.
Here we want to swipe the string up to a left boundary, then collect our desired data, then continue swiping to the end of string, if we like:
<.+title="(.+?)"(.*)
const regex = /<.+title="(.+?)"(.*)/gm;
const str = `Google`;
const subst = `$1`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
RegEx
If this expression wasn't desired, it can be modified or changed in regex101.com.
RegEx Circuit
jex.im also helps to visualize the expressions.
PHP
$re = '/<.+title="(.+?)"(.*)/m';
$str = 'Google';
$subst = '$1';
$result = preg_replace($re, $subst, $str);
echo $result;
Use this regex:
title=\"([^\"]*)\"
See:
Regex
Google
Remvoe Title and try

Extract plain text from an html file in C [duplicate]

This question already exists:
Using Regex in C [closed]
Closed 9 years ago.
I am really desperate. I need to extract all html elements including html tags. I want to retain just plain text. I am required to do this in C. I am discouraged to use Regex. If I use string functions, it just removes delimiters , not the string inside. I need to create a program which extracts plain text from an html file. Any guide would be appreciated on how to do so. Thanks!
Here's a starting point for you:
void remove_html(char* str) {
char* html_str = str;
while(*str) {
if(*html_str == '<')
while(*html_str && *html_str++ != '>');
*str++ = *html_str++;
}
}
int main() {
char foo[] = "hello <p>friends<b>!</b></p>";
remove_html(foo);
puts(foo);
}
It only strips the angular syntax - doesn't do any parsing. Also, it doesn't convert escape characters.
If you open up a html file in notepad, you'll find it is plain text (no images or anything).
All tags start with < and end with >, everything else is text. In this way, you can read through the file only once, excluding the characters that appear between < > symbols.
Pseudocode:
bool intag=false;
for (i=0;i<filesize;i++) {
char c = readchar();
if (c=='<') intag=true;
if (!intag) writechar(c);
if (c=='>') intag=false;
This logic should work for most cases, though you may have to do some more work to deal with indented text and possibly any javascript on the page.