regular expression to remove links [duplicate]

regular expression to remove links [duplicate] - html

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
RegEx match open tags except XHTML self-contained tags
I have a HTML page with
<a class="development" href="[variable content]">X</a>
The [variable content] is different in each place, the rest is the same.
What regexp will catch all of those links?
(Although I am not writing it here, I did try...)

What about the non-greedy version:
<a class="development" href="(.*?)">X</a>

Try this regular expression:
<a class="development" href="[^"]*">X</a>

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

Regex is generally a bad solution for HTML parsing, a topic which gets discussed every time a question like this is asked. For example, the element could wrap onto another line, either as
<a class="development"
href="[variable content]">X</a>
or
<a class="development" href="[variable content]">X
</a>
What are you trying to achieve?
Using JQuery you could disable the links with:
$("a.development").onclick = function() { return false; }
or
$("a.development").attr("href", "#");

Here's a version that'll allow all sorts of evil to be put in the href attribute.
/<a class="development" href=(?:"[^"]*"|'[^']*'|[^\s<>]+)>.*?<\/a>/m
I'm also assuming X is going to be variable, so I added a non-greedy match there to handle it, and the /m means . matches line-breaks too.

Related

What is the difference between <P> and <p> in HTML? [duplicate]

This question already has answers here:
Is HTML case sensitive?
(6 answers)
Closed 8 years ago.
So the thing is, I am currently analyzing html documents by reading them though java and I see that the p tag is one of the most commonly used tags. I know that it's there to provide a new line, but what I don't know is why in some documents I see
<P>Hello world!</P>
and in others
<p>Hello world!</p>
Sometimes both are even used in the same document.
It seems to have exactly the same effect but I am just wondering if there is any reason these two variations exist.

There is no difference.
In HTML, elements are case-insensitive.
However, in XHTML, you must use lowercase.

http://www.w3.org/TR/html-markup/documents.html#case-insensitivity
HTML is case-insensitive. as you can see in the documentation.

They're same. It does not matter if its lowercase or uppercase or even mixed.

<p></p> Is used for a new paragraph
HTML is case-insensitive, which means you can use both spellings.

there is no difference. Inherited from SGML, HTML is not case sensitive for elements and attributes.
I prefere to use the lower-case form... Else I've the impression that the coder is shouting at me ^^

Forcing a line break in a string in HTML, Equivalent to \n in HTML

This is being used in a Bootstrap Popover.
The live page under development can be viewed here
This is got to be simple but I can't find it anywhere. Within data-content attribute I want to force a paragraph or line break between "Date Assessed: 10-Nov-13 and Results: CR= ...
Using a BR or P tag doesn't work it shows the literal tag. In Javascript to force a line break you use \n how do you do the same in HTML within a quoted string?
<td class="setWidth concat"><div class="boldTitle"><a href="#"
class="tip" rel="popover" data-trigger="hover"
data-placement="top"
data-content="Date Assessed: 10-Nov-13 <br />
Results: Cr = 2.2 mg/dl"
data-original-title="Out of Range">
<span style="color:red"
class="glyphicon glyphicon-warning-sign"></span> Cr = 2.2 mg/dL</a></div></td>

See last update: Bootstrap gives you ability to specify that the content is HTML instead of text.
It depends entirely on bootstrap's implementation of the popover effect. If they are using $('.popover').html($(this).data('content')) then it should "just work". If they are using $('.popover').text($(this).data('content')) or otherwise escaping the results of the data-attribute first, then it probably won't.
If bootstrap's implementation isn't working the way you want it to work, you might be served better by writing your own javascript to handle the effect you're looking for.
See this fiddle for an example of a line break from a data-attribute working correctly:
http://jsfiddle.net/g32tw/1/
Update: I've updated the fiddle with a second link that produces the error you're experiencing, which is likely how bootstrap's implementation works.
UPDATE: just looked at bootstrap's documentation. Have you tried adding "data-html" = "true" to the element?
Source: http://getbootstrap.com/javascript/#popovers-usage
Watch out with this - if the content is end-user-supplied using the html option might subject you to XSS attack vulnerabilities. If you trust the data it's fine. See https://www.acunetix.com/websitesecurity/cross-site-scripting/ for information about cross-site scripting.

I am not sure that you can. You could try to have two data items:
data-assessdate="Date Assessed: 10-Nov-13"
data-results="Cr = 2.2 mg/dl"
and reassemble afterward with Javascript before displaying:
var summary = this.dataset;
var newhtml=summary.assessdate . "<br />" . summary.results;
and then write newhtml to the DOM where ever you want.

HTML tag that causes other tags to be rendered as plain text [duplicate]

This question already has answers here:
How to display raw HTML code on an HTML page
(30 answers)
Closed 3 years ago.
I'd like to add an area to a page where all of the dynamic content is rendered as plain text instead of markup. For example:
<myMagicTag>
<b>Hello</b> World
</myMagicTag>
I want the <b> tag to show up as just text and not as a bold directive. I'd rather not have to write the code to convert every "<" to an "<".
I know that <textarea> will do it, but it has other undesirable side effects like adding scroll bars.
Does myMagicTag exist?
Edit: A jQuery or javascript function that does this would also be ok. Can't do it server-side, unfortunately.

You can do this with the script element (bolded by me):
The script element allows authors to include dynamic script and data blocks in their documents.
Example:
<script type="text/plain">
This content has the media type plain/text, so characters reserved in HTML have no special meaning here: <div> ← this will be displayed.
</script>
(Note that the allowed content of the script element is restricted, e.g. you can’t have </script> as text content (it would close the script element).)
Typically, script elements have display:none by default in browser’s CSS, so you’d need to overwrite that in your CSS, e.g.:
script[type="text/plain"] {display:block;}

You can use a function to escape the < >, eg:
'span.name': function(){
return this.name.replace(/</g, '<').replace(/>/g, '>');
}
Also take a look at <plaintext></plaintext>. I haven't used it myself but it is known to render everything that follows as plain text(by everything i mean to say it ignores the closing tag, so all the following code is rendered as text)

The tag used to be <XMP> but in HTML 4 it was already deprecated. Browser's don't seem to have dropped its support but I would not recommend it for anything beyond quick debugging. The MDN article about <XMP> lists two other tags, <plaintext> and <listing>, that were deprecated even earlier. I'm not aware of any current alternative.
Whatever, the code to encode plain text into HTML is pretty straightforward in most programming languages.
Note: the term similar means exactly that—all three are designed to inject plain text into HTML. I'm not implying that they are synonyms or that they behave identically—they don't.

There is no specific tag except the deprecated <xmp>.
But a script tag is allowed to store unformatted data.
Here is the only solution so far showing dynamic content, as you wanted.
Run code snippet for more info.
<script id="myMagicTag" type="text/plain" style="display:block;">
<b>Hello</b> World
</script>
Use Visible Data-blocks
<script>
document.querySelector("#myMagicTag").innerHTML = "<b>Unformatted</b> dynamic content"
</script>

No, that's not possible, you need to HtmlEncode it.
If your using a server-side language, that's not really difficult though.
In .NET you would do something like this:
string encodedtext = HttpContext.Current.Server.HtmlEncode(plaintext);

In my application, I need to prevent HTML from rendering
"if (a<b || c>100) ..."
and
"cout << ...".
Also the entire C++ code region HTML must pass through the GCC compiler with the desired effect. I've hit on two schemes:
First:
//<xmp>
#include <string>
//</xmp>}
For reasons that escape me, the <xmp> tag is deprecated. I find (2016-01-09) that Chrome and FF, at least, render the tag the way I want. While researching my problem, I saw a remark that <xmp> is required in HTML 5.
Second, in <head> ... </head>, insert:
<style type="text/css">
textarea { border: none; }
</style>
Then in <body> ... </body>, write:
//<br /> <textarea rows="4" disabled cols="80">
#include <stdlib.h>
#include <iostream>
#include <string>
//</textarea> <br />
Note: Set "cols="80" to prevent following text from appearing on the right. Set "rows=..." to one more line than you enclose in the tag. This prevents scroll bars. This second technique has several disadvantages:
The "disabled" attribute shades the region
Incomprehensible, complex comments in the code sent to the compiler
Harder to understand
More typing
However, this methhod is neither obsolete nor deprecated. The gods of HTML will make their faces to shine unto you.

How to add plain text code in a webpage? [duplicate]

This question already has answers here:
How to display raw HTML code on an HTML page
(30 answers)
Closed 2 years ago.
I know it is possible because this website does it, but I tried researching how and just got a bunch of junk, so how do I add tags to a website paragraph without the browser interpreting it as code.
For example, if I have <p><div></div></p>, I want the div to display in the browser as text not have the browser interpret it as html. Is this complicated to do?
I have been writing tutorials for school, and it would be much easier if I could add the code directly to the webpage in text form instead of images, so students can copy and paste it.

Look at how this website itself achieves this:
<p>For example, if I have <code><p><div></div></p></code>, I want the div to display in the browser as text not have the browser interpret it as html. Is this complicated to do?</p>
You need to replace the < and > with their HTML character entities.

There are many ways to use:
Replace < with <
`<h1>This is heading </small></h1>`
Place the code inside </xmp><xmp> tags
<xmp>
<ul>
<li>Coffee</li>
<li>Tea</li>
</ul>
</xmp>
I do not recommend other ways because they do not work on all browsers like <plaintext> or <listing>.

You want to look into something called HTML Entities.
If you want the < character to appear on a website, for example, you can write this HTML code: <. These are the five basic HTML Entities and their source code equivalents:
< <
> >
" "
' &apos;
& &
If you are using a programming language (such as PHP or ASP.NET), then there is probably a built-in command that will do the conversion for you (htmlspecialchars() and Server.HtmlEncode, respectively).

Use the tag <PRE> before a block of reformatted text and </PRE> after.
The text between these tags is rendered as monospaced characters with line breaks and spaces at the same points as in the original file. This may be helpful for rendering poetry without adding a lot of HTML code. Try this:
Mary had a little lamb.
Its fleece was white as snow.
And everywhere that Mary went
the lamb was sure to go.

To add plain text code in a webpage, HTML Character Escaping is needed on five characters:
< as <> as >& as &&apos; as &apos;" as "
(OR)
<xmp> tag may also be used as an alternate, this tag disturbs the style and is obsolete.
<xmp>Code with HTML Tags like <div> etc. </xmp>

Use the html entity/special character of the tag, such as < (for less than)
<p> in html -> <p> in browser
You could also write <p> since there is no ambiguity about the opening tag.
Many languages have built in methods to convert HTML special characters such as php's htmlspecialchars

You need to escape the HTML tags, namely the less-than sign. Write it as < and it will appear as < on the HTML page.

Your html needs to not be in tags. If you use the <> tags you will have it converted into code not text, if I was to write <br> in the middle of a sentence then it would do this You will need to Write the code in code so to speak, using the < > (< >)
and then you get what you need.

I just discovered a much simpler solution at CSS-Tricks...
Just have your outer-most wrapper be a 'pre' tag, followed by a 'code' tag, then inside the code tag put your code in paranthesis.

The simplest way to do it without having to reformat your text using entities is to use JQuery.
<div id="container"></div>
<script>
$('#container').text("<div><h1>Hello!</h1><p>I like you.</p></div>");
</script>
If you then do alert($('#container').prop('innerHTML'));, you get <div><h1>Hello!</h1><p>I like you.</p></div>
How useful that technique is depends somewhat on where your material is coming from.

Use iframe and txt file:
<iframe src="html.txt"></iframe>

How do you parse a web page and extract all the href links?

I want to parse a web page in Groovy and extract all of the href links and the associated text with it.
If the page contained these links:
Google<br />
Apple
the output would be:
Google, http://www.google.com<br />
Apple, http://www.apple.com
I'm looking for a Groovy answer. AKA. The easy way!

Assuming well-formed XHTML, slurp the xml, collect up all the tags, find the 'a' tags, and print out the href and text.
input = """<html><body>
John
Google
StackOverflow
</body></html>"""
doc = new XmlSlurper().parseText(input)
doc.depthFirst().collect { it }.findAll { it.name() == "a" }.each {
println "${it.text()}, ${it.#href.text()}"
}

A quick google search turned up a nice looking possibility, TagSoup.

I don't know java but I think that xpath is far better than classic regular expressions in order to get one (or more) html elements.
It is also easier to write and to read.
<html>
<body>
1
2
3
</body>
</html>
With the html above, this expression "/html/body/a" will list all href elements.
Here's a good step by step tutorial http://www.zvon.org/xxl/XPathTutorial/General/examples.html

Use XMLSlurper to parse the HTML as an XML document and then use the find method with an appropriate closure to select the a tags and then use the list method on GPathResult to get a list of the tags. You should then be able to extract the text as children of the GPathResult.

Try a regular expression. Something like this should work:
(html =~ /<a.*href='(.*?)'.*>(.*?)<\/a>/).each { url, text ->
// do something with url and text
}
Take a look at Groovy - Tutorial 4 - Regular expressions basics and Anchor Tag Regular Expression Breaking.

Parsing using XMlSlurper only works if HTMl is well-formed.
If your HTMl page has non-well-formed tags, then use regex for parsing the page.
Ex: <a href="www.google.com">
here, 'a' is not closed and thus not well formed.
new URL(url).eachLine{
(it =~ /.*<A HREF="(.*?)">/).each{
// process hrefs
}
}

Html parser + Regular expressions
Any language would do it, though I'd say Perl is the fastest solution.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

regular expression to remove links [duplicate] - html

What about the non-greedy version: <a class="development" href="(.*?)">X</a>

Try this regular expression: <a class="development" href="[^"]*">X</a>

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

Here's a version that'll allow all sorts of evil to be put in the href attribute. /<a class="development" href=(?:"[^"]"|'[^']'|[^\s<>]+)>.*?<\/a>/m I'm also assuming X is going to be variable, so I added a non-greedy match there to handle it, and the /m means . matches line-breaks too.

Related

What is the difference between <P> and <p> in HTML? [duplicate]

Forcing a line break in a string in HTML, Equivalent to \n in HTML

HTML tag that causes other tags to be rendered as plain text [duplicate]

How to add plain text code in a webpage? [duplicate]

How do you parse a web page and extract all the href links?

Categories

Resources

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

regular expression to remove links [duplicate] - html

What about the non-greedy version: <a class="development" href="(.*?)">X</a>

Try this regular expression: <a class="development" href="[^"]*">X</a>

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

Here's a version that'll allow all sorts of evil to be put in the href attribute. /<a class="development" href=(?:"[^"]*"|'[^']*'|[^\s<>]+)>.*?<\/a>/m I'm also assuming X is going to be variable, so I added a non-greedy match there to handle it, and the /m means . matches line-breaks too.

Related

What is the difference between <P> and <p> in HTML? [duplicate]

Forcing a line break in a string in HTML, Equivalent to \n in HTML

HTML tag that causes other tags to be rendered as plain text [duplicate]

How to add plain text code in a webpage? [duplicate]

How do you parse a web page and extract all the href links?

Categories

Resources

Here's a version that'll allow all sorts of evil to be put in the href attribute. /<a class="development" href=(?:"[^"]"|'[^']'|[^\s<>]+)>.*?<\/a>/m I'm also assuming X is going to be variable, so I added a non-greedy match there to handle it, and the /m means . matches line-breaks too.