How can I convert URLs in text to HTML links?

How can I convert URLs in text to HTML links? - html

I'm writing a forum-type discussion board in Perl and would like to change automatically http://www.google.com to be an HTML link. This should also be safe, and err on the side of security. Is there a quick, easy, and safe way to add links automatically?

Try something like this:
use Regexp::Common qw /URI/;
$text =~ s|($RE{URI}{HTTP})(?!</a>)|$1|g
The key here is using Regexp::Common::URI which probably has a more thorough url matcher than anything I could come up with. Also I do a negative lookahead assertion at the end to make sure that the url is not already in a link. That last part isn't exactly thorough, since it's possible that somebody could do something like this:
http://www.mysite.com is my website
To do this correctly you'd need to parse the entire submission text and only substitute out urls that are not already part of a link.

Related

Finding a specific link from a site

I'm trying to find a specific link from a web page using windows command line and tools. I think Xidel can do what I want to do.
In the page, the link is used like this:
file: 'http://link.link/index.txt'
Note: there's only one line like this. Now if I can set something like
file: '{%link}'
then I'll be able to extract the link. Also if I want to change the word index.txt to something like root.txt and then use aria2 to download the link as http://link.link/root.txt , what do I need to do?
(I don't have any experience with any of these tools/command like scripts, I just wanted to make something that does this (some alternatives are already available but I want to do it myself) and this only. So I did search for it and have an idea on how can I do it but extrating the exact url seems to be the hardest part since I couldn't find anything that might help me in xidel's docs)

Xidel is meant to extract data from HTML/XML/json files, but it can also extract from CSV's and TXT if you know how to use the $raw variable and xidel/xquery functions, like extract(), tokenize() and replace().
Post the URL or the source (or part thereof) of the webpage and I'll see how I can help you.

Direct link to MediaWiki page section

In my Wikipedia page, I have a section called subtitleA. Before arriving at this point when reading, I have one sentence that has a link that jumps to the content of that section.
To be more clear, this is a simple illustration:
To do this, you will need `this` (link to subtitleA).
To do that, you will do another thing..
== SubtitleA ==
this is how you do it....
I found the following solution:
To do this, you will need [http://wikisite.com/pageName#SubtitleA this].
This has already been proven correct; however, one of my subtitles contains spaces, brackets and directory like the following:
== SubtitleA (balabalaA\balabalaB\balabala....) ==
I can no longer use the solution I found because of those spaces... Can anyone provide me an alternative solutions? Thanks.

To do this, you will need [[pageName#SubtitleA|this]].
Use the exact same format as in the section title.
Anchor encoding is similar to percent encoding (with a . instead of a %) but not exactly the same (e.g. spaces are collapsed and encoded to _). If you really, really need to do it directly, you can use {{anchorencode|original title}}.

I found the solution:
URL encoder is the key, but not using standard %xx as the replacements for special characters. Use .xx (e.g. .5C .28) would work in the mediawiki framework.

Label text ignoring html tags

<label for="abc" id="xyz">http://abc.com/player.js</xref>?xyz="foo" </label>
is ignoring
</xref> tag
value in the browser. So, the displayed output is
http://abc.com/player.js?xyz="foo"
but i want the browser to display
http://abc.com/player.js</xref>?xyz="foo"
Please help me how to achieve this.

It isn't being ignored. It is being treated as an end tag (for a non-HTML element that has no start tag). Use < if you want a < character to appear as data instead of as "start of tag".
That said, this is a URL and raw <, > and " characters shouldn't appear in URIs anyway. So encode it as http://abc.com/player.js%3C/xref%3E?xyz=%22foo%22

You should do it like this
"http://abc.com/player.js%3C/xref%3E?xyz=foo"
Url should be encoded properly to work as valid URL
Use encodeURI for encoding URLs for a valid one
var ValidURL = encodeURI("http://abc.com/player.js</xref>?xyz=foo");
See this answer on encodeURI for better knowledge.

I misunderstood the question, I thought the URI was to be used elsewhere within JavaScript. But the question pretty clearly states that the URI is to just be rendered as text.
If the text being displayed is being passed in from a server, then your best bet is to encode it before printing it on the page (or if you're using a template engine, then you can most likely just encode it on the template). Pretty much any web framework/templating engine should have this functionality.
However, if it is just static HTML, just manually encode the the characters. If you don't know the codes off the top of your head, you can just use some online converter to help, such as something like:
HTML Encode/Decode:
http://htmlentities.net/
Old Answer:
Try encoding the URI using the JavaScript function encodeURI before using it:
encodeURI('http://abc.com/player.js</xref>?xyz="foo"');
You can also decode it using decodeURI if need be:
decodeURI(yourEncodedURI);
So ultimately I don't think you'll be able to get the browser to display the </xref> tag as is, but you will be able to preserve it (using encodeURI/decodeURI) and use it in your code, if this is what you need.
Fiddle:
http://jsfiddle.net/rk8nR/3/
More info:
When are you supposed to use escape instead of encodeURI / encodeURIComponent?

Perl AJAX stripping html characters out of string?

I have a Perl program that is reading html tags from a text file. (im pretty sure this is working because when i run the perl program on the command line it prints out the HTML like it should be.)
I then pass that "html" to the web page as the return to an ajax request. I then use innerHTML to stick that string into a div.
Heres the problem:
all the text information is getting to where it needs to be. but the "<" ">" and "/" are getting stripped.
any one know the answer to this?

The question is a bit unclear to me without some code and data examples, but if it is what it vaguely sounds like, you may need to HTML-encode your text (e.g. using HTML::Entities).
I'm kind of surprized that's an issue with inserting into innerHTML, but without specific example, that's the first thing which comes to mind

There could be a mod on the server that is removing special characters. Are you running Apache? (I doubt this is what's happening).
If something is being stripped on the client-side, it is most likely in the response handler portion of the AJAX call. Show your code where you stick the string in the div.

How do I match text in HTML that's not inside tags?

Given a string like this:
This is the foo link
... and a search string like "foo", I would like to highlight all occurrences of "foo" in the text of the HTML -- but not inside a tag. In other words, I want to get this:
This is the <b>foo</b> link
However, a simple search-and-replace won't work, because it will match part of the URL in the <a> tag's href.
So, to express the above in the form of a question: How do I restrict a regex so that it only matches text outside of HTML tags?
Note: I promise that the HTML in question will never be anything pathological like:
<img title="Haha! Here are some angle brackets to screw you up: ><" />
Edit: Yes, of course I'm aware that there are complex libraries in CPAN that can parse even the most heinous HTML, and thus alleviate the need for such a regex. On many occasions, that's what I would use. However, this is not one of those occasions, since keeping this script short and simple, without external dependencies, is important. I just want a one-line regex.
Edit 2: Again, I know that Template::Refine::Fragment can parse all my HTML for me. If I were writing an application I would certainly use a solution like that. But this isn't an application. It's barely more than a shell script. It's a piece of disposable code. Being a single, self-contained file that can be passed around is of great value in this case. "Hey, run this program" is a much simpler instruction than, "Hey, install a Perl module and then run this-- wait, what, you've never used CPAN before? Okay, run perl -MCPAN -e shell (preferably as root) and then it's going to ask you a bunch of questions, but you don't really need to answer them. No, don't be afraid, this isn't going to break anything. Look, you don't need to answer every question carefully -- just hit enter over and over. No, I promise, it's not going to break anything."
Now multiply the above across a great deal of users who are wondering why the simple script they've been using isn't so simple anymore, when all that's changed is to make the search term boldface.
So while Template::Refine::Fragment may be the answer to someone else's HTML parsing question, it's not the answer to this question. I just want a regular expression that works on the very limited subset of HTML that the script will actually be asked to parse.

If you can absolutely guarantee that there are no angle brackets in the HTML other than those used to open and close tags, this should work:
s%(>|\G)([^<]*?)($key)%$1$2<b>$3</b>%g

In general, you want to parse the HTML into a DOM, and then traverse the text nodes. I would use Template::Refine for this:
#!/usr/bin/env perl
use strict;
use warnings;
use feature ':5.10';
use Template::Refine::Fragment;
my $frag = Template::Refine::Fragment->new_from_string('<p>Hello, world. This is a test of foo finding. Here is another foo.');
say $frag->process(
simple_replace {
my $n = shift;
my $text = $n->textContent;
$text =~ s/foo/<foo>/g;
return XML::LibXML::Text->new($text);
} '//text()',
)->render;
This outputs:
<p>Hello, world. This is a test of <foo> finding. Here is another <foo>.</p>
Anyway, don't parse structured data with regular expressions. HTML is not "regular", it's "context-free".
Edit: finally, if you are generating the HTML inside your program, and you have to do transformations like this on strings, "UR DOIN IT WRONG". You should build a DOM, and only serialize it when everything has been transformed. (You can still use TR, however, via the new_from_dom constructor.)

The following regex will match all text between tags or outside of tags:
<.*?>(.*?)<.*?>|>(.*?)<
Then you can operate on that as desired.

Try this one
(?=>)?(\w[^>]+?)(?=<)
it matches all words between tags

To strip off the variable size contents from even nested tags you can use this regex that is in fact a mini-regular grammar for that. (note: PCRE machine)
(?<=>)((?:\w+)(?:\s*))(?1)*

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008