I'm trying to find links in the following format:
http://subdomain.subdomain.domain.tld/subfolder/randomstring.html
Basically, I need a regex that looks for http:// and stops looking when it finds .html. Everything in between shouldn't matter. I.e., more/less subdomains, variable TLD and variable folder.
Is this possible?
((http://)?=(.html))
What I've got so far (not functional) is this. I'm really not familiar with the look-ahead assertion so I might be on the wrong track.
Anyways, any help is going to be greatly appreciated!
Look ahead? You only need a non-greedy match everything.
/http:\/\/.*?\.html/
I would use something like: /http:\/\/[^<>\s]+?\.html/
Can be enhanced, but at least won't match stuff like:
http://something.com/ has a lot of .html files
Related
I'm trying to find a specific link from a web page using windows command line and tools. I think Xidel can do what I want to do.
In the page, the link is used like this:
file: 'http://link.link/index.txt'
Note: there's only one line like this. Now if I can set something like
file: '{%link}'
then I'll be able to extract the link. Also if I want to change the word index.txt to something like root.txt and then use aria2 to download the link as http://link.link/root.txt , what do I need to do?
(I don't have any experience with any of these tools/command like scripts, I just wanted to make something that does this (some alternatives are already available but I want to do it myself) and this only. So I did search for it and have an idea on how can I do it but extrating the exact url seems to be the hardest part since I couldn't find anything that might help me in xidel's docs)
Xidel is meant to extract data from HTML/XML/json files, but it can also extract from CSV's and TXT if you know how to use the $raw variable and xidel/xquery functions, like extract(), tokenize() and replace().
Post the URL or the source (or part thereof) of the webpage and I'll see how I can help you.
In my header and footer template I have links like /pagina/test.
This works, but on other pages Bolt generates links like pagina/pagina/test. What am I doing wrong here?
To prevent wrong links I type in the whole url now www.test.nl/pagina/test, but this is not an ideal situation.
Normally I 'd say that you'd forgotten the first slash, but since you've specifically stated that you put that one in...
Could you post a more complete chunk of code?
The following regex does match what I am looking for, but it will also match all file extensions (just the file extensions) of anything ending with gif|jpg|png
webcomic"\ssrc="http://www\.explosm\.net/[a-zA-Z/]+\.gif|png|jpg"\s
I am using it on the source of the following page, which is a webcomic that is updated daily:
http://www.explosm.net/comics/
Today, the end goal would be the following, and only the following:
webcomic" src="http://www.explosm.net/db/files/Comics/Kris/lawyer.gif"
I'm just getting my feet wet with regex, have browsed a few websites but can't figure this one out. I don't get why just the file extensions are getting matched, when their file paths/urls do not match the rest of my pattern.
Any help appreciated
Well, the problem that jumps right out at me is the end there. gif|png|jpg should really be (gif|jpg|png) - with what you have now, the string can match webcomic"\ssrc="http://www\.explosm\.net/[a-zA-Z/]+\.gif, or it can match just png or jpg"\s. With the parentheses, it will match webcomic"\ssrc="http://www\.explosm\.net/[a-zA-Z/]+\. followed by (gif or jpg or png), and then followed by "\s.
That last bit
gif|png|jpg
means "match any of the three". If want it to match just gif, write just gif.
I'd try a regex like this:
\shttp://www.explosm.net\/[a-zA-Z]+\.(gif|png|jpg|jpeg)\s
Is it possible to do something like the URL below?
myurl.co.uk/#myAnchor?myQry=this
I'm trying to pass tracking codes while also being able to have multiple links from an email go to relevant parts of my page.
This currently seems to do nothing as it is. Is it actually possible.
The query goes before the anchor, so:
http://example.com/page.php?parameter=value#anchor
Though the query always comes before any anchors, here is the solution for this if required.
One solution for this could be to use Apache HTACCESS and declare #myAnchor in rewrite rules. Then you can use something like this:
myurl.co.uk/#myAnchor?myQry=this
But please remember that in the htaccess the # is also used for commenting so you will have to escape it with a rewrite rule.
I'm writing a forum-type discussion board in Perl and would like to change automatically http://www.google.com to be an HTML link. This should also be safe, and err on the side of security. Is there a quick, easy, and safe way to add links automatically?
Try something like this:
use Regexp::Common qw /URI/;
$text =~ s|($RE{URI}{HTTP})(?!</a>)|$1|g
The key here is using Regexp::Common::URI which probably has a more thorough url matcher than anything I could come up with. Also I do a negative lookahead assertion at the end to make sure that the url is not already in a link. That last part isn't exactly thorough, since it's possible that somebody could do something like this:
http://www.mysite.com is my website
To do this correctly you'd need to parse the entire submission text and only substitute out urls that are not already part of a link.