What does "BS warning" mean? - warnings

In a recent mail, I saw a phrase "BS warning". Although this mail is in Emacs's mailing list, I don't think this phrase it's Emacs-specific.
I've searched the web, but didn't get anything that looked relevant. Any ideas?

I the context of the linked email, it means the the warnings are not real, that they can (and should) be disregarded.

Related

The Difference Between Deprecated, Depreciated and Obsolete [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 3 years ago.
Improve this question
There is a lot of confusion about this and I'd like to know, what exactly is the difference between depreciated, deprecated and obsolete, in a programming context, but also in general.
I know I could just look at an online dictionary, and I have, even at many, but they don't all agree, or there are differences in what they say. So I decided to just ask here, considering I also want an answer in a programming context.
If I understand right, deprecated means it shouldn't be used anymore, because it has been replaced by a better alternative, or just because it has been abandoned. Obsolete means it doesn't work anymore, was removed, or doesn't work as it should anymore. And depreciated, if I understand right, once more, has completely nothing to do with programming and just means something has a lowered value, or was made worse.
Am I right, or am I wrong, and if I am wrong, what exactly do each of these mean?
You are correct.
Deprecated means that it is still in use, but only for historical purposes and it will be removed probably in the next big release. It is recommended that you do not use deprecated functions or features - even if they are present in the current library for example.
Obsolete means that is already out-of-use.
Depreciated means the monetary value of something has decreased over time. E.g., cars typically depreciate in value.
Also for more precise definitions of the terms in the context of the English language I recommend using https://english.stackexchange.com/.
Records are obsolete, CDs are deprecated, and the music industry is depreciated.
In the context of describing APIs and such, "depreciated" is a misreading, misspelling, and mispronunciation of "deprecated".
I'm thinking people have just seen "depreciated" so often in other contexts, and "deprecated" so rarely, that they don't even register the "i" or lack thereof. It doesn't exactly help that their definitions are similar either.
Obsolete: should not be used any more
Deprecated: should be avoided in new code, and likely to become obsolete in a later version of the API
Depreciated: usually a typo for deprecated (depreciation is where the value of goods goes down over time, e.g. if you buy a new computer its resale value goes down month by month)
With all due respect, this is a slight pet peeve of mine and the selected answer for this is actually wrong.
Granted language evolves, e.g., "google" is now a verb, apparently. Through what's known as "common use", it has earned its way into official dictionaries. However, "google" was a new word representing something heretofore non-existent in our speech.
Common use does not cover blatantly changing the meaning of a word just because we didn't understand its definition in the first place, no matter how many people keep repeating it.
The entire English-speaking computer industry seems to use "deprecate" to mean some feature that is being phased out or no longer relevant. Not bad, just not recommended. Usually, because there is a new and better replacement.
The actual definition of deprecate is to put down, or speak negatively about, or to express disapproval, or make fun of someone or something through degradation.
It comes from Latin de- (against) precari (to pray). To "pray against" to a 21st century person probably conjures up thoughts of warding off evil spirits or something, which is probably where the disconnect occurs with people. In fact, to pray or to pray for something meant to wish good upon, to speak about in a positive way. To pray against would be to speak ill of or to put down or denigrate. See this excerpt from the Oxford English Dictionary.
Express disapproval of:
(as adjective deprecating) he sniffed in a deprecating way
another term for depreciate ( sense 2).
he deprecates the value of children’s television
What people generally mean to convey when using deprecate, in the IT industry anyway, and perhaps others, is that something has lost value. Something has lost relevance. Something has fallen out of favor. Not that it has no value, it is just not as valuable as before (probably due to being replaced by something new.) We have two words that deal with concept in English and the first is "depreciate". See this excerpt from the Oxford English Dictionary.
Diminish in value over a period of time:
the pound is expected to depreciate against the dollar
Disparage or belittle (something):
Notice that definition 2 sounds like deprecate. So, ironically, deprecate can mean depreciate in some contexts, just not the one commonly used by IT folk.
Also, just because currency depreciation is a nice common use of the word depreciate, and therefore easy to cite as an example, doesn't mean it's the only context in which the word is relevant. It's just an example. ONE example.
The correct transitive verb for this is "obsolete". You obsolete something because its value has depreciated.
See this excerpt from the Oxford English Dictionary.
Verb - Cause something to be or become obsolete by replacing it with something new.
It bugs me, it just bugs me. I don't know why. Maybe because I see it everywhere. In every computer book I read, every lecture I attend, and on every technical site on the internet, someone invariably drops the d-bomb sooner or later. If this one ends up in the dictionary at some point, I will concede, but conclude that the gatekeepers of the English lexicon have become weak and have lost their way... or at the very least, lost their nerve. Even Wikipedia espouses this misuse, and indeed, defends it. I've already edited the page thrice, and they keep removing my edits.
Something is depreciated until it is obsolete. Deprecate, in the context of IT, makes no sense at all, unless you're putting down someone's performance or work or product or the fact that they still wear parachute pants.
Conclusion: The entire IT industry uses deprecate incorrectly. It may be common use. It may be some huge mis-understanding. But it is still, completely, wrong.
In computer software standards and documentation, the term deprecation is used to indicate discouragement of usage of a particular software feature, usually because it has been superseded by a newer/better version. The deprecated feature still works in the current version of the software, but it may raise error messages or warnings recommending an alternate practice.
The Obsolete attribute marks a program entity as one that is no longer recommended for use. Each use of an entity marked obsolete will subsequently generate a warning or an error since they are no longer in use or does not exist.
EDIT:
depreciated : Not sure how this relates to programming
I wouldn't say obsolete means it doesn't work anymore. In my mind obsolete just means there are better alternatives. A thing becomes obsolete because of something else. Deprecated means you shouldn't use it, although there might not be any alternatives. A thing becomes deprecated because someone says it is -- it is prescriptive.
"Obsolete" means "has been replaced".
"Depreciated" means "has less value than its original value".
"Deprecated" means to expressly disprove of and was popularised due to misspellings in two technical articles where the authors used deprecated without an "i". One in 1999 and the other in 2002 referenced in the dictionaries as origins.
Prior to that time frame we were reading comments like // depreciated in API documentation including the MS MSDN.
The use of deprecated in the tech industry is therefore completely incorrect and evidence of how a technical writer can produce a bug that can live in a language and someone should finally put the bug to rest.

How does google determine the date a thread was posted?

When you search in google, when searching for a term, you can click "Discussion" on the left hand side of the page. This will lead you to forum based discussions which you can select. I was in the process of designing a discussion board for a usergroup and I would like for google to index my data with post time.
You can filter the results by "Any Time" - "Past Hour" - "Past 24 Hours" - "Past Week" - etc.
What is the best way to ensure that the post date is communicated to google? RSS feed for thread? Special HTML label tag with particular id? Or some other method?
Google continually improves their heuristics and as such, I don't think there are any (publicly known) rules for what you describe. In fact, I just did a discussion search myself and found the resulting pages to have wildly differing layouts, and not all of them have RSS feeds or use standard forum software. I would just guess that Google looks for common indicators such as Post #, Author, Date.
Time-based filtering is mostly based on how frequently Google indexes your page and identifies new content (although discussion pages could also be filtered based on individual post dates, which is once again totally up to Google). Just guessing, but it might also help to add Last-Modified headers to your pages.
I believe Google will simply look at when the content appeared. No need for parsing there, and no special treatment required on your end.
i once read a paper from a googler (a paper i sadly can't find anymore, if somebody finds it, please give me a note) where it was outlines. a lot of formulas and so on, but the bottom line was: google has analyzed the structure of the top forum systems on the web. it does not use a page metaphor to analyse it, but breaks the forum down into topics, threads and posts.
so basically, if you use a standard, popular forum system, google knows that it is a forum and puts you into the discussion segment. if you build your own forum software it is probably best to use existing, established forum conventions (topics, threads, posts, authors....).

How to report a bug in an open-source-app? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
What if I think, that I found a bug in an open-source-app? What steps can I do, to provide as much helpful information for the programmers, as possible? And how I report best, to avoid to be annoying for the programmers?
Addition: As some here say, that the OS-programmers will love the report: Some projects are very picky about bug-reports. They say that are non-bugs or that it is non-reproducable, or the way it behaves is intended or similar things. Some of that critic towards the bug-reports may be justified, but often it isn't. I want to 'optimize' the bug-report to get the best feedback (preferably a fix) out of it.
The minimal information that I as a FOSS developer would like to get from someone submitting a bug report is:
software version
platform
brief description of the bug
sample input that you think is correct
sample output that you think is incorrect (and why you think this)
Exactly how you go about supplying the information will vary enormously from app to app. Before posting the bug, you should take a look a the support newsgroups or mailing lists to se how this kind of thing is handled.
Edit: If the bug is non-reproducible or intended behaviour, I don't think you will be getting a fix, no matter how you optimise the report. But you do always have the option of fixing it yourself if you are utterly convinced it is a bug.
First, go on the project page and check for information on how to report bugs. They might have a preferred way of doing it.
Most projects have mailing lists. Most of them have a user and a developer mailing list. Start by searching the lists to see if the bug you have discovered was already discussed. Maybe it's not a bug and the product simply does not support what you try to do.
If you have already digged in the code and found the cause of the bug (and maybe the fix), subscriber to the developer list and post a message describing the problem. Include a complete description of the problem, the version you use (and the version of other software if needed. ie.: Web server, OS, ...), a test case, what you found in the code and the patch you made. If it's a bug, they will tell you to report it in their bug tracking software (bugzilla, mantis, redmine, track, ...)
If you don't find anything in the code, subscribe to user list and post your problem.
Avoid saying thinks like "please, I really need to fix or I ...". Open source developer are not your employees. If you want something fixed, you can always do it yourself. Avoid ultimatums and rant about the software.
If the bug was already reported, the only thing you can do is watch it or vote it up. Avoid adding comments like "me too!" or "we need this fixed!" or "why is this still not fixed?!?". That is annoying.
Find the bug system (for example, https://bugzilla.mozilla.org/ for firefox) If you can't find any links off the main page or from google, you may have to use one of the projects mailing lists or forums. Poke around a bit and find the most appropriate one to use.
Once you have figured out where bugs should be reported, do a search to see if your bug has already been reported. If it has, see if there is anything you can add that would be helpful (me too! comments are not being helpful, additional information is very helpful)
When it comes what to report, first list your environment (operating system, what version you are using, where you got it from, etc) Describe the bug (what is going wrong), and give detailed steps on how to reproduce it
For general advice about reporting bugs (what information to provide, etc) I recommend Simon Tatham's paper: http://www.chiark.greenend.org.uk/~sgtatham/bugs.html
A. They will love to hear from you, this is not annoying.
B. describe exactly how you can reproduce the bug, what steps, what OS, what else is running on the system.
C. look at the open source project's site - it probably has an address to submit this kind of info.
Find the application's website. There is usually information there about bug reporting procedures, as well as bugs that have already been submitted (so that you don't submit a duplicate). Error messages, screenshots, and steps to reproduce are what I always like to have when I am trying to track down/fix a bug.

What different terms mean the same thing (or don't, but people think they do)? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Closed 5 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
One of the pitfalls I run into on a daily basis is customers saying one thing while meaning another. Usually, this is just due to a miscommunication somewhere, but occasionally they are, in fact, saying the same thing I am just using a different term.
For example, one of my customers the other day mentioned a feature he called, "find as you type." Being a little confused, I asked him what he meant, and he described the feature in Google where, once you start typing a search query, Google suggests other, popular queries that match the letters you have typed.
Click! He meant AutoComplete! He was not wrong, it is just that I had never heard that term before.
In the spirit of reducing confusion, what terms can you think of that are different but mean, essentially, the same thing?
Also, what terms do people think mean the same thing, but don't. Please differentiate between the two.
Please only one set of terms per answer, so we can vote on the best ones.
parameter == argument
Parameter is the variable in the
declaration of function or method.
Argument is the actual value of this
variable that gets passed to function.
I like this one because it happens even to programmers
I've seen this a few times on this site:
Authentication != Authorization
Authentication: Your identity
Authorization: Your privileges
Users often confuse "web browser" with "the Internet." I'll hear them say "I'm going to the Internet," which means "I'm launching a web browser."
"CPU" = tower
A favorite term I have heard customers use.
AJAX and Javascript.
A lot of times I hear semi-technical people interchanging the two terms. Like: "Can't you animate that image using AJAX". Which is of course just plain javascript.
"Client" is the big, perennial classic term that means so many things, all within the context of almost every development project.
Hard drive space != RAM
Verification == Validation
From wikipedia:
It is sometimes said that validation
can be expressed by the query "Are you
building the right thing?" and
verification by "Are you building the
thing right?". "Building the right
thing" refers back to the user's
needs, while "building it right"
checks that the specifications be
correctly implemented by the system.
In some contexts, it is required to
have written requirements for both as
well as formal procedures or protocols
for determining compliance.
"open source" == "free software"
If you watch Revolution OS, you'll hear Richard Stallman use the term "free software" and others like Linus Torvalds and Bruce Perens use "open source." After watching the film, I think they're talking about the same thing, but disagreeing (vehemently in some cases) on what to call it.
(I hope none of them are reading this.)
"Inconceivable"
I do not think it means what you think it means.
I once heard a junior dev use NULL and VOID interchangeably.
Scariest thing I'd ever heard.
Drop down = Combo box
Wiki != Wikipedia. (As in, "Well I looked it up on Wiki, and it says...")
This one is not really programming related, but it could cause a problem for someone working at a company that had their own internal wiki.
Wiki: http://en.wikipedia.org/wiki/Wiki
Wikipedia: http://en.wikipedia.org/wiki/Wikipedia:About
Some wikis that are not Wikipedia: http://en.wikipedia.org/wiki/List_of_wikis
Java == Javascript
Winchester == hard disk drive.
It ain't!
alt text http://upload.wikimedia.org/wikipedia/commons/thumb/1/11/Winchester_Model_1873_Short_Rifle_1495.jpg/300px-Winchester_Model_1873_Short_Rifle_1495.jpg
Scope != Lifetime
Scope :: is the collection of statements where a variable can be referenced. Those statements are called the referencing environment of that variable.
Lifetime :: is the association between a variable(the name) and its place of storage in memory(address).
Closure == lambda. In reality, they are distinct things: lambda is any anonymous function, and may or may not close over some variables; closure is any function that closes over some variables, and may or may not be anonymous. For example, the original Pascal had no lambdas, but it had closures (in form of nested functions).
deprecate != depreciate
Seriously people. Features are not depreciated from upcoming releases of software. They are deprecated.
hard disk drive = computer
There are 180 pages of preferred terms in the "Microsoft Manual of Style for Technical Publications," which is a great book for technical writers, but I think programmers should have it too.
Many of the entries mention unacceptable (or outdated) equivalents.
Example: "system tray Do not use. Use notification area instead."
PowerPoint != the projector
It really bothers me when people say "I'll just put it up on the PowerPoint" and then they go to Microsoft Word or something instead.
Bug - Incident - Failure - Error - Defect - Problem - Issue
Some users will use the term "downloading" to generally mean "transferring" instead of distinguishing between "downloading" and "uploading." So, if they say "The error happened right after I downloaded the data," it might refer to another part of the process than what a tech person would take it to mean.
System == Library == Framework == Program == Application == Software
One that really turned my head around was someone in my QA department referring to a null value and a blank value as being one and the same. I smiled and asked if they were serious and they said, "of course they're the same." I tried to explain as simply as I could that they were not the same and it just didn't register with them.
/matt
PC != Windows
PC means personal computer. Apple invented the PC. But, now it's taken a life of its own as anything that has Windows on it.
In this same vein, people tend to compare "Mac" or "PC" when it should be "OS X" or "Windows"... or "Mac vs. ThinkPad/Satellite"
Of course, that would be more difficult to put into an ad.
computer == system == workstation == machine == box
Whenever dealing with Departments of Education you must learn that "system" means software and "technology" means hardware.
Host == Server
.. Which is untrue :)
Value Object == Value Type
Value Objects are classes representing immutable attributes, as in Domain Driven Design.
Value Types are variables whose values are held on the stack (int, bool, struct, etc). These are spoken of in relation to Reference Types, which live on the heap and have memory pointers.

Detecting a (naughty or nice) URL or link in a text string

How can I detect (with regular expressions or heuristics) a web site link in a string of text such as a comment?
The purpose is to prevent spam. HTML is stripped so I need to detect invitations to copy-and-paste. It should not be economical for a spammer to post links because most users could not successfully get to the page. I would like suggestions, references, or discussion on best-practices.
Some objectives:
The low-hanging fruit like well-formed URLs (http://some-fqdn/some/valid/path.ext)
URLs but without the http:// prefix (i.e. a valid FQDN + valid HTTP path)
Any other funny business
Of course, I am blocking spam, but the same process could be used to auto-link text.
Ideas
Here are some things I'm thinking.
The content is native-language prose so I can be trigger-happy in detection
Should I strip out all whitespace first, to catch "www .example.com"? Would common users know to remove the space themselves, or do any browsers "do-what-I-mean" and strip it for you?
Maybe multiple passes is a better strategy, with scans for:
Well-formed URLs
All non-whitespace followed by '.' followed by any valid TLD
Anything else?
Related Questions
I've read these and they are now documented here, so you can just references the regexes in those questions if you want.
replace URL with HTML Links javascript
What is the best regular expression to check if a string is a valid URL
Getting parts of a URL (Regex)
Update and Summary
Wow, I there are some very good heuristics listed in here! For me, the best bang-for-the-buck is a synthesis of the following:
#Jon Bright's technique of detecting TLDs (a good defensive chokepoint)
For those suspicious strings, replace the dot with a dot-looking character as per #capar
A good dot-looking character is #Sharkey's subscripted · (i.e. "·"). · is also a word boundary so it's harder to casually copy & paste.
That should make a spammer's CPM low enough for my needs; the "flag as inappropriate" user feedback should catch anything else. Other solutions listed are also very useful:
Strip out all dotted-quads (#Sharkey's comment to his own answer)
#Sporkmonger's requirement for client-side Javascript which inserts a required hidden field into the form.
Pinging the URL server-side to establish whether it is a web site. (Perhaps I could run the HTML through SpamAssassin or another Bayesian filter as per #Nathan..)
Looking at Chrome's source for its smart address bar to see what clever tricks Google uses
Calling out to OWASP AntiSAMY or other web services for spam/malware detection.
I'm concentrating my answer on trying to avoid spammers. This leads to two sub-assumptions: the people using the system will therefore be actively trying to contravene your check and your goal is only to detect the presence of a URL, not to extract the complete URL. This solution would look different if your goal is something else.
I think your best bet is going to be with the TLD. There are the two-letter ccTLDs and the (currently) comparitively small list of others. These need to be prefixed by a dot and suffixed by either a slash or some word boundary. As others have noted, this isn't going to be perfect. There's no way to get "buyfunkypharmaceuticals . it" without disallowing the legitimate "I tried again. it doesn't work" or similar. All of that said, this would be my suggestion:
[^\b]\.([a-zA-Z]{2}|aero|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel)[\b/]
Things this will get:
buyfunkypharmaceuticals.it
google.com
http://stackoverflo**w.com/**questions/700163/
It will of course break as soon as people start obfuscating their URLs, replacing "." with " dot ". But, again assuming spammers are your goal here, if they start doing that sort of thing, their click-through rates are going to drop another couple of orders of magnitude toward zero. The set of people informed enough to deobfuscate a URL and the set of people uninformed enough to visit spam sites have, I think, a miniscule intersection. This solution should let you detect all URLs that are copy-and-pasteable to the address bar, whilst keeping collateral damage to a bare minimum.
I'm not sure if detecting URLs with a regex is the right way to solve this problem. Usually you will miss some sort of obscure edge case that spammers will be able to exploit if they are motivated enough.
If your goal is just to filter spam out of comments then you might want to think about Bayesian filtering. It has proved to be very accurate in flagging email as spam, it might be able to do the same for you as well, depending on the volume of text you need to filter.
I know this doesn't help with auto-link text but what if you search and replaced all full-stop periods with a character that looks like the same thing, such as the unicode character for hebrew point hiriq (U+05B4)?
The following paragraph is an example:
This might workִ The period looks a bit odd but it is still readableִ The benefit of course is that anyone copying and pasting wwwִgoogleִcom won't get too farִ :)
Well, obviously the low hanging fruit are things that start with http:// and www. Trying to filter out things like "www . g mail . com" leads to interesting philosophical questions about how far you want to go. Do you want to take it the next step and filter out "www dot gee mail dot com" also? How about abstract descriptions of a URL, like "The abbreviation for world wide web followed by a dot, followed by the letter g, followed by the word mail followed by a dot, concluded with the TLD abbreviation for commercial".
It's important to draw the line of what sorts of things you're going to try to filter before you continue with trying to design your algorithm. I think that the line should be drawn at the level where "gmail.com" is considered a url, but "gmail. com" is not. Otherwise, you're likely to get false positives every time someone fails to capitalize the first letter in a sentence.
Since you are primarily looking for invitations to copy and paste into a browser address bar, it might be worth taking a look at the code used in open source browsers (such as Chrome or Mozilla) to decide if the text entered into the "address bar equivalent" is a search query or a URL navigation attempt.
Ping the possible URL
If you don't mind a little server side computation, what about something like this?
urls = []
for possible_url in extracted_urls(comment):
if pingable(possible_url):
urls.append(url) #you could do this as a list comprehension, but OP may not know python
Here:
extracted_urls takes in a comment and uses a conservative regex to pull out possible candidates
pingable actually uses a system call to determine whether the hostname exists on the web. You could have a simple wrapper parse the output of ping.
[ramanujan:~/base]$ping -c 1 www.google.com
PING www.l.google.com (74.125.19.147): 56 data bytes
64 bytes from 74.125.19.147: icmp_seq=0 ttl=246 time=18.317 ms
--- www.l.google.com ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max/stddev = 18.317/18.317/18.317/0.000 ms
[ramanujan:~/base]$ping -c 1 fooalksdflajkd.com
ping: cannot resolve fooalksdflajkd.com: Unknown host
The downside is that if the host gives a 404, you won't detect it, but this is a pretty good first cut -- the ultimate way to verify that an address is a website is to try to navigate to it. You could also try wget'ing that URL, but that's more heavyweight.
Having made several attempts at writing this exact piece of code, I can say unequivocally, you won't be able to do this with absolute reliability, and you certainly won't be able to detect all of the URI forms allowed by the RFC. Fortunately, since you have a very limited set of URLs you're interested in, you can use any of the techniques above.
However, the other thing I can say with a great deal of certainty, is that if you really want to beat spammers, the best way to do that is to use JavaScript. Send a chunk of JavaScript that performs some calculation, and repeat the calculation on the server side. The JavaScript should copy the result of the calculation to a hidden field so that when the comment is submitted, the result of the calculation is submitted as well. Verify on the server side that the calculation is correct. The only way around this technique is for spammers to manually enter comments or for them to start running a JavaScript engine just for you. I used this technique to reduce the spam on my site from 100+/day to one or two per year. Now the only spam I ever get is entered by humans manually. It's weird to get on-topic spam.
Of course you realize if spammers decide to use tinuyrl or such services to shorten their URLs you're problem just got worse. You might have to write some code to look up the actual URLs in that case, using a service like TinyURL decoder
Consider incorporating the OWASP AntiSAMY API...
I like capar's answer the best so far, but dealing with unicode fonts can be a bit fraught, with older browsers often displaying a funny thing or a little box ... and the location of the U+05B4 is a bit odd ... for me, it appears outside the pipes here |ִ| even though it's between them.
There's a handy · (·) though, which breaks cut and paste in the same way. Its vertical alignment can be corrected by <sub>ing it, eg:
stackoverflow·com
Perverse, but effective in FF3 anyway, it can't be cut-and-pasted as a URL. The <sub> is actually quite nice as it makes it visually obvious why the URL can't be pasted.
Dots which aren't in suspected URLs can be left alone, so for example you could do
s/\b\.\b/<sub>·<\/sub>/g
Another option is to insert some kind of zero-width entity next to suspect dots, but things like ‍ and ‌ and &ampzwsp; don't seem to work in FF3.
There's already some great answers in here, so I won't post more. I will give a couple of gotchas though. First, make sure to test for known protocols, anything else may be naughty. As someone whose hobby concerns telnet links, you will probably want to include more than http(s) in your search, but may want to prevent say aim: or some other urls. Second, is that many people will delimit their links in angle-brackets (gt/lt) like <http://theroughnecks.net> or in parens "(url)" and there's nothing worse than clicking a link, and having the closing > or ) go allong with the rest of the url.
P.S. sorry for the self-referencing plugs ;)
I needed just the detection of simple http urls with/out protocol, assuming that either the protocol is given or a 'www' prefix. I found the above mentioned link quite helpful, but in the end I came out with this:
http(s?)://(\S+\.)+\S+|www\d?\.(\S+\.)+\S+
This does, obviously, not test compliance to the dns standard.
Given the messes of "other funny business" that I see in Disqus comment spam in the form of look-alike characters, the first thing you'll want to do is deal with that.
Luckily, the Unicode people have you covered. Dig up an implementation of the TR39 Skeleton Algorithm for Unicode Confusables in your programming language of choice and pair it with some Unicode normalization and Unicode-aware upper/lower-casing.
The skeleton algorithm uses a lookup table maintained by the Unicode people to do something conceptually similar to case-folding.
(The output may not use sensible characters, but, if you apply it to both sides of the comparison, you'll get a match if the characters are visually similar enough for a human to get the intent.)
Here's an example from this Java implementation:
// Skeleton representations of unicode strings containing
// confusable characters are equal
skeleton("paypal").equals(skeleton("paypal")); // true
skeleton("paypal").equals(skeleton("𝔭𝒶ỿ𝕡𝕒ℓ")); // true
skeleton("paypal").equals(skeleton("ρ⍺у𝓅𝒂ן")); // true
skeleton("ρ⍺у𝓅𝒂ן").equals(skeleton("𝔭𝒶ỿ𝕡𝕒ℓ")); // true
skeleton("ρ⍺у𝓅𝒂ן").equals(skeleton("𝔭𝒶ỿ𝕡𝕒ℓ")); // true
// The skeleton representation does not transform case
skeleton("payPal").equals(skeleton("paypal")); // false
// The skeleton representation does not remove diacritics
skeleton("paypal").equals(skeleton("pàỳpąl")); // false
(As you can see, you'll want to do some other normalization first.)
Given that you're doing URL detection for the purpose of judging whether something's spam, this is probably one of those uncommon situations where it'd be safe to start by normalizing the Unicode to NFKD and then stripping codepoints declared to be combining characters.
(You'd then want to normalize the case before feeding them to the skeleton algorithm.)
I'd advise that you do one of the following:
Write your code to run a confusables check both before and after the characters get decomposed, in case things are considered confusables before being decomposed but not after, and check both uppercased and lowercased strings in case the confusables tables aren't symmetrical between the upper and lowercase forms.
Investigate whether #1 is actually a concern (no need to waste CPU time if it isn't) by writing a little script to inspect the Unicode tables and identify any codepoints where decomposing or lowercasing/uppercasing a pair of characters changes whether they're considered confusable with each other.