Is it safe to use numbers in your web page file names? - html

Someone recently told me that using numbers in web page file names is not good practice. For example, say I was making a website about Samara Morgan and I had a file named 7days.html - would it be bad to start the file name with a number? Is it riskier than having numbers put later in the file name (ie. day7.html)?
I'm just a tad confused on whether it's generally discouraged to use numbers in file names or not.
EDIT: After asking them to explain a bit more, this is what they said to me:
.... the simplest way I can explain it is that certain programming
languages and operating systems might be confused by putting the
number as the first character. In other words, it has a higher
potential for error, so it's not recommended. That being said, it IS
acceptable to use a number AFTER the first character. By the way, a
domain name (like 4chan.org) is a little different because it's not a
file.
Here are some more tips/best practices (you'll see it as #3):
https://ed.fnal.gov/lincon/tech_web_naming.shtml

I think you need to go back to this someone and ask them for more information - are they saying there's a security problem? a usability problem because of something users might want to do with it? a Search Engine Optimisation trick you're missing that would make it easier for people to find?
I can't actually think of why numbers in URLs would matter for any of these, however. It seems most likely they were thinking of SEO, because that's a constant battle between search engines (who want users to get the results they want) and publishers (who want to get their brand higher up the results) and full of half-understood experiments and dodgy advice.
It's also worth noting that URLs don't exactly have "filenames" at all - they're just a string that the browser sends to the server, and the server may or may not map to a file on disk. Look at the URL of this page, for instance - it contains enough information for the server to look up the right question in a database, plus some human-readable text which is mostly for SEO.
Your server has filenames, of course, but I can't think of any reason why having numbers in those would be a problem, let alone why it would apply particularly to web pages.
Edit based on additional information supplied:
Two things I notice about the link you've added: one, it's twenty years old; two, it includes detailed reasoning for every single point, except point 3. I can't think of any "programming languages and operating systems" that would have a problem with a leading digit. It's actually quite common in some (non-web) contexts, as a way of forcing files to be listed or run in the desired order (e.g. 01-contents.txt, 02-introduction.txt, etc).
I can imagine problems if you began the filename with a ., -, or _, because sometimes there are entrenched conventions that those are hidden, or backups, etc. Either the advice made sense 20 years ago, or the author was being overly conservative to keep the rule simple.

To be precise . Your question refers to whether it is permissible or appropriate to begin the name of a file with an o or more numeric characters .. and according to convenzini on die files used by (main operating system names) this type of naming is allowed and does not present any problem we use it to enterpretazione ..
windows https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247(v=vs.85).aspx
linux https://www.cyberciti.biz/faq/linuxunix-rules-for-naming-file-and-directory-names/
the situation slightly different for the programming languages ​​and the most common case is that of C / C ++ where the use of variables with completely numeric characters or compound nouns that begin with numeric characters can be confusing, and therefore this practice is by some not recommended.
(See this SO for C/C++ vars naming samples and problem Is it safe to use numbers in your web page file names?)
Therefore, in your case that refers to names of files .. the limitations that you have been inidicate are not reflected.

No just keep it like that it doesn't effect anything

Related

If I have a collection of random websites, how do I get specific information from each?

Say I have a collection of websites for accountants, like this:
http://www.johnvanderlyn.com
http://www.rubinassociatespa.com
http://www.taxestaxestaxes.com
http://janus-curran.com
http://ricksarassociates.com
http://www.condoaudits.com
http://www.krco-cpa.com
http://ci.boca-raton.fl.us
What I want to do is crawl each and get the names & emails of the partners. How should I approach this problem, at a high-level?
Assume I know how to actually crawl each site (and all subpages) & parse the HTML elements -- I am using Oga.
What I am struggling with is how to make sense of data that is presented in a wide variety of ways. For instance, the email address for the firm (and or partner) can be found in one of these ways:
On the About Us page, under the name of the partner.
On the About Us page, as a generic catch-all email.
On the Team page, under the name of the partner.
On the Contact Us page, as a generic catch-all email.
On a Partner's page, under the name of the partner.
Or it could be any other way.
One way I was thinking about approaching the email, is just to search for all mailto a tags and filter from there.
The obvious downside for this is that there is no guarantee that the email will be for the partner and not some other employee.
Another issue that is more obvious is detecting the partner(s) names just from the markup. I was initially thinking I could just pull all the header tags and text in them, but I have stumbled across a few sites that have the partner names in span tags.
I know SO is usually for specific programming questions, but I am not sure how to approach this and where to ask this. Is there another StackExchange site that this question is more appropriate for?
Any advice on specific direction you can give me would be great.
I looked at the http://ricksarassociates.com/ website and I cant find any partners at all so in my opinion you better stand to gain from this if not you better look for some other invention.
I have done similar datascraping from time to time, and in norway we have laws - or should I say "laws" - that you are not allowed to email people however you are allowed to email the company - so in a way the same problem from another angle.
I wish I knew maths and algorythms by heart because I am sure there is a fascinating sollution hidden in AI and machine learning, but in my mind the only sollution I can see is building a rule set that over time probably gets quite complex. Maby you could apply some bayesian filtering - it works very well for email.
But - to be a little more productive here. One thing i know is inmportant, you could start by creating the crawler environment and building the dataset. Have the database for URLS so you can add more at any time, and start the crawling on what you have already so that you do your testing querying your own data with a 100% copy. This will save you enormous time instead of live scraping while tweaking.
I did my own search engine some years ago, scraping all NO domains however I needed only the index file that time. Took over a week alone just to scrape it down and I think it was 8GB of data just for that single file, and I had to use several proxyservers aswell to make it work due to problems with to much DNS traffik. Lots of problems that needed being taken care of. I guess I am only saying - if you are crawling a large scale you might aswell start getting the data down if you want to work efficient with the parsing later.
Good luck, and do post if you get a sollution. I do not think it is posible without an algorythm or AI though - people design websites the way they like and they pull templates out of their arse so there are no rules to follow. You will end up with bad data.
Do you have funding for this? If so its simpler. Then you could just crawl each site, and make a profile for each site. You could employ someone cheap to manual go through the parsed data and remove all the errors. This is probably how most people does it, unless someone already have done it and the database is for sale / available from webservice so it can be scraped.
The links you provide are mainly US site, so I guess you are focusing on English names. In that case, instead of parsing from html tags, I would just search the whole webpage for name. (There are free database of first name and last name) This may also work if you are donig this for some other Europe company, but it would be a problem for company from some countries. Take Chinese as an example, while there is a fix set of last name, one may use basically any combination of Chinese character as first name, so this solution won't work for Chinese site.
It is easy to find email from a webpage as there is a fixed format of (username)#(domain name) with no space in between. Again I won't treat it as html tags but just as normal string so that the email can be found no matter it is in mailto tag or in plain text. Then, to determine what email is it:
Only one email in page?
Yes -> catch-all email.
No -> Is name found in that page as well?
No -> catch-all email (can have more than one catch-all email, maybe for different purpose like info + employment)
Yes -> Email should be attached to the name found right before it. It is normal that the name should appear before the email.
Then, it should be safe to assume the name appear first belongs to more important member, e.g. Chairman or partner.
I have done similar scraping for these types of pages, and it varies wildly from site to site. If you are trying to make one crawler to sort of auto find the information, it will be difficult. However, the high level looks something like this.
For each site you check, look for element patterns. Divs will often have labels, ID's, and classes which will easily let you grab information. Perhaps you find that many divs will have a particular class name. Check for this first.
It is often better to grab too much data from a particular page, and boil it down on your side afterwards. You could, perhaps, look for information which comes up on a screen by utilizing type (is link) or regex (is email) to look for formatted text. Names and occupation will be harder to find by this method, but might be related positionally on many pages to other well formatted items.
Names will often be affixed with honorifics (Mrs., Mr., Dr., JD, MD, etc.) You could come up with a bank of those, and check against them for any page you end up on.
Finally, if you really wanted to make this process general purpose, you could do some heuristics to improve your methods based off of expected information; names, for example, are most often within a particular list. If it was worth your time, you could check certain text for whether it matches a list of more common names.
What you mentioned in your initial question seems that you would have a lot of benefit with a general purpose Regular Expressions crawler, and you could make improvements on it as you know more about the sites which you interact with.
There are excellent posts on this topic with a lot of useful links throughout these webpages:
https://www.quora.com/What-is-a-good-web-scraper-for-pulling-emails-names-etc-even-if-the-contact-info-is-another-page-deep-a-browser-add-on-is-a-plus
http://www.hongkiat.com/blog/web-scraping-tools/
http://www.garethjames.net/a-guide-to-web-scraping-tools/
http://www.butleranalytics.com/15-web-scraping-tools/
Some of the examined applications are working in macOS.

what data storage model is used to store articles in wikipedia

Articles in wikipedia get edited. They can grow/shrink/updated etc. What file system/database storage layout etc is used underneath to support it. In database course, I had read a bit on variable length record, but that seemed like more for small strings and not for whole document. Like in file system, files can grow/shrink etc, and I think its done by chaining blocks together. each time, we update a file, not the whole file is rewritten. Perhaps something similar would be done here.
I am looking for specific names,terminologies, may be even how the schema in mysql is defined. (I think wikipedia uses mysql).
Below are links to some writeup on wikipedia architecture, but I am not being able to answer my question from these:
http://swe.web.cs.unibo.it/twiki/pub/WikiFactory/AntonelloDiMuroThesis/Wikipedia-cheapandexplosivescalingwithLAMP.pdf
http://dom.as/uc/workbook2007.pdf
Thanks,
See:
http://www.mediawiki.org/wiki/Manual:Database_layout

Is internationalizing later really more expensive? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
Most people would agree that internationalizing an existing app is more expensive than developing an internationalized app from scratch.
Is that really true? Or when you write an internationalized app from scratch the cost of doing I18N is just being amortized over multiple small assignments and nobody feels on his shoulders the whole weight of the internationalization task?
You can even claim that a mature app has many and many LOC that where deleted during the project's history, and that they don't need to be I18Ned if internationalization is made as an after thought, but would have been I18N if the project was internationalized from the very beggining.
So do you think a project starting today, must be internationalized, or can that decision be deferred to the future based on the success (or not) the software enjoys and the geographic distribution of the demand.
I am not talking about the ability to manipulate unicode data. That you have for free in most mainstream languages, databases and libraries. I am talking specifically of supporting your own software's user interface in multiple languages and locales.
"when you write an internationalized app from scratch the cost of doing I18N is ... amortized"
However, that's not the whole story.
Retroactively tracking down every message to the users is -- in some cases -- impossible.
Not hard. Impossible.
Consider this.
theMessage = "Some initial part" + some_function() + "some following part";
You're going to have a terrible time finding all of these kinds of situations. After all, some_function just returns a String. You don't know if it's a database key (never shown to a person) or a message which must be translated. And when it's translated, grammar rules may reveal that a 3-part string concatenation was a dumb idea.
You can't simply GREP every String-valued function as containing a possible I18N message that must be translated. You have to actually read the code, and possibly rewrite the function.
Clearly, when some_function has any complexity to it at all, you're stumped as to why one part of your application is still in Swedish while the rest was successfully I18N'd into other languages. (Not to pick on Swedes in particular, replace this with any language used for development different from final deployment.)
Worse, of course, if you're working in C or C++, you might have some of this split between pre-processor macros and proper C-language syntax.
And in a dynamic language -- where code can be built on the fly -- you'll be paralyzed by a design in which you can't positively identify all the code. While dynamically generating code is a bad idea, it also makes your retroactive I18N job impossible.
I'm going to have to disagree that it costs more to add it to an existing application than from scratch with a new one.
A lot of the time i18n is not required until the application gets 'big'. When you do get big, you will likely have a bigger development team to devote to i18n so it will be less of a burden.
You may not actually need it. A lot of small teams put great effort to support internationalization when you have no customers who require it.
Once you have internationalized, it makes incremental changes more time consuming. It doesn't take a lot of extra time but every time you need to add a string to the product, you need to add it to the bundle first and then add a reference. No it is not a lot of work but it is effort and does take a bit of time.
I prefer to 'cross that bridge when we come to it' and internationalize only when you have a paying customer looking for it.
Yes, internationalizing an existing app is definitely more expensive than developing the app as internationalized from day one. And it's almost never trivial.
For instance
Message = "Do you want to load the " & fileType() & " file?"
cannot be internationalised without some code alterations because many languages have grammatical rules like gender agreement. You often need a different message string for loading every possible file type, unlike in English when it's possible to bolt together substrings.
There are many other issues like this: you need more UI space because some languages need more characters than English to express the same concept, you need bigger fonts for East Asia, you need to use localised date/times in the user interface but perhaps English US when communicating with databases, you need to use semicolon as a delimeter for CSV files, string comparisons and sorting are cultural, phone numbers & addresses...
So do you think a project starting
today, must be internationalized, or
can that decision be deferred to the
future based on the success (or not)
the software enjoys and the geographic
distribution of the demand?
It depends. How likely is the specific project to be internationalised? How important it is to get a first version fast?
If you truly think you get "unicode handling" "for free", you may have a surprise coming your way when you try.
Unless you use a framework that has proven i18n ability beyond languages with the ANSI or very similar character sets, you will find several niggles and more major issues where the unicode handling isn't quite right, or simply unavailable. Even with relatively common languages (e.g. German) you can run into difficulty with shrinking or expanding letter counts and APIs that don't support unicode.
And then think of languages with different reading-ordering!
This is one of the reasons you should really plan it in from the beginning, and test the stuff to destruction on the set of languages you plan to support.
The concept of i18n and l10n is broader than merely translating strings to and fro some languages.
Example: Consider the input of date and time by users. If you haven't internationalization in mind when you design
a) the interface for the user and
b) the storage, retrieval and display mechanism
you will get a really bad time, when you want to enable other input schemes.
Agreed, in most cases i18n is not necessary in the first place. But, and that is my point, if you don't spend a thought on some areas, that must be touched for i18n, you will find yourself ending up rewriting large portions of the original code. And then, adding i18n is a lot more expensive than having spent some thought beforehand.
One thing that seems like it can be a big issue is the different character counts for a message in various languages. I do some work on iPhone apps and especially on a small screen if you design the UI for a message that has 10 characters and then you try to internationalize later and find you need 20 characters to display the same thing you now have to redo your UI to accommodate. Even with desktop apps this can still be a large PITA.
It depends on your project and how your team is organised.
I've been involved in the internationalization of a website, and it was one developer full-time for a year, probably about 6-8 months part-time for me to handle installation impacts when needed (reorganising files, etc), and other developers getting involved from time to time when their projects needed heavy refactoring. This was in an application that was at v3.
So that's definitely expensive. What you have to ask is how expensive is it to provide a localization system from the start, and how will that impact the project in the early stages. Your project at v1 may not be able to survive delays and setbacks caused by issues with a hastily-designed internationalization framework, while a stable v3 project with a wide customer base may have the capital to invest in doing that properly.
It also depends on whether you want to internationalize everything including log messages, or just the UI strings, and how many of those UI strings there are, and who you have available to do localization and the QA that goes with it, and even what languages you want to support - for example, does your system need to support unicode strings (which is a requirement for Asian languages).
And don;t forget that changing the database backend to support internationalized data can be costly as well. Just try to change that varchar field to nvarchar when you already have 20,000,000 records.
I think it depends on a language. Every j2ee(java web) app is i18n, because its very easy(even IDE can extract strings for you and you just name them).
In j2ee its cheaper to add it later, however the culture is to add them as soon as possible. I think its because j2ee uses a lot of open-source and almost all open-source libs are i18n. its great idea for them, but not for most j2ee app. most enterprise apps are just for one company that speak one language.
Plus if you have bad testers putting it too soon makes them give you bug reports about labels and translations(I only once saw translations done NOT by developers). After testers are done with it you have buggy app with excellent i18n support. However it might be fun for users to switch language and see if they can use it. However using your app its just boring work for them, so they wont even do that. The only users of i18n are the testers.
Weird string joining is not in j2ee culture since you know that one day someone might want to make it i18n. Only problem is extracting labels from html templates.
I cant say what is expensive but, i can tell you that a clean API lets you internationalise your Aplication at very low cost.

Detecting a (naughty or nice) URL or link in a text string

How can I detect (with regular expressions or heuristics) a web site link in a string of text such as a comment?
The purpose is to prevent spam. HTML is stripped so I need to detect invitations to copy-and-paste. It should not be economical for a spammer to post links because most users could not successfully get to the page. I would like suggestions, references, or discussion on best-practices.
Some objectives:
The low-hanging fruit like well-formed URLs (http://some-fqdn/some/valid/path.ext)
URLs but without the http:// prefix (i.e. a valid FQDN + valid HTTP path)
Any other funny business
Of course, I am blocking spam, but the same process could be used to auto-link text.
Ideas
Here are some things I'm thinking.
The content is native-language prose so I can be trigger-happy in detection
Should I strip out all whitespace first, to catch "www .example.com"? Would common users know to remove the space themselves, or do any browsers "do-what-I-mean" and strip it for you?
Maybe multiple passes is a better strategy, with scans for:
Well-formed URLs
All non-whitespace followed by '.' followed by any valid TLD
Anything else?
Related Questions
I've read these and they are now documented here, so you can just references the regexes in those questions if you want.
replace URL with HTML Links javascript
What is the best regular expression to check if a string is a valid URL
Getting parts of a URL (Regex)
Update and Summary
Wow, I there are some very good heuristics listed in here! For me, the best bang-for-the-buck is a synthesis of the following:
#Jon Bright's technique of detecting TLDs (a good defensive chokepoint)
For those suspicious strings, replace the dot with a dot-looking character as per #capar
A good dot-looking character is #Sharkey's subscripted · (i.e. "·"). · is also a word boundary so it's harder to casually copy & paste.
That should make a spammer's CPM low enough for my needs; the "flag as inappropriate" user feedback should catch anything else. Other solutions listed are also very useful:
Strip out all dotted-quads (#Sharkey's comment to his own answer)
#Sporkmonger's requirement for client-side Javascript which inserts a required hidden field into the form.
Pinging the URL server-side to establish whether it is a web site. (Perhaps I could run the HTML through SpamAssassin or another Bayesian filter as per #Nathan..)
Looking at Chrome's source for its smart address bar to see what clever tricks Google uses
Calling out to OWASP AntiSAMY or other web services for spam/malware detection.
I'm concentrating my answer on trying to avoid spammers. This leads to two sub-assumptions: the people using the system will therefore be actively trying to contravene your check and your goal is only to detect the presence of a URL, not to extract the complete URL. This solution would look different if your goal is something else.
I think your best bet is going to be with the TLD. There are the two-letter ccTLDs and the (currently) comparitively small list of others. These need to be prefixed by a dot and suffixed by either a slash or some word boundary. As others have noted, this isn't going to be perfect. There's no way to get "buyfunkypharmaceuticals . it" without disallowing the legitimate "I tried again. it doesn't work" or similar. All of that said, this would be my suggestion:
[^\b]\.([a-zA-Z]{2}|aero|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel)[\b/]
Things this will get:
buyfunkypharmaceuticals.it
google.com
http://stackoverflo**w.com/**questions/700163/
It will of course break as soon as people start obfuscating their URLs, replacing "." with " dot ". But, again assuming spammers are your goal here, if they start doing that sort of thing, their click-through rates are going to drop another couple of orders of magnitude toward zero. The set of people informed enough to deobfuscate a URL and the set of people uninformed enough to visit spam sites have, I think, a miniscule intersection. This solution should let you detect all URLs that are copy-and-pasteable to the address bar, whilst keeping collateral damage to a bare minimum.
I'm not sure if detecting URLs with a regex is the right way to solve this problem. Usually you will miss some sort of obscure edge case that spammers will be able to exploit if they are motivated enough.
If your goal is just to filter spam out of comments then you might want to think about Bayesian filtering. It has proved to be very accurate in flagging email as spam, it might be able to do the same for you as well, depending on the volume of text you need to filter.
I know this doesn't help with auto-link text but what if you search and replaced all full-stop periods with a character that looks like the same thing, such as the unicode character for hebrew point hiriq (U+05B4)?
The following paragraph is an example:
This might workִ The period looks a bit odd but it is still readableִ The benefit of course is that anyone copying and pasting wwwִgoogleִcom won't get too farִ :)
Well, obviously the low hanging fruit are things that start with http:// and www. Trying to filter out things like "www . g mail . com" leads to interesting philosophical questions about how far you want to go. Do you want to take it the next step and filter out "www dot gee mail dot com" also? How about abstract descriptions of a URL, like "The abbreviation for world wide web followed by a dot, followed by the letter g, followed by the word mail followed by a dot, concluded with the TLD abbreviation for commercial".
It's important to draw the line of what sorts of things you're going to try to filter before you continue with trying to design your algorithm. I think that the line should be drawn at the level where "gmail.com" is considered a url, but "gmail. com" is not. Otherwise, you're likely to get false positives every time someone fails to capitalize the first letter in a sentence.
Since you are primarily looking for invitations to copy and paste into a browser address bar, it might be worth taking a look at the code used in open source browsers (such as Chrome or Mozilla) to decide if the text entered into the "address bar equivalent" is a search query or a URL navigation attempt.
Ping the possible URL
If you don't mind a little server side computation, what about something like this?
urls = []
for possible_url in extracted_urls(comment):
if pingable(possible_url):
urls.append(url) #you could do this as a list comprehension, but OP may not know python
Here:
extracted_urls takes in a comment and uses a conservative regex to pull out possible candidates
pingable actually uses a system call to determine whether the hostname exists on the web. You could have a simple wrapper parse the output of ping.
[ramanujan:~/base]$ping -c 1 www.google.com
PING www.l.google.com (74.125.19.147): 56 data bytes
64 bytes from 74.125.19.147: icmp_seq=0 ttl=246 time=18.317 ms
--- www.l.google.com ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max/stddev = 18.317/18.317/18.317/0.000 ms
[ramanujan:~/base]$ping -c 1 fooalksdflajkd.com
ping: cannot resolve fooalksdflajkd.com: Unknown host
The downside is that if the host gives a 404, you won't detect it, but this is a pretty good first cut -- the ultimate way to verify that an address is a website is to try to navigate to it. You could also try wget'ing that URL, but that's more heavyweight.
Having made several attempts at writing this exact piece of code, I can say unequivocally, you won't be able to do this with absolute reliability, and you certainly won't be able to detect all of the URI forms allowed by the RFC. Fortunately, since you have a very limited set of URLs you're interested in, you can use any of the techniques above.
However, the other thing I can say with a great deal of certainty, is that if you really want to beat spammers, the best way to do that is to use JavaScript. Send a chunk of JavaScript that performs some calculation, and repeat the calculation on the server side. The JavaScript should copy the result of the calculation to a hidden field so that when the comment is submitted, the result of the calculation is submitted as well. Verify on the server side that the calculation is correct. The only way around this technique is for spammers to manually enter comments or for them to start running a JavaScript engine just for you. I used this technique to reduce the spam on my site from 100+/day to one or two per year. Now the only spam I ever get is entered by humans manually. It's weird to get on-topic spam.
Of course you realize if spammers decide to use tinuyrl or such services to shorten their URLs you're problem just got worse. You might have to write some code to look up the actual URLs in that case, using a service like TinyURL decoder
Consider incorporating the OWASP AntiSAMY API...
I like capar's answer the best so far, but dealing with unicode fonts can be a bit fraught, with older browsers often displaying a funny thing or a little box ... and the location of the U+05B4 is a bit odd ... for me, it appears outside the pipes here |ִ| even though it's between them.
There's a handy · (·) though, which breaks cut and paste in the same way. Its vertical alignment can be corrected by <sub>ing it, eg:
stackoverflow·com
Perverse, but effective in FF3 anyway, it can't be cut-and-pasted as a URL. The <sub> is actually quite nice as it makes it visually obvious why the URL can't be pasted.
Dots which aren't in suspected URLs can be left alone, so for example you could do
s/\b\.\b/<sub>·<\/sub>/g
Another option is to insert some kind of zero-width entity next to suspect dots, but things like ‍ and ‌ and &ampzwsp; don't seem to work in FF3.
There's already some great answers in here, so I won't post more. I will give a couple of gotchas though. First, make sure to test for known protocols, anything else may be naughty. As someone whose hobby concerns telnet links, you will probably want to include more than http(s) in your search, but may want to prevent say aim: or some other urls. Second, is that many people will delimit their links in angle-brackets (gt/lt) like <http://theroughnecks.net> or in parens "(url)" and there's nothing worse than clicking a link, and having the closing > or ) go allong with the rest of the url.
P.S. sorry for the self-referencing plugs ;)
I needed just the detection of simple http urls with/out protocol, assuming that either the protocol is given or a 'www' prefix. I found the above mentioned link quite helpful, but in the end I came out with this:
http(s?)://(\S+\.)+\S+|www\d?\.(\S+\.)+\S+
This does, obviously, not test compliance to the dns standard.
Given the messes of "other funny business" that I see in Disqus comment spam in the form of look-alike characters, the first thing you'll want to do is deal with that.
Luckily, the Unicode people have you covered. Dig up an implementation of the TR39 Skeleton Algorithm for Unicode Confusables in your programming language of choice and pair it with some Unicode normalization and Unicode-aware upper/lower-casing.
The skeleton algorithm uses a lookup table maintained by the Unicode people to do something conceptually similar to case-folding.
(The output may not use sensible characters, but, if you apply it to both sides of the comparison, you'll get a match if the characters are visually similar enough for a human to get the intent.)
Here's an example from this Java implementation:
// Skeleton representations of unicode strings containing
// confusable characters are equal
skeleton("paypal").equals(skeleton("paypal")); // true
skeleton("paypal").equals(skeleton("𝔭𝒶ỿ𝕡𝕒ℓ")); // true
skeleton("paypal").equals(skeleton("ρ⍺у𝓅𝒂ן")); // true
skeleton("ρ⍺у𝓅𝒂ן").equals(skeleton("𝔭𝒶ỿ𝕡𝕒ℓ")); // true
skeleton("ρ⍺у𝓅𝒂ן").equals(skeleton("𝔭𝒶ỿ𝕡𝕒ℓ")); // true
// The skeleton representation does not transform case
skeleton("payPal").equals(skeleton("paypal")); // false
// The skeleton representation does not remove diacritics
skeleton("paypal").equals(skeleton("pàỳpąl")); // false
(As you can see, you'll want to do some other normalization first.)
Given that you're doing URL detection for the purpose of judging whether something's spam, this is probably one of those uncommon situations where it'd be safe to start by normalizing the Unicode to NFKD and then stripping codepoints declared to be combining characters.
(You'd then want to normalize the case before feeding them to the skeleton algorithm.)
I'd advise that you do one of the following:
Write your code to run a confusables check both before and after the characters get decomposed, in case things are considered confusables before being decomposed but not after, and check both uppercased and lowercased strings in case the confusables tables aren't symmetrical between the upper and lowercase forms.
Investigate whether #1 is actually a concern (no need to waste CPU time if it isn't) by writing a little script to inspect the Unicode tables and identify any codepoints where decomposing or lowercasing/uppercasing a pair of characters changes whether they're considered confusable with each other.

Internationalization in your projects

How have you implemented Internationalization (i18n) in actual projects you've worked on?
I took an interest in making software cross-cultural after I read the famous post by Joel, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). However, I have yet to able to take advantage of this in a real project, besides making sure I used Unicode strings where possible. But making all your strings Unicode and ensuring you understand what encoding everything you work with is in is just the tip of the i18n iceberg.
Everything I have worked on to date has been for use by a controlled set of US English speaking people, or i18n just wasn't something we had time to work on before pushing the project live. So I am looking for any tips or war stories people have about making software more localized in real world projects.
It has been a while, so this is not comprehensive.
Character Sets
Unicode is great, but you can't get away with ignoring other character sets. The default character set on Windows XP (English) is Cp1252. On the web, you don't know what a browser will send you (though hopefully your container will handle most of this). And don't be surprised when there are bugs in whatever implementation you are using. Character sets can have interesting interactions with filenames when they move to between machines.
Translating Strings
Translators are, generally speaking, not coders. If you send a source file to a translator, they will break it. Strings should be extracted to resource files (e.g. properties files in Java or resource DLLs in Visual C++). Translators should be given files that are difficult to break and tools that don't let them break them.
Translators do not know where strings come from in a product. It is difficult to translate a string without context. If you do not provide guidance, the quality of the translation will suffer.
While on the subject of context, you may see the same string "foo" crop up in multiple times and think it would be more efficient to have all instances in the UI point to the same resource. This is a bad idea. Words may be very context-sensitive in some languages.
Translating strings costs money. If you release a new version of a product, it makes sense to recover the old versions. Have tools to recover strings from your old resource files.
String concatenation and manual manipulation of strings should be minimized. Use the format functions where applicable.
Translators need to be able to modify hotkeys. Ctrl+P is print in English; the Germans use Ctrl+D.
If you have a translation process that requires someone to manually cut and paste strings at any time, you are asking for trouble.
Dates, Times, Calendars, Currency, Number Formats, Time Zones
These can all vary from country to country. A comma may be used to denote decimal places. Times may be in 24hour notation. Not everyone uses the Gregorian calendar. You need to be unambiguous, too. If you take care to display dates as MM/DD/YYYY for the USA and DD/MM/YYYY for the UK on your website, the dates are ambiguous unless the user knows you've done it.
Especially Currency
The Locale functions provided in the class libraries will give you the local currency symbol, but you can't just stick a pound (sterling) or euro symbol in front of a value that gives a price in dollars.
User Interfaces
Layout should be dynamic. Not only are strings likely to double in length on translation, the entire UI may need to be inverted (Hebrew; Arabic) so that the controls run from right to left. And that is before we get to Asia.
Testing Prior To Translation
Use static analysis of your code to locate problems. At a bare minimum, leverage the tools built into your IDE. (Eclipse users can go to Window > Preferences > Java > Compiler > Errors/Warnings and check for non-externalised strings.)
Smoke test by simulating translation. It isn't difficult to parse a resource file and replace strings with a pseudo-translated version that doubles the length and inserts funky characters. You don't have to speak a language to use a foreign operating system. Modern systems should let you log in as a foreign user with translated strings and foreign locale. If you are familiar with your OS, you can figure out what does what without knowing a single word of the language.
Keyboard maps and character set references are very useful.
Virtualisation would be very useful here.
Non-technical Issues
Sometimes you have to be sensitive to cultural differences (offence or incomprehension may result). A mistake you often see is the use of flags as a visual cue choosing a website language or geography. Unless you want your software to declare sides in global politics, this is a bad idea. If you were French and offered the option for English with St. George's flag (the flag of England is a red cross on a white field), this might result in confusion for many English speakers - assume similar issues will arise with foreign languages and countries. Icons need to be vetted for cultural relevance. What does a thumbs-up or a green tick mean? Language should be relatively neutral - addressing users in a particular manner may be acceptable in one region, but considered rude in another.
Resources
C++ and Java programmers may find the ICU website useful: http://www.icu-project.org/
Some fun things:
Having a PHP and MySQL Application that works well with German and French, but now needs to support Russian and Chinese. I think I move this over to .net, as PHP's Unicode support is - in my opinion - not really good. Sure, juggling around with utf8_de/encode or the mbstring-functions is fun. Almost as fun as having Freddy Krüger visit you at night...
Realizing that some languages are a LOT more Verbose than others. German is a LOT more verbose than English usually, and seeing how the German Version destroys the User Interface because too little space was allocated was not fun. Some products gained some fame for their creative ways to work around that, with Oblivion's "Schw.Tr.d.Le.En.W." being memorable :-)
Playing around with date formats, woohoo! Yes, there ARE actually people in the world who use date formats where the day goes in the middle. Sooooo much fun trying to find out what 07/02/2008 is supposed to mean, just because some users might believe it could be July 2... But then again, you guys over the pond may believe the same about users who put the month in the middle :-P, especially because in English, July 2 sounds a lot better than 2nd of July, something that does not neccessarily apply to other languages (i.e. in German, you would never say Juli 2 but always Zweiter Juli). I use 2008-02-07 whenever possible. It's clear that it means February 7 and it sorts properly, but dd/mm vs. mm/dd can be a really tricky problem.
Anoter fun thing, Number formats! 10.000,50 vs 10,000.50 vs. 10 000,50 vs. 10'000,50... This is my biggest nightmare right now, having to support a multi-cultural environent but not having any way to reliably know what number format the user will use.
Formal or Informal. In some language, there are two ways to address people, a formal way and a more informal way. In English, you just say "You", but in German you have to decide between the formal "Sie" and the informal "Du", same for French Tu/Vous. It's usually a safe bet to choose the formal way, but this is easily overlooked.
Calendars. In Europe, the first day of the Week is Monday, whereas in the US it's Sunday. Calendar Widgets are nice. Showing a Calendar with Sunday on the left and Saturday on the right to a European user is not so nice, it confuses them.
I worked on a project for my previous employer that used .NET, and there was a built in .resx format we used. We basically had a file that had all translations in the .resx file, and then multiple files with different translations. The consequence of this is that you have to be very diligent about ensuring that all strings visible in the application are stored in the .resx, and anytime one is changed you have to update all languages you support.
If you get lazy and don't notify the people in charge of translations, or you embed strings without going through your localization system, it will be a nightmare to try and fix it later. Similarly, if localization is an afterthought, it will be very difficult to put in place. Bottom line, if you don't have all visible strings stored externally in a standard place, it will be very difficult to find all that need to be localized.
One other note, very strictly avoid concatenating visible strings directly, such as
String message = "The " + item + " is on sale!";
Instead, you must use something like
String message = String.Format("The {0} is on sale!", item);
The reason for this is that different languages often order the words differently, and concatenating strings directly will need a new build to fix, but if you used some kind of string replacement mechanism like above, you can modify your .resx file (or whatever localization files you use) for the specific language that needs to reorder the words.
I was just listening to a Podcast from Scott Hanselman this morning, where he talks about internationalization, especially the really tricky things, like Turkish (with it's four i's) and Thai. Also, Jeff Atwood had a post:
Besides all the previous tips, remember that i18n it's not just about changing words for their equivalent on other languages, especially for non-latin languages alphabets (korean, Arabic) which written right to left, so the whole UI will have to conform, like
item 1
item 2
item 3
would have to be
arabic text 1 -
arabic text 2 -
arabic text 3 -
(reversed bullet list doesn't seem to work :P)
which can be a UI nightmare if your system has to apply changes dinamically once the user changes the language being used.
Another very hard thing is to test different languages, not just for the correctness of word, but since languages like Korean usually have bigger font type for their characters this may lead to language specific bugs (like "SAVE" text on a button being larger than the button itself for some language).
One of the funnier things to discover: italics and bold text makrup does not work with CJK (Chinese/Japanese/Korean) characters. They simply become unreadable. (OK, I couldn't really read them before either, but especially bolding just creates ink blots)
I think everyone working in internationalization should be familiar with the Common Locale Data Repository, which is now a sub-project of Unicode:
Common Locale Data Repository
Those folks are working hard to establish a standard resource for all kinds of i18n issues: currency, geographical names, tons of stuff. Any project that's maintaining its own core local data given that this project exists is pretty bonkers, IMHO.
I suggest to use something like 99translations.com to maintain your translations . Otherwise you won't be able to tell what of your translations are up to date in every language.
Another challenge will be accepting input from your users. In many cases, this is eased by the input processing provided by the operating system, such as IME in Windows, which works transparently with common text widgets, but this facility will not be available for every possible need.
One website I use has a translation method the owner calls "wiki + machine translation". This is a community based site so is obviously different to the needs of companies.
http://blog.bookmooch.com/2007/09/23/how-bookmooch-does-its-translations/
One thing no one have mentioned yet is strings with some warying part as in "The unit will arive in 5 days" or "On Monday something happens." where 5 and Monday will change depending on state. It is not a good idea to split those in two and concatenate them. With only one varying part and good documentation you might get away with it, with two varying parts there will be some language that preferes to change the order of them.