Simple Regex Question - I've sat here for close to 4 hours trying to figure this out, and for most of you, it's simple, but not for me

Simple Regex Question - I've sat here for close to 4 hours trying to figure this out, and for most of you, it's simple, but not for me - html

I know what this site is for, and I've read people griping at others for asking a question like this, but I truly need help, and I've been working on this for a very long time.
I have a huge sheet of passwords and usernames in a .csv file, which I'm trying to format so it can be read by my Google profile (through Google Password Manager). I've tried all different kinds of ways, but Google is picky, and I've had a hell of a time just getting it to register anything except for "failed." I've searched through so many damn forum posts on this site and others. I know I'm almost there, as I finally got an error message, but I just want this to be done, so I finally figured I'd ask all you smart people for help. I've put forth the effort, and I just can't figure it out, just as I can't figure out how to convert any of the examples I find on this site to what I am needing. I've tried, many, many times.
I've used a regex tester to crawl and scrape my way to this:
/https:\/\/w{3}\.(...+.*\.com|net),?/g
I'm working on this with Notepad++, so it may be the wrong format. I've been using regex for a while, not really understanding it, which figuring this out above, has helped my understanding a lot.
Below is an edited version of how my document looks, which illustrates most of the issues I am dealing with. I imported it into an excel program, to get all the columns correct and to remove excess (which is how I finally got Google to register an error message).
In Notepad++, I fixed all of the "https://" and all of the variations I had. But for the life of me the hardest thing has been to figure out how to get the "www." on the remaining lines. My code above seems close, but it combines the URL address, with the email address when they are on the same line only separated by a comma, (which combines them into one group) and for the life of me I cannot figure out how to fix it. So, please help.
The top line is how it's organized, except it's in the order of URL, username, and password.
username,password,url,
https://www.nvidia.com,myemail#mail.com,password#1,
https://www.google.com,myemail#mail.com,password#2,
https://www.firefox.com,username,password#3,
https://na.alienwarearena.com,myemail#mail.com,password#4,
https://www.pinterest.cl,myemail#mail.com,password#5,
https://www.cplusplus.com,Username,password#6,
https://myaccount.google.com,myemail#gmail.com,password#7,
https://twitter.com,username,password#8,
Please, just a simple example of what I'm doing incorrectly, and if you feel like it, a simple explanation, as this has given me a massive headache. But any help is very much appreciated.

I might be over-simplifying this a bit much, but sounds like you're just wanting to re-order the columns on the file. Which excel would be the better option (you can just cut/paste the columns around)
But if you want to do this via regex, here's the simplest regex I can think of:
^(http[^,]+),([^,]+),([^,]+),?
Demo
I'm making the assumption that a , will never appear in your data. If it does, then the regex will get much more complex.
^ - Forces the regex to start of line
(http[^,]+), - Matches url. Ensures it starts with http, just so it skips the header. Ends with a comma
([^,]+), - grabs whatever is between the previous , and the next (which should be the username
([^,]+),? - grabs the last part, which is the password
Each one of the () is called a group. When you do a replacement, you can reference these groups. They're numbered left to right starting at 1.
This allows you to do a find/replace in notepad++. The replacement line will look something like this:
$2,$3,$1
I'm using sublime and not notepad++. So the replacement might be something like \2,\3,\1 instead. Can't quite remember how notepad++ works on the replacement.
Best of luck.

Related

Search HTML Tables on Multiple Pages

Hello Stack Overflow Community!
I am making a directory of many thousand custom mods for a game using HTML tables. When I started this project, I thought one HTML page would be slow, but adequate for the ~4k files I was expecting. As I progressed, I realized there are tens of thousands of files I need to have in these tables, and let the user search though to find what they are missing to load up a new scenario. Each entry has about 20 text entries and a small image (~3KB). I only need to be able to search through one column.
I'm thinking of dividing the tables across several pages on my website to help loading speeds and improve overall organization. But then a user would have to navigate to each page, and perform a search there. This could take a while and be very cumbersome.
I'm not great at website programming. Can someone advise a way to allow the user to search through several web pages and tables from one location? Ideally this would jump to the location in the table on the new webpage, or maybe highlight the entry like the browser's search function does.
You can see my current setup here : https://www.loco-dat-directory.site/
Hopefully someone can point me in the right direction, as I'm quite confused now :-)

This would be my steps,
Copy all my info into an excel spredsheet, then convert that to json, then make that an array for javascript (myarray), then can make an input field, and on click an if statement if input == myarray[0].propertyName
if you want something more than an exact match, you'd need https://lodash.com/
in your project.

Hacky Solution
There is a browser tool, called TableCapture, to capture data from html tables and load into excel/spreadsheets - where you are basically deferring to spreadsheet software to manage the searching.
You would have to see if:
This type of tool would solve your problem - maybe you can pull each HTML page's contents manually, then merge these pages into a document with multiple "sheets", and then let people download the "spreadsheet" from your website.
If you do not take on the labor above and just tell other people to do it, then you'd have to see if you can teach the people how to perform the search and do this method on their own. eg. "download this plugin, use it on these pages, search"
Why your question is difficult to answer
The reason why it will be hard for people to answer you in stackoverflow.com (usually code solutions) is that you need a more complicated solution (in my opinion) than hard coded tables and html/css/javascript.
This type of situation is exactly why people use databases and APIs to accept requests ("term": "something") for information and deliver responses ( "results": [...] ).

Thank you everyone for your great advice. I wasn't aware most of these potential solutions existed, and it was good to see how other people were tackling problems of similar scope.
I've decided to go with DataTables for their built-in sorting and filtering : https://datatables.net/
I'm also going to use a javascript array with an input field on the main page to allow users to search for which pack their mod is in. This will lead them to separate pages on my site, each with a unique datatable for a mod pack. Separate pages will load up much quicker than one gigantic page trying to show everything.

If I have a collection of random websites, how do I get specific information from each?

Say I have a collection of websites for accountants, like this:
http://www.johnvanderlyn.com
http://www.rubinassociatespa.com
http://www.taxestaxestaxes.com
http://janus-curran.com
http://ricksarassociates.com
http://www.condoaudits.com
http://www.krco-cpa.com
http://ci.boca-raton.fl.us
What I want to do is crawl each and get the names & emails of the partners. How should I approach this problem, at a high-level?
Assume I know how to actually crawl each site (and all subpages) & parse the HTML elements -- I am using Oga.
What I am struggling with is how to make sense of data that is presented in a wide variety of ways. For instance, the email address for the firm (and or partner) can be found in one of these ways:
On the About Us page, under the name of the partner.
On the About Us page, as a generic catch-all email.
On the Team page, under the name of the partner.
On the Contact Us page, as a generic catch-all email.
On a Partner's page, under the name of the partner.
Or it could be any other way.
One way I was thinking about approaching the email, is just to search for all mailto a tags and filter from there.
The obvious downside for this is that there is no guarantee that the email will be for the partner and not some other employee.
Another issue that is more obvious is detecting the partner(s) names just from the markup. I was initially thinking I could just pull all the header tags and text in them, but I have stumbled across a few sites that have the partner names in span tags.
I know SO is usually for specific programming questions, but I am not sure how to approach this and where to ask this. Is there another StackExchange site that this question is more appropriate for?
Any advice on specific direction you can give me would be great.

I looked at the http://ricksarassociates.com/ website and I cant find any partners at all so in my opinion you better stand to gain from this if not you better look for some other invention.
I have done similar datascraping from time to time, and in norway we have laws - or should I say "laws" - that you are not allowed to email people however you are allowed to email the company - so in a way the same problem from another angle.
I wish I knew maths and algorythms by heart because I am sure there is a fascinating sollution hidden in AI and machine learning, but in my mind the only sollution I can see is building a rule set that over time probably gets quite complex. Maby you could apply some bayesian filtering - it works very well for email.
But - to be a little more productive here. One thing i know is inmportant, you could start by creating the crawler environment and building the dataset. Have the database for URLS so you can add more at any time, and start the crawling on what you have already so that you do your testing querying your own data with a 100% copy. This will save you enormous time instead of live scraping while tweaking.
I did my own search engine some years ago, scraping all NO domains however I needed only the index file that time. Took over a week alone just to scrape it down and I think it was 8GB of data just for that single file, and I had to use several proxyservers aswell to make it work due to problems with to much DNS traffik. Lots of problems that needed being taken care of. I guess I am only saying - if you are crawling a large scale you might aswell start getting the data down if you want to work efficient with the parsing later.
Good luck, and do post if you get a sollution. I do not think it is posible without an algorythm or AI though - people design websites the way they like and they pull templates out of their arse so there are no rules to follow. You will end up with bad data.
Do you have funding for this? If so its simpler. Then you could just crawl each site, and make a profile for each site. You could employ someone cheap to manual go through the parsed data and remove all the errors. This is probably how most people does it, unless someone already have done it and the database is for sale / available from webservice so it can be scraped.

The links you provide are mainly US site, so I guess you are focusing on English names. In that case, instead of parsing from html tags, I would just search the whole webpage for name. (There are free database of first name and last name) This may also work if you are donig this for some other Europe company, but it would be a problem for company from some countries. Take Chinese as an example, while there is a fix set of last name, one may use basically any combination of Chinese character as first name, so this solution won't work for Chinese site.
It is easy to find email from a webpage as there is a fixed format of (username)#(domain name) with no space in between. Again I won't treat it as html tags but just as normal string so that the email can be found no matter it is in mailto tag or in plain text. Then, to determine what email is it:
Only one email in page?
Yes -> catch-all email.
No -> Is name found in that page as well?
No -> catch-all email (can have more than one catch-all email, maybe for different purpose like info + employment)
Yes -> Email should be attached to the name found right before it. It is normal that the name should appear before the email.
Then, it should be safe to assume the name appear first belongs to more important member, e.g. Chairman or partner.

I have done similar scraping for these types of pages, and it varies wildly from site to site. If you are trying to make one crawler to sort of auto find the information, it will be difficult. However, the high level looks something like this.
For each site you check, look for element patterns. Divs will often have labels, ID's, and classes which will easily let you grab information. Perhaps you find that many divs will have a particular class name. Check for this first.
It is often better to grab too much data from a particular page, and boil it down on your side afterwards. You could, perhaps, look for information which comes up on a screen by utilizing type (is link) or regex (is email) to look for formatted text. Names and occupation will be harder to find by this method, but might be related positionally on many pages to other well formatted items.
Names will often be affixed with honorifics (Mrs., Mr., Dr., JD, MD, etc.) You could come up with a bank of those, and check against them for any page you end up on.
Finally, if you really wanted to make this process general purpose, you could do some heuristics to improve your methods based off of expected information; names, for example, are most often within a particular list. If it was worth your time, you could check certain text for whether it matches a list of more common names.
What you mentioned in your initial question seems that you would have a lot of benefit with a general purpose Regular Expressions crawler, and you could make improvements on it as you know more about the sites which you interact with.

There are excellent posts on this topic with a lot of useful links throughout these webpages:
https://www.quora.com/What-is-a-good-web-scraper-for-pulling-emails-names-etc-even-if-the-contact-info-is-another-page-deep-a-browser-add-on-is-a-plus
http://www.hongkiat.com/blog/web-scraping-tools/
http://www.garethjames.net/a-guide-to-web-scraping-tools/
http://www.butleranalytics.com/15-web-scraping-tools/
Some of the examined applications are working in macOS.

How can I summarize the updates to a table on an page I browse?

I am a student at a University. With the placement process going on, we have an internal placement website that shows updates and status about various companies I have applied to. Since the number of companies is too large it becomes cumbersome to scroll through the complete list to find information. Sometimes, I just miss some things. Now, to tackle this problem, here is what I want to do:
The data is in an HTML table. Each row shows information about one company: Some dates, Status(Not/Shortlisted/Applied), Some yes/no options etc. each in a different column. Once I open the page I want to be able to extract information about which companies I got shortlisted in, and in which ones I didn't make it.
What is the right technology to do this ? I am thinking of writing a Greasemonkey user script (I have never actually written any, but how hard could it be ?). What other options do I have?
Edit: I don't quite understand why this question has voted to be closed?
I just displayed a use case for something general: On opening a web page, automatically extracting information from the page and display it to the user. What is the easiest and sufficiently powerful way to achieve this?

Since you can't get access to the website's database, Greasemonkey would be your best automation approach. However, this task is likely to be over before you can get a decent script up from scratch.
Your best practical approach is to save the pages and/or copy and summarize the data in MS Excel, or equivalent.
~~~~~~~~~
Here at SO, We will not develop any but the simplest Greasemonkey scripts for you from scratch (unless they are fun somehow ;) ). But, you can sometimes get such help in the "Script requests forum" at userscripts.org.
In order for someone to help you, they will need:
A clear idea of exactly what data gets manipulated, and how.
Access to the target site. Or access to saved snapshots of the target pages. GM scripts are extremely dependent on the details of the target page.

"other option":
ctrl + F
enter shortlisted
enter
ctrl + G <--repeat last search

Detecting what changed in an HTML Textfield

For a major school project I am implementing a real-time collaborative editor. For a little background, basically what this means is that two(or more) users can type into a document at the same time, and their changes are automatically propagated to one another (similar to Etherpad).
Now my problem is as follows:
I want to be able to detect what changes a user carried out onto an HTML textfield. They could:
Insert a character
Delete a character
Paste a string of characters
Cut a string of characters
I want to be able to detect which of these changes happened and then notify other clients similar to "insert character 'c' at position 2" etc.
Anyway I was hoping to get some advice on how I would go about implementing the detection of these changes?
My first attempt was to consider the carot position before and after a change occurred, but this failed miserably.
For my second attempt I was thinking about doing a diff on the entire contents of the textfields old and new value. Am I missing anything obvious with this solution? Is there something simpler?

It is a really hard work make this working today, for several reasons, but
maybe you will need to restrict only to some browsers. read: https://developer.mozilla.org/en/XUL/Attribute/oninput the alternative to "oninput" is listen to all input events (keyboard, mouse, dragdrop) I suggest to use "oninput"
html is not perfect... even html5. input and textareas supports only single-range
selections. you can solve this using designmode/contenteditable instead of
textareas/textfield
detecting offsets of what changed is a hard work: read
-- https://developer.mozilla.org/en/Document_Object_Model_%28DOM%29/window.getSelection
-- http://www.quirksmode.org/dom/range_intro.html -- http://msdn.microsoft.com/en-us/library/ms535869%28v=VS.85%29.aspx -- http://msdn.microsoft.com/en-us/library/ms535872%28v=VS.85%29.aspx
you may need a "diff" algorithm written in javascript! http://ejohn.org/projects/javascript-diff-algorithm/
one personal note: detecting words, characters changes may be totally non-sense and not useful, detect instead paragraphs changes, or in case of an excel-like worksheet, the single cell
I hope this helps
feel free to correct my English!

My pseudocode/written out response would be (if I understand your question exactly) to use jQuery to detect keyup events and then save the input to the server via ajax, then also take the response and post it back to the input. This isn't very efficient, but basically the idea is that you're constantly posting and checking what else has been posted. If you want to see what someone else is doing in real time, you can ping the server every second or so and update with the response.
All of this of course can be optimized, but it still is kind of taxing for a server. You could also see if you can implement Google Topeka Wave for your project, or get in touch with Google Topeka to see how they do it :)

Detecting a (naughty or nice) URL or link in a text string

How can I detect (with regular expressions or heuristics) a web site link in a string of text such as a comment?
The purpose is to prevent spam. HTML is stripped so I need to detect invitations to copy-and-paste. It should not be economical for a spammer to post links because most users could not successfully get to the page. I would like suggestions, references, or discussion on best-practices.
Some objectives:
The low-hanging fruit like well-formed URLs (http://some-fqdn/some/valid/path.ext)
URLs but without the http:// prefix (i.e. a valid FQDN + valid HTTP path)
Any other funny business
Of course, I am blocking spam, but the same process could be used to auto-link text.
Ideas
Here are some things I'm thinking.
The content is native-language prose so I can be trigger-happy in detection
Should I strip out all whitespace first, to catch "www .example.com"? Would common users know to remove the space themselves, or do any browsers "do-what-I-mean" and strip it for you?
Maybe multiple passes is a better strategy, with scans for:
Well-formed URLs
All non-whitespace followed by '.' followed by any valid TLD
Anything else?
Related Questions
I've read these and they are now documented here, so you can just references the regexes in those questions if you want.
replace URL with HTML Links javascript
What is the best regular expression to check if a string is a valid URL
Getting parts of a URL (Regex)
Update and Summary
Wow, I there are some very good heuristics listed in here! For me, the best bang-for-the-buck is a synthesis of the following:
#Jon Bright's technique of detecting TLDs (a good defensive chokepoint)
For those suspicious strings, replace the dot with a dot-looking character as per #capar
A good dot-looking character is #Sharkey's subscripted · (i.e. "·"). · is also a word boundary so it's harder to casually copy & paste.
That should make a spammer's CPM low enough for my needs; the "flag as inappropriate" user feedback should catch anything else. Other solutions listed are also very useful:
Strip out all dotted-quads (#Sharkey's comment to his own answer)
#Sporkmonger's requirement for client-side Javascript which inserts a required hidden field into the form.
Pinging the URL server-side to establish whether it is a web site. (Perhaps I could run the HTML through SpamAssassin or another Bayesian filter as per #Nathan..)
Looking at Chrome's source for its smart address bar to see what clever tricks Google uses
Calling out to OWASP AntiSAMY or other web services for spam/malware detection.

I'm concentrating my answer on trying to avoid spammers. This leads to two sub-assumptions: the people using the system will therefore be actively trying to contravene your check and your goal is only to detect the presence of a URL, not to extract the complete URL. This solution would look different if your goal is something else.
I think your best bet is going to be with the TLD. There are the two-letter ccTLDs and the (currently) comparitively small list of others. These need to be prefixed by a dot and suffixed by either a slash or some word boundary. As others have noted, this isn't going to be perfect. There's no way to get "buyfunkypharmaceuticals . it" without disallowing the legitimate "I tried again. it doesn't work" or similar. All of that said, this would be my suggestion:
[^\b]\.([a-zA-Z]{2}|aero|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel)[\b/]
Things this will get:
buyfunkypharmaceuticals.it
google.com
http://stackoverflo**w.com/**questions/700163/
It will of course break as soon as people start obfuscating their URLs, replacing "." with " dot ". But, again assuming spammers are your goal here, if they start doing that sort of thing, their click-through rates are going to drop another couple of orders of magnitude toward zero. The set of people informed enough to deobfuscate a URL and the set of people uninformed enough to visit spam sites have, I think, a miniscule intersection. This solution should let you detect all URLs that are copy-and-pasteable to the address bar, whilst keeping collateral damage to a bare minimum.

I'm not sure if detecting URLs with a regex is the right way to solve this problem. Usually you will miss some sort of obscure edge case that spammers will be able to exploit if they are motivated enough.
If your goal is just to filter spam out of comments then you might want to think about Bayesian filtering. It has proved to be very accurate in flagging email as spam, it might be able to do the same for you as well, depending on the volume of text you need to filter.

I know this doesn't help with auto-link text but what if you search and replaced all full-stop periods with a character that looks like the same thing, such as the unicode character for hebrew point hiriq (U+05B4)?
The following paragraph is an example:
This might workִ The period looks a bit odd but it is still readableִ The benefit of course is that anyone copying and pasting wwwִgoogleִcom won't get too farִ :)

Well, obviously the low hanging fruit are things that start with http:// and www. Trying to filter out things like "www . g mail . com" leads to interesting philosophical questions about how far you want to go. Do you want to take it the next step and filter out "www dot gee mail dot com" also? How about abstract descriptions of a URL, like "The abbreviation for world wide web followed by a dot, followed by the letter g, followed by the word mail followed by a dot, concluded with the TLD abbreviation for commercial".
It's important to draw the line of what sorts of things you're going to try to filter before you continue with trying to design your algorithm. I think that the line should be drawn at the level where "gmail.com" is considered a url, but "gmail. com" is not. Otherwise, you're likely to get false positives every time someone fails to capitalize the first letter in a sentence.

Since you are primarily looking for invitations to copy and paste into a browser address bar, it might be worth taking a look at the code used in open source browsers (such as Chrome or Mozilla) to decide if the text entered into the "address bar equivalent" is a search query or a URL navigation attempt.

Ping the possible URL
If you don't mind a little server side computation, what about something like this?
urls = []
for possible_url in extracted_urls(comment):
if pingable(possible_url):
urls.append(url) #you could do this as a list comprehension, but OP may not know python
Here:
extracted_urls takes in a comment and uses a conservative regex to pull out possible candidates
pingable actually uses a system call to determine whether the hostname exists on the web. You could have a simple wrapper parse the output of ping.
[ramanujan:~/base]$ping -c 1 www.google.com
PING www.l.google.com (74.125.19.147): 56 data bytes
64 bytes from 74.125.19.147: icmp_seq=0 ttl=246 time=18.317 ms
--- www.l.google.com ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max/stddev = 18.317/18.317/18.317/0.000 ms
[ramanujan:~/base]$ping -c 1 fooalksdflajkd.com
ping: cannot resolve fooalksdflajkd.com: Unknown host
The downside is that if the host gives a 404, you won't detect it, but this is a pretty good first cut -- the ultimate way to verify that an address is a website is to try to navigate to it. You could also try wget'ing that URL, but that's more heavyweight.

Having made several attempts at writing this exact piece of code, I can say unequivocally, you won't be able to do this with absolute reliability, and you certainly won't be able to detect all of the URI forms allowed by the RFC. Fortunately, since you have a very limited set of URLs you're interested in, you can use any of the techniques above.
However, the other thing I can say with a great deal of certainty, is that if you really want to beat spammers, the best way to do that is to use JavaScript. Send a chunk of JavaScript that performs some calculation, and repeat the calculation on the server side. The JavaScript should copy the result of the calculation to a hidden field so that when the comment is submitted, the result of the calculation is submitted as well. Verify on the server side that the calculation is correct. The only way around this technique is for spammers to manually enter comments or for them to start running a JavaScript engine just for you. I used this technique to reduce the spam on my site from 100+/day to one or two per year. Now the only spam I ever get is entered by humans manually. It's weird to get on-topic spam.

Of course you realize if spammers decide to use tinuyrl or such services to shorten their URLs you're problem just got worse. You might have to write some code to look up the actual URLs in that case, using a service like TinyURL decoder

Consider incorporating the OWASP AntiSAMY API...

I like capar's answer the best so far, but dealing with unicode fonts can be a bit fraught, with older browsers often displaying a funny thing or a little box ... and the location of the U+05B4 is a bit odd ... for me, it appears outside the pipes here |ִ| even though it's between them.
There's a handy · (·) though, which breaks cut and paste in the same way. Its vertical alignment can be corrected by <sub>ing it, eg:
stackoverflow·com
Perverse, but effective in FF3 anyway, it can't be cut-and-pasted as a URL. The <sub> is actually quite nice as it makes it visually obvious why the URL can't be pasted.
Dots which aren't in suspected URLs can be left alone, so for example you could do
s/\b\.\b/<sub>·<\/sub>/g
Another option is to insert some kind of zero-width entity next to suspect dots, but things like ‍ and ‌ and &ampzwsp; don't seem to work in FF3.

There's already some great answers in here, so I won't post more. I will give a couple of gotchas though. First, make sure to test for known protocols, anything else may be naughty. As someone whose hobby concerns telnet links, you will probably want to include more than http(s) in your search, but may want to prevent say aim: or some other urls. Second, is that many people will delimit their links in angle-brackets (gt/lt) like <http://theroughnecks.net> or in parens "(url)" and there's nothing worse than clicking a link, and having the closing > or ) go allong with the rest of the url.
P.S. sorry for the self-referencing plugs ;)

I needed just the detection of simple http urls with/out protocol, assuming that either the protocol is given or a 'www' prefix. I found the above mentioned link quite helpful, but in the end I came out with this:
http(s?)://(\S+\.)+\S+|www\d?\.(\S+\.)+\S+
This does, obviously, not test compliance to the dns standard.

Given the messes of "other funny business" that I see in Disqus comment spam in the form of look-alike characters, the first thing you'll want to do is deal with that.
Luckily, the Unicode people have you covered. Dig up an implementation of the TR39 Skeleton Algorithm for Unicode Confusables in your programming language of choice and pair it with some Unicode normalization and Unicode-aware upper/lower-casing.
The skeleton algorithm uses a lookup table maintained by the Unicode people to do something conceptually similar to case-folding.
(The output may not use sensible characters, but, if you apply it to both sides of the comparison, you'll get a match if the characters are visually similar enough for a human to get the intent.)
Here's an example from this Java implementation:
// Skeleton representations of unicode strings containing
// confusable characters are equal
skeleton("paypal").equals(skeleton("paypal")); // true
skeleton("paypal").equals(skeleton("𝔭𝒶ỿ𝕡𝕒ℓ")); // true
skeleton("paypal").equals(skeleton("ρ⍺у𝓅𝒂ן")); // true
skeleton("ρ⍺у𝓅𝒂ן").equals(skeleton("𝔭𝒶ỿ𝕡𝕒ℓ")); // true
skeleton("ρ⍺у𝓅𝒂ן").equals(skeleton("𝔭𝒶ỿ𝕡𝕒ℓ")); // true
// The skeleton representation does not transform case
skeleton("payPal").equals(skeleton("paypal")); // false
// The skeleton representation does not remove diacritics
skeleton("paypal").equals(skeleton("pàỳpąl")); // false
(As you can see, you'll want to do some other normalization first.)
Given that you're doing URL detection for the purpose of judging whether something's spam, this is probably one of those uncommon situations where it'd be safe to start by normalizing the Unicode to NFKD and then stripping codepoints declared to be combining characters.
(You'd then want to normalize the case before feeding them to the skeleton algorithm.)
I'd advise that you do one of the following:
Write your code to run a confusables check both before and after the characters get decomposed, in case things are considered confusables before being decomposed but not after, and check both uppercased and lowercased strings in case the confusables tables aren't symmetrical between the upper and lowercase forms.
Investigate whether #1 is actually a concern (no need to waste CPU time if it isn't) by writing a little script to inspect the Unicode tables and identify any codepoints where decomposing or lowercasing/uppercasing a pair of characters changes whether they're considered confusable with each other.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008