Related
Say I have a collection of websites for accountants, like this:
http://www.johnvanderlyn.com
http://www.rubinassociatespa.com
http://www.taxestaxestaxes.com
http://janus-curran.com
http://ricksarassociates.com
http://www.condoaudits.com
http://www.krco-cpa.com
http://ci.boca-raton.fl.us
What I want to do is crawl each and get the names & emails of the partners. How should I approach this problem, at a high-level?
Assume I know how to actually crawl each site (and all subpages) & parse the HTML elements -- I am using Oga.
What I am struggling with is how to make sense of data that is presented in a wide variety of ways. For instance, the email address for the firm (and or partner) can be found in one of these ways:
On the About Us page, under the name of the partner.
On the About Us page, as a generic catch-all email.
On the Team page, under the name of the partner.
On the Contact Us page, as a generic catch-all email.
On a Partner's page, under the name of the partner.
Or it could be any other way.
One way I was thinking about approaching the email, is just to search for all mailto a tags and filter from there.
The obvious downside for this is that there is no guarantee that the email will be for the partner and not some other employee.
Another issue that is more obvious is detecting the partner(s) names just from the markup. I was initially thinking I could just pull all the header tags and text in them, but I have stumbled across a few sites that have the partner names in span tags.
I know SO is usually for specific programming questions, but I am not sure how to approach this and where to ask this. Is there another StackExchange site that this question is more appropriate for?
Any advice on specific direction you can give me would be great.
I looked at the http://ricksarassociates.com/ website and I cant find any partners at all so in my opinion you better stand to gain from this if not you better look for some other invention.
I have done similar datascraping from time to time, and in norway we have laws - or should I say "laws" - that you are not allowed to email people however you are allowed to email the company - so in a way the same problem from another angle.
I wish I knew maths and algorythms by heart because I am sure there is a fascinating sollution hidden in AI and machine learning, but in my mind the only sollution I can see is building a rule set that over time probably gets quite complex. Maby you could apply some bayesian filtering - it works very well for email.
But - to be a little more productive here. One thing i know is inmportant, you could start by creating the crawler environment and building the dataset. Have the database for URLS so you can add more at any time, and start the crawling on what you have already so that you do your testing querying your own data with a 100% copy. This will save you enormous time instead of live scraping while tweaking.
I did my own search engine some years ago, scraping all NO domains however I needed only the index file that time. Took over a week alone just to scrape it down and I think it was 8GB of data just for that single file, and I had to use several proxyservers aswell to make it work due to problems with to much DNS traffik. Lots of problems that needed being taken care of. I guess I am only saying - if you are crawling a large scale you might aswell start getting the data down if you want to work efficient with the parsing later.
Good luck, and do post if you get a sollution. I do not think it is posible without an algorythm or AI though - people design websites the way they like and they pull templates out of their arse so there are no rules to follow. You will end up with bad data.
Do you have funding for this? If so its simpler. Then you could just crawl each site, and make a profile for each site. You could employ someone cheap to manual go through the parsed data and remove all the errors. This is probably how most people does it, unless someone already have done it and the database is for sale / available from webservice so it can be scraped.
The links you provide are mainly US site, so I guess you are focusing on English names. In that case, instead of parsing from html tags, I would just search the whole webpage for name. (There are free database of first name and last name) This may also work if you are donig this for some other Europe company, but it would be a problem for company from some countries. Take Chinese as an example, while there is a fix set of last name, one may use basically any combination of Chinese character as first name, so this solution won't work for Chinese site.
It is easy to find email from a webpage as there is a fixed format of (username)#(domain name) with no space in between. Again I won't treat it as html tags but just as normal string so that the email can be found no matter it is in mailto tag or in plain text. Then, to determine what email is it:
Only one email in page?
Yes -> catch-all email.
No -> Is name found in that page as well?
No -> catch-all email (can have more than one catch-all email, maybe for different purpose like info + employment)
Yes -> Email should be attached to the name found right before it. It is normal that the name should appear before the email.
Then, it should be safe to assume the name appear first belongs to more important member, e.g. Chairman or partner.
I have done similar scraping for these types of pages, and it varies wildly from site to site. If you are trying to make one crawler to sort of auto find the information, it will be difficult. However, the high level looks something like this.
For each site you check, look for element patterns. Divs will often have labels, ID's, and classes which will easily let you grab information. Perhaps you find that many divs will have a particular class name. Check for this first.
It is often better to grab too much data from a particular page, and boil it down on your side afterwards. You could, perhaps, look for information which comes up on a screen by utilizing type (is link) or regex (is email) to look for formatted text. Names and occupation will be harder to find by this method, but might be related positionally on many pages to other well formatted items.
Names will often be affixed with honorifics (Mrs., Mr., Dr., JD, MD, etc.) You could come up with a bank of those, and check against them for any page you end up on.
Finally, if you really wanted to make this process general purpose, you could do some heuristics to improve your methods based off of expected information; names, for example, are most often within a particular list. If it was worth your time, you could check certain text for whether it matches a list of more common names.
What you mentioned in your initial question seems that you would have a lot of benefit with a general purpose Regular Expressions crawler, and you could make improvements on it as you know more about the sites which you interact with.
There are excellent posts on this topic with a lot of useful links throughout these webpages:
https://www.quora.com/What-is-a-good-web-scraper-for-pulling-emails-names-etc-even-if-the-contact-info-is-another-page-deep-a-browser-add-on-is-a-plus
http://www.hongkiat.com/blog/web-scraping-tools/
http://www.garethjames.net/a-guide-to-web-scraping-tools/
http://www.butleranalytics.com/15-web-scraping-tools/
Some of the examined applications are working in macOS.
We found some code being inserted into emails sent by our proprietary email system and have no idea of its provenance.
My company sends a lot of bulk email for clients. (We follow all the best practice protocols to ensure we're not spammers.) The system is proprietary, based on open source code. Customers have a GUI to enter content, similar to the big guys like MailChimp and the like.
A staff member brought a UI challenge with the GUI to me, using a client's bulk email as an example. I dug into the source to see if they had some exotic CSS that might be affecting my interface, when I noticed the following tag:
<custom name="opencounter" type="tracking"> </custom>
My interface certainly doesn't insert that code into an email.
What is opencounter? Who's technology is it? Does it have a valid reason for being used on our (proprietary) email system?
It appears that "opencounter" is a proprietary counting mechanism used by Exact Target's bulk mailing system. Apparently, the client was copy/pasting from an old campaign done on ExactTarget to move the design to our system. It is therefore safe for me to remove.
My best guess is that it is something that is auto-substituted to put in some tracking information into the individual e-mails. I'd suggest doing some tests on "bulk" e-mails you've set up just to yourself. Put some known content immediately either side of it and then send yourself this e-mail and view the source to see if its been substituted with anything. e.g.
XXX<custom name="opencounter" type="tracking"> </custom>YYY
If the final output has XXXYYY or something then you'll know its a tracker in the bulk e-mailer. If it outputs as is you can probably safely assume you can get rid of it. If it gets rid of it completely then it may be used for some kind of processing on the server but I'm not sure what that might be...
The other thing you can do is to do a search of your entire codebase for "opencounter" to see if there are any references to it.
One final thought: Does your customer interface allow them to put in HTML directly or is it just a gui? It occurs to me that if they used a previous bulk e-mailer then it might be something specific to that one that got copied over if its not in yours.
We had a similar situation and traced it back to a user who was utilizing a third-party WYSIWYG tool to develop code that they then pasted into our CMS. It's a harmless issue, but it points to the need to improve our tool so that others don't feel like they have to use another editor.
We have a service where we literally give away free money.
Naturally said service is ripe for abuse. To defend against this we do the following:
log ip address
use unique email addresses (only 1 acct/email addy)
collect more info like st. address, phone number, etc.
use signup captcha
BHOs (I've seen poker rooms use these)
Now, let's get real here -- NONE of this will stop a determined user.
Obviously ip addresses can be changed via a proxy (which could be blacklisted via akismet) but change anyways if the user has a dynamic ip or if more than one user is behind a NAT'd network (can we say almost everyone?)
I can sign up for thousands of unique email addresses each hour -- this is no defense.
I can put in fake information taken from lists for street addresses and phone numbers.
I can buy captchas from captcha solving services (1k for $5).
bhos seem only effective for downloadable software -- this is a website
What are some other ways to prevent multiple users from abusing the service? How do all the PPC people control click fraud?
I know we could actually call the person but I don't think we are trying to do that anytime soon.
Thanks,
It's pretty difficult to generate lots of fake phone numbers that can send and receive SMS messages. SMS verification could go a long way towards cutting down on fraud. Of course, it also limits you to giving away free money to cell phone owners.
I think only way is to bind your users accounts to 'real world' information, like his/her passport number, for instance. Of course, you'll need to make sure that information is securely stored and to find some way to validate it.
Re: signing up for new email accounts...
A user doesn't even need to do that. Please feel free to send your mail to brian_s#mailinator.com, or feydr.asks.a.question#spamherelots.com, or stackoverflow#safetymail.info, or my_arbitrary_username#zippymail.info. I haven't registered any of those email addresses, but all of them will work.
Those domains are owned by ManyBrain, and they (and probably others as well) set the domain to accept any email user. ManyBrain in particular then makes the inboxes for those emails publicly accessible without any registration (stripping everything by text from the email and deleting old mail). Check it out: admin#mailinator.com's email inbox!
Others have mentioned ways to try and keep user identities unique. This is just one more reason to not trust email addresses.
First, I suppose (hope) that you don't literally give away free money but rather give it to use your service or something like that.
That matters as there is a big difference between users trying to just get free money from you they can spend on buying expensive cars vs only spending on your service which would be much more limited.
Obviously many more user will try to fool the system in the former than in the latter case.
Why it matters? Because it is all about the balance between your control vs your user annoyance. I see many answers concentrating on the control part, so let's go through annoyance, shall we?
Log IP address. What if I am the next guy on the computer in say internet shop and the guy before me already used that IP? The other guy left your hot page that I now see but I am screwed because the IP is blocked. Yes, I can go to another computer but it is annoyance and I may have other things to do.
Collecting physical Adresses. For what??? Are you going to visit me? Or start sending me spam letters? Let me guess, more often than not you get addresses with misprints at best and fake ones at worst. In fact, it is much less hassle for me to give you fake address and not dealing with whatever possible spam letters I'll have to recycle in environment-friendly way. :)
Collecting phone numbers. Again, why shall I trust your site? This is the real story. I gave my phone nr to obscure site, then later I started receiving occasional messages full of nonsense like "hit the fly". That I simply deleted. Only later and by accident to discover that I was actually charged 2 euros to receive each of those messages!!! Do I want to get those hassles? Obviously not! So no, buddy, sorry to disappoint but I will not give your site my phone number unless your company is called Facebook or Google. :)
Use signup captcha. I love that :). So what are we trying to achieve here? Will the user who is determined to abuse your service, have problems to type in a couple of captchas? I doubt it. But what about the "good user"? Are you aware how annoying captchas are for many users??? What about users with impaired vision? But even without it, most captchas are so bad that they make you feel like you have impaired vision! The best advice I can give - if you care about user experience, avoid captchas as plague! If you have any doubts, do your online research first!
See here more discussion about control vs annoyance and here some more thoughts about being user-friendly.
You have to bind their information to something that is 'real world', as Rubens says. Of course, you also need to be able to verify this information (I can just make up passport numbers all day if you don't check to make sure they're correct).
How do you deliver the money? Perhaps you can index this off the paypal account, mailing address, or whatever you're sending the money to?
Sometimes the only way to prevent people abusing a system is to not have the system in the first place.
If you're doing what you say you're doing, "giving away money to people", then surprise surprise, there will be tons of people with more time available to try to find ways to game the system than you will have to fix it.
I guess it will never be possible to have an identification system which identifies fake identities that is:
cheap to run (I think it's called "operational cost"?)
cheap to implement (ideally one time cost - how do you call that?)
has no Type-I/Type-II errors
is scalable
But I think you could prevent users from having too many (to say a quite random number: more than 50) accounts.
You might combine the following approaches:
IP address: can be bypassed with VPN
CAPTCHA: can be bypassed with human farms (see this article, for example - although they claim that their test can't be that easily passed to other humans, I doubt this is true)
Ability-based identification: can be faked when you know what is stored and how exactly the identification works by randomly (but with a given distribution) acting (example: brainauth.com)
Real-world interaction: Although this might be the best one, but I guess it is expensive and not many users will accept it. Also, for some users/countries it might not be possible. (example: Postident in Germany, where the Post wants to see your identity card. I guess this can only be faced in massive scale by the government.)
Other sites/resources: This basically transforms the problem for other sites. You can use services, where it is not allowed/uncommon/expensive to have much more than 1 account
Email
Phone number: e.g. by using SMS, see Multi-factor authentication
Bank account: PayPal; transfer not much money or ask them to transfer a random (small) amount to you (which you will send back).
Social based
When you take the social graph (vertices are people, edges are connections), you will expect some distribution. You know that you are a single human and you know some other people. So you have a "network of trust" (in quotes, because I think this might be used in other context as well). Now you might not trust people / networks how interact heavily with your service, but are either isolated (no connection) or who connect a large group with another large group ("articulation points"). You also might not trust fast growing, heavily interacting new, isolated graphs.
When a user provides content that is liked by many other users (who you trust), this might be an indicator that there is a real human creating it.
We had a similar issue recently on our website, it is really a hassle to solve this issue if you are providing a business over one time or monthly recurring free credits system.
We are using a fraud detection solution https://fraudradar.io for a while and that helped us a lot to clean out most of the spam activities. It is pretty customizable with:
IP checks
Email domain validity
Regex rules
Whitelisting options per IP, email domain etc.
Simple API to communicate through
I would suggest to check that out.
I've gotten into a habit of using the standard register->send activation email->activate account process for every site that supports user authentication and free registration without questioning if I really need this.
What are your thoughts on this? If I have captcha on the registration form is the email confirmation process really necessary?
EDIT:
OK, so the general consensus seems to be that by getting the users to confirm the email they entered I'll keep them away from putting someone else's email in there.
What about when I let users edit their profile/settings and they enter another email?
If I need to keep them away from entering other people's addresses then I'd need to confirm that email address (by temporarily deactivating their accoun)t every time they change it.
Captcha+activation prevents bots AND spoofed people
Well basically it is since each part prevents one problematic scenario:
Captcha prevents (if you use strong captcha like reCaptcha) bots from registering new users
Email activation prevents people from registering other people (by their email address)
I guess this is a valid everyday pattern for registration that's widely acknowledged by IT community.
EDIT
Yes. When you want to prevent users from changing their email address, you'd have to repeat email activation procedure to make it robust.
But you don't have to deactivate their account while doing it. All you have to do is having a pending email-change email activation active. If it gets activated, you change email address at that point (not when they change it), otherwise the old one is still used.
If you don't confirm an e-mail, you're supposing that the user registering that service owns that email account. How can you start sending a lot of system e-mails, reset passwords and etc to a person that has nothing to do with your system? I would be really pissed of if it was my e-mail.
Another scenario: what if the register mispelled his e-mail when registering? Suppose he doesn't check his "account settins" in your application, doesn't change his email, and needs to reset his password. If the e-mail is registered in a wrong way, it's your fault for not checking it before.
Of course, I'm just saying this to services that would REALLY demand an account to be created. Avoid the login barrier when possible, or use openid when your service isn't so critical.
You should give serious consideration to supporting OpenID. http://openid.net/get-an-openid/what-is-openid/
The key benefit for OpenID is that it reduces the complexity for your user. There is no reason to force people to remember login credentials for hundreds of sites when a viable alternative exists. There is no worldwide netizen database - and there likely never will be - but OpenID simplifies the situation greatly.
I know that as a user I found the registration process for Stack Overflow to be painless and easy. I wish more sites used OpenID.
It's the lowest-level attempt at identity validation. It encourages users to re-use the same account when they return (by having a common, shared identifier you and they can use to reconnect), and it prevents impersonation, because it requires access to the claimed identity as proof.
It's not perfect, but something by definition works infinitely better than nothing.
If identity doesn't matter on your site (e.g. your service is throwaway after each use) then you don't need email activation. Otherwise, you probably want it.
On my site, I let users sign-up and do everything non-public until they confirm their email address. Because I run a gaming website, it means users can earn medals, post scores, just not post in the forum or post comments in the blog until they verify their email address.
I find it works pretty well. I have 16,000 registered users.
I find it both unnecessary and annoying. If I can, I avoid doing this.
However, I do do this if 1) email will be sent by the program, so I can test if the email address is valid, or 2) this is a very large, public-facing website, in which case I want to filter out as many potential problems as possible.
For most basic sites, I don't bother with either. Both email activation and captcha are relatively easy for dedicated spammers to bypass and overcome and do little but cause an annoyance to most of the users, driving away at least a certain percentage who might have otherwise signed up. I've found in my experience, focusing more on spam filters for member posted content has a better ROI overall.
For sites with more serious content, you'll typically have more serious users. In cases like that, I'll throw everything I've reasonably got available at it to counter the spam.
I find it useful when an email is sent for confirmation. This makes sure that I am the one who has registered with that email address.
Even with captcha you can register someone else email address although he may or may not approve that confirmation.
You only seem to need e-mail confirmation to confirm identity, not to send useful content by e-mail. But e-mail confirmation is only one means to an end. You may consider others, preferablly less intrusive ones.
Generally you can check something that
you are (e.g. fingerprint, iris scan)
you have (e.g. token, creditcard, key, access to an e-mail account)
you know (e.g. PIN, password, your mom's weight, name of your favorite deceased pet, the optimistic length of your most private bodyparts measured in inches)
Also, you can delegate the check to others; the creditcard company, phone company, someone's friends.
Example: GoogleMail could not ask
for a confirmation e-mail address
upon creation of your GMail account.
Instead, the early adopters had a limited supply of
"invites" to share with friends.
So - unless you actually need me to receive information you'd e-mail, which I generally hate anyway - you might be inclined to resort to more creative/fun means.
My organization has a form to allow users to update their email address with us.
It's suggested that we have two input boxes for email: the second as an email confirmation.
I always copy/paste my email address when faced with the confirmation.
I'm assuming most of our users are not so savvy.
Regardless, is this considered a good practice?
I can't stand it personally, but I also realize it probably isn't meant for me.
If someone screws up their email, they can't login, and they must call to sort things out.
I've seen plenty of people type their email address wrong and I've also looked through user databases full of invalid email address.
The way I see it you've got two options. Use a second box to confirm the input, or send an authentication/activation email.
Both are annoyances so you get to choose which you think will annoy your users less.
Most would argue that having to find an email and click on a link is more annoying, but it avoids the copy/paste a bad address issue, and it allows you to do things like delete or roll back users if they don't activate after say 48 hours.
I would just use one input box. The "Confirm" input is a remnant form the "Confirm Password" method.
With passwords, this is useful because they are usually typed as little circles. So, you can't just look at it to make sure that you typed it correctly.
With a regular text box, you can visually check your input. So, there is no need for a confirmation input box.
I agree with you in that it is quite an annoyance to me (I also copy and paste my address into the second input).
That being said, for less savvy users, it is probably a good idea. Watching my mother type is affirmation that many users do not look at the screen when they type (when she's using her laptop she resembles Linus from Peanuts when he's playing the piano). If it's important for you to have the user's correct email address then I would say having a confirmation input is a very good idea (one of these days I'll probably type my email address wrong in the first box and paste it wrong into the second box and then feel like a complete idiot).
While the more tech-savvy people tend to copy and paste, not technical people find it just as annoying to have to type something twice. During a lot of user testing I've down, the less tech-savvy - the more annoyed they seem with something like this... They struggle to type as it is, when they see they have to type their email in again it's usually greeted with a strong sign.
I would suggested a few things.
Next to the input box write the style of the information you are looking for so something like (i.e. user#domain.com). The reason this is important is you would be surprised how many of the less tech-savvy don't really understand the different between a website and an email address, so let them know visually the format you want.
Run strong formatting test in real time, and visually show a user that the format is good or bad. A green check box if everything is okay comes to mind.
Lastly, depending on your system architecture I often use a library to actually wrong a domain in the background. I don't necessarily try to run a VRFY on the server - I often use a library to check to make sure the domain they entered has MX records in it's DNS record.
I agree with Justin, while most technical folks will use the copy, paste method, for the less savvy users it is a good practice.
One more thing that I would add is that the second field should have the auto-complete feature disabled. This ensures that there is human input from either method on at least one of the fields.
Typing things twice is frustrating and doesn't prevent copy&paste errors or even some typos.
I would use an authenticate/activate schema with a roll back to the old address if the activation is not met within 48 hours or if the email bounces.
As long as a field is viewable, you do not need a confirm box. As long as you do some form validation to be sure that it is at least in valid format for an email address let the user manage the rest of the issues.
I'd say that this is ok but should only be reserved for forms where the email is essential. If you mistype your email for your flight booking then you have severed the two-way link between yourself and the other party and risk not getting the confirmation number, here on StackOverflow it would only mean your Gravatar would not be loaded ...
I'd consider myself fairly techie but I always fill in both fields /wo cut-paste if I regard it to be important enough.
I tend to have it send a verification code to the email address specified (and only ask for it once), and not change the email address until the user has entered the code I sent them.
This has the advantage that if they try to set it to a dozen different addresses in quick succession, you'll know which ones work by which verification code they put in.
Plus, if I am presented with a "confirm email address" box, I just copy and paste from the previous one, and if I'm guilty of that, I'm sure that other less careful users will do the same.