Most sites which use an auto-increment primary-key display it openly in the url.
i.e.
example.org/?id=5
This makes it very easy for anyone to spider a site and collect all the information by simply incrementing the value of id. I can understand where in some cases this is a bad thing if permissions/authentication are not setup correctly and anyone could view anything by simply guessing the id, but is it ever a good thing?
example.org/?id=e4da3b7fbbce2345d7772b0674a318d5
Is there ever a situation where hashing the id to prevent crawling is bad-practice (besides losing the time it takes to setup this functionality)? Or is this all a moot topic because by putting something on the web you accept the risk of it being stolen/mined?
Generally with web-sites you're trying to make them easy to crawl and get access to all the information so that you can get good search rankings and drive traffic to your site. Good web developers design their HTML with search engines in mind, and often also provide things like RSS feeds and site maps to make it easier to crawl content. So if you're trying to make crawling more difficult by not using sequential identifiers then (a) you aren't making it more difficult, because crawlers work by following links, not by guessing URLs, and (b) you're trying to make something more difficult that you also spend time trying to make easier, which makes no sense.
If you need security then use actual security. Use checks of the principal to authorize or deny access to resources. Obfuscating URLs is no security at all.
So I don't see any problem with using numeric identifiers, or any value in trying to obfuscate them.
Using a hash like MD5 or SHA on the ID is not a good idea:
there is always the possibility of collisions. That is, two different IDs hash to the same value.
How are you going to unhash it back to the actual ID?
A better approach if you're set on avoiding incrementing IDs would be to use a GUID, or just a random value when you create the ID.
That said, if your application security relies on people not guessing an ID, that shows some flaws elsewhere in the system. My advice: stick to the plain and easy auto-incrementing ID and apply some proper access control.
I think hashing for publicly accessible id's is not a bad thing, but showing sequential id's will in some cases be a bad thing. Even better, use GUID/UUIDs for all your IDs. You can even use sequential GUIDS in a lot of technologies, so it's faster (insert-stage) (though not as good in a distributed environment)
Hashing or randomizing identifiers or other URL components can be a good practice when you don't want your URLs to be traversable. This is not security, but it will discourage the use (or abuse) of your server resources by crawlers, and can help you to identify when it does happen.
In general, you don't want to expose application state, such as which IDs will be allocated in the future, since it may allow an attacker to use a prediction in ways that you didn't forsee. For example, BIND's sequential transaction IDs were a security flaw.
If you do want to encourage crawling or other traversal, a more rigorous way would be to provide links, rather than by providing an implementation detail which may change in the future.
Using sequential integers as IDs can make many things cheaper on your end, and might be a resonable tradeoff to make.
My opinion is that if something is on the web, and is served without requiring authorization, it was put with the intention that it should be publicly accessible. Actively trying to make it more difficult to access seems counter-intuitive.
Often, spidering a site is a Good Thing. If you want your information available as much as possible, you want sites like Google to gather data on your site, so that others can find it.
If you don't want people to read through your site, use authentication, and deny access to people who don't have access.
Random-looking URLs only give the impression of security, without giving the reality. If you put account information (hidden) in a URL, everyone will have access to that web spider's account.
My general rule is to use a GUID if I'm showing something that has to be displayed in a URL and also requires credentials to access or is unique to a particular user (like an order id). http://site.com/orders?id=e4da3b7fbbce2345d7772b0674a318d5
That way another user won't be able to "peek" at the next order by hacking the url. They may be denied access to someone else's order, but throwing a zillion letters and numbers at them is a pretty clear way to say "don't mess with this".
If I'm showing something that's public and not tied to a particular user, then I may use the integer key. For example, for displaying pictures, you might wish to allow your users to hack the url to see the next picture.
http://example.org/pictures?id=4, http://example.org/pictures?id=5, etc.
(I actually wouldn't do either as a simple GET parameter, I'd use mod_rewrite (or something) to make readable urls. Something like http://example.org/pictures/4 -> /pictures.php?picture_id=4, etc.)
Hashing an integer is a poor implementation of security by obscurity, so if that's the goal, a true GUID or even a "sequential" GUID (whether via NEWSEQUENTIALID() or COMB algorithm) is much better.
Either way, no one types URLs anymore, so I don't see much sense in worrying about the difference in length.
Related
I am writing a REST API. However, one of the requirements is to allow the caller to determine if an action may be performed (so that, for example, a button can be enabled or disabled, etc.)
The action might not be allowed for several reasons - perhaps user permissions, but possibly because, for example, you can't delete a shared object, or you can't create an item with the same name as another item or an array of other business rules.
All the logic to determine if something can be deleted should be determined in the back end, but the front end must show this in the GUI.
I am trying to find the right pattern to use for this in REST, and am coming up a bit short. I could create a parallel API so for every entity endpoint there was an EntityPermissions endpoint, but that seems to be overkill. I could also do something like add an HTTP header that indicates that the request was only to check permisisons, not perform it, but that seems a bit dubious, and likely to mess up the http cache.
Can anyone point me to the common pattern for doing something like this? Does it have a name? Or a web page that discusses it? I'm sure everyone has their own ideas on this (like my dumb ideas) but I this seems to be a common enough requirement that I figure there must be a common pattern for it. But google didn't help much.
There's going to be multiple opinionated answers about this. I'll share mine. Might not be the best for your problem, but it's a valid solutions.
If you followed the real definition of REST, you would be building a hypermedia/HATEOAS-style webservice. Urls would not be hardcoded, they would be discovered and actions would be discovered by the existence of a link.
If an action may not be performed, you can just hide the link. If a user fetches the next resource they just see all the available actions right there.
A popular format for hypermedia API's is HAL. You might decorate the links further with more information from HTTP Link hints.
If this is the first time you heard of hypermedia API's, there might be a bit of a learning curve. The results of learning this can be very beneficial though.
Say I have a collection of websites for accountants, like this:
http://www.johnvanderlyn.com
http://www.rubinassociatespa.com
http://www.taxestaxestaxes.com
http://janus-curran.com
http://ricksarassociates.com
http://www.condoaudits.com
http://www.krco-cpa.com
http://ci.boca-raton.fl.us
What I want to do is crawl each and get the names & emails of the partners. How should I approach this problem, at a high-level?
Assume I know how to actually crawl each site (and all subpages) & parse the HTML elements -- I am using Oga.
What I am struggling with is how to make sense of data that is presented in a wide variety of ways. For instance, the email address for the firm (and or partner) can be found in one of these ways:
On the About Us page, under the name of the partner.
On the About Us page, as a generic catch-all email.
On the Team page, under the name of the partner.
On the Contact Us page, as a generic catch-all email.
On a Partner's page, under the name of the partner.
Or it could be any other way.
One way I was thinking about approaching the email, is just to search for all mailto a tags and filter from there.
The obvious downside for this is that there is no guarantee that the email will be for the partner and not some other employee.
Another issue that is more obvious is detecting the partner(s) names just from the markup. I was initially thinking I could just pull all the header tags and text in them, but I have stumbled across a few sites that have the partner names in span tags.
I know SO is usually for specific programming questions, but I am not sure how to approach this and where to ask this. Is there another StackExchange site that this question is more appropriate for?
Any advice on specific direction you can give me would be great.
I looked at the http://ricksarassociates.com/ website and I cant find any partners at all so in my opinion you better stand to gain from this if not you better look for some other invention.
I have done similar datascraping from time to time, and in norway we have laws - or should I say "laws" - that you are not allowed to email people however you are allowed to email the company - so in a way the same problem from another angle.
I wish I knew maths and algorythms by heart because I am sure there is a fascinating sollution hidden in AI and machine learning, but in my mind the only sollution I can see is building a rule set that over time probably gets quite complex. Maby you could apply some bayesian filtering - it works very well for email.
But - to be a little more productive here. One thing i know is inmportant, you could start by creating the crawler environment and building the dataset. Have the database for URLS so you can add more at any time, and start the crawling on what you have already so that you do your testing querying your own data with a 100% copy. This will save you enormous time instead of live scraping while tweaking.
I did my own search engine some years ago, scraping all NO domains however I needed only the index file that time. Took over a week alone just to scrape it down and I think it was 8GB of data just for that single file, and I had to use several proxyservers aswell to make it work due to problems with to much DNS traffik. Lots of problems that needed being taken care of. I guess I am only saying - if you are crawling a large scale you might aswell start getting the data down if you want to work efficient with the parsing later.
Good luck, and do post if you get a sollution. I do not think it is posible without an algorythm or AI though - people design websites the way they like and they pull templates out of their arse so there are no rules to follow. You will end up with bad data.
Do you have funding for this? If so its simpler. Then you could just crawl each site, and make a profile for each site. You could employ someone cheap to manual go through the parsed data and remove all the errors. This is probably how most people does it, unless someone already have done it and the database is for sale / available from webservice so it can be scraped.
The links you provide are mainly US site, so I guess you are focusing on English names. In that case, instead of parsing from html tags, I would just search the whole webpage for name. (There are free database of first name and last name) This may also work if you are donig this for some other Europe company, but it would be a problem for company from some countries. Take Chinese as an example, while there is a fix set of last name, one may use basically any combination of Chinese character as first name, so this solution won't work for Chinese site.
It is easy to find email from a webpage as there is a fixed format of (username)#(domain name) with no space in between. Again I won't treat it as html tags but just as normal string so that the email can be found no matter it is in mailto tag or in plain text. Then, to determine what email is it:
Only one email in page?
Yes -> catch-all email.
No -> Is name found in that page as well?
No -> catch-all email (can have more than one catch-all email, maybe for different purpose like info + employment)
Yes -> Email should be attached to the name found right before it. It is normal that the name should appear before the email.
Then, it should be safe to assume the name appear first belongs to more important member, e.g. Chairman or partner.
I have done similar scraping for these types of pages, and it varies wildly from site to site. If you are trying to make one crawler to sort of auto find the information, it will be difficult. However, the high level looks something like this.
For each site you check, look for element patterns. Divs will often have labels, ID's, and classes which will easily let you grab information. Perhaps you find that many divs will have a particular class name. Check for this first.
It is often better to grab too much data from a particular page, and boil it down on your side afterwards. You could, perhaps, look for information which comes up on a screen by utilizing type (is link) or regex (is email) to look for formatted text. Names and occupation will be harder to find by this method, but might be related positionally on many pages to other well formatted items.
Names will often be affixed with honorifics (Mrs., Mr., Dr., JD, MD, etc.) You could come up with a bank of those, and check against them for any page you end up on.
Finally, if you really wanted to make this process general purpose, you could do some heuristics to improve your methods based off of expected information; names, for example, are most often within a particular list. If it was worth your time, you could check certain text for whether it matches a list of more common names.
What you mentioned in your initial question seems that you would have a lot of benefit with a general purpose Regular Expressions crawler, and you could make improvements on it as you know more about the sites which you interact with.
There are excellent posts on this topic with a lot of useful links throughout these webpages:
https://www.quora.com/What-is-a-good-web-scraper-for-pulling-emails-names-etc-even-if-the-contact-info-is-another-page-deep-a-browser-add-on-is-a-plus
http://www.hongkiat.com/blog/web-scraping-tools/
http://www.garethjames.net/a-guide-to-web-scraping-tools/
http://www.butleranalytics.com/15-web-scraping-tools/
Some of the examined applications are working in macOS.
Let's say we have a site where we have a list of items. On each of these items you can start a couple of different process that will result in somekind of output related to the item in question. How should you design for the most appropriate use of the http verbs? What I would like to have is multiple links per item and each link trigger one of the actions, but in my scenario that doesn't match the HTTP-VERB get, which will be used if I am using links. On the other hand, I don't want to have buttons which all are in a separate form with different actions.
It's somewhat hard to explain but hopefully you understand, it should be some best practices to apply here.
You should NOT use GET. GET requests should be safe which means they are intended only for information retrieval and should not change the state of the server. (i.e. things like logging are OK, but things that actually update the state of the application are a no-no.) Think of a crawler going over your application. Anything you wouldn't mind a crawler going through is fine for GET, but that doesn't sound like your situation (because you said, "start a couple of different processes", but I could be misinterpreting your use case).
That leaves PUT, DELETE and POST. PUT and DELETE must be idempotent, meaning that multiple identical requests should have the same effect as a single request. So if you had a request that updated a person's name, for example, if you called it once or 100 times, the person's name would still be the same, so it is idempotent.
POST is the most flexible verb. If the processes you are kicking off are not safe or idempotent (or even if they are) you can use POST, which simply doesn't guarantee anything about safety or idempotency. The disadvantages there are:
If you use POST when GET is more semantically correct, it is less communicative of the intent of your request, since POST usually means you are sending a payload.
You just couldn't take advantage of the web's caching infrastructure that makes it so scalable.
In the past, I have used POST with query args to specify custom actions. It made sense in my use case because I had a majority of custom actions needing to pass a payload. Since you do not want to use buttons, you can use GET with query args to specify the different actions, but you have to be very careful that the action you are taking does not have any side effects and is idempotent. As noted in the comment by #jhericks below, there are many things in the network that assume that GET's are safe and may repeat GET's.
From a pure RESTful perspective though, this is not ideal. Your items will have a specific URI and GET on the URI will return the items representation. Running actions on the item is effectively a change in the state of the item representation and that should be done with a POST(or a PUT depending on who you ask and if your web server supports PUT). In real life though, using query args is an easy work around and it may make sense to your use case.
Im not sure i fully understand your question.
But here's a quick paragraph which might help you.
REST is about making smart clients and simple servers. GET, PUT, DELETE represent the basic operations of file access at the lowest level. What you should be doing is completely ignoring anything the server can offer and be offloading that work onto clients.
So, the question is, why is the server being triggered to do many things. why can't the client do all of these things itself.
Mike Brown
What practical benefits can my client get if I use microformats on his site for every possible thing?
How can I explain these benefits to a non-technical client?
Sometimes it seems like the practical benefits are hard to quantify.
Search engines already pick up and parse microformats (see e.g. https://support.google.com/webmasters/answer/99170). I believe hCard and hCalendar are fairly well supported--and if not, plenty of sites are using it, including places like MySpace.
It's the idea that adding CSS classes and specified IDs make your existing content easier to parse in a machine-readable manner.
hReview is starting to make some inroads, and hResume looks like it take off too.
I heavily use rel="nofollow" on uncontrolled links (3rd party sources) which is actually a microformat.
Check the microformats wiki for a decent starting point.
It just means your viewers can share a few generic "formats". You can generalize stylesheets, and parsing mechanisms. Rather than having a webpage consist of one "html document," you have a webpage that consist of "10 formatted micro-documents".
If you need a real world analog: think of it like attaching a formatted invoice, to a receipt, and a business card, rather than writing it all down on notebook paper with your left hand.
Overall the site becomes easier to digest for the rest of the internet. The data can be reused, combined, cross-referenced, and saved.
A simple example would be to have anywhere on the site a latitude and a longitude (geo). With Microformats, anybody that searches for that latitude and longitude can be easily referenced to their website, increasing traffic, awareness of that person / company, and allow users to easily save that information. (Although I've encountered little of this personally, this is more of 'the future' of things than it is current. But always good to stay up to date).
A second example would be a business card (hCard) where a browser can easily save and transfer it to an address book, so that just one visit to the site and the visitor has the information saved locally. Especially useful if they're getting hits from a cell phone.
I wouldn't recommend using microformats for "every possible thing". Use them for things where you get some benefit, in exchange for the effort of using them.
The main practical benefit I'm aware of is customised search engine results:
https://support.google.com/webmasters/answer/99170
Technically, Google now prefers this to be implemented using microdata (i.e. itemprop attributes) rather than microformats, but it's the same idea.
Having a micro-format can be better than no format since it lets you save every possible thing in the application.
A micro-format for every possible thing can be better than a standard format only because: it's quicker to create so it costs less and it take less space than some standard formats, like XML.
But all this depends on the context of the application and so you must explain it to the client in that context.
microformatting your content extends its reach in every, which way possible. using your sites structure as its "api" the possibilities are what you set your limits too
I'm specifically thinking about the BugMeNot service, which provides user name and password combos to a good number of sites. Now, I realize that pay-for-content sites might be worried about this (and I would suspect that most watch for shared accounts), but how about other sites? Should administrators be on the lookout for these accounts? Should web developers do anything differently to take them into account (and perhaps prevent their use)?
I think it depends on the aim of your site. If usage analytics are all-important, then this is something you'd have to watch out for. If advertising is your only revenue stream, then does it really matter which username someone uses?
Probably the best way to discourage use of bugmenot accounts is to make it worthwhile to have an actual account. E.g.: No one would use that here, since we all want rep and a profile, or if you're sending out useful emails, people want to receive them.
Ask yourself the question "Why do we require users to register to access my site?" Once you have business reason for this requirement, then you can try to work out what the effect of having some part of that bypassed by suspect account information.
Work on the basis that at least 10 to 15 percent of account information will be rubbish - and if people using the site can't see any benefit to them personally for registering, and if the registration process is even remotely tedious or an imposition, then accept that you will be either driving more potential visitors away, or increasing your "crap to useful information" ratio.
Not make registration mandatory to read something? i.e. Ask people to register when you are providing some functionality for them that 'saves' some settings, data, etc. I would imagine site like stackoverflow gets less fake registrations (reading questions doesn't require an account) than say New York Times, where you need to have an account to read articles.
If that is not upto your control, you may consider removing dormant accounts. i.e. Removing accounts after a certain amount of inactivity.
That entirely depends.
Most sites that find themselves listed in bugmenot.com tend to be the ones that require registration for in order to access otherwise-free content.
If registration is required in order to interact with the site (ie, add comments/posts/etc), then chances are most people would rather create their own account than use one that has been made public.
So before considering whether to do things like automatically check bugmenot - think about whether their are problems with your business model.
There are a few situations where pay-to-access content sites (I'm thinking things like, ahem, 'adult' sites) end up with a few user accounts being published publically (usually because someone has brute-forced some account details), and in that case there may be a argument for putting significant effort into it.
From an administrator viewpoint absolutely. That registration is required for a reason, even if it's something just as simple as user tracking/profile maintaining. Several thousand people using that login entirely defeats the purpose. IP tracking could help mitigate this problem, but it would definitely be hard to eliminate entirely.
No need to worry about BugMeNot: http://www.bugmenot.com/report.php
With bugmenot, keep in mind that this service is not actually there to harm the sites, but rather to make using them easier. You can request to block your site if it is pay-per-view, community-based (i.e. a forum or Wiki) or the account includes sensible information (like banking). This means in virtually all situations where you would think that bugmenot is a bad thing, bugmenot does not want to be used. So maybe things are not as bad as you might think.