Building an index of URLs , what features to include?

Building an index of URLs , what features to include? - data-analysis

I am working towards building an index of URLs. The objective is to build and store a data structure which will have key as a domain URL (eg. www.nytimes.com) and the value will be a set of features associated with that URL. I am looking for your suggestions for this set of features. For example I would like to store www.nytimes.com as following:
[www.nytimes.com: [lang:en, alexa_rank:96, content_type:news, spam_probability: 0.0001, etc..]
Why I am building this? Well the ultimate goal is to do some interesting things with this index, for example I may do clustering on this index and find interesting groups etc. I have with me a whole lot of text which was generated by whole lot URLs over a period of whole lot of time :) So data is not a problem.
Any kind of suggestions are very welcome.

Make it work first with what you've already suggested. Then start adding features suggested by everybody else.
ideas are worth nothing unless
executed.
-- http://www.codinghorror.com/blog/2010/01/cultivate-teams-not-ideas.html

I would maybe start here:
Google white papers on IR
Then also search for white papers on IR on Google maybe?
Also a few things to add to your index:
Subdomains associated with the domain
IP addresses associated with the domain
Average page speed
Links pointing at the domain in Yahoo -
e.g link:nytimes.com or search on yahoo
Number of pages on the domain - site:nytimes.com on Google
traffic nos on compete.com or google trends
whois info e.g. age of domain, length of time registered for etc.
Some other places to research - http://www.majesticseo.com/, http://www.opensearch.org/Home and http://www.seomoz.org they all have their own indexes
I'm sure theres plenty more but hopefully the IR stuff will get the cogs whirring :)

Related

If I have a collection of random websites, how do I get specific information from each?

Say I have a collection of websites for accountants, like this:
http://www.johnvanderlyn.com
http://www.rubinassociatespa.com
http://www.taxestaxestaxes.com
http://janus-curran.com
http://ricksarassociates.com
http://www.condoaudits.com
http://www.krco-cpa.com
http://ci.boca-raton.fl.us
What I want to do is crawl each and get the names & emails of the partners. How should I approach this problem, at a high-level?
Assume I know how to actually crawl each site (and all subpages) & parse the HTML elements -- I am using Oga.
What I am struggling with is how to make sense of data that is presented in a wide variety of ways. For instance, the email address for the firm (and or partner) can be found in one of these ways:
On the About Us page, under the name of the partner.
On the About Us page, as a generic catch-all email.
On the Team page, under the name of the partner.
On the Contact Us page, as a generic catch-all email.
On a Partner's page, under the name of the partner.
Or it could be any other way.
One way I was thinking about approaching the email, is just to search for all mailto a tags and filter from there.
The obvious downside for this is that there is no guarantee that the email will be for the partner and not some other employee.
Another issue that is more obvious is detecting the partner(s) names just from the markup. I was initially thinking I could just pull all the header tags and text in them, but I have stumbled across a few sites that have the partner names in span tags.
I know SO is usually for specific programming questions, but I am not sure how to approach this and where to ask this. Is there another StackExchange site that this question is more appropriate for?
Any advice on specific direction you can give me would be great.

I looked at the http://ricksarassociates.com/ website and I cant find any partners at all so in my opinion you better stand to gain from this if not you better look for some other invention.
I have done similar datascraping from time to time, and in norway we have laws - or should I say "laws" - that you are not allowed to email people however you are allowed to email the company - so in a way the same problem from another angle.
I wish I knew maths and algorythms by heart because I am sure there is a fascinating sollution hidden in AI and machine learning, but in my mind the only sollution I can see is building a rule set that over time probably gets quite complex. Maby you could apply some bayesian filtering - it works very well for email.
But - to be a little more productive here. One thing i know is inmportant, you could start by creating the crawler environment and building the dataset. Have the database for URLS so you can add more at any time, and start the crawling on what you have already so that you do your testing querying your own data with a 100% copy. This will save you enormous time instead of live scraping while tweaking.
I did my own search engine some years ago, scraping all NO domains however I needed only the index file that time. Took over a week alone just to scrape it down and I think it was 8GB of data just for that single file, and I had to use several proxyservers aswell to make it work due to problems with to much DNS traffik. Lots of problems that needed being taken care of. I guess I am only saying - if you are crawling a large scale you might aswell start getting the data down if you want to work efficient with the parsing later.
Good luck, and do post if you get a sollution. I do not think it is posible without an algorythm or AI though - people design websites the way they like and they pull templates out of their arse so there are no rules to follow. You will end up with bad data.
Do you have funding for this? If so its simpler. Then you could just crawl each site, and make a profile for each site. You could employ someone cheap to manual go through the parsed data and remove all the errors. This is probably how most people does it, unless someone already have done it and the database is for sale / available from webservice so it can be scraped.

The links you provide are mainly US site, so I guess you are focusing on English names. In that case, instead of parsing from html tags, I would just search the whole webpage for name. (There are free database of first name and last name) This may also work if you are donig this for some other Europe company, but it would be a problem for company from some countries. Take Chinese as an example, while there is a fix set of last name, one may use basically any combination of Chinese character as first name, so this solution won't work for Chinese site.
It is easy to find email from a webpage as there is a fixed format of (username)#(domain name) with no space in between. Again I won't treat it as html tags but just as normal string so that the email can be found no matter it is in mailto tag or in plain text. Then, to determine what email is it:
Only one email in page?
Yes -> catch-all email.
No -> Is name found in that page as well?
No -> catch-all email (can have more than one catch-all email, maybe for different purpose like info + employment)
Yes -> Email should be attached to the name found right before it. It is normal that the name should appear before the email.
Then, it should be safe to assume the name appear first belongs to more important member, e.g. Chairman or partner.

I have done similar scraping for these types of pages, and it varies wildly from site to site. If you are trying to make one crawler to sort of auto find the information, it will be difficult. However, the high level looks something like this.
For each site you check, look for element patterns. Divs will often have labels, ID's, and classes which will easily let you grab information. Perhaps you find that many divs will have a particular class name. Check for this first.
It is often better to grab too much data from a particular page, and boil it down on your side afterwards. You could, perhaps, look for information which comes up on a screen by utilizing type (is link) or regex (is email) to look for formatted text. Names and occupation will be harder to find by this method, but might be related positionally on many pages to other well formatted items.
Names will often be affixed with honorifics (Mrs., Mr., Dr., JD, MD, etc.) You could come up with a bank of those, and check against them for any page you end up on.
Finally, if you really wanted to make this process general purpose, you could do some heuristics to improve your methods based off of expected information; names, for example, are most often within a particular list. If it was worth your time, you could check certain text for whether it matches a list of more common names.
What you mentioned in your initial question seems that you would have a lot of benefit with a general purpose Regular Expressions crawler, and you could make improvements on it as you know more about the sites which you interact with.

There are excellent posts on this topic with a lot of useful links throughout these webpages:
https://www.quora.com/What-is-a-good-web-scraper-for-pulling-emails-names-etc-even-if-the-contact-info-is-another-page-deep-a-browser-add-on-is-a-plus
http://www.hongkiat.com/blog/web-scraping-tools/
http://www.garethjames.net/a-guide-to-web-scraping-tools/
http://www.butleranalytics.com/15-web-scraping-tools/
Some of the examined applications are working in macOS.

alternative to Google maps

My client wants some of the functionality of Google maps namely:
- geocoding
- generating maps with points based on postal code or long.lat
- optimal trip mapping
Their issues with Google maps
- cannot control outages
- postal codes are sometimes inaccurate or not updated frequently for Canada/UK
- they have no way to correct inaccurate information
They would prefer to host the mapping application themselves, but will require postal code updates.
Can anyone suggest such a product?
thanks

"cannot control outages - postal codes are sometimes inaccurate or not updated frequently for Canada/UK - they have no way to correct inaccurate information"
Outages
hosting your own mapping is the only way to control this, but you would be very very hard pushed to beat Google Maps / Bing Maps uptime over the last 5 years. Take a look at the following:
OpenStreetMap for the road imagery data, this is open source data very good in the UK (Im not sure about canada) and you can make your own changes and submit them (or just change the data you have downloaded)
Geoserver, Mapnik or MapServer will read openstreetmapdata and create the image tiles needed to create your own maps in whatever style you wish. Depending on if you dont want all countries and all zoom levels these products can create all the tiles you will need in advance, but usually they have to be created in real time and cached. You need a BIG fast server to manage tile crunching
Openlayers or Leaflet are open source javascript mapping platforms that will display your tiles for you
Obviously this is just for road maps, aerial imagery would cost you an absolute fortune.
Post Code Data
Many people do not realize that UK postcode data for latitude and longitude is now completely free and available to download every quarter from the official source (ordinance survey) http://www.ordnancesurvey.co.uk/oswebsite/products/code-point-open/index.html.
This is the same data source Google will use and there is none better but it will always contain inaccuracies and always be a few months out of date.
Finally
Hopefully that answer the question you asked and gives you information to inform your client. Now for the question you didn't ask "Is this approach good value to my client?".
I won't presume to know your business or client, however what I described above is possible but with one to many months of work involved to get it all working together and even then it wont have any where near the performance or uptime of something like google /bing maps and only offers a small subset of their features.

I think you're looking for something like Caliper-It's a very custom, and I would expect expensive, solution. Not suggested.
http://www.caliper.com/GISMappingSoftwareDevelopment.htm
One solution could be to use two different mapping services and compare their results, this way there's a much better chance the data is accurate. You can also fix inaccurate data by creating a system which acts as a barrier between the API and your user, where data you know is inaccurate is corrected before it's displayed. Not sure exactly what you're doing though, so this might not work for you.
Is trip mapping/routing the basic functionality you want to do?

Before rushing into rolling your own, I'd suggest a good think about the consequences of doing so. The first that springs to mind is whilst the pros are that you can now control your data, the cons are that you now control your data.
So you are going to have to consider where and when you get updates and the processes you are going to have to employ to keep your maps in sync with the rest of the world. There are a lot of headaches involved in these things which is why so many people use externally hosted solutions such as Googles.

How does google determine the date a thread was posted?

When you search in google, when searching for a term, you can click "Discussion" on the left hand side of the page. This will lead you to forum based discussions which you can select. I was in the process of designing a discussion board for a usergroup and I would like for google to index my data with post time.
You can filter the results by "Any Time" - "Past Hour" - "Past 24 Hours" - "Past Week" - etc.
What is the best way to ensure that the post date is communicated to google? RSS feed for thread? Special HTML label tag with particular id? Or some other method?

Google continually improves their heuristics and as such, I don't think there are any (publicly known) rules for what you describe. In fact, I just did a discussion search myself and found the resulting pages to have wildly differing layouts, and not all of them have RSS feeds or use standard forum software. I would just guess that Google looks for common indicators such as Post #, Author, Date.
Time-based filtering is mostly based on how frequently Google indexes your page and identifies new content (although discussion pages could also be filtered based on individual post dates, which is once again totally up to Google). Just guessing, but it might also help to add Last-Modified headers to your pages.

I believe Google will simply look at when the content appeared. No need for parsing there, and no special treatment required on your end.

i once read a paper from a googler (a paper i sadly can't find anymore, if somebody finds it, please give me a note) where it was outlines. a lot of formulas and so on, but the bottom line was: google has analyzed the structure of the top forum systems on the web. it does not use a page metaphor to analyse it, but breaks the forum down into topics, threads and posts.
so basically, if you use a standard, popular forum system, google knows that it is a forum and puts you into the discussion segment. if you build your own forum software it is probably best to use existing, established forum conventions (topics, threads, posts, authors....).

Anyone ever tried to use Twitter to replace comments sections on web apps?

Here's the scenario I'm imagining.
Simple blog, users typically post comments in a comments form at the bottom of each blog article. Instead of that, using the Twitter API, pull tweets based on a hashtag. Base the hashtag on the article id (i.e. #site10201) where site is a prefix and the number is the article id.
Then provide a link to post a tweet using the hashtag., which would then get picked up in your twitter api pull.
I'm imagining horrible spam issues, but other than that, bad idea?

Has some drawbacks to more run-of-the-mill database systems:
Additional network overhead. Most self-hosted blogs would typically rely on database and blog being on the same server (physical or virtual) so db-lookup is fast (and reliable) compared to twitter API requests.
Caching issues. One host is only allowed X requests of twitter at a time (the next request is going to end up a 404), and how are you going to manage that from your website for a scenario which becomes steadily more complex as multiple articles are added? Presumably you need to authenticate so the easy-way out is a security liability. (The easy way out being to use JavaScript on the at the browser to perform the actual request, neatly circumvents the problem in 20/80 fashion.) Granted most blogs don't get that kind of traffic. ;)
You tie your precious or not so precious comments to the mercy of the fail whale. Which is kind of odd considering a self-hosted blog basically means you want to have such control in the first place by not using a service like blogger.
Is it possible to ensure unicity of hash tags --in the general case? What are you going to do if someone had the same bright idea, only took the name of the tag 5ms before you did? Would you end up pulling the drivel of someone else's blog comments rather than the brilliance you have come to expect from yours? ;)
Lesser point: you rely on others to have a twitter account. Anonymous replies are off the table.
TOS and other considerations that may be imposed on you by twitter, either now or in future. (2) is actually a major item of Twitter's TOS.

SEO Problem for new dictionary site, google hasn't indexed content

I loaded about 15,000 pages, letters A & B of a dictionary and submitted to google a text site map. I'm using google's search with advertisement as the planned mechanism to go through my site. Google's webmaster accepted the site mapps as good but then did not index. My index page has been indexed by google and at this point have not linked to any pages.
So to get google's search to work I need to get all my content indexed. It appears google will not just index from the site map and so I was thinking of adding pages that spider in links from the main index page. But I don't want to create a bunch of pages that programicly link all of the pages without knowing if this has a chance to work. Eventually I plan on having about 150,000 pages each page being a word or phrase being defined. I wrote a program that is pulling this from a dictionary database. I would like to prove the content that I have to anyone interested to show the value of the dictionary in releation to the dictionary software that I'm completing. Suggestions for getting the entire site indexed by google so I can appear in the search results?
Thanks

First off, you have to have legitimate, interesting content and not be some sort of attempt to scam or fool Google (see below).
Assuming that's true, Google has some guidelines for getting your site indexed. I highly recommend reading this document.
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=35769
The first recommendation is:
Make a site with a clear hierarchy and text links. Every page should be reachable from at least one static text link.
This means that, yes, you have to have links to each page.
My experience is with a bookseller with about 90,000 titles, some of which are rare and hard to obtain elsewhere. Humans will access the books pages through the search interface, mostly. Each book page has a fair amount of information: title, publisher, date published, etc - about 20 fields in all.
The site and its ever-changing catalog has been live for more than a decade, but the titles were not accessible to search engines since they were only accessible through the search interface. So google only indexed the CMS pages and not the book catalog.
We started by having a site map. This didn't get google to index the book catalog. Then after reading the guidelines we added static text links to all the titles. Then we added accurate meta tags. (The urls remained normal IIS PHP cruft i.e. book.php?id=123546.) There is no duplicate content.
At one point we got google to index 17,000 pages, but since then the number of pages in google's index has dropped to about 900 - all the non book pages and a few books that are featured on the site.
Unfortunately the client didn't want to pay for us to continue to try to get google to index the titles, so that's how it is today.
Finally, it's not really clear from your post what you are trying to accomplish. If you are trying to put an existing, boring old dictionary up on the web and then get ad revenue, Google is going to ignore you.

Google crawling takes time, specially for new sites/domains. The sitemaps you submit are just "informational" for Google and don't necessarily mean that everything there will be indexed.
Some notes:
Make sure you display unique content or Google will penalize you for duplicating.
Avoid pages with few text, Google likes pages with a reasonable amount of content.
Keep adding content progressively, a couple times every day if possible.
You should start to see your pages crawled hourly/daily (depending on the update frequency) and appearing in Search Engines in about 2 weeks-2 months (mines usually take 1 month for me to be considered "crawled"). Something you can do meanwhile is to get backlinks from other websites and keep checking Webmaster Tools to see Google's crawling rate.
Good luck! :)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008