cloud based "knowledge base" approach with links, snippets and excerpts for google chrome - google-chrome

I'm working as a web developer and have lots (hundreds) of links with hacks, tutorials and code snippets that I don't want to memorize. I am currently using evernote to save the content of my links as snippets and have them searchable and always available (even if the source site is down).
I spend a lot of time on tagging, sorting, evaluating and saving stuff to evernote and I'm not quite happy with the outcome. I ended up with a multitude of tags and keep reordering and renaming tags while retagging saved articles.
My Requirements
web based
saving web content as snippets with rich styling (code sections, etc.)
interlinked entries possible
chrome plugin for access to content
chrome plugin for content generation
web app or desktop client for faster sorting / tagging / batch processing
good and flexible search mechanism
(bonus) google search integration (search results from KnowledgeBase within google search results)
I had a look at kippt but that doesn't seem to be a solution for me. If I don't find a better solution, I'm willing to stay with evernote as it meets nearly all my needs but I need a good plan to sort through my links/snippets once and get them in order.
Which solutions do you use and how do you manage your knowledge base?

I'm a big Evernote fan but a stern critic of all my tools. I've stuck with Evernote because I'm happy enough with its fundamental information structures. I am, however, currently working on some apps to provide visualisations and hopefully better ways to navigate complex sets of notes.
A few tips, based on years of using Evernote and wiki's for collaboration and software project management:
you can't get away from the need to curate things, regardless of your tool
don't over-think using tags, tags in combination with words are a great way to search (you do know you can say tag:blah in a search to combine that with word searches?)
build index pages for different purposes - I'm using a lot more of the internal note links to treat Evernote like a wiki
refactor into smaller notebooks if you use mobile clients a lot, allowing you to choose to have different collections of content with you at different times

Related

Scraper: distinguishing meaningful text from meaningless items, hadoop

I'm trying to build a crawler and scraper in Apache Nutch to find all the pages containing a section talking about a particular word-topic (e.g. "election","elections", "vote", etc).
Once I have crawled, Nutch cleans the HTML from stop words, and tags, but it doesn't take out menu voices (that are in every pages of the website).
So it could happen that when you look for all the pages talking about elections, you could retrieve a whole website because it has the word "elections" in its menu and therefore in every page.
I was wondering if techniques that analyze multiple pages of the website to understand what is the main template of a page, exist. Useful papers and/or implementations/libraries.
I was thinking about creating some kind of hadoop Job that analyzed similarities between multiple pages to extract a template. But the same website could have multiple templates, so it is hard to think of an effective way to do that.
E.G.
WEBPage 1:
MENU HOME VOTE ELECTION NEWS
meaningful text... elections ....
WebPage 2:
MENU HOME VOTE ELECTION NEWS
meaningful text... talking about swimming pools ....
You didn't mention which branch of Nutch (1.x/2.x) are you using, but at the moment I can think of a couple of approaches:
Take a look at NUTCH-585 which will be helpful if you are not crawling many different sites and if you can specify which nodes of your HTML content you want to exclude from the indexed content.
If you're working with different sites and the previous approach is not feasible take a look at NUTCH-961 which uses the boilerplate feature inside Apache Tika to guess what texts matter from your HTML content. This library uses some algorithms and provides several extractors, you could try it and see what works for you. In my experience I've had some issues with news sites that had a lot of comments and some of the comments ended up being indexed alone with the main article content, but it was a minor issue after all. In any case this approach could work very well for a lot of cases.
Also you can take a peek at NUTCH-1870 which let you specify XPath expressions to extract certain specific parts of the webpage as separated fields, using this with the right boost parameters in Solr could improve your precision.

Tree view navigation, good idea?

I'm thinking of using a tree view for page navigation in my web application, similar to Windows Explorer. There are a lot of things for administrators to configure in the application so I figured listing all links in a single page in tree form would keep things organized. Related page links are grouped in a "folder", and all folders will show closed initially.
Obviously, this page is for administrators only, so they'd be provided with some training. That being said, is this a good design from user's point of view? Do you see any usability or potential implementation issues?
The best answer involves empirical evidence. A yes or no answer could really vary based on the specific task and your intended audience. Try doing a simple 5 minute usability test with your users. Draw out your page layouts on paper and have a couple of users pretend to use the site (see Paper Protyping). Give them a few simple tasks to complete using your interface and observe what they do.
If they get confused or have trouble with the concept, then it's probably best to find another way to provide navigation.
It totally depends on how your users are using your site. If they're often jumping from one part of the site to a completely different, unrelated place in the site, a tree may be the best way to let them quickly find that "other page" they were looking for.
However, for the vast majority of websites I've ever seen or used, I'd prefer to find what I'm looking for either via Search functionality, or by links on the page I'm looking at that lead me to related data.

Crawling data or using API

How these sites gather all the data - questionhub, bigresource, thedevsea, developerbay?
Is this legal to show data in frame as bigresource do?
#amazed
EDITED : fixed some spelling issues 20110310
How these sites gather all data- questionhub, bigresource ...
Here's a very general sketch of what is probably happening in the background at website like questionhub.com
Spider program (google "spider program" to learn more)
a. configured to start reading web pages at stackoverflow.com (for example)
b. run program so it goes to home page of stackoverflow.com and starts visiting all links that it finds on those pages.
c. Returns HTML data from all of those pages
Search Index Program
Reads HTML data returned by spider and creates search index
Storing the words that it found AND what URL those words where found at
User Interface web-page
Provides feature rich user-interface so you can search the sites that have been spidered.
Is this legal to show data in frame as bigresource do?
To be technical, "it all depends" ;-)
Normally, websites want to be visible in google, so why not other search engines too.
Just as google displays part of the text that was found when a site was spidered,
questionhub.com (or others) has chosen to show more of the text found on the original page,
possibly keeping the formatting that was in the orginal HTML OR changing the formatting to
fit their standard visual styling.
A remote site can 'request' that spyders do NOT go thru some/all of their web pages
by adding a rule in a well-known file called robots.txt. Spiders do not
have to honor the robots.txt, but a vigilant website will track the IP addresses
of spyders that do not honor their robots.txt file and then block that IP address
from looking at anything on their website. You can find plenty of information about robots.txt here on stackoverflow OR by running a query on google.
There is a several industries (besides google) built about what you are asking. There are tags in stack-overflow for search-engine, search; read some of those question/answers. Lucene/Solr are open source search engine components. There is a companion open-source spider, but the name eludes me right now. Good luck.
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, or give it a + (or -) as a useful answer. This goes for your other posts here too ;-)

Distingushing features of a blog, i.e deference between a blog and a normal site

I'm looking at things that can distinguish a blog from a normal website. These are things that a program needs to be able identify from the html of a website or particular features that a site supports. For eg. pings. The same for news websites.
I'm working on a blog/news monitor program and it will index sites to automatically determine if it is a blog or a news site and then monitor user feedback in comments etc on posts from sites that it determines to be of a blog or news nature.
So what i'm really after is suggestions on what i can use or look out for in identifying these sites.
It's going to be a desktop app written in java so if you have any code specifics in java that'll be great.
thanks in advance
You can search the page for the word "blog", as this will probably be present. Specifically, you can look for it in parts of the HTML page, or exclude parts - like links. This will give you a decent starting point.
Ultimately, though, this is something that will have to be done manually. You should construct an interface for people to specify if it's a blog or news site, or different features of it, when the site is submitted. Then you should create a database of sites and features, and flag them so that you or another administrator can review them and make changes. Once you do this for a site, you'll never need to do it again, so for example http://*.wordpress.com/ is all going to be blogs.
Some features you can automatically detect or get a pretty good chance of detecting, but ultimately you will need a manual review.
Look for a discoverable RSS or Atom feed, which should be present on a blog or serially-updated news site.

Generating a static website from a set of content data (possibly with webgen, webby or a similar toolkit)

My company (an engineering firm) is looking to redesign their website with some dynamic content. We have a nice portfolio of projects that we'd like to present on our site by category.
To elaborate, I'd like to have a "Projects Category" menu, where you can choose a sub-project category (such as churches, schools, etc) which links to a page with images of all projects which have been tagged with that category attribute. Clicking on an image would then take you to a detailed page for that project.
I have done a good bit of asp and jsp page development, but I've always worked on the front end in an enterprise environment - I've never built a production site from the back end. The advice I've gotten so far is that a full-blown CMS solution would be somewhat overkill, as we won't have a large hit count, and we'll be displaying a few hundred projects at most.
One big-picture choice I appear to have - whether to dynamically generate the pages (with asp or jsp) or to use a tool to generate a set of static html pages. The tool would build the menus, project summary pages, and individual project pages based on a set of data I could provide (in the form of a database or text file.)
I'm leaning towards trying to use a tool like webgen or webby to statically generate the site due to our current web hosting situation. Any thoughts on which approach is more appropriate? Is webgen or webby capable of doing what I am trying to do? Or can anyone recommend other web authoring tools better equipped to accomplish this?
Thanks for any feedback!
You could always use Template Toolkit :)
Jekyll may be worth a look.
Refer: https://github.com/jekyll/jekyll/wiki/
I've been told that webgen can't do what I'm trying to do (without some manual coding extensions myself) but that nanoc can.
http://nanoc.stoneship.org/