How can I get started on programmatically analyzing web site content? - language-agnostic

I've been looking for a new hobby programming project, and I think it would be interesting to dabble in ways to programmatically gather information from websites and then analyze that data to do things like aggregate or filter it. For example, if I wanted to write an application that could take Craiglist listings and then do something like display only the ones matching a specific city not just a geographical area. That's just a simple example, but you could go as advanced and sophisticated as how Google analyzes a site's content to know how to rank it.
I know next to nothing about that subject and I think it would be fun to learn more about it, or hopefully do a very modest programming project in that topic. My problem is, I know so little that I don't even know how to find more information about the subject.
What are these types of programs called? What are some useful keywords to use when searching on Google? Where can I get some introductory reading material? Are there interesting papers I should read?
All I need is someone to disabuse me of my ignorance, so that I can do some research on my own.

cURL (http://en.wikipedia.org/wiki/CURL) is a good tool to fetch a website's contents and hand it off to a processor.
If you are proficient with a particular language, see if it supports cURL. If not, PHP (php.net) may be a good place to start.
When you have retrieved a website's content via cURL, you can use the language's text processing functionality to parse the data. You can use regular expressions (http://www.regular-expressions.info/) or functions such as PHP's strstr() to find and extract the particular data you seek.

Programs that "scan" other sites are usually called web crawlers or spiders.

I recently completed a project that uses Google Search Appliance that basically crawls the whole .com domain of the web server.
GSA is very powerful tool that pretty much indexes all the urls it encounters and serves the results.
http://code.google.com/apis/searchappliance/documentation/60/xml_reference.html

Related

GEDCOM to HTML and RDF

I was wondering if anyone knew of an application that would take a GEDCOM genealogy file and convert it to HTML format for viewing and publishing on the web. I'd like to have separate html files for each individual and perhaps additional files for other content as well. I know there are some tools out there but I was wondering if anyone used any tools and could advise on this. I'm not sure what format to look for such applications. They could be Python or php files that one can edit, or even JavaScript (maybe) or just executable files.
The next issue might be appropriate for a topic in itself. Export of GEDCOM to RDF. My interest here would be to align the information with specific vocabularies, such as BIO or REL which both are extended from FOAF.
Thanks,
Bruce
Like Rob Kam said, Ged2Html was the most popular such program for a long time.
GRAMPS can also create static HTML sites and has the advantage of being free software and having a native XML format which you could easily modify to fit your needs.
Several years ago, I created a simple Java program to turn gedcom into xml. I then used xslt to generate html and rdf. The html I generate is pretty rudimentary, so it would probably be better to look elsewhere for that, but the rdf might be useful to you:
http://jay.askren.net/Projects/SemWeb/
There are a number of these. All listed at http://www.cyndislist.com/gedcom/gedcom-to-web-page-conversion/
Ged2html used to be the most popular and most versatile, but is now no longer being developed. It's an executable, with output customisable through its own scripting syntax.
Family Historian http://www.family-historian.co.uk will create exactly what you are looking for, eg one file per person using the built in Web Site creator. As will a couple of the other Major genealogy packages. I have not seen anything for the RDF part of your question.
I have since tried to produce a Genealogy application using Semantic MediaWiki - MediaWiki, the software behind Wikipedia, and Semantic MediaWiki includes various extensions related to the Semantic Web. I thought it is very easy to use with the forms and the ability to upload a GEDCOM but some feedback from people into genealogy said that it appeared too technical and didn't seem to offer anything new.
So, now the issue is whether to stay with MediaWiki and make it more user friendly or create an entirely new application that allows for adding and updating data in a triple store as well as displaying. I'm not sure how to generate a family tree graphical view of the data, like on sites like ancestry.com, where one can click on a box to see details about the person and update that info or one could click on a right or left arrow around a box to navigate the tree. The data comes from SPARQL queries sent to the data set/triple store both when displaying the initial view and when navigating the tree, where an Ajax call is needed to get more data.
Bruce

MySQL-based wiki that is suitable for custom applications?

I develop an online, Flash-based multiplayer game. It is a complex game, and requires a lot of documentation to fully explain it to our users. Ideally, I would like to find MySQL-based wiki software that can provide these editable documentation pages outside of Flash (in the HTML realm) but also within Flash for convenience, and so that players can refer to the information without interrupting their game or having to switch back-and-forth between browser tabs. I am expecting that I would need to do a lot of the work on the Flash side myself, as far as formatting, for example, but I would like to feel comfortable in querying the wiki's database to get info directly. I guess this means that I need a wiki that is structured relatively "flat" or intuitively so that I can do things like:
Run a MySQL query that returns a list of all the articles (their titles and IDs) in the wiki
For each article ID in the wiki, return the associated content
This may mean that I have to limit the kinds of formatting I put into the wiki -- things like tables would probably be omitted since they would be very difficult, if not impossible, for me to do on the Flash side. And that is fine!
Basically I am just looking for suggestions for wiki software that is pretty easy to use, but mostly is technically simple enough on the back-end that interfacing with it directly via MySQL is not difficult. When interfacing with the database directly, I only need to READ data. Any time the wiki would be edited or added to would be done via the wiki's actual front-end application.
Thanks for any suggestions!
MediaWiki is the best-known and best-supported MySQL-based Wiki, used for plenty of complex game documentation projects like MinecraftWiki. The database is not all that simple, but it's well documented and basic read operations aren't too hard. For example, here's how to fetch the current content of the page "MyPage":
SELECT old_text FROM page,revision,text WHERE page.page_title="MyPage" AND
page.page_id=revision.rev_page AND revision.rev_text_id=text.old_id;
(And yes, old_text is the current content of the page. Don't ask me why!)
Your main problem will be figuring out how to parse MediaWiki markup, there are plenty of parsers for it but I'm not aware of anything that would work in Flash.

What are the practical benefits of using microformats for every possible thing?

What practical benefits can my client get if I use microformats on his site for every possible thing?
How can I explain these benefits to a non-technical client?
Sometimes it seems like the practical benefits are hard to quantify.
Search engines already pick up and parse microformats (see e.g. https://support.google.com/webmasters/answer/99170). I believe hCard and hCalendar are fairly well supported--and if not, plenty of sites are using it, including places like MySpace.
It's the idea that adding CSS classes and specified IDs make your existing content easier to parse in a machine-readable manner.
hReview is starting to make some inroads, and hResume looks like it take off too.
I heavily use rel="nofollow" on uncontrolled links (3rd party sources) which is actually a microformat.
Check the microformats wiki for a decent starting point.
It just means your viewers can share a few generic "formats". You can generalize stylesheets, and parsing mechanisms. Rather than having a webpage consist of one "html document," you have a webpage that consist of "10 formatted micro-documents".
If you need a real world analog: think of it like attaching a formatted invoice, to a receipt, and a business card, rather than writing it all down on notebook paper with your left hand.
Overall the site becomes easier to digest for the rest of the internet. The data can be reused, combined, cross-referenced, and saved.
A simple example would be to have anywhere on the site a latitude and a longitude (geo). With Microformats, anybody that searches for that latitude and longitude can be easily referenced to their website, increasing traffic, awareness of that person / company, and allow users to easily save that information. (Although I've encountered little of this personally, this is more of 'the future' of things than it is current. But always good to stay up to date).
A second example would be a business card (hCard) where a browser can easily save and transfer it to an address book, so that just one visit to the site and the visitor has the information saved locally. Especially useful if they're getting hits from a cell phone.
I wouldn't recommend using microformats for "every possible thing". Use them for things where you get some benefit, in exchange for the effort of using them.
The main practical benefit I'm aware of is customised search engine results:
https://support.google.com/webmasters/answer/99170
Technically, Google now prefers this to be implemented using microdata (i.e. itemprop attributes) rather than microformats, but it's the same idea.
Having a micro-format can be better than no format since it lets you save every possible thing in the application.
A micro-format for every possible thing can be better than a standard format only because: it's quicker to create so it costs less and it take less space than some standard formats, like XML.
But all this depends on the context of the application and so you must explain it to the client in that context.
microformatting your content extends its reach in every, which way possible. using your sites structure as its "api" the possibilities are what you set your limits too

How do you database access (I/O) to/from Magento Commerce?

So, I want to import, export and modify the database. I have read that I have to do that by XML, but I don't really understand their doc system and I haven't found any good tutorials out there that explain this. I am slowly reading the very expensive and short book which is somewhat answering my questions, but I crave more.
As a second question, I want to have a order system where I can send out information or emails with my own code. I assume this would be some type of plug-in that would override or be called at a certain time. Any info would be helpful.
Some parts of the magento data can be imported/exported via the backend (System->Import/Export), namely products and customers.
If you want to deal with the complete DB - use your DB tool of choice (I prefer mysqldump).
When dealing with exported CSV.. use OpenOffice, from my experience it deals better with the separation characters than Excel.
As for your second question - as far as I understood, you will have to develop a module if you want to do something different than the existing functionality and keep the original mail functions. If you don't want to/have to keep the original functions, you can opt to overwrite the module, which is much easier as far as I can see. Google search for "overriding magento module" should turn up atleast one decent tutorial.
I found what I was looking for here:
(on magento site: Resources -> Magento Core API -> Product API or whichever API you want)
The problem is there is no Order API yet (or none that I've seen)
http://www.magentocommerce.com/wiki/doc/webservices-api/api/catalog_product#examples
This details how you'd write an external php script and obtain,edit or delete products (or anything else with an API).
Modules still look daunting, but I am reading through the (very thin) magento book (the only one available).
I hope this helps someone else.

Best open source, extendable crawler to use for image crawling

We are in the starting phase of a project, and we are currently wondering
whether which crawler is the best choice for us.
Our project:
Basically, we're going to set up Hadoop and crawl the web for images.
We will then run our own indexing software on the images stored in HDFS
based on the Map/Reduce facility in Hadoop. We will not use other indexing
than our own.
Some particular questions:
Which crawler will handle crawling for images best?
Which crawler will best adapt to a distributed crawling system, in which we
use many servers conducting crawling together?
Right now these look like the 3 best options-
Nutch: Known to scale. Doesn't look like the best option because it seems that is it tied closely to their text searching software.
Heritrix: Also scales. This one currently looks like the best option.
Scrapy: Has not been used on a large scale (not sure though). I dont know if it has the basic stuff like URL canonicalization. I would like to use this one because it is a python framework (I like python more than java), but I don't know if they have implemented the advanced features of a web crawler.
Summary:
We need to get as many images as possible from the web. Which existing crawling framework is both scalable and efficient , but also the one which will be the easiest to modify to get only images?
Thanks!
http://lucene.apache.org/nutch/
I would think going with something with the broadest use and support (community support) would be the better approach.
Nutch may be a good option because you want to end up on HDFS. It may be useful to look into the HBase integration that are currently in the works (NUTCH-650).
You may be able to get the data you need by skipping the index step at the end and instead look at the segments themselves.
However for flexibility another option may be Droids: http://incubator.apache.org/droids/. It's still in the incubator phase at apache, but worth looking at.
You may get some ideas by looking at the SimpleRuntime example in the org.apache.droids.examples. Perhaps by replacing the Sysout handler with one that stores the images onto HDFS that may give you what you want.