automatic web crawler - json

I'm writing a crawler which needs to get data from many websites. The problem is that every website has different structure. How can I easily write a crawler which downloads (correctly) data from (many) different websites? If the structure of a website will change will I need to rewrite the crawler, or are there other methods?
What logical and implemented tools can be used to improve the quality of data mined by an automatic web-crawler (many websites are involved with different structure)?
Thank You!

I presume you want to query it is some way, in which case you should store the data in a flexible data store. A relational database would not be fit for purpose as it has a strict schema, but something like mongodb which lets you store semi structured data without having to define a schema up front, but still provides a powerful query language.
The same goes for how you represent the data in the crawler code. Don't map the data to classes where the structure is defined up front, but use a flexible data structures that can change at runtime. If you are using Java then de-serialise the data into HashMaps. In other languages this might be called Dictionaries or Hashes.

If you're scraping data from websites that actually want to allow you to do that, chances are they will provide some sort of webservice to allow you to query their data in a structured way.
Otherwise, you're on your own, and you might even be violating their terms of use.

If the websites provide no APIs, then you're out cold and you have to write separate extraction module for each data format you're encountering. If the website changes the format, then you have to update your format module. A standard thing to do is to have plugins for every website you're crawling and have a testing framework which does regression testing with data you've already collected. When a test fails you know something went wrong and you can investigate whether you have to update your format plugin or if there is another issue.
Without knowing what kind of data you're collecting it will be very difficult to try to hypothesize about ways to improve the "quality" of the data that was mined.

Maybe you could find out whether the website allows you to access the data like API, if so, you could use this kind of structured data to your website directly. If not, you may need plugins for that. Or you could turn to other web crawlers with API access like Octoparse, to find the way to access their API to your own web crawler.

Related

Best practice to use several APIs or data sources for one application

I want to build an application that uses data from several endpoints.
Lets say I have:
JSON API for getting cinema data
XML Export for getting data about ???
Another JSON API for something else
A csv-file for some more shit ...
In my application I want to bring all this data together and build views for it and so on ...
MY idea was to set up a database by create schemas for all these data sources, so I can do some kind of "import scripts" which I can call whenever I want to get the latest data.
I thought of schemas because I want to be able to easily adept a new API with any kind of schema.
Please enlighten me of the possibilities and best practices out there (theory and practice if possible :P)
You are totally right on making a database. But the real problem is probably not going to be how to store your data. It's going to be how to make it fit together logically and semantically.
I suggest you first take a good look at what your enpoints can provide. Get several samples from every source and analyze them if you can. How will you know which data is new? How can you match it against existing data and against data from other sources? If existing data changes or gets deleted, how will you detect and handle that? What if sources disagree on something? How and when should you run the synchronization? What will you do if one of your sources goes down? Etc.
It is extremely difficult to make data consistent if your data sources are not. As a rule, if the sources are different, they are not consistent. Thus the proverb "garbage in, garbage out". We, humans, have no problem dealing with small inconsistencies, but algorithms cannot work correctly if there are discrepancies. Even if everything fits together on paper, one usually forgets that data can change over time...
At least that's my experience in such cases.
I'm not sure if in the application you want to display all the data in the same view or if you are going to be creating different views for each of the sources. If you want to display the data in the same view, like a grid, I would recommend using inheritance or an interface depending on your data and needs. I would recommend setting this structure up in the database too using different tables for the different sources and having a parent table related to all them that has a type associated with it.
Here's a good thread with discussion about choosing an interface or inheritance.
Inheritance vs. interface in C#
And here are some examples of representing inheritance in a database.
How can you represent inheritance in a database?

Creating a REST API for static hosting

I know this sounds crazy, but I had a thought and I was willing to try it out. I use GitLab pages for all my online projects, but a lot of them are ASP.NET MVC, which is an issue as I don't think you can run ASP.NET MVC sites on GitLab pages. I then thought, what if I make a site using something like angular or node.js, and have a central API for all my web projects? I thought that was a great idea, until I realized I couldn't use a database either. I guess what I'm asking is, would it be possible to create a REST API that uses JSON files for storage and node.js as the request pages, to create an API without a database?
Of course.
If you think about a database from the perspective of your application code, it is basically just a place to store and retrieve data.
Imagine the database library you are using has two simple methods, store and retrieve. In your application code, you could write db.store('here is the item') and the later on, db.retrieve().
However, those store and retrieve methods could be implemented in many different ways to provide the same effective behavior from the perspective of your application. Some examples:
Send/query the data to/from an external data store, such as PostgreSQL
Write it to a file on disk and read it back later
Store the data in memory
Make HTTP requests to an external system to store the data
Some of these options will be more or less appropriate depending on your exact requirements, however, the general idea is that given a database API, you could implement the exact same method signatures with a completely different approach.

using couch db and sql server side by side

We currently have a nicely relational sql server 2008 database that is our master application database. We are looking to improve an existing document storage mechanism which uses xml data types with something more schemaless that can handle similar but not identical documents and thought that couchdb would be good fit.
The idea is that the common meta data about the documents could be stored within sql server for ease of display/aggregation/reporting but the actual documents are stored in couch to handle the subtle differences in the documents. The idea is to make the most of the two different technologies.
For example the status, type, related person and date created would all be common across all documents and stored in sql but an email and a letter (obviously with different fields) could be stored in couch.
Then we can display our document grid for all types of document (thousands of docs) which can be queried through sql but the display of the doc gets its data from couch when the user requests to view it.
Something to bear in mind is that some document types are generated from templates that are also documents themselves (think mail merge/find and replace).
Application layer is asp.net 4.5, c#, repository pattern, Windsor for ioc, JavaScript
So, to the question...
Is this approach a sensible way to make the most of the two differing data storage paradigms?
Are we making our programming lives needlessly complex in the desire to "use the most appropriate technology for the problem"?
Does anyone have any experiences of trying something similar and if so, how did it go?
It's really not uncommon to use two different storage formats for a document: One for searchable aspects and metadata and another for presentation.
Looking at it in a more general way, the approach is somewhat similar to the one we developed at the Royal Danish Library and pushed in the Planets EU project:
http://www.researchgate.net/publication/221176211_Archive_Design_Based_on_Planets_Inspired_Logical_Object_Model
Here's another paper that discusses this approach in a more general way:
"Opening Schrödingers Library"
The goal was archiving. We recognized that when converting documents for archiving or preservation no sigle storage format was superior in all aspects of preserving the attributes, formats, looks, contents etc of the original document. Solution: Convert to several formats, and use a sophisticated digital object to track the conversions, and which aspects of the original were best preserved in which conversion.
So in my opinion the approach is theoretically and practically sound.
Practical issues: You will probably need some sort of digital object that keeps track of the various parts of a document, eg. whether it occurs in one system only (and so which one), or in both. It seems that you are going to use SQLserver for this aspect, and that sounds sensible.
We actually did implement the object model we describe in the paper, and last I hear they are still using it.

Best way to create a SPARQL endpoint for a RDBMS (MySQL database)

I am doing (want to do) some experiments with Linked Open Datasets particularly those put out by governments.
I have a RDBMS (more specifically MySQL). I designed it with semantic web ideas in mind i.e. I have a information stored as objects, predicates and classes which define objects. In turn all objects are related to each other though statements of the form subject --> predicate --> object (where the subjects are from the objects table).
I want to be able to query other RDF triple stores from my application and let other triple stores query my data. Is it possible to "set something up" so that this is possible?
I have looked at Jena. Using Jena seems to mean I have to it as a storage application rather than MySQL - the only problem with this is that I include a new concept called a category (which I don't think is part of the semantic web languages). I will use categories to help with displaying information (they don't have any other meaning) but using Jena seems to mean that I can't organise predicates under categories for more convenient viewing.
I am using Java so a JAVA API is preferred.
It's also possible I misunderstood the purpose of Jena, and maybe that can be of use, but I am not sure how.
I am sure four days from now this question will seem rather silly, but at the moment I am somewhat confused about how to proceed.
I'm not sure what you mean by "a new concept called category", perhaps you can give an example?
If you mean that you want to add additional metadata, perhaps as a way of organizing information in the user interface, there is no need to extend the semantic web languages or storage systems - they can already do what you want.
Suppose you have data for a school from the UK Government schools dataset (using Turtle encoding for brevity):
#prefix sch-ont: <http://education.data.gov.uk/def/school/>.
<http://education.data.gov.uk/id/school/135412>
a sch-ont:School;
sch-ont:establishmentStatus
<http://education.data.gov.uk/def/school/EstablishmentStatus_Open>;
sch-ont:MSOA <http://statistics.data.gov.uk/id/msoa/E02000001>;
sch-ont:establishmentName "Guildhall School of Music and Drama";
...
You can directly query that data from the SPARQL end-point, or you can download the data and store it locally in your own triple store. Either way, you're perfectly at liberty to add extra information that's useful to your users. For example:
#prefix ankurs-app: <http://ankur.org/example/app/vocab/display#>.
<http://education.data.gov.uk/id/school/135412>
ankurs-app:category ankurs-app:wkdCool.
You can store this new triple in the same graph as the downloaded data, or you can store it in a separate named-graph to indicate that it's information that has a different provenance than the source data. Either way, it's then simple to query it either programmatically from Jena, or via a SPARQL query.
Doing a layout for efficiently querying schemaless triple-centric data is a well-studied, and hard, problem. Most of the RDF platforms, including Jena, have well-optimised code for querying and updating triples from their own database schemes. You would have to have very good reasons for embarking on your own relational table layout :)
If you really do need to take an existing relational table scheme and map it to a Jena RDF model, look at D2RQ.
Why didn't you just use a triple store to store all of your data? If you use a triple store with SPARQL endpoint capability then you would have a SPARQL-accessible web api. Similarly, many other data sets on the web are exposed as SPARQL endpoints and accessible via HTTP.
There are many triple stores available with persistent storage both in a db and otherwise (Jena + SDB, Mulgara, Virtuoso, Oracle, etc). You could certainly extend Mulgara through their resolvers to support queries against your custom db but I think that's probably a lot of work for not too much real value.
I'm sure you could use existing concepts to handle your notion of categories in RDF or perhaps by layering something over Jena.

Can I run an HTTP GET directly in SQL under MySQL?

I'd love to do this:
UPDATE table SET blobCol = HTTPGET(urlCol) WHERE whatever LIMIT n;
Is there code available to do this? I known this should be possible as the MySQL Docs include an example of adding a function that does a DNS lookup.
MySQL / windows / Preferably without having to compile stuff, but I can.
(If you haven't heard of anything like this but you would expect that you would have if it did exist, A "proly not" would be nice.)
EDIT: I known this would open a whole can-o-worms re security, however in my cases, the only access to the DB is via the mysql console app. Its is not a world accessible system. It is not a web back end. It is only a local data logging system
No, thank goodness — it would be a security horror. Every SQL injection hole in an application could be leveraged to start spamming connections to attack other sites.
You could, I suppose, write it in C and compile it as a UDF. But I don't think it really gets you anything in comparison to just SELECTing in your application layer and looping over the results doing HTTP GETs and UPDATEing. If we're talking about making HTTP connections, the extra efficiency of doing it in the database layer will be completely dwarfed by the network delays anyway.
I don't know of any function like that as part of MySQL.
Are you just trying to retreive HTML data from many URLs?
An alternative solution might be to use Google spreadsheet's importHtml function.
Google Spreadsheets Lets You Import Online Data
Proly not. Best practises in a web-enviroment is to have database-servers isolated from the outside, both ways, meaning that the db-server wouldn't be allowed to fetch stuff from the internet.
Proly not.
If you're absolutely determined to get web content from within an SQL environ, there are as far as I know two possibilities:
Write a custom MySQL UDF in C (as bobince mentioned). The could potentially be a huge job, depending on your experience of C, how much security you want, how complete you want the UDF to be: eg. Just GET requests? How about POST? HEAD? etc.
Use a different database which can do this. If you're happy with SQL you could probably do this with PostgreSQL and one of the snap-in languages such as Python or PHP.
If you're not too fussed about sticking with SQL you could use something like eXist. You can do this type of thing relatively easily with XQuery, and would benefit from being able to easily modify the results to fit your schema (rather than just lumping it into a blob field) or store the page "as is" as an xhtml doc in the DB.
Then you can run queries very quickly across all documents to, for instance, get all the links or quotes or whatever. You could even apply XSL to such a result with very little extra work. Great if you're storing the pages for reference and want to adapt the results into a personal "intranet"-style app.
Also since eXist is document-centric it has lots of great methods for fuzzy-text searching, near-word searching, and has a great full-text index (much better than MySQL's). Perfect if you're after doing some data-mining on the content, eg: find me all documents where a word like "burger" within 50 words of "hotdog" where the word isn't in a UL list. Try doing that native in MySQL!
As an aside, and with no malice intended; I often wonder why eXist is over-looked when people build CMSs. Its a database that can store content in its native format (XML, or its subset (x)HTML), query it with ease in its native format, and can translate it from its native format with a powerful templating language which looks and acts like its native format. Sometimes SQL is just plain wrong for the job!
Sorry. Didn't mean to waffle! :-$