Best way to create a SPARQL endpoint for a RDBMS (MySQL database) - mysql

I am doing (want to do) some experiments with Linked Open Datasets particularly those put out by governments.
I have a RDBMS (more specifically MySQL). I designed it with semantic web ideas in mind i.e. I have a information stored as objects, predicates and classes which define objects. In turn all objects are related to each other though statements of the form subject --> predicate --> object (where the subjects are from the objects table).
I want to be able to query other RDF triple stores from my application and let other triple stores query my data. Is it possible to "set something up" so that this is possible?
I have looked at Jena. Using Jena seems to mean I have to it as a storage application rather than MySQL - the only problem with this is that I include a new concept called a category (which I don't think is part of the semantic web languages). I will use categories to help with displaying information (they don't have any other meaning) but using Jena seems to mean that I can't organise predicates under categories for more convenient viewing.
I am using Java so a JAVA API is preferred.
It's also possible I misunderstood the purpose of Jena, and maybe that can be of use, but I am not sure how.
I am sure four days from now this question will seem rather silly, but at the moment I am somewhat confused about how to proceed.

I'm not sure what you mean by "a new concept called category", perhaps you can give an example?
If you mean that you want to add additional metadata, perhaps as a way of organizing information in the user interface, there is no need to extend the semantic web languages or storage systems - they can already do what you want.
Suppose you have data for a school from the UK Government schools dataset (using Turtle encoding for brevity):
#prefix sch-ont: <http://education.data.gov.uk/def/school/>.
<http://education.data.gov.uk/id/school/135412>
a sch-ont:School;
sch-ont:establishmentStatus
<http://education.data.gov.uk/def/school/EstablishmentStatus_Open>;
sch-ont:MSOA <http://statistics.data.gov.uk/id/msoa/E02000001>;
sch-ont:establishmentName "Guildhall School of Music and Drama";
...
You can directly query that data from the SPARQL end-point, or you can download the data and store it locally in your own triple store. Either way, you're perfectly at liberty to add extra information that's useful to your users. For example:
#prefix ankurs-app: <http://ankur.org/example/app/vocab/display#>.
<http://education.data.gov.uk/id/school/135412>
ankurs-app:category ankurs-app:wkdCool.
You can store this new triple in the same graph as the downloaded data, or you can store it in a separate named-graph to indicate that it's information that has a different provenance than the source data. Either way, it's then simple to query it either programmatically from Jena, or via a SPARQL query.
Doing a layout for efficiently querying schemaless triple-centric data is a well-studied, and hard, problem. Most of the RDF platforms, including Jena, have well-optimised code for querying and updating triples from their own database schemes. You would have to have very good reasons for embarking on your own relational table layout :)
If you really do need to take an existing relational table scheme and map it to a Jena RDF model, look at D2RQ.

Why didn't you just use a triple store to store all of your data? If you use a triple store with SPARQL endpoint capability then you would have a SPARQL-accessible web api. Similarly, many other data sets on the web are exposed as SPARQL endpoints and accessible via HTTP.
There are many triple stores available with persistent storage both in a db and otherwise (Jena + SDB, Mulgara, Virtuoso, Oracle, etc). You could certainly extend Mulgara through their resolvers to support queries against your custom db but I think that's probably a lot of work for not too much real value.
I'm sure you could use existing concepts to handle your notion of categories in RDF or perhaps by layering something over Jena.

Related

Is it possible to create a SQL data base using a core data schema

I have a core data schema file with relationships between the entities.
I need to create a SQL database and would like to know if it can be created automatically (MySql or MS-SQL) using only this file.
Looking at the SQLite DB I see that the relationships are not mapped in any logical way.
First, your assessment that the relationships are "not mapped in any logical way" is not correct. If you look carefully at the Core Data generated database you will discover that the relationships are mapped exactly as in any other old relational database scheme, i.e. with foreign keys referring to rows in other tables.
Also, the naming conventions in these SQLite databases are very transparent (e.g., entity and attribute names start with Z, etc.
That being said, I would strongly discourage you to hack the Core Data generated database file, or even to use it to inform another database scheme, the reason being that these are undocumented features that could change any time without notice and thus break any code you write based on them.
IMO, the most practical thing to do is to rewrite the model quickly in the usual MySQL schema format and update it manually as well when you change the managed object model.
If you would like to automate the process, there is a rich set of APIs provided for interpreting and parsing NSManagedObjectModel, including classes like NSEntityDescription, NSAttributeDescription etc. You could write a framework that iterates though your entities and attributes and generates a text file that is a readable schema for MySQL, complete with information about indexing, versions etc..
If you go down that route, please make sure to notify us and do post your framework on Github for the benefit of others.
If you use Core Data you can create an SQL based database using a schema file but its structure is entirely controlled by the Core Data framework. Apple specifically tell us as developers to leave it alone and do not edit it using libsqlite or any other method. If you do then Core Data won't have anything to do with you!
In terms of making your own DB using one of Apple's schema files, I'm sure it is possible, but you'd have to know the inner workings of the Core Data framework to even attempt it.
In terms of making your own SQLite DB then you have to sort out all the relationships and mapping yourself.
I think that mixing and matching Core Data resources and custom built SQLite databases is probably a headache waiting to happen. I have used both methods and find that Core Data is brilliant (especially with iCloud) as long as you're OK with your App being limited to Apple only.

How to create triple store from RDFa?

I have implemented RDFa on a shopping website.
Now, how to create triple store using those structured data?
There are thousands of products in the website. So, manually visiting each and every page and extracting RDF is not a good solution. Is there any automatic tools for this?
The answer depends on how you "implemented RDFa". It is unlikely that the majority of your content is expressed as static information, so it is also unlikely that the majority of your content requires scraping.
There are tools, such as D2R Server, that give you facilities for exposing your underlying datastore as a read-only SPARQL endpoint. The only trick will be if you do have static content and wish to expose that as automatically generated RDF as well. That would require some finessing.
The data which is in RDFa format on your website probably comes from a database, where it is in relational form, since you probably didn't add the RDF triples to the HTML manually. So the easiest way to get the data into the triple store would not be from the HTML, but by some kind of transformation of the original data in the database. In the end, RDF triples can be seen as a ternary relation that can well be stored in any relational database.
GRDDL (Gleaning Resource Descriptions from Dialects of Languages) is a way of using XSLT to extract the RDF triples from the HTML, in case you do not have access to a relational database that stores the data. Hope this helps.

using couch db and sql server side by side

We currently have a nicely relational sql server 2008 database that is our master application database. We are looking to improve an existing document storage mechanism which uses xml data types with something more schemaless that can handle similar but not identical documents and thought that couchdb would be good fit.
The idea is that the common meta data about the documents could be stored within sql server for ease of display/aggregation/reporting but the actual documents are stored in couch to handle the subtle differences in the documents. The idea is to make the most of the two different technologies.
For example the status, type, related person and date created would all be common across all documents and stored in sql but an email and a letter (obviously with different fields) could be stored in couch.
Then we can display our document grid for all types of document (thousands of docs) which can be queried through sql but the display of the doc gets its data from couch when the user requests to view it.
Something to bear in mind is that some document types are generated from templates that are also documents themselves (think mail merge/find and replace).
Application layer is asp.net 4.5, c#, repository pattern, Windsor for ioc, JavaScript
So, to the question...
Is this approach a sensible way to make the most of the two differing data storage paradigms?
Are we making our programming lives needlessly complex in the desire to "use the most appropriate technology for the problem"?
Does anyone have any experiences of trying something similar and if so, how did it go?
It's really not uncommon to use two different storage formats for a document: One for searchable aspects and metadata and another for presentation.
Looking at it in a more general way, the approach is somewhat similar to the one we developed at the Royal Danish Library and pushed in the Planets EU project:
http://www.researchgate.net/publication/221176211_Archive_Design_Based_on_Planets_Inspired_Logical_Object_Model
Here's another paper that discusses this approach in a more general way:
"Opening Schrödingers Library"
The goal was archiving. We recognized that when converting documents for archiving or preservation no sigle storage format was superior in all aspects of preserving the attributes, formats, looks, contents etc of the original document. Solution: Convert to several formats, and use a sophisticated digital object to track the conversions, and which aspects of the original were best preserved in which conversion.
So in my opinion the approach is theoretically and practically sound.
Practical issues: You will probably need some sort of digital object that keeps track of the various parts of a document, eg. whether it occurs in one system only (and so which one), or in both. It seems that you are going to use SQLserver for this aspect, and that sounds sensible.
We actually did implement the object model we describe in the paper, and last I hear they are still using it.

automatic web crawler

I'm writing a crawler which needs to get data from many websites. The problem is that every website has different structure. How can I easily write a crawler which downloads (correctly) data from (many) different websites? If the structure of a website will change will I need to rewrite the crawler, or are there other methods?
What logical and implemented tools can be used to improve the quality of data mined by an automatic web-crawler (many websites are involved with different structure)?
Thank You!
I presume you want to query it is some way, in which case you should store the data in a flexible data store. A relational database would not be fit for purpose as it has a strict schema, but something like mongodb which lets you store semi structured data without having to define a schema up front, but still provides a powerful query language.
The same goes for how you represent the data in the crawler code. Don't map the data to classes where the structure is defined up front, but use a flexible data structures that can change at runtime. If you are using Java then de-serialise the data into HashMaps. In other languages this might be called Dictionaries or Hashes.
If you're scraping data from websites that actually want to allow you to do that, chances are they will provide some sort of webservice to allow you to query their data in a structured way.
Otherwise, you're on your own, and you might even be violating their terms of use.
If the websites provide no APIs, then you're out cold and you have to write separate extraction module for each data format you're encountering. If the website changes the format, then you have to update your format module. A standard thing to do is to have plugins for every website you're crawling and have a testing framework which does regression testing with data you've already collected. When a test fails you know something went wrong and you can investigate whether you have to update your format plugin or if there is another issue.
Without knowing what kind of data you're collecting it will be very difficult to try to hypothesize about ways to improve the "quality" of the data that was mined.
Maybe you could find out whether the website allows you to access the data like API, if so, you could use this kind of structured data to your website directly. If not, you may need plugins for that. Or you could turn to other web crawlers with API access like Octoparse, to find the way to access their API to your own web crawler.

Linq 2 Sql and Dynamic table schemas

First a background. Our application is built on ASP.NET MVC3, .NET 4.0, and uses Linq-to-Sql (PLINQO) as its primary means of data access. Our web application is a multi-tenant/multi-client system where each client gets their own Sql Server database. Each Sql Server database up to now has had exactly the same schema.
Often times, clients will ask us to track custom fields in their Db that other clients don't track. The way we've handled this is by reserving a number of customfields in the db in our main tables. For example, our Widget table may have a CustomText1, CustomText2.. CustomText10, and a CustomDate1, CustomDate2..CustomDate10 fields. Again, all our schemas across clients are the same, so Linq-to-Sql handles these fields just as easily as any other field.
Now we are running into an issue where a client wants several hundred CustomBool fields, but doesn't need the others. So, basically, we are researching for ways to still use the Linq-to-Sql, but have it work against potentially different schemas depending on the database it is connected to (although they are different in a very specific way.)
Too much code has already been built on Linq-to-Sql and accessing the Widget classes generated by it that I'd like to not just fall back to straight SQL.
I've seen answers here and on the web on ways for Linq to Sql to access different tables that have the same schema, but I have not found a good answer to the same table name across different dbs with different columns.
Is this possible?
If the main objective is to store a few extra fields for existing domain objects then why not create a generic table that can store key value pairs. This is extremely flexible since there is no need to change your schema if a customer requires a new property.
We do this frequently and normally have some helpers to correctly cast the properties e.g.
Service.GetProperty<bool>("SomeCustomProperty")
If you are looking for a more "pluggable" domain model that can be completely different for each tenant, I think you will struggle if you are following a database driven approach and using the L2S designer to generate your code.
To achieve this you really need to be generating your database based on your code (domain driven design) which will give you much more flexibility i.e. you can load a tenant specific configuration (set of classes, business rules etc.) at runtime and use this to generate/validate your schema.
Update
It would be good if you could elaborate on exactly what design approach you have taken i.e. are you using the Linq designer and generating your model from the database?
It's clear that a generic key value pair store is not going to meet your querying requirements.
It's hard to provide a solution without suggesting a different technology. Relational SQL databases aren't really suited for dynamic domain models. You may be better off with a document database such as MongoDb or RavenDb where you are not tied to a specific schema. You could even make use of these just for your custom properties.
If that's not ideal then another solution would be to use something like Dapper to construct your queries. Assuming you are developing against interfaces you can have a implementation of your data service per tenant that makes use of their custom fields.
Ayende did a whole series of posts on Multitenancy and covers tenant specific domain models. It starts here and may be of some use to you.