How can i publish CSV data as Linked data on Web? - csv

My work is mainly focused on conversion of CSV data to RDF data format. After get RDF data ,i need to publish that RDF data as Linked data on web. Actually i want to convert CSV data to RDF data using java programming by myself then i want to publish that RDF data as Linked data on web using any tools.Can anyone help me finding any ways to do this or give me any suggestion or reference ? which tools i should use for this work? Thanks

You can publish your RDF in a variety of ways. Here is a common reference where they explain the steps, software tools and examples: http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf
In a nutshell, once you have your RDF data, you should think about the following:
1) Which tool/set of tools do I want to use to store my RDF data? For instance, I commonly use Virtuoso because I can use it for free and it facilitates the creation of the endpoint. But you can use Jena TDB, Allegro Graph, or many other triple stores.
2) Which tool do I use to make my data derreferenceable? For example, I use Pubby because I can configure it easily. But you can use Jena TDB (for the previous step) + Fuseki + Snorql for the same purpose. See the reference above for more information on the links and features of each tool.
3) Which datasets should I link to? (i.e., which data from other datasets do I reference, in order to make my dataset part of the Linked Data cloud?)
4) How should I link to these datasets? For example, the SILK framework can be used to analyze which of the URIs of your dataset are owl:sameAs other URIs in the target dataset of your choice.
Many people just publish their RDF in their endpoints, without linking it to other datasets. Although this follows the Linked Data principles (http://www.w3.org/DesignIssues/LinkedData.html), it is always better to link to other existing URIs when possible.
This is a short summary, assumming you already have the RDF data created. I hope it helps.

You can use Tarql (https://tarql.github.io/) or if you want to do more advanced mapping you can use SparqlMap (http://aksw.org/Projects/SparqlMap).
In both cases you will end up having a SPARQL endpoint which you can make available on-line and people can query your data.
Making each data item available under its URL is a very good idea, following the Linked Data principles as mentioned by #daniel-garijo in the other answer: http://www.w3.org/DesignIssues/LinkedData.html.
So you can also publish the data-items with all its properties in individual files.

Related

Which database can be used to store processed data from NLP engine

I am looking at taking unstructured data in the form of files, processing it and storing it in a database for retrieval.
The data will be in natural language and the queries to get information will also be in natural language.
Ex: the data could be "Roses are red" and the query could be "What is the color of a rose?"
I have looked at several nlp systems, focusing more on open-source information extraction and relation extraction system and the following seems apt and easy for quick start:
https://www.npmjs.com/package/mitie
This can give data in the form of (word,type) pairs. It also gives a relation as result of running the the processing (check the site example).
I want to know if sql is good database to save this information. For retrieving the information, I will need to convert the natural language query also to some kind of (word, meaning) pairs
and for using sql I will have to write a layer that converts natural language to sql queries.
Please suggest if there are any open source database that work well in this situation. I'm open to suggestions for databases that work with other open-source information extraction and relation extraction systems if not MITIE.
SQL wont be an appropriate choice for your problem. You can use NLP or rules to extract relationships and then store that relationship in a Triple Store or a Graph database. There are many good open source Graph Databases like Neo4j and Apache Titan. You can query Google for Triple-stores, I suppose Apache Jena should be a good choice. After storing your data you can query your graphs using any of the Graph Query Languages like Gremlin or Cypher etc. (like SQL). Note that the heart of your system would be a Knowledge Graph.
You may also setup a Lucene/Solr based Search System on your unstructured data which may help you with answering your queries in conjunction with Graph Databases. All of these (NLP, IR, Graph DB/Triplestores etc.) would coexist to solve your problem.
It would be like an ensemble. No silver bullets :) However to start with look at Graph DB's or Triple-stores.

How to create triple store from RDFa?

I have implemented RDFa on a shopping website.
Now, how to create triple store using those structured data?
There are thousands of products in the website. So, manually visiting each and every page and extracting RDF is not a good solution. Is there any automatic tools for this?
The answer depends on how you "implemented RDFa". It is unlikely that the majority of your content is expressed as static information, so it is also unlikely that the majority of your content requires scraping.
There are tools, such as D2R Server, that give you facilities for exposing your underlying datastore as a read-only SPARQL endpoint. The only trick will be if you do have static content and wish to expose that as automatically generated RDF as well. That would require some finessing.
The data which is in RDFa format on your website probably comes from a database, where it is in relational form, since you probably didn't add the RDF triples to the HTML manually. So the easiest way to get the data into the triple store would not be from the HTML, but by some kind of transformation of the original data in the database. In the end, RDF triples can be seen as a ternary relation that can well be stored in any relational database.
GRDDL (Gleaning Resource Descriptions from Dialects of Languages) is a way of using XSLT to extract the RDF triples from the HTML, in case you do not have access to a relational database that stores the data. Hope this helps.

automatic web crawler

I'm writing a crawler which needs to get data from many websites. The problem is that every website has different structure. How can I easily write a crawler which downloads (correctly) data from (many) different websites? If the structure of a website will change will I need to rewrite the crawler, or are there other methods?
What logical and implemented tools can be used to improve the quality of data mined by an automatic web-crawler (many websites are involved with different structure)?
Thank You!
I presume you want to query it is some way, in which case you should store the data in a flexible data store. A relational database would not be fit for purpose as it has a strict schema, but something like mongodb which lets you store semi structured data without having to define a schema up front, but still provides a powerful query language.
The same goes for how you represent the data in the crawler code. Don't map the data to classes where the structure is defined up front, but use a flexible data structures that can change at runtime. If you are using Java then de-serialise the data into HashMaps. In other languages this might be called Dictionaries or Hashes.
If you're scraping data from websites that actually want to allow you to do that, chances are they will provide some sort of webservice to allow you to query their data in a structured way.
Otherwise, you're on your own, and you might even be violating their terms of use.
If the websites provide no APIs, then you're out cold and you have to write separate extraction module for each data format you're encountering. If the website changes the format, then you have to update your format module. A standard thing to do is to have plugins for every website you're crawling and have a testing framework which does regression testing with data you've already collected. When a test fails you know something went wrong and you can investigate whether you have to update your format plugin or if there is another issue.
Without knowing what kind of data you're collecting it will be very difficult to try to hypothesize about ways to improve the "quality" of the data that was mined.
Maybe you could find out whether the website allows you to access the data like API, if so, you could use this kind of structured data to your website directly. If not, you may need plugins for that. Or you could turn to other web crawlers with API access like Octoparse, to find the way to access their API to your own web crawler.

Best way to create a SPARQL endpoint for a RDBMS (MySQL database)

I am doing (want to do) some experiments with Linked Open Datasets particularly those put out by governments.
I have a RDBMS (more specifically MySQL). I designed it with semantic web ideas in mind i.e. I have a information stored as objects, predicates and classes which define objects. In turn all objects are related to each other though statements of the form subject --> predicate --> object (where the subjects are from the objects table).
I want to be able to query other RDF triple stores from my application and let other triple stores query my data. Is it possible to "set something up" so that this is possible?
I have looked at Jena. Using Jena seems to mean I have to it as a storage application rather than MySQL - the only problem with this is that I include a new concept called a category (which I don't think is part of the semantic web languages). I will use categories to help with displaying information (they don't have any other meaning) but using Jena seems to mean that I can't organise predicates under categories for more convenient viewing.
I am using Java so a JAVA API is preferred.
It's also possible I misunderstood the purpose of Jena, and maybe that can be of use, but I am not sure how.
I am sure four days from now this question will seem rather silly, but at the moment I am somewhat confused about how to proceed.
I'm not sure what you mean by "a new concept called category", perhaps you can give an example?
If you mean that you want to add additional metadata, perhaps as a way of organizing information in the user interface, there is no need to extend the semantic web languages or storage systems - they can already do what you want.
Suppose you have data for a school from the UK Government schools dataset (using Turtle encoding for brevity):
#prefix sch-ont: <http://education.data.gov.uk/def/school/>.
<http://education.data.gov.uk/id/school/135412>
a sch-ont:School;
sch-ont:establishmentStatus
<http://education.data.gov.uk/def/school/EstablishmentStatus_Open>;
sch-ont:MSOA <http://statistics.data.gov.uk/id/msoa/E02000001>;
sch-ont:establishmentName "Guildhall School of Music and Drama";
...
You can directly query that data from the SPARQL end-point, or you can download the data and store it locally in your own triple store. Either way, you're perfectly at liberty to add extra information that's useful to your users. For example:
#prefix ankurs-app: <http://ankur.org/example/app/vocab/display#>.
<http://education.data.gov.uk/id/school/135412>
ankurs-app:category ankurs-app:wkdCool.
You can store this new triple in the same graph as the downloaded data, or you can store it in a separate named-graph to indicate that it's information that has a different provenance than the source data. Either way, it's then simple to query it either programmatically from Jena, or via a SPARQL query.
Doing a layout for efficiently querying schemaless triple-centric data is a well-studied, and hard, problem. Most of the RDF platforms, including Jena, have well-optimised code for querying and updating triples from their own database schemes. You would have to have very good reasons for embarking on your own relational table layout :)
If you really do need to take an existing relational table scheme and map it to a Jena RDF model, look at D2RQ.
Why didn't you just use a triple store to store all of your data? If you use a triple store with SPARQL endpoint capability then you would have a SPARQL-accessible web api. Similarly, many other data sets on the web are exposed as SPARQL endpoints and accessible via HTTP.
There are many triple stores available with persistent storage both in a db and otherwise (Jena + SDB, Mulgara, Virtuoso, Oracle, etc). You could certainly extend Mulgara through their resolvers to support queries against your custom db but I think that's probably a lot of work for not too much real value.
I'm sure you could use existing concepts to handle your notion of categories in RDF or perhaps by layering something over Jena.

What is the practical purpose of XML, that MySQL does not have?

I am interested in XML. I know it from Google's CSE.
It is often a pain for me to manipulate 3000-rows XML files.
This raises a question.
Why does Google use XML, not MySQL, such that I need to manipulate large XML -files?
XML has at least these advantages over SQL for data interchange purposes:
It's self-describing, you don't need to have any additional information to parse it.
It's a true standard, universally interoperable.
You aren't limited to tabular-oriented data: you can also use it to model hierarchies, for instance.
Probably the best you can do with SQL is ship tables in source code form, ie, as CREATE TABLE statements followed by a lot of INSERT statements. This is fine if you have a compatible database, but since SQL never really crystallized as a standard, interoperability at this level is very poor, and Google would have to offer multiple dialects (perhaps even for incompatible versions of the same DBMS).
XML is mostly human readable and cross platform. How would google send you data from just MYSql? Would you expect them to send you a binary blob that assumes you have the proper database to insert it into? How would you use that blob if MYSql wasn't installed, or a different version of MYSql was installed on your machine than on google?
XML is often uses as a transport format between systems. In CSE I would guess that google is transferring a lot of data from them to you in a format that many systems can use. If they used MySQL it would be no use to me as I don't know anything about it. However, pretty much most modern software frameworks can work with XML.
ADDITIONAL
Also, CSE (Customised Search Engine) probably expects that you don't need to do a lot of manipulation to the XML, just transform if for rendering to a web page. You can very easily perform an XSLT (Extensible Stylesheet Language Transformation) to an XML file to transform it in to an HTML fragment to use on your website.
MySQL is a specific SQL database engine. One not very suitable for providing the backend for the very very large dataset and special special needs that a search engine like google have.
I'm sure you can dig up info on how google's infrastructore, e.g. starting here
Relying on and exposing something specific like MySQL is not something you want to do when exchanging data over the internet.
XML on the other hand, being a general and textual markup language is ideal when you need
to interface and exchange data between systems. Thus it provides an ideal way to interface services such as Google CSE. You don't need to care about the specific implementation google have to provide the data, and Google don't need to care about the specific technology you use to manipulate the data
In addition to #Jared, there are XML databases. If the data is stored in XML, then it can be queried, transformed into html on the fly, or used in applications without the need for wrapping the data.
Why does Google use XML, not MySQL, such that I need to manipulate large XML -files?
access time, because there is no security check routine in DOM level on the accesed/open port /-: