I am working on a website project. We have a MySql and a MongoDb base.
We want to add a full-text search-engine over these bases (and if it can be linked with PostgreSql it's better).
These databases contain multilingual texts but we cannot determine the language.
I saw Solr, ElasticSearch and Sphinx, but what is your advice on this topic ?
Solr and Sphinx have stemmings but I am not sure we can use it without knowledge about content language...
Elastic is full JSON that could be better if we use more and more mongoDb...
It doesn't matter what search engine you use, stemming is highly language-dependent. IMHO you'll have to somehow detect the language in order to feed the text to the proper stemmer.
There is a product from Basis Technologies called the Rosette Language Platform that does autodetection of languages that you might look into.
Solr supports JSON for results (and indexing???) if that is a key integration mechanism. I would put "JSON" support a bit further down the list of things to scorecard on, and focus on How Relevant Will Results Be From Search Engine X For My Domain.
Related
I am looking at taking unstructured data in the form of files, processing it and storing it in a database for retrieval.
The data will be in natural language and the queries to get information will also be in natural language.
Ex: the data could be "Roses are red" and the query could be "What is the color of a rose?"
I have looked at several nlp systems, focusing more on open-source information extraction and relation extraction system and the following seems apt and easy for quick start:
https://www.npmjs.com/package/mitie
This can give data in the form of (word,type) pairs. It also gives a relation as result of running the the processing (check the site example).
I want to know if sql is good database to save this information. For retrieving the information, I will need to convert the natural language query also to some kind of (word, meaning) pairs
and for using sql I will have to write a layer that converts natural language to sql queries.
Please suggest if there are any open source database that work well in this situation. I'm open to suggestions for databases that work with other open-source information extraction and relation extraction systems if not MITIE.
SQL wont be an appropriate choice for your problem. You can use NLP or rules to extract relationships and then store that relationship in a Triple Store or a Graph database. There are many good open source Graph Databases like Neo4j and Apache Titan. You can query Google for Triple-stores, I suppose Apache Jena should be a good choice. After storing your data you can query your graphs using any of the Graph Query Languages like Gremlin or Cypher etc. (like SQL). Note that the heart of your system would be a Knowledge Graph.
You may also setup a Lucene/Solr based Search System on your unstructured data which may help you with answering your queries in conjunction with Graph Databases. All of these (NLP, IR, Graph DB/Triplestores etc.) would coexist to solve your problem.
It would be like an ensemble. No silver bullets :) However to start with look at Graph DB's or Triple-stores.
I already have a Sql Server 2008 based app in production, where am using Full text search by storing the binary (along with the file extension). Which means the same column can store doc, xls, pdf, docx... etc. I went for that approach (knowing it would be insert costly) because i have varied files which can be uploaded and I don't want to run into madness of converting text from various types (xls, xlsx, doc, docx, pdf etc) of files. Also i am not aware of any free tools which can do that for me. I don't want to use filesystem as that would be insecure and maintenance will be costly.
Now am looking for the ease (or difficulty) to move to mysql. Do have some options of full text search in mysql For ex: MySql Full text
search (which does not index binary), Sphinx and Solr.
I found this Question, which is kind of closest to what i need... Although i guess Sphinx doesn't index binary data... However, by using SphinxSE i can query the mysql tables and Sphinx to get related resultset (in the same connection). I hope that understanding is correct. But am not sure of the performance. Can someone add more insight?
Of what i have heard... Integrating Lucene with Mysql is difficult.
My need is to fetch ranked results based on criterion which can be structured (stored in RDBMS) and unstructured (textual dats which
shall be indexed).
Also, is there any other option which looks like more suitable in my given situation.
Have a look at ElasticSearch (uses lucene under the hood like Solr) I think it may do what you require I haven't needed document indexing though so not tried it.
See here though for more information
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-attachment-type.html
It uses Apache Tika to convert the documents to indexable content (same as SQL server does with IFilter plugins)
I'm writing a crawler which needs to get data from many websites. The problem is that every website has different structure. How can I easily write a crawler which downloads (correctly) data from (many) different websites? If the structure of a website will change will I need to rewrite the crawler, or are there other methods?
What logical and implemented tools can be used to improve the quality of data mined by an automatic web-crawler (many websites are involved with different structure)?
Thank You!
I presume you want to query it is some way, in which case you should store the data in a flexible data store. A relational database would not be fit for purpose as it has a strict schema, but something like mongodb which lets you store semi structured data without having to define a schema up front, but still provides a powerful query language.
The same goes for how you represent the data in the crawler code. Don't map the data to classes where the structure is defined up front, but use a flexible data structures that can change at runtime. If you are using Java then de-serialise the data into HashMaps. In other languages this might be called Dictionaries or Hashes.
If you're scraping data from websites that actually want to allow you to do that, chances are they will provide some sort of webservice to allow you to query their data in a structured way.
Otherwise, you're on your own, and you might even be violating their terms of use.
If the websites provide no APIs, then you're out cold and you have to write separate extraction module for each data format you're encountering. If the website changes the format, then you have to update your format module. A standard thing to do is to have plugins for every website you're crawling and have a testing framework which does regression testing with data you've already collected. When a test fails you know something went wrong and you can investigate whether you have to update your format plugin or if there is another issue.
Without knowing what kind of data you're collecting it will be very difficult to try to hypothesize about ways to improve the "quality" of the data that was mined.
Maybe you could find out whether the website allows you to access the data like API, if so, you could use this kind of structured data to your website directly. If not, you may need plugins for that. Or you could turn to other web crawlers with API access like Octoparse, to find the way to access their API to your own web crawler.
I am doing (want to do) some experiments with Linked Open Datasets particularly those put out by governments.
I have a RDBMS (more specifically MySQL). I designed it with semantic web ideas in mind i.e. I have a information stored as objects, predicates and classes which define objects. In turn all objects are related to each other though statements of the form subject --> predicate --> object (where the subjects are from the objects table).
I want to be able to query other RDF triple stores from my application and let other triple stores query my data. Is it possible to "set something up" so that this is possible?
I have looked at Jena. Using Jena seems to mean I have to it as a storage application rather than MySQL - the only problem with this is that I include a new concept called a category (which I don't think is part of the semantic web languages). I will use categories to help with displaying information (they don't have any other meaning) but using Jena seems to mean that I can't organise predicates under categories for more convenient viewing.
I am using Java so a JAVA API is preferred.
It's also possible I misunderstood the purpose of Jena, and maybe that can be of use, but I am not sure how.
I am sure four days from now this question will seem rather silly, but at the moment I am somewhat confused about how to proceed.
I'm not sure what you mean by "a new concept called category", perhaps you can give an example?
If you mean that you want to add additional metadata, perhaps as a way of organizing information in the user interface, there is no need to extend the semantic web languages or storage systems - they can already do what you want.
Suppose you have data for a school from the UK Government schools dataset (using Turtle encoding for brevity):
#prefix sch-ont: <http://education.data.gov.uk/def/school/>.
<http://education.data.gov.uk/id/school/135412>
a sch-ont:School;
sch-ont:establishmentStatus
<http://education.data.gov.uk/def/school/EstablishmentStatus_Open>;
sch-ont:MSOA <http://statistics.data.gov.uk/id/msoa/E02000001>;
sch-ont:establishmentName "Guildhall School of Music and Drama";
...
You can directly query that data from the SPARQL end-point, or you can download the data and store it locally in your own triple store. Either way, you're perfectly at liberty to add extra information that's useful to your users. For example:
#prefix ankurs-app: <http://ankur.org/example/app/vocab/display#>.
<http://education.data.gov.uk/id/school/135412>
ankurs-app:category ankurs-app:wkdCool.
You can store this new triple in the same graph as the downloaded data, or you can store it in a separate named-graph to indicate that it's information that has a different provenance than the source data. Either way, it's then simple to query it either programmatically from Jena, or via a SPARQL query.
Doing a layout for efficiently querying schemaless triple-centric data is a well-studied, and hard, problem. Most of the RDF platforms, including Jena, have well-optimised code for querying and updating triples from their own database schemes. You would have to have very good reasons for embarking on your own relational table layout :)
If you really do need to take an existing relational table scheme and map it to a Jena RDF model, look at D2RQ.
Why didn't you just use a triple store to store all of your data? If you use a triple store with SPARQL endpoint capability then you would have a SPARQL-accessible web api. Similarly, many other data sets on the web are exposed as SPARQL endpoints and accessible via HTTP.
There are many triple stores available with persistent storage both in a db and otherwise (Jena + SDB, Mulgara, Virtuoso, Oracle, etc). You could certainly extend Mulgara through their resolvers to support queries against your custom db but I think that's probably a lot of work for not too much real value.
I'm sure you could use existing concepts to handle your notion of categories in RDF or perhaps by layering something over Jena.
I am interested in XML. I know it from Google's CSE.
It is often a pain for me to manipulate 3000-rows XML files.
This raises a question.
Why does Google use XML, not MySQL, such that I need to manipulate large XML -files?
XML has at least these advantages over SQL for data interchange purposes:
It's self-describing, you don't need to have any additional information to parse it.
It's a true standard, universally interoperable.
You aren't limited to tabular-oriented data: you can also use it to model hierarchies, for instance.
Probably the best you can do with SQL is ship tables in source code form, ie, as CREATE TABLE statements followed by a lot of INSERT statements. This is fine if you have a compatible database, but since SQL never really crystallized as a standard, interoperability at this level is very poor, and Google would have to offer multiple dialects (perhaps even for incompatible versions of the same DBMS).
XML is mostly human readable and cross platform. How would google send you data from just MYSql? Would you expect them to send you a binary blob that assumes you have the proper database to insert it into? How would you use that blob if MYSql wasn't installed, or a different version of MYSql was installed on your machine than on google?
XML is often uses as a transport format between systems. In CSE I would guess that google is transferring a lot of data from them to you in a format that many systems can use. If they used MySQL it would be no use to me as I don't know anything about it. However, pretty much most modern software frameworks can work with XML.
ADDITIONAL
Also, CSE (Customised Search Engine) probably expects that you don't need to do a lot of manipulation to the XML, just transform if for rendering to a web page. You can very easily perform an XSLT (Extensible Stylesheet Language Transformation) to an XML file to transform it in to an HTML fragment to use on your website.
MySQL is a specific SQL database engine. One not very suitable for providing the backend for the very very large dataset and special special needs that a search engine like google have.
I'm sure you can dig up info on how google's infrastructore, e.g. starting here
Relying on and exposing something specific like MySQL is not something you want to do when exchanging data over the internet.
XML on the other hand, being a general and textual markup language is ideal when you need
to interface and exchange data between systems. Thus it provides an ideal way to interface services such as Google CSE. You don't need to care about the specific implementation google have to provide the data, and Google don't need to care about the specific technology you use to manipulate the data
In addition to #Jared, there are XML databases. If the data is stored in XML, then it can be queried, transformed into html on the fly, or used in applications without the need for wrapping the data.
Why does Google use XML, not MySQL, such that I need to manipulate large XML -files?
access time, because there is no security check routine in DOM level on the accesed/open port /-: