I want to create a semantic texts analyzer. To do that I need to store a lot of words roots in the database - the basic language vocabulary which is about one hundred thousand words.
Is there any pattern or common architecture and what kind of database should I use - relational or nosql(probably mongodb)?
There are 26 letters and many thousand of words can start from each. If using relational db should I create 26 different tables for each letter or if using nosql should I store them all together?
Oracle SPARQL loaded with WORDNET is a good start.
Related
I know this question has been asked quite a few times, but I have not got any satisfying answer.
I have read many blogs and most of them say that RDBMS cannot be scaled horizontally. The only way to deal with it is by buying bigger machines.
Then I read why they can't be scaled horizontally. People say because they provide solid, mature services according to the ACID properties. My argument to that is can't we drop an RDBMS to provide ACID properties for specific tables. Is that the only reason that it can't be scaled horizontally and we have to consider NoSQL databases.
The second argument that is put up is that NoSQL databases store data as a single unit whereas RDBMS stores data across multiple tables. Thus one piece of data may be in one system and another piece of data which it is referring may be in another system. Hence scaling RDBMS distributedly becomes difficult. My question to them is why can't we store all the related data in a single table rather scattering it across multiple tables if the situation demands. If NoSQL can store data as a single unit in a single collection, why can't RDBMS store data as a single unit in a single table. (For eg, why an order has to be split into order table, customer table and payment table. Why can't they be clubbed into a single table, the way a NoSQL would have stored)
This also allows developers to develop without having to convert in-memory structures to relational structures.
In short, can we make an RDBMS behave like a NoSQL database and make it scale horizontally?
First - what do you mean by 'scaling horizontally'?
To me - scaling horizontally is what we all do in MPP (Massive Parallel Processing) databases - like Vertica, Teradata, DB2 Parallel Edition, NonStop SQL, etc.: You have a very big table, which you distribute evenly across all nodes of your MPP cluster, based, usually, on the hash value of the primary key, or something similar. This is what Hadoop and all other Map-Reduce architecture does, too (while often being less effective, at least as of now).
(just editing to clarify): If you have 10 nodes in your cluster, your big tables all are distributed to have one tenth of their data on each node. Scaling, now, would be to add, for example, 10 nodes, and re-distribute the data so that each table has 1/20 of its data on each node. And MPP databases scale linearly; this means that by doubling the number of nodes, with the same data volume, the queries will now run twice as quick.
You seem to mean something different - and I'm curious on what you might mean.
As to RDBMS having to split everything into several tables:
The 'R' in RDBMS stands for 'Relational'. Before entering a discussion of all this, you should read a basic tutorial on relational algebra. A Relation simply is a set of objects that can all be described with the same attributes. With that, all objects have the same attributes/colums/fields. As soon as this rule is violated, it is not a relation / table anymore.
I strongly suggest that you take a training on relational theory and relational databases, even before starting to play with SQL.
It's a big, big world of its own that you will have the opportunity to explore. And it all boils down to set theory and Boolean and relational algebra. And you can do so many things with it ...
Your question here is just like asking why a bicycle has two wheels.
Or am I missing something?
Marco the Sane
I am wondering if I would want to design a Chinese input method (as Chinese characters can be typed into the computer by how it is pronounced**)so that when the user can retrieve the word he/she wants. Should I design it as relational database such as using MYSQL or should I consider something else?
Since I cannot find relevant information for my question, I tried looking for how English dictionaries are built for search but the nearest answer I found was the best data structure for dictionary implementation and also another one discussing Where shouldn't I use a relational-database My current thought is that since I only have one huge table of data, I seems like I should try consider other Database management systems? Or are there other suggestions and methods?
Many Thanks!
**More on Chinese English methods if that would help describe my question: In Chinese, to type out a character, it can be completed by pronunciation or the formation of the word (simplify how the word is "composed"), here I would like to focus on the prior where we use pronunciation, a modified examples would be: by typing xi-an-g-3, these four elements would form a word.
Yes - you certainly can use a relational database for creating a dictionary. Languages typically have under 100,000 words. With the proper indexes, queries to the database should be very quick.
You have to understand that "big data" these days are millions of records or more and that almost any database system will be able to handle this small set of data.
The question you need to be asking is what will the load be on the server (how many lookups will be happening at once?) and if you need to be optimizing it or adding cache.
I would always advice to start with keeping data in database normalized form and go from there in caching and optimizing your data.
I'd like to ask you for some advice on my research for my thesis.
I am building an application, where I'll have 1000 articles of 200-300 words and then a "word frequency list" - 30.000 words, each one rated according to usage across an English corpora e.g. "of" - 20168 times, "the" - 6464684 times, "aquaintance" - 15 times and so forth....
Now I want to query the database with lists of words and I want an article returned that contains most of these words, the most times.
E.g.: my list: different, contemporary, persistency.
Article 1 contains contemporary 1x
article 2 contains contemporary 3x
So the returned article would be no 2.
Questions
Should I create any relations among words and articles in the database. I mean for a thousand articles each one 300 words (well not unique) that would be quite a list. Or would an index suffice?
Mysql vs Oracle? With Mysql I'd use SOLR to index, I know that oracle has a tool for indexing but nothing more about it.
Is oracle with such functionality available for free? And also is it easy to handle, because I've never worked with it, but if the setup would be easy, I would go for it.
Thank you very much!
I will recommend you to use Hadoop to perform a WordCount operation. This will be scalable later (you're a researcher!) and efficient. Moreover creating relations among words, and articles in the database doesn't look like a neat solution.
If you choose Hadoop, it will provide the functionality of MapReduce. It works like this:
divides all input text files amongst multiple physical machines
each machine performs the word count algorithm
Results are collected from all machines, and then combined to give the final output.
You don't have to worry about implementing these functionalities, here is a tutorial.
WordCount job can also run locally on one machine.
I am working on an application that needs to do interesting things with search, including full-text search, hit-highlighting, faceted-search, etc...
The dataset is likely to be between 3000-10000 records with 20-30 fields on each, and is all stored in MySQL. The traffic profile of the site is likely to be on the small size of medium.
All of these requirements could be achieved (clunkily) in MySQL, but at what point (in terms of data-size and traffic levels) does it become worth looking at more focused technologies like Solr or Sphinx?
This question calls for a very broad answer to be answered in all aspects. There are very well certain specificas that may make one system superior to another for a special use case, but I want to cover the basics here.
I will deal entirely with Solr as an example for several search engines that function roughly the same way.
I want to start with some hard facts:
You cannot rely on Solr/Lucene as a secure database. There are a list of facts why but they mostly consist of missing recovery options, lack of acid transactions, possible complications etc. If you decide to use solr, you need to populate your index from another source like an SQL table. In fact solr is perfect for storing documents that include data from several tables and relations, that would otherwise requrie complex joins to be constructed.
Solr/Lucene provides mind blowing text-analysis / stemming / full text search scoring / fuzziness functions. Things you just can not do with MySQL. In fact full text search in MySql is limited to MyIsam and scoring is very trivial and limited. Weighting fields, boosting documents on certain metrics, score results based on phrase proximity, matching accurazy etc is very hard work to almost impossible.
In Solr/Lucene you have documents. You cannot really store relations and process. Well you can of course index the keys of other documents inside a multivalued field of some document so this way you can actually store 1:n relations and do it both ways to get n:n, but its data overhead. Don't get me wrong, its perfectily fine and efficient for a lot of purposes (for example for some product catalog where you want to store the distributors for products and you want to search only parts that are available at certain distributors or something). But you reach the end of possibilities with HAS / HAS NOT. You can almonst not do something like "get all products that are available at at least 3 distributors".
Solr/Lucene has very nice facetting features and post search analysis. For example: After a very broad search that had 40000 hits you can display that you would only get 3 hits if you refined your search to the combination of having this field this value and that field that value. Stuff that need additional queries in MySQL is done efficiently and convinient.
So let's sum up
The power of Lucene is text searching/analyzing. It is also mind blowingly fast because of the reverse index structure. You can really do a lot of post processing and satisfy other needs. Altough it's document oriented and has no "graph querying" like triple stores do with SPARQL, basic N:M relations are possible to store and to query. If your application is focused on text searching you should definitely go for Solr/Lucene if you haven't good reasons, like very complex, multi-dmensional range filter queries, to do otherwise.
If you do not have text-search but rather something where you can point and click something but not enter text, good old relational databases are probably a better way to go.
Use Solr if:
You do not want to stress your database.
Get really full text search.
Perform lightning fast search results.
I currently maintain a news website with 5 million users per month, with MySQL as the main datastore and Solr as the search engine.
Solr works like magick for full text indexing, which is difficult to achieve with Mysql. A mix of Mysql and Solr can be used: Mysql for CRUD operations and Solr for searches. I have previusly worked with one of India's best real estate online classifieds portal which was using Solr for search ( and was previously using Mysql). The migration reduced the search times manifold.
Solr can be easily integrated with Mysql:
Solr Full Dataimport can be used for importing data from Mysql tables into Solr collections.
Solr Delta import can be scheduled at short frequencies to load latest data from Mysql to Solr collections.
How can I implement a large relational database schema on a key value store?
The following are the requirements:
1) uses no stored procedures or special database vendor specific features
2) Uses indexes
3) Uses joins
4) Many complex types in tables (VARCHAR, INT, BLOB, etc)
5) Billions of records
6) Full text search
7) Timestamped backup possible
8) Transactions not needed (atomic single row/field updates only)
You will have to, in essence, build your own relational database system. Without it, joins will be horribly slow (no query optimizer). You can get full text search by grafting on Lucene.
Have you considered an open source RDBMS (e.g. Derby)?
Often, the whole point of a key-value store is to get away from the constraints of relational databases, but it sounds like you want them back here. Everybody wants to have their cake and eat it too, but I'm confused about what you're trying to accomplish here. if you want the power of a relational database, you should use a relational database.
That said, you may want to take a look at MongoDB, which represents a very good compromise between the rigid, structured nature of relational databases and the more free-form approach of a key-value store.