Storing NLP corpora in databases rather than csv? - csv

While implementing an NLP system, I wonder why CSV files are often used to store text Corpora in Academia and common Python Examples (in particular: NLTK-based). I have personally ran into issues, using a system that generates a number of corpora automatically and accesses them later.
These are issues that come from CSV files:
- Difficult to automate back up
- Difficult to ensure availability
- Potential transaction race and thread accessing issues
- Difficult to distribute/shard over multiple servers
- Schema not clear or defined, if corpora becomes complicated
- Accessing via a filename is risky. It could be altered.
- File Corruption possible
- Fine grained permissions not typically used for file-access
Issues from using MySQL, or MongooseDB:
- Initial set up, keeping a dediated server running with DB instance online
- Requires spending time creating and defining a Schema
Pros of CSV:
- Theoretically easier to automate zip and unzipping of contents
- More familiar to some programmers
- Easier to transfer to another academic researcher, via FTP or even e-mail
Viewing multiple academic articles, even in cases of more advanced NLP research, for example undertaking Named Entity Recognition or statement extraction, research seems to use CSV.
Are there other advantages to the CSV format, that make it so widely used? What should an Industry system use?

I will organize the answer into two parts:
Why CSV:
A dataset for an nlp task, be it a classification or sequence annotation basically requires two things per each training instance in a corpus:
Text(might be a single token, sentence or document) to be annotated and optionally pre-extracted features.
Corresponding labels/tag.
Because of this simple tabular organization of data that is consistent across different NLP problems, CSV is a natural choice. CSV is easy to learn, easy to parse, easy to serialize and easy to include different encodings and languages. CSV is easy to work with Python(which is the most dominant for NLP) and there are excellent libraries like Pandas that makes it really easy to manipulate and re-organize the data.
Why not database
A database is an overkill really. An NLP model is always trained offline, i.e you fit all the data at once in an ML/DL model. There are no concurrency issues. The only parallelism that exists during training is managed inside a GPU. There is no security issue during training: you train the model in your machine and you only deploy a trained model in a server.

Related

Can large scale fixed format data qualify as big data?

In a hypothetical scenario, there are hundreds of machines located worldwide.
All of them generate housekeeping data, logs, records 24x7.
One possible use of this data is to generate various kind of reports.
Entire of this data generated is having a fixed format, and can very well be defined using corresponding relational schema.
Does that qualify as big data merely because of its huge extent ?
How to choose between relational or NoSQL solution for this kind of problem ?
Reason to raise this question is; the moment we move out of SQL/query land, speed issues start cropping up.
Is there a known practice to deal with this kind of data effectively ?
Wikipedia defines Big Data as "Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate". There are literally dozens of definitions of Big Data - http://www.forbes.com/sites/gilpress/2014/09/03/12-big-data-definitions-whats-yours, so I would advice you not to bother about the term itself and to rather look for the solution for your problem.
There is no silver bullet for choosing NoSQL/BigData solution - "horses for courses". To get started, take a look at the following research done by Altoros’s R&D engineers - "A Vendor-independent Comparison of NoSQL Databases: Cassandra, HBase, MongoDB, Riak with sharded MySQL" - http://www.altoros.com/vendor_independent_comparison_of_nosql_databases.html. They have used "Yahoo Cloud Serving Benchmark" for benchmarking the various NoSQLs.

Can we do all the things which we can do in BizTalk using SSIS

I have been using SSIS for a while, and I have never came across BizTalk.
One of the data migration project we are doing, also consists of BizTalk, apart from SSIS.
I just wondered what is the need of BizTalk, if we already have a SSIS ETL tool.
SSIS is well suited for bulk ETL batch options where you're transfering data between a SQL Server and
Another RDBMS
Excel
A simple CSV file
You do not need row by row processing
Your mapping is primarily data type conversion mapping (i.e. changing VARCHAR to NVARCHAR or DATETIME to VARCHAR etc.)
You're ok with error/fault handling for batches rather than rows
You're doing primarily point to point integrations that are unlikely to change or will only be needed temporarily.
BizTalk is well suited for real time messaging needs where:
You're transferring messages between any two end points
You need a centralized hub and/or ESB for message processing
You need fine grained transformations of messages
You need to work with more complicated looping file structures (i.e. not straight up CSV)
You need to apply analyst manageable business rules
You need to be able to easily swap out endpoints at run time
You need more enhanced error/fault management for individual messages/rows
You need enhanced B2B capabilities (EDI, HL7, SWIFT, trading partner management, acknowledgements)
Both can do the job of the other with a lot of extra work, but to see this, try to get SSIS to do a task that would require calling a stored procedure per row and have it do proper error handling/transformation of each row, and try to have BizTalk do a bulk ETL operation that requires minimal transformation. Both can do either, but it will be painful.
The short answer, no.
BizTalk Server and SSIS are different paradigms and are used to complement each other, not in opposition. They are both part of the BizTalk Stack and are frequently used in the same app.
BizTalk is a messaging platform and app will tend to process one entity at a time. SSIS is set based and works best for bulk table based operations.

Apache spark to store and query json data is a good use case?

Architecture - A brief description about the architecture, I am working on a answering engine where people query and wait for answer (something different to a search engine). Back-end looks for automated answer or if doesn't finds the answer directly it sends out snippet to the interface with the confidence score. Whatever snippets and answers gets generated are stored in Mongodb collection. Each query asked get a unique URL and snippetid, this ids I save in Mongodb and whenever an user jumps on to the URL from other search engines, a query to fetch the data from Mongodb collection is made. At start this architecture ran well but now the data is increasing I am seriously in need of better architecture.
Should I store data in Hadoop and can write a MR program to fetch the data.
Should I use spark and shark preferably
Should I stick to Mongodb
Should I go for HBase or HIVE
You are confusing architecture and technology selection. Though they are related these are separate notions. (You can find a couple of article I wrote about it in the past here and here etc.)
Anyway to your question - generally speaking JSON is an expensive format that need re-parsing every time you fetch it (unless you always want is as a "blob") there are several other formats like Avro, Google ProtoBuff, ORC, Parquet etc. that support schema evolution but also use binary formats that are more efficient and faster to access.
Regarding choice of persistent store - that highly depends on your intended use and anticipated loads. Note that some of the options you've mentioned are aimed at completely different usages (e.g. HBase which you can use for real-time queries vs. Hive which has a rich analytical interface (via SQL) but is batch oriented)

Any drawback of building website based on JSON API for Data Access Layer

For instance, in ecommerce websites, we generally have two interfaces. One with which customer interacts and places orders and one with which company employees interact to manage orders and customers etc.
If we divide this website into two different websites. That means, two different projects all together, not dependent on each other. Only thing common between both websites will be the database. Both websites will be using the same database. Then what would be a good option for making Data Access Layer
Each website have its own Database access code and entities.
Link both website with a centralized layer - which exposes Read/Write to database using API based on JSON
In my opinion, second option would be better. As it cancels out dependency of database, any changes made in database need not to be made at two places. And many other benefits.
But my only concern is, how much it could hamper performance of overall system? Because in that case we are serializing and de-serializing objects and also making use of HTTP connections.
Could someone please throw some light over what would be benefits and drawbacks of API backed Data Access Layer in comparison to having own Database access code.
People disagree about the best architecture for this sort of thing, but one common and popular architectural guideline suggest that you avoid integrating two products at the database layer at all costs. It is simpler to have two separate apps and databases which can change independently of each other, and if you need to reference data from one in the other you should have some sort of event pipeline between the two configured on the esb.
And, you should probably have more than two back end databases anyway -- unless you have an incredibly simple system with only the two classes of objects you mentioned, you'll probably find that you have more than two bounded domains.
Also, if your performance requirements increase then you'll probably want to look at splitting the read and write sides of your services and databases, connecting the two sides through an eventing system of some sort, (maybe event-sourcing).
Before you decide what to do you should read Implementing Domain Driven Design by Vaughn Vernon. And, the paper on CQRS by Martin Fowler. And the paper on event sourcing, also from Dr Fowler. For extra points you should also read Fowler on Microservices architecture.
Finally, on JSON -- and I'm a big fan -- but you should only use it at the repository interface if you're either using javascript on the back end (which is a great idea if you're using io.js and Koa) and the front end (backbone & marionette, please), or if you're using a data-source that natively emits json. If you have to parse it then it's only going to slow you down so use some format native to the data-source and its consumers, that way you'll be as fast as possible.
An API centric approach makes more sense as the data is standardised and gives you more flexibility by being usable in any language for one or multiple interfaces.
Performance wise this would greatly depend on the quality and implementation of the technology stack behind the API. You could also look at caching certain data on the frontend to improve page load time.
The guys over at moltin have already built a platform like this and I've had great success using it. There's already a backend dashboard and the response times are pretty fast too!

Using MongoDB vs MySQL with lots of JSON fields?

There is a microblogging type of application. Two main basic database stores zeroed upon are:
MySQL or MongoDB.
I am planning to denormalize lot of data I.e. A vote done on a post is stored in a voting table, also a count is incremented in the main posts table. There are other actions involved with the post too (e.g. Like, vote down).
If I use MySQL, some of the data better suits as JSON than fixed schema, for faster lookups.
E.g.
POST_ID | activity_data
213423424 | { 'likes': {'count':213,'recent_likers' :
['john','jack',..fixed list of recent N users]} , 'smiles' :
{'count':345,'recent_smilers' :
['mary','jack',..fixed list of recent N users]} }
There are other components of the application as well, where usage of JSON is being proposed.
So, to update a JSON field, the sequence is:
Read the JSON in python script.
Update the JSON
Store the JSON back into MySQL.
It would have been single operation in MongoDB with atomic operations like $push,$inc,$pull etc. Also
document structure of MongoDB suits my data well.
My considerations while choosing the data store.
Regarding MySQL:
Stable and familiar.
Backup and restore is easy.
Some future schema changes can be avoided using some fields as schemaless JSON.
May have to use layer of memcached early.
JSON blobs will be static in some tables like main Posts, however will be updated alot in some other tables like Post votes and likes.
Regarding MongoDB:
Better suited to store schema less data as documents.
Caching might be avoided till a later stage.
Sometimes the app may become write intensive, MongoDB can perform better at those points where unsafe writes are not an issue.
Not sure about stability and reliability.
Not sure about how easy is it to backup and restore.
Questions:
Shall we chose MongoDB if half of data is schemaless, and is being stored as JSON if using MySQL?
Some of the data like main posts is critical, so it will be saved using safe writes, the counters etc
will be saved using unsafe writes. Is this policy based on importance of data, and write intensiveness correct?
How easy is it to monitor, backup and restore MongoDB as compared to MySQL? We need to plan periodic backups ( say daily ), and restore them with ease in case of disaster. What are the best options I have with MongoDB to make it a safe bet for the application.
Stability, backup, snapshots, restoring, wider adoption I.e.database durability are the reasons pointing me
to use MySQL as RDBMS+NoSql even though a NoSQL document storage could serve my purpose better.
Please focus your views on the choice between MySQL and MongoDB considering the database design I have in mind. I know there could be better ways to plan database design with either RDBMS or MongoDB documents. But that is not the current focus of my question.
UPDATE : From MySQL 5.7 onwards, MySQL supports a rich native JSON datatype which provides data flexibility as well as rich JSON querying.
https://dev.mysql.com/doc/refman/5.7/en/json.html
So, to directly answer the questions...
Shall we chose mongodb if half of data is schemaless, and is being stored as JSON if using MySQL?
Schemaless storage is certainly a compelling reason to go with MongoDB, but as you've pointed out, it's fairly easy to store JSON in a RDBMS as well. The power behind MongoDB is in the rich queries against schemaless storage.
If I might point out a small flaw in the illustration about updating a JSON field, it's not simply a matter of getting the current value, updating the document and then pushing it back to the database. The process must all be wrapped in a transaction. Transactions tend to be fairly straightforward, until you start denormalizing your database. Then something as simple as recording an upvote can lock tables all over your schema.
With MongoDB, there are no transactions. But operations can almost always be structured in a way that allow for atomic updates. This usually involves some dramatic shifts from the SQL paradigms, but in my opinion they're fairly obvious once you stop trying to force objects into tables. At the very least, lots of other folks have run into the same problems you'll be facing, and the Mongo community tends to be fairly open and vocal about the challenges they've overcome.
Some of the data like main posts is critical , so it will be saved using safe writes , the counters etc will be saved using unsafe writes. Is this policy based on importance of data, and write intensiveness correct?
By "safe writes" I assume you mean the option to turn on an automatic "getLastError()" after every write. We have a very thin wrapper over a DBCollection that allows us fine grained control over when getLastError() is called. However, our policy is not based on how "important" data is, but rather whether the code following the query is expecting any modifications to be immediately visible in the following reads.
Generally speaking, this is still a poor indicator, and we have instead migrated to findAndModify() for the same behavior. On the occasion where we still explicitly call getLastError() it is when the database is likely to reject a write, such as when we insert() with an _id that may be a duplicate.
How easy is it to monitor,backup and restore Mongodb as compared to mysql? We need to plan periodic backups (say daily), and restore them with ease in case of disaster. What are the best options I have with mongoDb to make it a safe bet for the application?
I'm afraid I can't speak to whether our backup/restore policy is effective as we have not had to restore yet. We're following the MongoDB recommendations for backing up; #mark-hillick has done a great job of summarizing those. We're using replica sets, and we have migrated MongoDB versions as well as introduced new replica members. So far we've had no downtime, so I'm not sure I can speak well to this point.
Stability,backup,snapshots,restoring,wider adoption i.e.database durability are the reasons pointing me to use MySQL as RDBMS+NoSql even though a NoSQL document storage could serve my purpose better.
So, in my experience, MongoDB offers storage of schemaless data with a set of query primitives rich enough that transactions can often be replaced by atomic operations. It's been tough to unlearn 10+ years worth of SQL experience, but every problem I've encountered has been addressed by the community or 10gen directly. We have not lost data or had any downtime that I can recall.
To put it simply, MongoDB is hands down the best data storage ecosystem I have ever used in terms of querying, maintenance, scalability, and reliability. Unless I had an application that was so clearly relational that I could not in good conscience use anything other than SQL, I would make every effort to use MongoDB.
I don't work for 10gen, but I'm very grateful for the folks who do.
I'm not going to comment on the comparisons (I work for 10gen and don't feel it's appropriate for me to do so), however, I will answer the specific MongoDB questions so that you can better make your decision.
Back-Up
Documentation here is very thorough, covering many aspects:
Block-Level Methods (LVM makes it very easy and quite a lot of folk do this)
With/Without Journaling
EBS Snapshots
General Snapshots
Replication (technically not back-up, however, a lot of folk use replica sets for their redundancy and back-up - not recommending this but it is done)
Until recently, there is no MongoDB equivalent of mylvmbackup but a nice guy wrote one :) In his words
Early days so far: it's just a glorified shell script and needs way more error checking. But already it works for me and I figured I'd share the joy. Bug reports, patches & suggestions welcome.
Get yourself a copy from here.
Restores
Formats etc
mongodump is completely documented here and mongorestore is here.
mongodump will not contain the indexes but does contain the system.indexes collection so mongorestore can rebuild the indexes when you restore the bson file. The bson file is the actual data whereas mongoexport/mongoimport are not type-safe so it could be anything (techically speaking) :)
Monitoring
Documented here.
I like Cacti but afaik, the Cacti templates have not kept up with the changes in MongoDB and so rely on old syntax so post 2.0.4, I believe there are issues.
Nagios works well but it's Nagios so you either love or hate it. A lot of folk use Nagios and it seems to provide them with great visiblity.
I've heard of some folk looking at Zappix but I've never used it so can't comment.
Additionally, you can use MMS, which is free and hosted externally. Your MongoDB instances run an agent and one of those agents communicate (using python code) over https to mms.10gen.com. We use MMS to view all performance statistics on the MongoDB instances and it is very beneficial from a high-level wide view as well as offering the ability to drill down. It's simple to install and you don't have to run any hardware for this. Many customers run it and some compliment it with Cacti/Nagios.
Help information on MMS can be found here (it's a very detailed, inclusive document).
One of the disadvantages of a mysql solution with stored json is that you will not be able to efficiently search on the json data. If you store it all in mongodb, you can create indexes and/or queries on all of your data including the json.
Mongo's writes work very well, and really the only thing you lose vs mysql is transaction support, and thus the ability to rollback multipart saves. However, if you are able to commit your changes in atomic operations, then there isn't a data safety issue. If you are replicated, mongo provides an "eventually consistent" promise such that the slaves will eventually mirror the master.
Mongodb doesn't provide native enforcement or cascading of certain db constructs such as foreign keys, so you have to manage those yourself (such as either through composition, which is one of mongo's strenghts), or through use of dbrefs.
If you really need transaction support and robust 'safe' writes, yet still desire the flexibility provided by nosql, you might consider a hybrid solution. This would allow you to use mysql as your main post store, and then use mongodb as your 'schemaless' store. Here is a link to a doc discussing hybrid mongo/rdbms solutions: http://www.10gen.com/events/hybrid-applications The article is from 10gen's site, but you can find other examples simply by doing a quick google search.
Update 5/28/2019
The here have been a number of changes to both MySQL and Mongodb since this answer was posted, so the pros/cons between them have become even blurrier. This update doesn't really help with the original question, but I am doing it to make sure any new readers have a bit more recent information.
MongoDB now supports transactions: https://docs.mongodb.com/manual/core/transactions/
MySql now supports indexing and searching json fields:
https://dev.mysql.com/doc/refman/5.7/en/json.html