Hadoop Data extraction aftermath - hadoop2

What are the ways to ship the extracted data after MapReduce in a Hadoop system? I mean the data is extracted by employing say Pig Latin, then how the data would be sent out to another application say a reporting application to make use of it by the group interested in it? What are the specific tools to ship out the user friendly data from Hadoop?
I know this question of mine don't have any research value, but I feel it worth to learn the aftermath process.

Related

Storing NLP corpora in databases rather than csv?

While implementing an NLP system, I wonder why CSV files are often used to store text Corpora in Academia and common Python Examples (in particular: NLTK-based). I have personally ran into issues, using a system that generates a number of corpora automatically and accesses them later.
These are issues that come from CSV files:
- Difficult to automate back up
- Difficult to ensure availability
- Potential transaction race and thread accessing issues
- Difficult to distribute/shard over multiple servers
- Schema not clear or defined, if corpora becomes complicated
- Accessing via a filename is risky. It could be altered.
- File Corruption possible
- Fine grained permissions not typically used for file-access
Issues from using MySQL, or MongooseDB:
- Initial set up, keeping a dediated server running with DB instance online
- Requires spending time creating and defining a Schema
Pros of CSV:
- Theoretically easier to automate zip and unzipping of contents
- More familiar to some programmers
- Easier to transfer to another academic researcher, via FTP or even e-mail
Viewing multiple academic articles, even in cases of more advanced NLP research, for example undertaking Named Entity Recognition or statement extraction, research seems to use CSV.
Are there other advantages to the CSV format, that make it so widely used? What should an Industry system use?
I will organize the answer into two parts:
Why CSV:
A dataset for an nlp task, be it a classification or sequence annotation basically requires two things per each training instance in a corpus:
Text(might be a single token, sentence or document) to be annotated and optionally pre-extracted features.
Corresponding labels/tag.
Because of this simple tabular organization of data that is consistent across different NLP problems, CSV is a natural choice. CSV is easy to learn, easy to parse, easy to serialize and easy to include different encodings and languages. CSV is easy to work with Python(which is the most dominant for NLP) and there are excellent libraries like Pandas that makes it really easy to manipulate and re-organize the data.
Why not database
A database is an overkill really. An NLP model is always trained offline, i.e you fit all the data at once in an ML/DL model. There are no concurrency issues. The only parallelism that exists during training is managed inside a GPU. There is no security issue during training: you train the model in your machine and you only deploy a trained model in a server.

Alternate methods to supply data for machine learning (Other than using CSV files)

I have a question which is relating to machine learning application in real world. It might be sounds stupid lol.
I've been self study machine learning for a while and most of the exercise was using the csv file as data source (both processed and raw). I would like to ask is there any other methods other than import csv file to channel/supply data for machine learning?
Example: Streaming Facebook/ Twitter live feed's data for machine learning in real-time, rather than collect old data and stored them into CSV file.
The data source can be anything. Usually, it's provided as a CSV or JSON file. But in the real world, say you have a website such as Twitter, as you're mentioning, you'd be storing your data in a rational DB such as SQL databases, and for some data you'd be putting them in an in-memory cache.
You can basically utilize both of these to retrieve your data and process it. The thing here is when you have too much data to fit in the memory, you can't really just query everything and process it, in that case, you'll be utilizing some smart algorithms to process data in chunks.
Good thing about some databases such as SQL is that they provide you with a set of functions that you can invoke right in your SQL script to efficiently calculate some data. For example you can get a sum of a column across the whole table or something using SUM() function SQL, which allows for efficient and easy data manipulation

Parse.com query database for business intelligence and statistics purposes

i would love to know how do i periodically run some complex queries on my app's Parse database? We are currently exporting the entire database as JSON, converting the JSON to CSV and putting it in an Excel and getting business intelligence data on that. It is not very efficient because the database is growing everyday and the process of converting the file to CSV is taking longer everyday. Any advice or good practices that you guys used?
Parse's own "Background Jobs": http://blog.parse.com/announcements/introducing-background-jobs/
A meaningful answer would probably require more specifics but it is normal that bigger data = longer processing times.
One way of keeping a lid would be to keep processed data, processed :P and just appending your new results to that (differencials).

Apache spark to store and query json data is a good use case?

Architecture - A brief description about the architecture, I am working on a answering engine where people query and wait for answer (something different to a search engine). Back-end looks for automated answer or if doesn't finds the answer directly it sends out snippet to the interface with the confidence score. Whatever snippets and answers gets generated are stored in Mongodb collection. Each query asked get a unique URL and snippetid, this ids I save in Mongodb and whenever an user jumps on to the URL from other search engines, a query to fetch the data from Mongodb collection is made. At start this architecture ran well but now the data is increasing I am seriously in need of better architecture.
Should I store data in Hadoop and can write a MR program to fetch the data.
Should I use spark and shark preferably
Should I stick to Mongodb
Should I go for HBase or HIVE
You are confusing architecture and technology selection. Though they are related these are separate notions. (You can find a couple of article I wrote about it in the past here and here etc.)
Anyway to your question - generally speaking JSON is an expensive format that need re-parsing every time you fetch it (unless you always want is as a "blob") there are several other formats like Avro, Google ProtoBuff, ORC, Parquet etc. that support schema evolution but also use binary formats that are more efficient and faster to access.
Regarding choice of persistent store - that highly depends on your intended use and anticipated loads. Note that some of the options you've mentioned are aimed at completely different usages (e.g. HBase which you can use for real-time queries vs. Hive which has a rich analytical interface (via SQL) but is batch oriented)

How to synchronize market data frequently and show as a historical timeseries data

http://pubapi.cryptsy.com/api.php?method=marketdatav2
I would like to synchronize market data on a continuous basis (e.g. cryptsy and other exchanges). I would like to show latest buy/sell price from the respective orders from these exchanges on a regular basis as a historical time series.
What backend database should I used to store and render or plot any parameter from the retrieved data as a historical timeseries data.
I'd suggest you look at a database tuned for handling time series data. The one that springs to mind is InfluxDB. This question has a more general take on time series databases.
I think it needs more detail about the requirement.
It just describe, "it needs sync time series data". What is scenario? what is data source and destination?
Option 1.
If it is just data synchronization issues between two data based, easiest solution is CouchDB NoSQL Series (CouchDB, CouchBase, Cloudant)
All they are based on CouchDB, anyway they provides data center level data replication feature (XCDR). So you can replicate the date to other couchDB in other data center or even in couchDB in mobile devices.
I hope it will be useful to u.
Option 2.
Other approach is Data Integration approach. You can sync data by using ETL batch job. Batch worker can copy data to destination periodically. It is most common way to replicate data to other destination. There are a lot of tools it supports ETL line Pentaho ETL, Spring Integration, Apache Camel.
If you provide me more detail scenario, i can help u in more detail
Enjoy
-Terry
I think mongoDB is a good choice. Here is why:
You can easily scale out, and thus be able to store tremendous amount of data. When using an according shard key, you might even be able to position the shards close to the exchange they follow in order to improve speed, if that should become a concern.
Replica sets offer automatic failover, which implicitly could be an issue
Using the TTL feature, data can be automatically deleted after their TTL, effectively creating a round robin database.
Both the aggregation and the map/reduce framework will be helpful
There are some free classes at MongoDB University which will prevent you to avoid the most common pitfalls