I'm writing a script which runs a MySQL query that returns ~5 million results. I need to manipulate this data:
Formatting
Encoding
Convert some sections to JSON
etc
Then write that to a CSV file in the shortest amount of time possible.
Since the node-mysql module is able to handle streams, I figured that's my best route. However I'm not familiar enough with writing my own streams to do this.
What would be the best way to stream the data from MySQL, through a formatting function and into a CSV.
We're in a position where we don't mind what order the data goes in as, so long as it's written to a CSV file real quick.
https://github.com/wdavidw/node-csv seems to do what you need. It can get streams, write streams, and has built-in transformers.
Related
I am new to golang and using encoding/csv ReadAll() lib to read all records of CSV file. e.g
records := csv.NewReader(filename).ReadAll()
Just want to know are there any constraints that I should be aware of for e.g CSV file size etc..
How big CSV file can I read using ReadAll() without issues.
The only limitation here is originating from the hardware's RAM.
In case you come across such a problem you may resolve this (depending on the case) with stream processing.
On stream processing, you read one element at a time before proceeding to the next one.
Here is an example from another thread.
Which technology would be best to import large amount of large JSON Line format files (approx 2 GB per file).
I am thinking about Solr.
Once the data will be imported it will have to be query-able.
Which technology would you suggest to import and then query JSON line format data in a timely manner?
You can start prototyping with some scripting language you prefer, to read the lines, massage the format as needed to get valid Solr json and send it to Solr via HTTP. Would the faster to get going.
Longer term, SolrJ will allow you to get max perf (if you need to), as you can:
hit the leader replica in a Solrcloud environment directly
use multiple threads to ingest and send docs (you can also use multiple processes). Not that this is harder/impossible with all other technologies, but in some it is.
you have the full flexibility of using all SolrJ api
I have a question which is relating to machine learning application in real world. It might be sounds stupid lol.
I've been self study machine learning for a while and most of the exercise was using the csv file as data source (both processed and raw). I would like to ask is there any other methods other than import csv file to channel/supply data for machine learning?
Example: Streaming Facebook/ Twitter live feed's data for machine learning in real-time, rather than collect old data and stored them into CSV file.
The data source can be anything. Usually, it's provided as a CSV or JSON file. But in the real world, say you have a website such as Twitter, as you're mentioning, you'd be storing your data in a rational DB such as SQL databases, and for some data you'd be putting them in an in-memory cache.
You can basically utilize both of these to retrieve your data and process it. The thing here is when you have too much data to fit in the memory, you can't really just query everything and process it, in that case, you'll be utilizing some smart algorithms to process data in chunks.
Good thing about some databases such as SQL is that they provide you with a set of functions that you can invoke right in your SQL script to efficiently calculate some data. For example you can get a sum of a column across the whole table or something using SUM() function SQL, which allows for efficient and easy data manipulation
It is a performance question - I created a web app (in Node.js) that loads a JSON file that has around 10 000 records and then displays that data to the user. I'm wondering if it would be faster to use (for example) MongoDB(or any other noSQL database, CouchDB?) instead? And how much faster would it be?
If you are looking for speed, JSON is quite specifically "not-fast". JSON involves sending the Keys along with the Values and it requires some heavy parsing on the receiving end. Reading the data from file can be slower than reading from the DB. I wouldn't like to say which is better, so you'll have to test it.
Rephrasing the question:
I am working on http rest calls and driving them via http package.
So i ran a http get request from http::getUrl method, which returns an array instance consisting the output.
When I do parray $arrName wish shell hangs, as the data is around in 50 mb.
It is working fine in TCL shell, since that show only the buffer output not the complete.
Any solution for doing page wise reading in wish.exe
There is no way out of it -- with the amounts of data of this magnitude, you need to paginate... How exactly to do that depends on your budget (not money necessarily -- time and effort you have for this task), and the data format.
If the data is tabular, you may wish to utilize the TkTable widget. If it is in HTML, TkHTML may be called for. Simply dumping 50Mb worth of text to stdout is hardly useful -- a human user can't read through all that anyway. You either need to highlight the important parts or filter out the unimportant...