Read millions of small files and insert into mysql with Nodejs - mysql

I've tried many ways but can't find an efficient and performant way to open millions of files in a folder and insert their content into a database with nodejs.
It needs to be memory efficient and asynchronous because of SQL queries.
Any insight ?

I guess you are not creating an app but more of a one time migration right?
If you are going to just let NodeJS read everything at once and insert to DB using a simple JS loop, you are probably going to face errors.
Either your DB will hang due to insufficient memory / choke up due to too many connections at once.
NodeJS is lightweight.. it just reads the "million of files"
My take on this vague question is that you need to control the insertion:
You can use modules like https://caolan.github.io/async/v3/ to help you control which calls are asynchronous or synchronous using async.eachSeries() or async.waterfall()
Reading files you can use the Nodejs' fs module which can be found here https://www.tutorialspoint.com/nodejs/nodejs_file_system.htm
If you can't control what files your NodeJS is reading, you can.
Read a few files, store it in batches of JSON arrays or objects
Insert them asynchronously / synchronously using the method mentioned above.
This implementation is totally up to how you nest each read and write.
Cheers

Related

Options for storing and retrieving small objects (node.js), is a database necessary?

I am in the process of building my first live node.js web app. It contains a form that accepts data regarding my clients current stock. When submitted, an object is made and saved to an array of current stock. This stock is then permanently displayed on their website until the entry is modified or deleted.
It is unlikely that there will ever be more than 20 objects stored at any time and these will only be updated perhaps once a week. I am not sure if it is necessary to use MongoDB to store these, or whether there could be a simpler more appropriate alternative. Perhaps the objects could be stored to a JSON file instead? Or would this have too big an implication on page load times?
You could potentially store in a JSON file or even in a cache of sorts such as Redis but I still think MongoDB would be your best bet for a live site.
Storing something in a JSON file is not scalable so if you end up storing a lot more data than originally planned (this often happens) you may find you run out of storage on your server hard drive. Also if you end up scaling and putting your app behind a load balancer, then you will need to make sure there are matching copy's of that JSON file on each server. Further more, it is easy to run into race conditions when updating a JSON file. If two processes are trying to update the file at the same time, you are going to potentially lose data. Technically speaking, JSON file would work but it's not recommended.
Storing in memory (i.e.) Redis has similar implications that the data is only available on that one server. Also the data is not persistent, so if your server restarted for whatever reason, you'd lose what was stored in memory.
For all intents and purposes, MongoDB is your best bet.
The only way to know for sure is test it with a load test. But as you probably read html and js files from the file system when serving web pages anyway, the extra load of reading a few json files shouldn't be a problem.
If you want to go with simpler way i.e JSON file use nedb API which is plenty fast as well.

EC2 suitability for synching large CSV files from an FTP

I have to execute a task twice per week. The task consists on fetching a 1.4GB csv file from a public ftp server. Then I have to process it (apply some filters, discard some rows, make some calculations) and then synch it to a Postgres database hosted on AWS RDS. For each row I have to retrieve a SKU entry on the database and determine wether it needs an update or not.
My question is if EC2 could work as a solution for me. My main concern is the memory.. I have searched for some solutions https://github.com/goodby/csv which handle this issue by fetching row by row instead of pulling it all to memory, however they do not work if I try to read the .csv directly from the FTP.
Can anyone provide some insight? Is AWS EC2 a good platform to solve this problem? How would you deal with the issue of the csv size and memory limitations?
You wont be able to stream the file directly from FTP, instead, you are going to copy the entire file and store it locally. Using curl or ftp command is likely the most efficient way to do this.
Once you do that, you will need to write some kind of program that will read the file a line at a time or several if you can parallelize the work. There are ETL tools available that will make this easy. Using PHP can work, but its not a very efficient choice for this type of work and your parallelization options are limited.
Of course you can do this on an EC2 instance (you can do almost anything you can supply the code for in EC2), but if you only need to run the task twice a week, the EC2 instance will be sitting idle, eating money, the rest of the time, unless you manually stop and start it for each task run.
A scheduled AWS Lambda function may be more cost-effective and appropriate here. You are slightly more limited in your code options, but you can give the Lambda function the same IAM privileges to access RDS, and it only runs when it's scheduled or invoked.
FTP protocol doesn't do "streaming". You cannot read file from Ftp chunks by chunk.
Honestly, downloading the file and trigger run a bigger instance is not a big deal if you only run twice a week, you just choose r3.large (it cost less than 0.20/hour ), execute ASAP and stop it. The internal SSD disk space should give you the best possible I/O compare to EBS.
Just make sure your OS and code are deployed inside EBS for future reuse(unless you have automated code deployment mechanism). And you must make sure RDS will handle the burst I/O, otherwise it will become bottleneck.
Even better, using r3.large instance, you can split the CSV file into smaller chunks, load them in parallel, then shutdown the instance after everything finish. You just need to pay the minimal root EBS storage cost afterwards.
I will not suggest lambda if the process is lengthy, since lambda is only mean for short and fast processing (it will terminate after 300 seconds).
(update):
If you open up a file, the simple ways to parse it is read it sequentially, it may not put the whole CPU into full use. You can split up of CSV file follow reference this answer here.
Then using the same script, you can call them simultaneously by sending some to the background process, example below show putting python process in background under Linux.
parse_csvfile.py csv1 &
parse_csvfile.py csv2 &
parse_csvfile.py csv3 &
so instead single file sequential I/O, it will make use of multiple files. In addition, splitting the file should be a snap under SSD.
So I made it work like this.
I used Python and two great libraries. First of all I created a Python code to request and download the csv file from the FTP so I could load it to the memory. The first package is Pandas, which is a tool to analyze large amounts of data. It includes methods to read files from a csv easily. I used the included features to filter and sort. I filtered the large csv by a field and created about 25 new smaller csv files, which allowed me to deal with the memory issue. I used as well Eloquent which is a library inspired by Laravel's ORM. This library allows you to create a connection using AWS public DNS, database name, username and password and make queries using simple methods, without writing a single Postgres query. Finally I created a T2 micro AWS instance, installed Pandas and Eloquent updated my code and that was it.

NodeJS cache mysql data whith clustering enabled

I want to cache data that I got from my MySQL DB and for this I am currently storing the data in an object.
Before querying the database, I check if the needed data exists in the meantioned object or not. If not, I will query and insert it.
This works quiet well and my webserver is now just fetching the data once and reuses it.
My concern is now: Do I have to think of concurrent writes/reads for such data structures that lay in the object, when using nodejs's clustering feature?
Every single line of JavaScript that you write on your Node.js program is thread-safe, so to speak - at any given time, only a single statement is ever executed. The fact that you can do async operations is only implemented at a low level implementation that is completely transparent to the programmer. To be precise, you can only run some code in a "truly parallel" way when you do some input/output operation, i.e. reading a file, doing TCP/UDP communication or when you spawn a child process. And even then, the only code that is executed in parallel to your application is that of Node's native C/C++ code.
Since you use a JavaScript object as a cache store, you are guaranteed no one will ever read or write from/to it at the same time.
As for cluster, every worker is created its own process and thus has its own version of every JavaScript variable or object that exists in your code.

Live chat application using Node JS Socket IO and JSON file

I am developing a Live chat application using Node JS, Socket IO and JSON file. I am using JSON file to read and write the chat data. Now I am stuck on one issue, When I do the stress testing i.e pushing continuous messages into the JSON file, the JSON format becomes invalid and my application crashes.Although I am using forever.js which should keep application up but still the application crashes.
Does anybody have idea on this?
Thanks in advance for any help.
It is highly recommended that you re-consider your approach for persisting data to disk.
Among other things, one really big issue is that you will likely experience data loss. If we both get the file at the exact same time - {"foo":"bar"} - we both make a change and you save it before me, my change will overwrite yours since I started with the same thing as you. Although you saved it before me, I didn't re-open it after you saved.
What you are possibly seeing now in an append-only approach is that we're both adding bits and pieces without regard to valid JSON structure (IE: {"fo"bao":r":"ba"for"o"} from {"foo":"bar"} x 2).
Disk I/O is actually pretty slow. Even with an SSD hard drive. Memory is where it's at.
As recommended, you may want to consider MongoDB, MySQL, or otherwise. This may be a decent use case for Couchbase which is an in-memory key/value store based on memcache that persists things to disk ASAP. It is extremely JSON friendly (it is actually mostly based on JSON), offers great map/reduce support to query data, is super easy to scale to multiple servers, and has a node.js module.
This would allow you to very easily migrate your existing data storage routine into a database. Also, it provides CAS support which will prevent you from data loss in the scenarios outlined earlier.
At minimum though, you should possibly just modify an in memory object that you save to disk ever so often to prevent permanent data loss. However, this only works well with 1 server and then you're back at likely needing to look at a database.

Caching database queries with Node.js

Is there an implementation of database (mysql) query caching written purely in Node.js?
I'm writing a Node web app and was planning on caching queries with memcached, but while considering this I realised it's probably possible to do the caching through a separate Node.js layer instead
To explain:
You could query the database through a node server on a separate port, returning data from memory where available and loading it into memory where it isn't.
Anyone know how Node.js would compare to memcache in terms of return speed on hashed arrays? Is this a pipe-dream or something I should look at?
I went ahead and wrote a caching solution for private use that stored the data in a shared object. This wasn't really query caching, it stores specific results instead of raw sql results ordered by hashes, but it kept what I needed in memory and was ridiculously easy to write.
Since I originally asked this question a number of node caching solutions have emmerged:
ptarjan/node-cache
tcs-de/nodecache
vxtindia/node-cache
mape/node-caching
I haven't used any of these but one of them might well be of use to someone else.
There are now also redis and memcached clients for node.
You can definitely implement something like this in node, and it could be an interesting project, but it depends on your needs. If you're just doing this for a hobby project, by all means, build a caching layer in node and try it out. Let us know how it goes!
If this is for production use, then I would recommend sticking to the established caching layers (memcached, redis, etc) as they have already gone through all of the growing pains associated with building a scalable caching system.
I have written a node.js module that performs MySQL query caching using memcached.
The module is named Memento and is available at https://www.npmjs.com/package/memento-mysql
Enjoy!