EC2 suitability for synching large CSV files from an FTP - csv

I have to execute a task twice per week. The task consists on fetching a 1.4GB csv file from a public ftp server. Then I have to process it (apply some filters, discard some rows, make some calculations) and then synch it to a Postgres database hosted on AWS RDS. For each row I have to retrieve a SKU entry on the database and determine wether it needs an update or not.
My question is if EC2 could work as a solution for me. My main concern is the memory.. I have searched for some solutions https://github.com/goodby/csv which handle this issue by fetching row by row instead of pulling it all to memory, however they do not work if I try to read the .csv directly from the FTP.
Can anyone provide some insight? Is AWS EC2 a good platform to solve this problem? How would you deal with the issue of the csv size and memory limitations?

You wont be able to stream the file directly from FTP, instead, you are going to copy the entire file and store it locally. Using curl or ftp command is likely the most efficient way to do this.
Once you do that, you will need to write some kind of program that will read the file a line at a time or several if you can parallelize the work. There are ETL tools available that will make this easy. Using PHP can work, but its not a very efficient choice for this type of work and your parallelization options are limited.

Of course you can do this on an EC2 instance (you can do almost anything you can supply the code for in EC2), but if you only need to run the task twice a week, the EC2 instance will be sitting idle, eating money, the rest of the time, unless you manually stop and start it for each task run.
A scheduled AWS Lambda function may be more cost-effective and appropriate here. You are slightly more limited in your code options, but you can give the Lambda function the same IAM privileges to access RDS, and it only runs when it's scheduled or invoked.

FTP protocol doesn't do "streaming". You cannot read file from Ftp chunks by chunk.
Honestly, downloading the file and trigger run a bigger instance is not a big deal if you only run twice a week, you just choose r3.large (it cost less than 0.20/hour ), execute ASAP and stop it. The internal SSD disk space should give you the best possible I/O compare to EBS.
Just make sure your OS and code are deployed inside EBS for future reuse(unless you have automated code deployment mechanism). And you must make sure RDS will handle the burst I/O, otherwise it will become bottleneck.
Even better, using r3.large instance, you can split the CSV file into smaller chunks, load them in parallel, then shutdown the instance after everything finish. You just need to pay the minimal root EBS storage cost afterwards.
I will not suggest lambda if the process is lengthy, since lambda is only mean for short and fast processing (it will terminate after 300 seconds).
(update):
If you open up a file, the simple ways to parse it is read it sequentially, it may not put the whole CPU into full use. You can split up of CSV file follow reference this answer here.
Then using the same script, you can call them simultaneously by sending some to the background process, example below show putting python process in background under Linux.
parse_csvfile.py csv1 &
parse_csvfile.py csv2 &
parse_csvfile.py csv3 &
so instead single file sequential I/O, it will make use of multiple files. In addition, splitting the file should be a snap under SSD.

So I made it work like this.
I used Python and two great libraries. First of all I created a Python code to request and download the csv file from the FTP so I could load it to the memory. The first package is Pandas, which is a tool to analyze large amounts of data. It includes methods to read files from a csv easily. I used the included features to filter and sort. I filtered the large csv by a field and created about 25 new smaller csv files, which allowed me to deal with the memory issue. I used as well Eloquent which is a library inspired by Laravel's ORM. This library allows you to create a connection using AWS public DNS, database name, username and password and make queries using simple methods, without writing a single Postgres query. Finally I created a T2 micro AWS instance, installed Pandas and Eloquent updated my code and that was it.

Related

SSIS Cache Mamnager

Is it possible to use an SSIS Cache manger with anything other than a Lookup? I would like to use similar data across multiple data flows.
I haven't been able to find a way to cache this data in memory in a cache manager and then reuse it in a later flow.
Nope, a cache connection manager was specific to solving lookup tasks originally only allowing an OLE DB Connection to be used.
However, if you have a set of data you want to be static for the life of a package run and able to be used across data flows, or even other packages, as a table-like entity, perhaps you're looking for a Raw File. It's a tight, binary implementation of the data stored to disk. Since it's stored to disk, you will pay a write and subsequent read performance penalty but it's likely that the files are right sized such that any penalty is offset by the specific needs.
The first step you will need to do is define the data that will go into a Raw file and connect a Raw File Destination. Which is going to involve creating a Raw File Connection Manager where you will define where the file lives and the rules about the data in there (recreate, append, etc). At this point, run the data flow task so the file is created and populated.
The next step is everywhere you want to use the data, you'll patch in a Raw File Source. It's going to behave much as any other data source in your toolkit at this point.

Using Consul for dynamic configuration management

I am working on designing a little project where I need to use Consul to manage application configuration in a dynamic way so that all my app machines can get the configuration at the same time without any inconsistency issue. We are using Consul already for service discovery purpose so I was reading more about it and it looks like they have a Key/Value store which I can use to manage my configurations.
All our configurations are json file so we make a zip file with all our json config files in it and store the reference from where you can download this zip file in a particular key in Consul Key/Value store. And all our app machines need to download this zip file from that reference (mentioned in a key in Consul) and store it on disk on each app machine. Now I need all app machines to switch to this new config at the same time approximately to avoid any inconsistency issue.
Let's say I have 10 app machines and all these 10 machines needs to download zip file which has all my configs and then switch to new configs at the same time atomically to avoid any inconsistency (since they are taking traffic). Below are the steps I came up with but I am confuse on how loading new files in memory along with switch to new configs will work:
All 10 machines are already up and running with default config files as of now which is also there on the disk.
Some outside process will update the key in my consul key/value store with latest zip file reference.
All the 10 machines have a watch on that key so once someone updates the value of the key, watch will be triggered and then all those 10 machines will download the zip file onto the disk and uncompress it to get all the config files.
(..)
(..)
(..)
Now this is where I am confuse on how remaining steps should work.
How apps should load these config files in memory and then switch all at same time?
Do I need to use leadership election with consul or anything else to achieve any of these things?
What will be the logic around this since all 10 apps are already running with default configs in memory (which is also stored on disk). Do we need two separate directories one with default and other for new configs and then work with these two directories?
Let's say if this is the node I have in Consul just a random design (could be wrong here) -
{"path":"path-to-new-config", "machines":"ip1:ip2:ip3:ip4:ip5:ip6:ip7:ip8:ip9:ip10", ...}
where path will have new zip file reference and machines could be a key here where I can have list of all machines so now I can put each machine ip address as soon as they have downloaded the file successfully in that key? And once machines key list has size of 10 then I can say we are ready to switch? If yes, then how can I atomically update machines key in that node? Maybe this logic is wrong here but I just wanted to throw out something. And also need to clean up all those machines list after switch since for the next config update I need to do similar exercise.
Can someone outline the logic on how can I efficiently manage configuration on all my app machines dynamically and also avoid inconsistency issue at the same time? Maybe I need one more node as status which can have details about each machine config, when it downloaded, when it switched and other details?
I can think of several possible solutions, depending on your scenario.
The simplest solution is not to store your config in memory and files at all, just store the config directly in the consul kv store. And I'm not talking about a single key that maps to the entire json (I'm assuming your json is big, otherwise you wouldn't zip it), but extracting smaller key/value sets from the json (this way you won't need to pull the whole thing every time you make a query to consul).
If you get the config directly from consul, your consistency guarantees match consul consistency guarantees. I'm guessing you're worried about performance if you lose your in-memory config, that's something you need to measure. If you can tolerate the performance loss, though, this will save you a lot of pain.
If performance is a problem here, a variation on this might be to use fsconsul. With this, you'll still extract your json into multiple key/value sets in consul, and then fsconsul will map that to files for your apps.
If that's off the table, then the question is how much inconsistencies are you willing to tolerate.
If you can stand a few seconds of inconsistencies, your best bet might be to put a TTL (time-to-live) on your in-memory config. You'll still have the watch on consul but you combine it with evicting your in-memory cache every few seconds, as a fallback in case the watch fails (or stalls) for some reason. This should give you a worst-case few seconds inconsistencies (depending on the value you set for your TTL), but normal case (I think) should be fast.
If that's not acceptable (does downloading the zip take a lot of time, maybe?), you can go down the route you mentioned. To update a value atomically you can use their cas (check-and-set) operation. It will give you an error if an update had happened between the time you sent the request and the time consul tried to apply it. Then you need to pull the list of machines, and apply your change again and retry (until it succeeds).
I don't see why you would need 2 directories, but maybe I'm misunderstanding the question: when your app starts, before you do anything else, you check if there's a new config and if there is you download it and load it to memory. So you shouldn't have a "default config" if you want to be consistent. After you downloaded the config on startup, you're up and alive. When your watch signals a key change you can download the config to directly override your old config. This is assuming you're running the watch triggered code on a single thread, so you're not going to be downloading the file multiple times in parallel. If the download failed, it's not like you're going to load the corrupt file to your memory. And if you crashed mid-download, then you'll download again on startup, so should be fine.

loading 20 million records from SSIS to SNOWFLAKE through ODBC

I am trying to load around 20 million records from ssis to snowflake using ODBC connection, this load is taking forever to complete. I there any faster method than using ODBC? I can think of loading it into flat file and then using flat file to load into snowflake but sure how to do it.
Update:
i generated a text file using bcp and the put that file on snowflake staging using ODBC connection and then using copy into command to load the data into tables.
issue: the txt file generated is a 2.5gb file and the ODBC is struggling to send the file to snowflake stage any help on this part??
It should be faster to write compressed objects to the cloud provider's object store (AWS S3, Azure blob, etc.) and then COPY INTO Snowflake. But also more complex.
You are, by chance, not writing one row at a time, for 20,000,000 database calls?
ODBC is slow on a database like this, Snowflake (and similar columnar warehouses) also want to eat shred files, not single large ones. The problem with your original approach was no method of ODBC usage is going to be particularly fast on a system designed to load nodes in parallel across shred staged files.
The problem with your second approach was no shred took place. Non-columnar databases with a head node (say, Netezza) would like and eat and shred your single file, but a Snowflake or a Redshift are basically going to ingest it as a single thread into a single node. Thus your ingest of a single 2.5 GB file is going to take the same amount of time on an XS 1-node Snowflake as an L 8-node Snowflake cluster. Your single node itself is not saturated and has plenty of CPU cycles to spare, doing nothing. Snowflake appears to use up to 8 write threads on a node basis for an extract or ingest operation. You can see some tests here: https://www.doyouevendata.com/2018/12/21/how-to-load-data-into-snowflake-snowflake-data-load-best-practices/
My suggestion would be to make at least 8 files of size (2.5 GB / 8), or about 8 315MB files. For 2-nodes, at least 16. Likely this involves some effort in your file creation process if it is not natively shredding and horizontally scaling; although as a bonus it's breaking up your data into easier bite sized processes to abort/resume/etc should any problems occur.
Also note that once the data is bulk insert into Snowflake it is unlikely to be optimally placed to take advantage of the benefits of micro-partitions - so I would recommend something like rebuilding the table with the loaded data and at least sorting it on an oft restricted column, ie. a fact table I would at least rebuild and sort by date. https://www.doyouevendata.com/2018/03/06/performance-query-tuning-snowflake-clustering/
generate the file and then use Snow CLI to Put it in the internal Stage. Use Cooy into for stage->table. Some coding to do, and you can never avoid transporting GB over the net, but Put coukd compress and transfer the file in chunks

Backing up DynamoDB tables via data pipeline vs manually creating a json for dynamoDB

I need to back up a few DynamoDB tables which are not too big for now to S3. However, these are tables another team uses/works on but not me. These back ups need to happen once a week, and will only be used to restore the DynamoDB tables in disastrous situations (so hopefully never).
I saw that there is a way to do this by setting up a data pipeline, which I'm guessing you can schedule to do the job once a week. However, it seems like this would keep the pipeline open and start incurring charges. So I was wondering, if there is a significant cost difference between backing the tables up via the pipeline and keeping it open, or creating something like a powershellscript that will be scheduled to run on an EC2 instance, which already exists, which would manually create a JSON mapping file and update that to S3.
Also, I guess another question is more of a practicality question. How difficult it is to backup dynamoDB tables to Json format. It doesn't seem too hard but wasn't sure. Sorry if these questions are too general.
Are you are working under the assumption that Data Pipeline keeps the server up forever? That is not the case.
For instance, you have defined a Shell Activity, after the activity completes, the server will terminate. (You may manually set the termination protection. Ref.
Since you only run a pipeline once a week, the costs are not high.
If you run a cron job on ec2 instance, that instance needs to up when you want to run the backup - and that could be a point of failure.
Incidentally, Amazon provides a Datapipeline sample on how to export data from dynamodb.
I just checked the pipeline cost page, and it says "For example, a pipeline that runs a daily job (a Low Frequency activity) on AWS to replicate an Amazon DynamoDB table to Amazon S3 would cost $0.60 per month". So I think I'm safe.

Looking to make MySQL db records readable as CSV-like lines in text file via standard file io interface Linux device driver (working with legacy code)

I want to be able to read from a real live proper MySQL database using standard file access routines. I don't mean reading the MySQL database's own underlying private files. What I mean is implementing a file-based linux device driver that "presents" a MySQL database as a file. In other words, the text file is a "View" of the MySQL database. The MySQL records are presented in our homegrown custom variation of the CSV format that the legacy code was originally written to understand.
Background
I have some legacy code that reads from a text file that contains a very large table of data, each line being a separate record. New records (lines) need to be added but there is contention for the file among the team, there is also an overhead in deployment of the legacy code and this file to many systems when releasing the software to them. The text file itself also needs to be version controlled.
Rather than modify the legacy code to call a MYSQL database version of these records directly, I thought it would be better to leave it untouched. This would avoid risks in modifying the code and ease deployment and moreover, modifying the code would cause much overhead in de-risking, design discussions, more testing etc.
So what I'm looking to do is write a file-based device driver such that this makes the MySQL database appear as a file to the legacy code, with the data within the format that the legacy code expects. That way the legacy code is not changed and can work oblivious that the file is really an underlying database. Contention is removed because the individual records in the database can now be updated/added to separately (via MySQL, or even better a separate web admin interface that guides and validates data entry from the user for individual records) and deployment effort is much reduced without having to up-issue the whole file on all the systems that use it.
The device driver would contain routines to internally translate standard file read operations into MySQL queries to the MySQL database and contain routines to return the MySQL results and translate these into the text format for returning back to the file read operation.
This is for a Linux/Unix platform.
Has this been done and what are your thoughts?
(cleaned up the question, grammar, clarification, readability. This does not affect the accepted answer.)
This kind of thing has been done before - an obvious example being the dynamic view filing system in ClearCase which provided (maybe still does?) a virtualised view onto a version control repository. Behind the scenes it implemented an object cache and used RPC to fetch objects from other hosts if necessary, and made extensive use of both local and remote databases.
It's fairly clear that you are going to implement the bulk of your filing system in user-space, but you will need a (small) kernel resident portion. Unless there's a really good reason to do otherwise, FUSE is what you're looking for - it will provide the kernel-resident part for you. All you'll need to write is glue to turn file operations into SQL requests.