FileAppender with maxHistory - logback

I am writing a batch application. It is desirable for me to create a new log file at each new application launch. I am tempted to use logback's Uniquely named file. So if I run the batch application 100 times, it will create 100 log files. I also desire a maximum number of 30 archive files to keep and asynchronously deleting older files. Here is another stackoverflow question asking the same. But I do not want to write code, is there a way to easily achieve that using logback?

I ended up using log4j2's RollingFileAppender. It has a "OnStartup Triggering Policy" which is perfect for batch applications.

Related

Using Consul for dynamic configuration management

I am working on designing a little project where I need to use Consul to manage application configuration in a dynamic way so that all my app machines can get the configuration at the same time without any inconsistency issue. We are using Consul already for service discovery purpose so I was reading more about it and it looks like they have a Key/Value store which I can use to manage my configurations.
All our configurations are json file so we make a zip file with all our json config files in it and store the reference from where you can download this zip file in a particular key in Consul Key/Value store. And all our app machines need to download this zip file from that reference (mentioned in a key in Consul) and store it on disk on each app machine. Now I need all app machines to switch to this new config at the same time approximately to avoid any inconsistency issue.
Let's say I have 10 app machines and all these 10 machines needs to download zip file which has all my configs and then switch to new configs at the same time atomically to avoid any inconsistency (since they are taking traffic). Below are the steps I came up with but I am confuse on how loading new files in memory along with switch to new configs will work:
All 10 machines are already up and running with default config files as of now which is also there on the disk.
Some outside process will update the key in my consul key/value store with latest zip file reference.
All the 10 machines have a watch on that key so once someone updates the value of the key, watch will be triggered and then all those 10 machines will download the zip file onto the disk and uncompress it to get all the config files.
(..)
(..)
(..)
Now this is where I am confuse on how remaining steps should work.
How apps should load these config files in memory and then switch all at same time?
Do I need to use leadership election with consul or anything else to achieve any of these things?
What will be the logic around this since all 10 apps are already running with default configs in memory (which is also stored on disk). Do we need two separate directories one with default and other for new configs and then work with these two directories?
Let's say if this is the node I have in Consul just a random design (could be wrong here) -
{"path":"path-to-new-config", "machines":"ip1:ip2:ip3:ip4:ip5:ip6:ip7:ip8:ip9:ip10", ...}
where path will have new zip file reference and machines could be a key here where I can have list of all machines so now I can put each machine ip address as soon as they have downloaded the file successfully in that key? And once machines key list has size of 10 then I can say we are ready to switch? If yes, then how can I atomically update machines key in that node? Maybe this logic is wrong here but I just wanted to throw out something. And also need to clean up all those machines list after switch since for the next config update I need to do similar exercise.
Can someone outline the logic on how can I efficiently manage configuration on all my app machines dynamically and also avoid inconsistency issue at the same time? Maybe I need one more node as status which can have details about each machine config, when it downloaded, when it switched and other details?
I can think of several possible solutions, depending on your scenario.
The simplest solution is not to store your config in memory and files at all, just store the config directly in the consul kv store. And I'm not talking about a single key that maps to the entire json (I'm assuming your json is big, otherwise you wouldn't zip it), but extracting smaller key/value sets from the json (this way you won't need to pull the whole thing every time you make a query to consul).
If you get the config directly from consul, your consistency guarantees match consul consistency guarantees. I'm guessing you're worried about performance if you lose your in-memory config, that's something you need to measure. If you can tolerate the performance loss, though, this will save you a lot of pain.
If performance is a problem here, a variation on this might be to use fsconsul. With this, you'll still extract your json into multiple key/value sets in consul, and then fsconsul will map that to files for your apps.
If that's off the table, then the question is how much inconsistencies are you willing to tolerate.
If you can stand a few seconds of inconsistencies, your best bet might be to put a TTL (time-to-live) on your in-memory config. You'll still have the watch on consul but you combine it with evicting your in-memory cache every few seconds, as a fallback in case the watch fails (or stalls) for some reason. This should give you a worst-case few seconds inconsistencies (depending on the value you set for your TTL), but normal case (I think) should be fast.
If that's not acceptable (does downloading the zip take a lot of time, maybe?), you can go down the route you mentioned. To update a value atomically you can use their cas (check-and-set) operation. It will give you an error if an update had happened between the time you sent the request and the time consul tried to apply it. Then you need to pull the list of machines, and apply your change again and retry (until it succeeds).
I don't see why you would need 2 directories, but maybe I'm misunderstanding the question: when your app starts, before you do anything else, you check if there's a new config and if there is you download it and load it to memory. So you shouldn't have a "default config" if you want to be consistent. After you downloaded the config on startup, you're up and alive. When your watch signals a key change you can download the config to directly override your old config. This is assuming you're running the watch triggered code on a single thread, so you're not going to be downloading the file multiple times in parallel. If the download failed, it's not like you're going to load the corrupt file to your memory. And if you crashed mid-download, then you'll download again on startup, so should be fine.

TIMEOUT in Laravel

So, i have to read a excel file in which each row contains some data that i want do write in my database. I pass the whole file to laravel, it reads the file and format it to a array and then i make a new insertion (or update) in my databse.
The thing is, the input excel file can contain thousands of rows and its taking a while to complete, giving a timeout error in some cases.
When i try to make this locally i use set_time_limit(0); function so timeout doesnt occur, and it works pretty wel. But in a remote server this function is disabled for security reasons and my code crashes because of a timeout.
Somebody can help in how to solve this problem ? Maybe another ideia in how to better solve this problem ?
A nice way to handle tasks that take a long time is by making use of so called jobs.
You can make a job called ImportExcel and dispatch it when someone send you a file.
Take a good look at the docs, they have some great examples on how to do this.
You can take care of this using following steps :
1. Take the csv file and store it temporarily in storage :
You can store the large csv when user uploads. If it's something which is not uploaded from frontend, just make sure you have it saved to be processed in next step.
2. Then dispatch a job which can be queued :
You can create a job which can handle this asynchronously. You can use Supervisor to manage queues and timeouts etc.
3. Use package like thephpleague :
Using this package(or similar), you can chunk the records or read one at a time. It is really really helpful to keep your memory usage under limit. Plus it has different options of methods available to read the data from files.
4. Once file is processed, you can delete it from the temporary storage :
Just some teardown cleanup activity.

EC2 suitability for synching large CSV files from an FTP

I have to execute a task twice per week. The task consists on fetching a 1.4GB csv file from a public ftp server. Then I have to process it (apply some filters, discard some rows, make some calculations) and then synch it to a Postgres database hosted on AWS RDS. For each row I have to retrieve a SKU entry on the database and determine wether it needs an update or not.
My question is if EC2 could work as a solution for me. My main concern is the memory.. I have searched for some solutions https://github.com/goodby/csv which handle this issue by fetching row by row instead of pulling it all to memory, however they do not work if I try to read the .csv directly from the FTP.
Can anyone provide some insight? Is AWS EC2 a good platform to solve this problem? How would you deal with the issue of the csv size and memory limitations?
You wont be able to stream the file directly from FTP, instead, you are going to copy the entire file and store it locally. Using curl or ftp command is likely the most efficient way to do this.
Once you do that, you will need to write some kind of program that will read the file a line at a time or several if you can parallelize the work. There are ETL tools available that will make this easy. Using PHP can work, but its not a very efficient choice for this type of work and your parallelization options are limited.
Of course you can do this on an EC2 instance (you can do almost anything you can supply the code for in EC2), but if you only need to run the task twice a week, the EC2 instance will be sitting idle, eating money, the rest of the time, unless you manually stop and start it for each task run.
A scheduled AWS Lambda function may be more cost-effective and appropriate here. You are slightly more limited in your code options, but you can give the Lambda function the same IAM privileges to access RDS, and it only runs when it's scheduled or invoked.
FTP protocol doesn't do "streaming". You cannot read file from Ftp chunks by chunk.
Honestly, downloading the file and trigger run a bigger instance is not a big deal if you only run twice a week, you just choose r3.large (it cost less than 0.20/hour ), execute ASAP and stop it. The internal SSD disk space should give you the best possible I/O compare to EBS.
Just make sure your OS and code are deployed inside EBS for future reuse(unless you have automated code deployment mechanism). And you must make sure RDS will handle the burst I/O, otherwise it will become bottleneck.
Even better, using r3.large instance, you can split the CSV file into smaller chunks, load them in parallel, then shutdown the instance after everything finish. You just need to pay the minimal root EBS storage cost afterwards.
I will not suggest lambda if the process is lengthy, since lambda is only mean for short and fast processing (it will terminate after 300 seconds).
(update):
If you open up a file, the simple ways to parse it is read it sequentially, it may not put the whole CPU into full use. You can split up of CSV file follow reference this answer here.
Then using the same script, you can call them simultaneously by sending some to the background process, example below show putting python process in background under Linux.
parse_csvfile.py csv1 &
parse_csvfile.py csv2 &
parse_csvfile.py csv3 &
so instead single file sequential I/O, it will make use of multiple files. In addition, splitting the file should be a snap under SSD.
So I made it work like this.
I used Python and two great libraries. First of all I created a Python code to request and download the csv file from the FTP so I could load it to the memory. The first package is Pandas, which is a tool to analyze large amounts of data. It includes methods to read files from a csv easily. I used the included features to filter and sort. I filtered the large csv by a field and created about 25 new smaller csv files, which allowed me to deal with the memory issue. I used as well Eloquent which is a library inspired by Laravel's ORM. This library allows you to create a connection using AWS public DNS, database name, username and password and make queries using simple methods, without writing a single Postgres query. Finally I created a T2 micro AWS instance, installed Pandas and Eloquent updated my code and that was it.

Ruby process to index files on a linux filesystem

I'm developing a small photo sharing Rails app which will read and display photos from a library of photos on the local filesystem.
In order to avoid scanning the filesystem every time the user loads the page, I want to set up an hourly cron job that indexes all files and stores it in a local MySQL table.
What's the best way to scan the local filesystem and store metadata about local files (e.g. size, file type, modified date, etc..)? Is there a convenient ruby-based library? I'd also like to be able to "watch" the filesystem to know when files have disappeared since the last scan so that they can be deleted from my table.
Thanks!
You will want to look into inotify.
https://github.com/nex3/rb-inotify
You can set a watch (register a callback in the Linux kernel) on a file or a directory, and everytime something changes in that file/directory, the kernel will notify you immediately with a list of what has changed.
Common events are listed here: https://en.wikipedia.org/wiki/Inotify
You will notice that IN_CREATE + IN_DELETE are the events you are looking for.
Side note: IN_CREATE only creates the file (it's still empty), you will need to wait until IN_CLOSE_WRITE is called, to know data was finished writing to file.

Is there a way to watch a mysql database for changes using perl?

I'm looking for a solution similar to the inotify method of watching files for changes. I'm aware that I could watch the binlog file of the mysql database and run queries to pick out the new results but that seems very inefficient and inelegant; as does simply doing masses of queries in a loop waiting for new results.
If you add a TRIGGER to the table(s) you're interested in, you can use that to alert the watching application. You could do that in a number of ways:
Create an audit table in the database, and have the trigger write the relevant info there; and have your watching application poll the audit table for new entries. You're still polling, but in a controlled way which won't hit the server too hard.
Have the trigger call an external app through a UDF.
As far as MyISAM tables go you can watch information_schema.TABLES.UPDATE_TIME. That would save you from polling all tables you're interested in. For InnoDB, watching binlog is the best I can think of.
Another approach is to do a push/signal instead of a DB poll. Have whatever process updates the database, notify your Perl code that an update was done via any IPC you pick (heck, a log file being appended to with name of table being changed might do the trick well enough).
This is especially effective if the updates a fairly rare/low volume yet the reaction time to them must be quick.
An additional benefit is portability - works for any MySQL backend or for any other DB engine.