Optimisation of the export function - mysql

I am wondering what would be the best way to speed up the company export function in my application:
function export(){
ini_set('memory_limit', '-1');
ini_set('max_execution_time', 0);
$conditions = $this->getConditions($this->data);
$resultCompanies = $this->Company->find('all', array('conditions' => $conditions));
$this->set(compact('resultCompanies'));
}
So what it does is it searches for the companies in the databse that match certain conditions. Then the results are set to be able to be displayed in the corresponding view.
How can I speed this function up? The more results you want to export the more time it takes to export them but is it possible to optimize it in some way? It currently takes around 30 seconds to export just 4000 results - so I don't imagine it being able to export say 40000 results and it should be able to? Is it a matter of the server?
Thanks.

This is not a matter of the server but of the program architecture.
You do not want to fetch and render this huge amount of information on the fly for obvious reasons you already have encountered.
I don't know enough about the requirements of your app but I assume that you need to download a report. Assuming it has to be always up to date here is what I would do:
The user clicks the link to download the report. The user will get a loading indicator displayed and a message that is report export is being prepared using JS and AJAX. On the server side a task is triggered to build a report.
A background service, a simple CakePHP shell that runs in a loop, will notice that there is a new report to build. It will build the report reading the db records in chunks to avoid running out of memory and write it to a file. When it's done the report download request is flagged as done and the file can be downloaded. On the client side the long polling JS script notices that the file is ready and downloads it.
Another solution, assuming the data does not has to be up to date, would be to generate the report files one time per day for example and have them available for download without any waiting time for the user. On the server side the task will stay the same: Read and write the data in chunks.
About the performance part of the question:
It this does not make it faster but it gives the user a feedback, you could even calculate (estimate) the remaining time based on the already processed chunks and further it prevents the script from crashing because of running out of memory. Instead of writing the file to disk you could directly stream it to the client. As soon as it starts reading the first chunk you'll start sending the data. But reading the data from the database... Well throw money and hardware on it. I suggest you something like a RAID5 array with SSDs if you have the money. Expect to throw a few thousand dollar on it.
But even the fastest DB read is limited by the bandwidth you can send and the user can receive. For speeding up the DB I recommend you to ask on superuser.com, I'm not an expert for DB hardware but a properly set up SSD configuration gives a huge speed boost to your DB.
Fetch only data you need:
I don't see any contain() or recursive setting in your find: Make sure you only fetch data that is really needed in the report!

Related

Connect Adobe Analytics to MYSQL

I am trying to connect the data collected from Adobe Analytics to my local instance of MYSQL, is this possible? if so what would be the method of doing so?
There isn't a way to directly connect your mysql db with AA, make queries or whatever.
The following is just some top level info to point you in a general direction. Getting into specifics is way too long and involved to be an answer here. But below I will list some options you have for getting the data out of Adobe Analytics.
Which method is best largely depends on what data you're looking to get out of AA and what you're looking to do with it, within your local db. But in general, I listed them in order of level of difficulty of setting something up for it and doing something with the file(s) once received, to get them into your database.
First option is to within the AA interface, schedule data to be FTP'd to you on a regular basis. This can be a scheduled report from the report interface or from Data Warehouse, and can be delivered in a variety of formats but most commonly done as a CSV file. This will export data to you that has been processed by AA. Meaning, aggregated metrics, etc. Overall, this is pretty easy to setup and parse the exported CSV files. But there are a number of caveats/limitations about it. But it largely depends on what specifically you're aiming to do.
Second option is to make use of their API endpoint to make requests and receive response in JSON format. Can also receive it in XML format but I recommend not doing that. You will get similar data as above, but it's more on-demand than scheduled. This method requires a lot more effort on your end to actually get the data, but it gives you a lot more power/flexibility for getting the data on-demand, building interfaces (if relevant to you), etc. But it also comes with some caveats/limitations same as first option, since the data is already processed/aggregated.
Third option is to schedule Data Feed exports from the AA interface. This will send you CSV files with non-aggregated, mostly non-processed, raw hit data. This is about the closest you will get to the data sent to Adobe collection servers without Adobe doing anything to it, but it's not 100% like a server request log or something. Without knowing any details about what you are ultimately looking to do with the data, other than put it in a local db, at face value, this may be the option you want. Setting up the scheduled export is pretty easy, but parsing the received files can be a headache. You get files with raw data and a LOT of columns with a lot of values for various things, and then you have these other files that are lookup tables for both columns and values within them. It's a bit of a headache piecing it all together, but it's doable. The real issue is file sizes. These are raw hit data files and even a site with moderate traffic will generate files many gigabytes large, daily, and even hourly. So bandwidth, disk space, and your server processing power are things to consider if you attempt to go this route.

TIMEOUT in Laravel

So, i have to read a excel file in which each row contains some data that i want do write in my database. I pass the whole file to laravel, it reads the file and format it to a array and then i make a new insertion (or update) in my databse.
The thing is, the input excel file can contain thousands of rows and its taking a while to complete, giving a timeout error in some cases.
When i try to make this locally i use set_time_limit(0); function so timeout doesnt occur, and it works pretty wel. But in a remote server this function is disabled for security reasons and my code crashes because of a timeout.
Somebody can help in how to solve this problem ? Maybe another ideia in how to better solve this problem ?
A nice way to handle tasks that take a long time is by making use of so called jobs.
You can make a job called ImportExcel and dispatch it when someone send you a file.
Take a good look at the docs, they have some great examples on how to do this.
You can take care of this using following steps :
1. Take the csv file and store it temporarily in storage :
You can store the large csv when user uploads. If it's something which is not uploaded from frontend, just make sure you have it saved to be processed in next step.
2. Then dispatch a job which can be queued :
You can create a job which can handle this asynchronously. You can use Supervisor to manage queues and timeouts etc.
3. Use package like thephpleague :
Using this package(or similar), you can chunk the records or read one at a time. It is really really helpful to keep your memory usage under limit. Plus it has different options of methods available to read the data from files.
4. Once file is processed, you can delete it from the temporary storage :
Just some teardown cleanup activity.

Which method improves the performance?

Suppose I have multiple users.
Which of the following method will improve the performance if I have to process all user's data?
Database hit for every user to fetch the filtered data of that user or
Fetch data of all user in the single database hit and the filter the data using expression/loops like LAMBDA or LINQ.
It depends on your data.
if it is small than load it all and process it in a loop/stream
if it is very big than it is best to combine,
load a chunk of data process it and load a new chunk, process it and so on,
The issue is that loading from a DB takes time (open connection etc) but loading the entire DB to the memory is also problematic (if it is big), so you need to combine the two options.
hope it helps.

EC2 suitability for synching large CSV files from an FTP

I have to execute a task twice per week. The task consists on fetching a 1.4GB csv file from a public ftp server. Then I have to process it (apply some filters, discard some rows, make some calculations) and then synch it to a Postgres database hosted on AWS RDS. For each row I have to retrieve a SKU entry on the database and determine wether it needs an update or not.
My question is if EC2 could work as a solution for me. My main concern is the memory.. I have searched for some solutions https://github.com/goodby/csv which handle this issue by fetching row by row instead of pulling it all to memory, however they do not work if I try to read the .csv directly from the FTP.
Can anyone provide some insight? Is AWS EC2 a good platform to solve this problem? How would you deal with the issue of the csv size and memory limitations?
You wont be able to stream the file directly from FTP, instead, you are going to copy the entire file and store it locally. Using curl or ftp command is likely the most efficient way to do this.
Once you do that, you will need to write some kind of program that will read the file a line at a time or several if you can parallelize the work. There are ETL tools available that will make this easy. Using PHP can work, but its not a very efficient choice for this type of work and your parallelization options are limited.
Of course you can do this on an EC2 instance (you can do almost anything you can supply the code for in EC2), but if you only need to run the task twice a week, the EC2 instance will be sitting idle, eating money, the rest of the time, unless you manually stop and start it for each task run.
A scheduled AWS Lambda function may be more cost-effective and appropriate here. You are slightly more limited in your code options, but you can give the Lambda function the same IAM privileges to access RDS, and it only runs when it's scheduled or invoked.
FTP protocol doesn't do "streaming". You cannot read file from Ftp chunks by chunk.
Honestly, downloading the file and trigger run a bigger instance is not a big deal if you only run twice a week, you just choose r3.large (it cost less than 0.20/hour ), execute ASAP and stop it. The internal SSD disk space should give you the best possible I/O compare to EBS.
Just make sure your OS and code are deployed inside EBS for future reuse(unless you have automated code deployment mechanism). And you must make sure RDS will handle the burst I/O, otherwise it will become bottleneck.
Even better, using r3.large instance, you can split the CSV file into smaller chunks, load them in parallel, then shutdown the instance after everything finish. You just need to pay the minimal root EBS storage cost afterwards.
I will not suggest lambda if the process is lengthy, since lambda is only mean for short and fast processing (it will terminate after 300 seconds).
(update):
If you open up a file, the simple ways to parse it is read it sequentially, it may not put the whole CPU into full use. You can split up of CSV file follow reference this answer here.
Then using the same script, you can call them simultaneously by sending some to the background process, example below show putting python process in background under Linux.
parse_csvfile.py csv1 &
parse_csvfile.py csv2 &
parse_csvfile.py csv3 &
so instead single file sequential I/O, it will make use of multiple files. In addition, splitting the file should be a snap under SSD.
So I made it work like this.
I used Python and two great libraries. First of all I created a Python code to request and download the csv file from the FTP so I could load it to the memory. The first package is Pandas, which is a tool to analyze large amounts of data. It includes methods to read files from a csv easily. I used the included features to filter and sort. I filtered the large csv by a field and created about 25 new smaller csv files, which allowed me to deal with the memory issue. I used as well Eloquent which is a library inspired by Laravel's ORM. This library allows you to create a connection using AWS public DNS, database name, username and password and make queries using simple methods, without writing a single Postgres query. Finally I created a T2 micro AWS instance, installed Pandas and Eloquent updated my code and that was it.

mysql huge operations

I am currently importing a huge CSV file from my iPhone to a rails server. In this case, the server will parse the data and then start inserting rows of data into the database. The CSV file is fairly large and would take a lot time for the operation to end.
Since I am doing this asynchronously, my iPhone is then able to go to other views and do other stuff.
However, when it requests another query in another table.. this will HANG because the first operation is still trying to insert the CSV's information into the database.
Is there a way to resolve this type of issue?
As long as the phone doesn't care when the database insert is complete, you might want to try storing the CSV file in a tmp directory on your server and then have a script write from that file to the database. Or simply store it in memory. That way, once the phone has posted the CSV file, it can move on to other things while the script handles the database inserts asynchronously. And yes, #Barmar is right about using an InnoDB engine rather than MyISAM (which may be default in some configurations).
Or, you might want to consider enabling "low-priority updates" which will delay write calls until all pending read calls have finished. See this article about MySQL table locking. (I'm not sure what exactly you say is hanging: the update, or reads while performing the update…)
Regardless, if you are posting the data asynchronously from your phone (i.e., not from the UI thread), it shouldn't be an issue as long as you don't try to use more than the maximum number of concurrent HTTP connections.