I have a site where a CSV of racehorse data is to be uploaded once a week. The CSV contains the details of about 19,000 racehorses currently registered in the UK and is about 1.3MB in size, on average. I have a script that processes that csv and either updates the horse if it exists and the ratings data has changed, or adds it if it doesn't exist. If a horse is unchanged, it skips to the next one. The script works, as it was running on the host I use as a test. It took 5 or 6 minutes to run (less than ideal, I know), but it worked.
However, we're now testing on the staging version of the client's host, and it's running for 15 minutes and then returning a 504 timeout. We've tweaked htaccess and php.ini settings as much as we're able ... no joy.
The host is in a shared environment, so they tell me that MySQL's LOAD DATA is unavailable to us.
What other alternative approaches would you try? Or is there a way of splitting the CSV into chunks and running a process on each one in turn, for example?
Related
This is my first time working with MySQL besides a few basic queries on an existing DB, so I'm not great at troubleshooting this.
I have a CSV with 125,000 records that I want to load into MySQL. I got version 8 installed along with workbench. I used the Import Wizard to load my CSV and it started importing. The problem is that it was ~5 hours to get to 30,000 records. From what I read this is a long time and there should be a faster way.
I tried LOAD DATA INFILE but got an error regarding secure-file-priv so I went looking to solve that. The configuration appear to be off for secure-file-priv but it keeps popping up as the error. Now I'm getting "Access denied" errors so I'm just stuck.
I am the admin on this machine and this data doesn't mean anything to anyone so security isn't a concern. I just want to learn how to do this.
Is LOAD DATA INFILE the best way to load his amount of data?
Is 20 hours too long for 125000 records?
Anyone have any idea what I'm doing wrong?
You don't need to set secure-file-priv if you use LOAD DATA LOCAL INFILE. This allows the client to read the file content on the computer where the client runs, so you don't have to upload the file to the designated directory on the database server. This is useful if you don't have access to the database server.
But the LOCAL option is disabled by default. You have to enable it in both server and client with the local-infile option in my.cnf on the server, and also using in the MySQL client by using mysql --local-infile.
In addition, your user must be granted the FILE privilege to load files into a table. See https://dev.mysql.com/doc/refman/8.0/en/privileges-provided.html
Once it's working, LOAD DATA INFILE should be the fastest way to bulk-load data. I did a bunch of comparative speed tests for a presentation Load Data Fast!
You may also have some limiting factors with respect to MySQL Server configuration options, or even performance limitations with respect to the computer hardware.
I think the 5 hours for 30k records is way too long even on modest hardware.
I tested on a Macbook with builtin SSD storage. Even in my test designed to be as inefficient as possible (open connection, save one row using INSERT, disconnect), I still was able to insert 290 rows/second, or 10k rows in 34 seconds. The best result was using LOAD DATA INFILE, at a rate of close to 44k rows/second, loading 1 million rows in 22 seconds.
So something is severely underpowered on your database server, or else the Import Wizard is doing something so inefficient I cannot even imagine what it could be.
I am wondering what would be the best way to speed up the company export function in my application:
function export(){
ini_set('memory_limit', '-1');
ini_set('max_execution_time', 0);
$conditions = $this->getConditions($this->data);
$resultCompanies = $this->Company->find('all', array('conditions' => $conditions));
$this->set(compact('resultCompanies'));
}
So what it does is it searches for the companies in the databse that match certain conditions. Then the results are set to be able to be displayed in the corresponding view.
How can I speed this function up? The more results you want to export the more time it takes to export them but is it possible to optimize it in some way? It currently takes around 30 seconds to export just 4000 results - so I don't imagine it being able to export say 40000 results and it should be able to? Is it a matter of the server?
Thanks.
This is not a matter of the server but of the program architecture.
You do not want to fetch and render this huge amount of information on the fly for obvious reasons you already have encountered.
I don't know enough about the requirements of your app but I assume that you need to download a report. Assuming it has to be always up to date here is what I would do:
The user clicks the link to download the report. The user will get a loading indicator displayed and a message that is report export is being prepared using JS and AJAX. On the server side a task is triggered to build a report.
A background service, a simple CakePHP shell that runs in a loop, will notice that there is a new report to build. It will build the report reading the db records in chunks to avoid running out of memory and write it to a file. When it's done the report download request is flagged as done and the file can be downloaded. On the client side the long polling JS script notices that the file is ready and downloads it.
Another solution, assuming the data does not has to be up to date, would be to generate the report files one time per day for example and have them available for download without any waiting time for the user. On the server side the task will stay the same: Read and write the data in chunks.
About the performance part of the question:
It this does not make it faster but it gives the user a feedback, you could even calculate (estimate) the remaining time based on the already processed chunks and further it prevents the script from crashing because of running out of memory. Instead of writing the file to disk you could directly stream it to the client. As soon as it starts reading the first chunk you'll start sending the data. But reading the data from the database... Well throw money and hardware on it. I suggest you something like a RAID5 array with SSDs if you have the money. Expect to throw a few thousand dollar on it.
But even the fastest DB read is limited by the bandwidth you can send and the user can receive. For speeding up the DB I recommend you to ask on superuser.com, I'm not an expert for DB hardware but a properly set up SSD configuration gives a huge speed boost to your DB.
Fetch only data you need:
I don't see any contain() or recursive setting in your find: Make sure you only fetch data that is really needed in the report!
I have read a few/lots of things on this but they don't seem to help much.
I have an App (it's called "TieUp" but that is irrelevant) I run it manually daily to collate data from several locations.
It is using as sources:
A) Data from a remote SOAP source and loaded into an in-memory TClientDataset via an XMLtransform setup.
B) CSV files downloaded daily and loaded into an in-memory TClientDataset
C) A Mysql Database on the same computer as the program (it's a restored backup of the live source)
D) A remote MS-SQL (SQLServer 2008) database
E) A Mysql Database on a remote server
Data is only read from sources A, B, C and D
Data source E is updated with the consolidated data.
There are between 800 to 2000 records daily so the datasets are not vast although the target (E) has grown to around 150,000 and increasing daily.
I can normally run this all happily and everything works as expected if a little slowly because of all the individual remote lookups to the MS-SQL system) but some days it really screws up and the error is always "Catastrophic Failure!".
The failure does not occur during any particular phase or operation that I can see. The steps are:
1) Get the SOAP(A) data first.
2) Tie in with CSV/In Memory data(B).
3) Lookup References data on Sources C and D to collate
4) Write the consolidated data to source E
After reading in the data into the in memory datasets every thing is In TClientDatasets accesses via DatasetProviders linked to TSQLQueries (they all on the same servers currently but I did it that way to keep some flexibility in future where it might goes true three tier). All queries are contained within the SQLQuery components as they are actually quite simple - it's just a matter of tying things together.
I am using completely standard components from Delphi 2009 Enterprise. All updates and database update packs have been applied. Each data source has its own DataModule these are auto created at startup
There is obviously quite a lot of data access going on here but when it crashes (with catastrophic failure) It gets stuck, completely stuck. Windows can't end the task from the normal "TieUp has stopped working" I have to go to the process and kill it.
There is so much going on and as this only happens once a week or so I really don't know where to start looking.
The reasons for asking the question is twofold: 1) is that I am trying to eliminate any manual stuff and fully automate it, but I can't rely on it if if bombs every week or so. 2) if it happens in the update phase to E - I have to manually delete the new records for the day and start again as I do not have (or haven't written yet) a mechanism to restart from a random point and I would still have to query the DB manually to establish that point for certain.
My next step is to install Delphi on another computer and always run it under the debugger until I can catch it, if it does not freeze first. But that introduces yet another different network connection (instead of the local host one).
So: "Is there a definite answer?" or what is the most likely offending component/connection? Where is the favoured place to start looking?
Thanks in advance...
I've been asked for a quick turn around on this. The group I'm assisting has a .MDB database where offsite workers that don't have internet all the time. Thus, way back the team implemented an Access DB which allows for synchronization.
As their team grew bigger they started running into the following issues:
Remote synching – when an user tries to synch from a worksite, more often than not, the database will crash either due to loss of wireless signal, program timing out, or Inspector manually shutting down due to time (i.e., 30 or more minutes)
Multiple synchers – we are unable to synch multiple at one time (there are currently 34 users in 3 different territories). If someone is synching and another person tries to synch at the same time, the second user will end up with an error message. They will have to shut down their DB and try to synch at a later time.
Incomplete synchs – sometimes when an worker synch’s his/her DB, not all the line items will copy over to the Master file which can cause confusion during review.
Is there any work arounds or items I can look into to resolve these?
I have little resources and time so anything involving a new server might not work.
THanks
It sounds as though you are mainly adding new data from different field operatives, rather than everyone updating existing data, if this is the case then that's good and you could try the following:
Ensure all the tables have "Replication ID's" for the Primary Keys as this will ensure no two operatives create conflicting records.
The synchronisation process should then be amended to take a snapshot of said table/tables to a .txt file on the operatives machine and then this file transferred back to the source machine.
Then at the end of the day or more often if required, the master copy should be setup to import the new data from all the text files it has received, as there will be no conflicting Primary Keys you should be ok, just remember to insert only those where the Primary Key is not already in the table.
Hope all that makes sense : )
Ok, so basically I've got a script in CakePHP where I'm putting over 7 million records into a database. Seeing as how there are that many records, I'm running into some issues with timeouts. This is on a personal server so the memory limit is set to 2000MB so that's not really an issue with how I'm wanting to do it.
The database rows are coming from a huge file. The file was too big for the memory limit, I've split it up into 101 pieces at 10000 lines in each file.
I want the page to refresh after 10 records, and when it comes back, restart inserting records where it left off.
Any ideas?
I've tried the $this->redirect() route, but it's created never-ending scripts that had to be stopped by manually restarting the server.
Why are you not using a shell for that?
To avoid the redirect loop you could try to redirect between two actions or try to attach a timestamp to the url. I'm not sure if that will work, the shell would be the much better approach anyways.