Best way to process large database with Laravel - mysql

The database
I'm working with a database that has pretty big tables and it's causing me problems. One in particular has more than 120k lines.
What I'm doing with it
I'm looping over this table in a MakeAverage.php file to merge them into about 1k lines in a new table in my database.
What doesn't work
Laravel doesn't allow me to process it all at once even if I try to DB::disableQueryLog() or or a take(1000) limit for example. It returns me a blank page every time even if my error reporting was enabled (kind of like this). Also, I had no Laravel log file for this. I had to look in my php_error.log (I'm using MAMP) to realize that it was actually a memory_limit problem.
What I did
I increased the amount of memory before executing my code by using ini_set('memory_limit', '512M'). (It's bad practice, I should do it in php.ini.)
What happened?
It worked! However, Laravel thrown me an error because the page didn't finished to load after 30s because of the large amount of data.

What I will do
After spending some time on this issue and looking at other people having similar problems (see: Laravel forum, 19453595, 18775510 and 12443321), I thought that maybe PHP isn't the solution.
Since, I'm only creating a Table B from the average values of the Table A, I believe that a SQL is going to fits best my needs as it's clearly faster than PHP for that type of operation (see: 6449072) and I can use functions such as SUM, AVERAGE, COUNT and GROUP_BY (Reference).

Related

Rails Writes Take 100% Longer After Postgres Migration

I'm working on a migration from MySQL to Postgres on a large Rails app, most operations are performing at a normal rate. However, we have a particular operation that will generate job records every 30 minutes or so. There are usually about 200 records generated and inserted after which we have separate workers that pick up the jobs and work on them from another server.
Under MySQL it takes about 15 seconds to generate the records, and then another 3 minutes for the worker to perform and write back the results, one at a time (so 200 more updates to the original job records).
Under Postgres it takes around 30 seconds, and then another 7 minutes for the worker to perform and write back the results.
The table being written to has roughly 2 million rows, and 1 sequence column under ID.
I have tried tweaking checkpoint timeouts and sizes with no luck.
The table is heavily indexed and really shouldn't be any different than it was before.
I can't post code samples as its a huge codebase and without posting pages and pages of code it wouldn't make sense.
My question is, can anyone think of why this would possibly be happening? There is nothing in the Postgres log and the process of creating these objects has not changed really. Is there some sort of blocking synchronous write behavior I'm not aware of with Postgres?
I've added all sorts of logging in my code to spot errors or transaction failures but I'm coming up with nothing, it just takes twice as long to run, which doesn't seem correct to me.
The Postgres instance is hosted on AWS RDS on a M3.Medium instance type.
We also use New Relic, and it's showing nothing of interest here, which is surprising
Why does your job queue contain 2 million rows? Are they all live or are have not moved them to an archive table to keep your reporting more simple?
Have you used EXPLAIN on your SQL from a psql prompt or your preferred SQL IDE/tool?
Postgres is a completely different RDBMS then MySQL. It allocates space differently and manipulates space differently so may need to be indexed differently.
Additionally there's a tool called pgtune that will suggest configuration changes.
edit: 2014-08-13
Also, rails comes with a profiler that might add some insight. Here's a StackOverflow thread about rails profiling.
You also want to watch your DB server at the disk IO level. Does your job fulfillment to a large number of updates? Postgres created new rows when you update a existing rows, and marks the old rows as available, instead of just overwriting the existing row. So you may be seeing a lot more IO as a result of your RDBMS switch.

Does mysql return data on demand, or all? Why looping over the data is slower than accessing them?

I was wondering when we call the perl dbi apis to query a database, are all the results return? Or do we get partially the result set and as we iterating we retrieve more and more rows from the database.
The reason I am asking is that I notice the following in a perl script.
I did a query to a database which returns a really large number of records. After getting this records I did a for loop over the results and created a hash of this data.
What I noticed is that the actual query from the database return in a reasonable amount of time (the results were a lot) but the big delay was looping over the data to create the hash.
I don't understand this. I would expect that the query would be the slow part since the for loop and the construction of the hash would be in-memory and would be cheap.
Any explanation/idea why this happens? Am I misunderstanding something basic here?
Update
I understand that MySQL caches data so when I run the same query multiple times it would be faster the second time and on. But still I would not expect the for loop over the data set in memory to be of the same (and more) time duration as the query to the MySQL DB.
Assuming you are using DBD::mysql, the default is to pull all the results from the server at once and store them in memory. This avoids tying up the server's resources and works fine for the majority of result sets as RAM is usually plentiful.
That answers your original question, but if you would like more assistance, I suggest pasting code - it's possible your hash building code is doing something wrong, or unnecessary queries are being made. See also Speeding up the DBI for tips on efficient use of the DBI API, and how to profile what DBI is doing.

MySQL - Restrict (Forcibly) the number of rows returned for ANY QUERY

I have a special security need with mysql. I need to forcibly restrict the number of rows a query returns, issuing an error if the returned rows will be over, say a million rows. Here is the setup -
Need - The data has 100s of millions of rows, and we don't want the client to run down the server or do a complete extraction (They would never need all the lines, just aggregations) The idea is, if they need it, they run into an error or the barrier, and come to us with the reason explaining why they need to pull so many rows with a query.
System - Clients can use any query tool, so we have no control over what query is generated. Thus, we cannot use Limit x which seems to be the solution suggested everywhere.
I have tried searching for a solution, and for now it seems that the only way to do it is at the application level (which we do not own).
Is there any way to achieve this?
Setting
1- We need to have SSL enabled.
2- MySQL 5.5
Thanks!
J
It seems like you might be able to get close with MySQL Proxy.
https://launchpad.net/mysql-proxy
See this page for manipulating results. Not sure if it does a buffered or unbuffered read, or if you can cancel the reading of results or not...
http://dev.mysql.com/doc/refman/5.1/en/mysql-proxy-scripting-read-query-result.html
It's open source, so you might be able to hire someone to tweak it if needed as well.
There may be other ways to restrict the overloading of your database server. Take a look at this link for more info:
MySQL - can I limit the maximum time allowed for a query to run?

Run analytics on huge MySQL database

I have a MySQL database with a few (five to be precise) huge tables. It is essentially a star topology based data warehouse. The table sizes range from 700GB (fact table) to 1GB and whole database goes upto 1 terabyte. Now I have been given a task of running analytics on these tables which might even include joins.
A simple analytical query on this database can be "find number of smokers per state and display it in descending order" this requirement could be converted in a simple query like
select state, count(smokingStatus) as smokers
from abc
having smokingstatus='current smoker'
group by state....
This query (and many other of same nature) takes a lot of time to execute on this database, time taken is in order of tens of hours.
This database is also heavily used for insertion which means every few minutes there are thousands of rows getting added.
In such a scenario how can I tackle this querying problem?
I have looked in Cassandra which seemed easy to implement but I am not sure if it is going to be as easy for running analytical queries on the database especially when I have to use "where clause and group by construct"
Have Also looked into Hadoop but I am not sure how can I implement RDBMS type queries. I am not too sure if I want to right away invest in getting at least three machines for name-node, zookeeper and data-nodes!! Above all our company prefers windows based solutions.
I have also thought of pre-computing all the data in a simpler summary tables but that limits my ability to run different kinds of queries.
Are there any other ideas which I can implement?
EDIT
Following is the mysql environment setup
1) master-slave setup
2) master for inserts/updates
3) slave for reads and running stored procedures
4) all tables are innodb with files per table
5) indexes on string as well as int columns.
Pre-calculating values is an option but since requirements for this kind of ad-hoc aggregated values keeps changing.
Looking at this from the position of attempting to make MySQL work better rather than positing an entirely new architectural system:
Firstly, verify what's really happening. EXPLAIN the queries which are causing issues, rather than guessing what's going on.
Having said that, I'm going to guess as to what's going on since I don't have the query plans. I'm guessing that (a) your indexes aren't being used correctly and you're getting a bunch of avoidable table scans, (b) your DB servers are tuned for OLTP, not analytical queries, (c) writing data while reading is causing things to slow down greatly, (d) working with strings just sucks and (e) you've got some inefficient queries with horrible joins (everyone has some of these).
To improve things, I'd investigate the following (in roughly this order):
Check the query plans, make sure the existing indexes are being used correctly - look at the table scans, make sure the queries actually make sense.
Move the analytical queries off the OLTP system - the tunings required for fast inserts and short queries are very different to those for the sorts of queries which potentially read most of a large table. This might mean having another analytic-only slave, with a different config (and possibly table types - I'm not sure what the state of the art with MySQL is right now).
Move the strings out of the fact table - rather than having the smoking status column with string values of (say) 'current smoker', 'recently quit', 'quit 1+ years', 'never smoked', push these values out to another table, and have the integer keys in the fact table (this will help the sizes of the indexes too).
Stop the tables from being updated while the queries are running - if the indexes are moving while the query is running I can't see good things happening. It's (luckily) been a long time since I cared about MySQL replication, so I can't remember if you can batch up the writes to the analytical query slave without too much drama.
If you get to this point without solving the performance issues, then it's time to think about moving off MySQL. I'd look at Infobright first - it's open source/$$ & based on MySQL, so it's probably the easiest to put into your existing system (make sure the data is going to the InfoBright DB, then point your analytical queries to the Infobright server, keep the rest of the system as it is, job done), or if Vertica ever releases its Community Edition. Hadoop+Hive has a lot of moving parts - its pretty cool (and great on the resume), but if it's only going to be used for the analytic portion of you system it may take more care & feeding than other options.
1 TB is not that big. MySQL should be able to handle that. At least simple queries like that shouldn't take hours! Can't be very helpful without knowing the larger context, but I can suggest some questions that you might ask yourself, mostly related to how you use your data:
Is there a way you can separate the reads and writes? How many read so you do per day and how many writes? Can you live with some lag, e.g write to a new table each day and merge it to the existing table at the end of the day?
What are most of your queries like? Are they mostly aggregation queries? Can you do some partial aggregation beforehand? Can you pre-calculate number of new smokers every day?
Can you use hadoop for the aggregation process above? Hadoop is kinda good at that stuff. Basically use hadoop just for daily or batch processing and store the results into the DB.
On the DB side, are you using InnoDB or MyISAM? Are the indices on String columns? Can you make it ints etc.?
Hope that helps
MySQL is have a serious limitation what prevent him to be able to perform good on such scenarious. The problem is a lack of parralel query capability - it can not utilize multiple CPUs in the single query.
Hadoop has an RDMBS like addition called Hive. It is application capable of translate your queries in Hive QL (sql like engine) into the MapReduce jobs. Since it is actually small adition on top of Hadoop it inherits its linear scalability
I would suggest to deploy hive alongside MySQL, replicate daily data to there and run heavy aggregations agains it. It will offload serious part of the load fro MySQL. You still need it for the short interactive queries, usually backed by indexes. You need them since Hive is iherently not-interactive - each query will take at least a few dozens of seconds.
Cassandra is built for the Key-Value type of access and does not have scalable GroupBy capability build-in. There is DataStax's Brisk which integrate Cassandra with Hive/MapReduce but it might be not trivial to map your schema into Cassandra and you still not get flexibility and indexing capabiilties of the RDBMS.
As a bottom line - Hive alongside MySQL should be good solution.

Mysql query fast only first time run

I have a mysql SELECT query which is fast (<0.1 sec) but only the first time I run it. It joins 3 tables together (using indices) and has a relatively simple WHERE statement. When I run it by hand in the phpmyadmin (always changing numbers in the WHERE so that it isn't cached) it is always fast but when I have php run several copies of it in a row, the first one is fast and the others hang for ~400 sec. My only guess is that somehow mysql is running out of memory for the connection and then has to do expensive paging.
My general question is how can I fix this behavior, but my specific questions are without actually closing and restarting the connection how can I make these queries coming from php be seen as separate just like the queries coming from phpmyadmin, how can I tell mysql to flush any memory when the request is done, and does this sound like a memory issue to you?
Well I found the answer at least in my case and I'm putting it here for anyone in the future who runs into a similar issue. The query I was running had a lot of results returned and MYSQL's query cache was causing a lot of overhead. When you run a query MYSQL will save it and its output so that it can quickly answer future identical requests quickly. All I had to do was put SQL_NO_CACHE and the speed was back to normal. Just look out if your incoming query is large or the results are very large because it can take considerable resources for MYSQL to decide when to kick things out.