Large MySQL tmp files using Joomla/VirtueMart

Large MySQL tmp files using Joomla/VirtueMart - mysql

I am using Joomla 1.5 and VirtueMart 1.1.3.
There is an issue where tmp files that are 1.6 GB are created every time a certain query is executed. is this normal? I think virtuemart is using a huge join statement to pull the whole products table and several other tables. I found the file that builds the query but i don't know where to begin to optimize this. even if i did virtuemart seems to use this one file to build all sql statements so i could end up breaking something.

You could look at the MySQL slow query log (and/or enable it) to see the particular query taking time and space. With that in hand, you can use MySQL's EXPLAIN functionality to see why the query is slow.
If you're lucky, the VirtueMart developers simply haven't added valid indexes to their tables, which causes MySQL to have to do things the slow way (ie. filesort, etc). If you're unlucky, changing the schema won't help and you'll have to take this up with the VirtueMart developers, or fix it yourself.
In any case, if you find a solution, you probably should let the VirtueMart team know.
Best of luck!

Related

Best way to process large database with Laravel

The database
I'm working with a database that has pretty big tables and it's causing me problems. One in particular has more than 120k lines.
What I'm doing with it
I'm looping over this table in a MakeAverage.php file to merge them into about 1k lines in a new table in my database.
What doesn't work
Laravel doesn't allow me to process it all at once even if I try to DB::disableQueryLog() or or a take(1000) limit for example. It returns me a blank page every time even if my error reporting was enabled (kind of like this). Also, I had no Laravel log file for this. I had to look in my php_error.log (I'm using MAMP) to realize that it was actually a memory_limit problem.
What I did
I increased the amount of memory before executing my code by using ini_set('memory_limit', '512M'). (It's bad practice, I should do it in php.ini.)
What happened?
It worked! However, Laravel thrown me an error because the page didn't finished to load after 30s because of the large amount of data.

What I will do
After spending some time on this issue and looking at other people having similar problems (see: Laravel forum, 19453595, 18775510 and 12443321), I thought that maybe PHP isn't the solution.
Since, I'm only creating a Table B from the average values of the Table A, I believe that a SQL is going to fits best my needs as it's clearly faster than PHP for that type of operation (see: 6449072) and I can use functions such as SUM, AVERAGE, COUNT and GROUP_BY (Reference).

Modify database files

I have a system that a client designed and the table was originally not supposed to get larger than 10 gigs (maybe 10 million rows) over a few years. Well, they've imported a lot more information than they were thinking and within a month, the table is now up to 208 gigs (900 million rows).
I have very little experience with MySQL and a lot more experience with Microsoft SQL. Is there anything in MySQL that would allow the client to have the database span multiple files so the queries that are run wouldn't have to use the entire table and index? There is a field on the table that could easily be split on, but I wasn't sure how to do this.
The main issue I'm trying to solve is a retrieval query from this table. Inserts aren't a big deal at all since it's all done by a back-end service. I have a test system where the table is about 2 gigs (6 million rows) and my query takes less than a second. When this same query is run on the production system, it takes 20 seconds. I have feeling that the query is doing well, it's just the size of the table that's causing the issue. There is an index on this table created specifically for this query, and using an EXPLAIN, it is using it.
If you have any other suggestions/questions, please feel free to ask.

Use partitioning and especially the part of create table that sets the data_directory and index_directory.
With these options you can put partitions on separate drives if needed. Usially though, it's enough to partition with a key that you can use on each query, usually time.

In addition to partitioning which has been mentioned you might also want to run the tuning-primer script to ensure your mysql configuration is optimal.

Run analytics on huge MySQL database

I have a MySQL database with a few (five to be precise) huge tables. It is essentially a star topology based data warehouse. The table sizes range from 700GB (fact table) to 1GB and whole database goes upto 1 terabyte. Now I have been given a task of running analytics on these tables which might even include joins.
A simple analytical query on this database can be "find number of smokers per state and display it in descending order" this requirement could be converted in a simple query like
select state, count(smokingStatus) as smokers
from abc
having smokingstatus='current smoker'
group by state....
This query (and many other of same nature) takes a lot of time to execute on this database, time taken is in order of tens of hours.
This database is also heavily used for insertion which means every few minutes there are thousands of rows getting added.
In such a scenario how can I tackle this querying problem?
I have looked in Cassandra which seemed easy to implement but I am not sure if it is going to be as easy for running analytical queries on the database especially when I have to use "where clause and group by construct"
Have Also looked into Hadoop but I am not sure how can I implement RDBMS type queries. I am not too sure if I want to right away invest in getting at least three machines for name-node, zookeeper and data-nodes!! Above all our company prefers windows based solutions.
I have also thought of pre-computing all the data in a simpler summary tables but that limits my ability to run different kinds of queries.
Are there any other ideas which I can implement?
EDIT
Following is the mysql environment setup
1) master-slave setup
2) master for inserts/updates
3) slave for reads and running stored procedures
4) all tables are innodb with files per table
5) indexes on string as well as int columns.
Pre-calculating values is an option but since requirements for this kind of ad-hoc aggregated values keeps changing.

Looking at this from the position of attempting to make MySQL work better rather than positing an entirely new architectural system:
Firstly, verify what's really happening. EXPLAIN the queries which are causing issues, rather than guessing what's going on.
Having said that, I'm going to guess as to what's going on since I don't have the query plans. I'm guessing that (a) your indexes aren't being used correctly and you're getting a bunch of avoidable table scans, (b) your DB servers are tuned for OLTP, not analytical queries, (c) writing data while reading is causing things to slow down greatly, (d) working with strings just sucks and (e) you've got some inefficient queries with horrible joins (everyone has some of these).
To improve things, I'd investigate the following (in roughly this order):
Check the query plans, make sure the existing indexes are being used correctly - look at the table scans, make sure the queries actually make sense.
Move the analytical queries off the OLTP system - the tunings required for fast inserts and short queries are very different to those for the sorts of queries which potentially read most of a large table. This might mean having another analytic-only slave, with a different config (and possibly table types - I'm not sure what the state of the art with MySQL is right now).
Move the strings out of the fact table - rather than having the smoking status column with string values of (say) 'current smoker', 'recently quit', 'quit 1+ years', 'never smoked', push these values out to another table, and have the integer keys in the fact table (this will help the sizes of the indexes too).
Stop the tables from being updated while the queries are running - if the indexes are moving while the query is running I can't see good things happening. It's (luckily) been a long time since I cared about MySQL replication, so I can't remember if you can batch up the writes to the analytical query slave without too much drama.
If you get to this point without solving the performance issues, then it's time to think about moving off MySQL. I'd look at Infobright first - it's open source/$$ & based on MySQL, so it's probably the easiest to put into your existing system (make sure the data is going to the InfoBright DB, then point your analytical queries to the Infobright server, keep the rest of the system as it is, job done), or if Vertica ever releases its Community Edition. Hadoop+Hive has a lot of moving parts - its pretty cool (and great on the resume), but if it's only going to be used for the analytic portion of you system it may take more care & feeding than other options.

1 TB is not that big. MySQL should be able to handle that. At least simple queries like that shouldn't take hours! Can't be very helpful without knowing the larger context, but I can suggest some questions that you might ask yourself, mostly related to how you use your data:
Is there a way you can separate the reads and writes? How many read so you do per day and how many writes? Can you live with some lag, e.g write to a new table each day and merge it to the existing table at the end of the day?
What are most of your queries like? Are they mostly aggregation queries? Can you do some partial aggregation beforehand? Can you pre-calculate number of new smokers every day?
Can you use hadoop for the aggregation process above? Hadoop is kinda good at that stuff. Basically use hadoop just for daily or batch processing and store the results into the DB.
On the DB side, are you using InnoDB or MyISAM? Are the indices on String columns? Can you make it ints etc.?
Hope that helps

MySQL is have a serious limitation what prevent him to be able to perform good on such scenarious. The problem is a lack of parralel query capability - it can not utilize multiple CPUs in the single query.
Hadoop has an RDMBS like addition called Hive. It is application capable of translate your queries in Hive QL (sql like engine) into the MapReduce jobs. Since it is actually small adition on top of Hadoop it inherits its linear scalability
I would suggest to deploy hive alongside MySQL, replicate daily data to there and run heavy aggregations agains it. It will offload serious part of the load fro MySQL. You still need it for the short interactive queries, usually backed by indexes. You need them since Hive is iherently not-interactive - each query will take at least a few dozens of seconds.
Cassandra is built for the Key-Value type of access and does not have scalable GroupBy capability build-in. There is DataStax's Brisk which integrate Cassandra with Hive/MapReduce but it might be not trivial to map your schema into Cassandra and you still not get flexibility and indexing capabiilties of the RDBMS.
As a bottom line - Hive alongside MySQL should be good solution.

mysql optimization script file

I'm looking at having someone do some optimization on a database. If I gave them a similar version of the db with different data, could they create a script file to run all the optimizations on my database (ie create indexes, etc) without them ever seeing or touching the actual database? I'm looking at MySQL but would be open to other db's if necessary. Thanks for any suggestions.
EDIT:
What if it were an identical copy with transformed data? Along with a couple sample queries that approximated what the db was used for (ie OLAP vs OLTP)? Would a script be able to contain everything or would they need hands on access to the actual db?
EDIT 2:
Could I create a copy of the db, transform the data to make it unrecognizable, create a backup file of the db, give it to vendor and them give me a script file to run on my db?

Why are you concerned that they should not access the database? You will get better optimization if they have the actual data as they can consider table sizes, which queries run the slowest, whether to denormalise if necessary, putting small tables completely in memory, ...?
If it is a issue of confidentiality you can always make the data anomous by replacement of names.

If it's just adding indices, then yes. However, there are a number of things to consider when "optimizing". Which are the slowest queries in your database? How large are certain tables? How can certain things be changed/migrated to make those certain queries run faster? It could be harder to see this with sparse sample data. You might also include a query log so that this person could see how you're using the tables/what you're trying to get out of them, and how long those operations take.

mysql stored routine vs. mysql-alternative?

We are using a mysql database w/ about 150,000 records (names) total. Our searches on the 'names' field is done through an autocomplete function in php. We have the table indexed but still feel that the searching is a bit sluggish (a few full seconds vs. something like Google Finance w/ near-instant response). We came up w/ 2 possibilities, but wanted to get more insight:
Can we create a bunch (many thousands or more) of stored procedures to speed up searches, or will creating that many stored procedures bog-down the db?
Is there a faster alternative to mysql for "select" statements (speed on inserting & updating rows isn't too important so we can sacrifice that, if necessary). I've vaguely heard of BigTable & others that don't support JOIN statements....we need JOIN statements for some of our other queries we do.
thx

Forget about stored procedures. They wont do any good for you.
Mysql is good choice, it's often considered as fastest RDBMS. And there is no need to look for 'faster alternative to select statement'.
Abnormal query execution time you mentioned is a result of server misconfiguration or wrong database schema, or both. Please read this response on serverfault or update your question here: provide server configuration, part of database schema and problem query along with explain select ...

You need to cache the information in memory to avoid making repeated calls to the database.

Yes, you need to expire the cache if you change the data, but as you said, that's not common, so you can even do that on a semi-automated basis and not worry about it if necessary. You should check out this MySQL.com article, as well as perhaps explore the MEMORY storage engine (sorry, new and can't post more than one hyperlink per post?!) which takes a little bit of coding around to use but can be extremely efficient.
What's the actual query time (vs page time)? On a reasonably modern server that's not loaded to hell, MySQL should be able to do an autocomplete query on 150k rows much, much, faster than two seconds. Missing some indexes?

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008