MySQL Queries Pegging Server Resources -- Indexes aren't being used - mysql

I took on a volunteer project a few years ago. The site is set up with Joomla, but most of the articles are rendered with php scripts that pull info from non-Joomla tables. The database is now almost 50MB and several of the non-Joomla tables have 60,000+ rows -- I had no idea it would get this big. Even just pulling up the list of the articles that contain these scripts takes a long time -- and right now there are only about 30 of them. I initially thought the problem was because I'm on dial-up, so everything is slow, but then we started getting "resources exceeded" notices, so I figured I better find out what's going on. It's not a high traffic site -- we get less than 2,000 unique visitors in any given month.
In one particular instance, I have one table where the library holdings (books, etc.) are listed by title, author, pub date, etc. The second table contains the names mentioned in those books. I have a Joomla! article for each publication that lists the names found in that book. I also have an article that lists all of the names from all of the books. That is the query below -- but even the ones for the specific books that pull up only 1,000 or so entries are very slow.
I originally set up indexes for these tables (MyISAM), but when I went back to check, they weren't there. So I thought re-configuring the indexes would solve the problem. Not even -- and according to EXPLAIN, they aren't even being used.
One of my problematic queries is as follows:
SELECT *
FROM pub_surnames
WHERE pub_surname_last REGEXP '^[A-B]'
ORDER BY pub_surname_last, pub_surname_first, pub_surname_middle
EXPLAIN gave:
id 1
select_type SIMPLE
table pub_surnames
type ALL
possible_keys NULL
key NULL
key_len NULL
ref NULL
rows 56422
Extra Using where; Using filesort
Also, phpmyadmin says "Current selection does not contain a unique column."
All of the fields are required for this query, but I read here that it would help if I listed them individually, so I did. The table contains a primary key, and I added a second unique index containing the primary key for the table, as well as the primary key for the table that holds the information about the publication itself. I also added an index for the ORDER BY fields. But I still get the same results when I use EXPLAIN and the performance isn't improved at all.
I set these tables up within the Joomla! database that the site uses for connection purposes and it makes it easier to back everything up. I'm wondering now if it would help if I used a separate database for our non-Joomla tables? Or would that just make it worse?
I'm not really sure where to go from here.

I think you are probably approaching this the wrong way. Probably it was the quick way to get it done when you first set it up, but now that the data has grown you are paying the price.
It sounds like you are recreating a massive list "inside" an article each time a page is rendered. Even though the source data is constantly being updated you would probably be better off storing the results. (Assuming I understand your data structure correctly.) Not knowing exactly what your php scripts are doing makes it a little complicated .. it could be that it would make more sense to actually make a very simple component to read the data from the other tables but I'll assume that doesn't make sense.
Here's what I think you might want to do.
Create a cron job (really easy to make a script using Joomla, go take a look at the jacs respository) and use it to run whatever your php is doing. You can schedule it once a day or once an hour or every 10 minutes, whatever makes sense.
Save the results. These could go into a data base table or you could cache them in the file system. Or both. Or possibly have the script update the articles since they seem to be fixed (you aren't adding new ones etc)
Then when the user comes you just want to either read the article if you stored there or you want to have a component that renders the results or make a plugin that will manage the queries for you. You should not be doing queries directly from inside an article layout, it's just wrong, even if no one knows it's there. If you have to run queries, use a content plugin similar to maybe the profile plugin, which does the queries in the right place architecturally.
Not knowing the exact purpose of what you are doing, it's hard to advise more, but I think if you are managing searches for people you'd likely be better off creating a way to use finder to index and search the results.

Check out below suggestions
Try changing your database engine to InnoDB which will work better for large datasets.
Also use RegEx alternative, which is used in "WHERE" part of query which hugely affects queries execution time.
Instead selecting all the columns with "*" just select needed columns.

Related

Seeking a performant solution for accessing unique MySQL entries

I know very little about MySQL (or web development in general). I'm a Unity game dev and I've got a situation where users (of a region the size of which I haven't decided yet, possibly globally) can submit entries to an online database. The users must be able to then locate their entry at any time.
For this reason, I've generated a guid from .Net (System.Guid.NewGuid()) and am storing that in the database entry. This works for me! However... I'm no expert, but my gut tells me that looking up a complex string in what could be a gargantuan table might have terrible performance.
That said, it doesn't seem like anything other than a globally unique identifier will solve my problem. Is there a more elegant solution that I'm not seeing, or a way to mitigate against any issues this design pattern might create?
Thanks!
Make sure you define the GUID column as the primary key in the MySQL table. That will cause MySQL to create an index on it, which will enable MySQL to quickly find a row given the GUID. The table might be gargantuan but (assuming a regular B-tree index) the time required for a lookup will increase logarithmically relative to the size of the table. In other words, if it requires 2 reads to find a row in a 1,000-row table, finding a row in a 1,000,000-row table will only require 2 more reads, not 1,000 times as many.
As long as you have defined the primary key, the performance should be good. This is what the database is designed to do.
Obviously there are limits to everything. If you have a billion users and they're submitting thousands of these entries every second, then maybe a regular indexed MySQL table won't be sufficient. But I wouldn't go looking for some exotic solution before you even have a problem.
If you have a key of the row you want, and you have an index on that key, then this query will take less than a second, even if the table has a billion rows:
SELECT ... FROM t WHERE id = 1234.
The index in question might be the PRIMARY KEY, or it could be a secondary key.
GUIDs/UUIDs should be used only if you need to manufacture unique ids in multiple clients without asking the database for an id. If you do use such, be aware that GUIDs perform poorly if the table is bigger than RAM.

MySQL partitioning or NoSQL like AWS DynamoDB?

Business logic:
My application crawls a lot(hundreds or sometimes thousands) of webpages every few hours and stores all the links(i.e. all anchor tags) on that webpage in a MySQL database table say links. This table is growing very big day by day (already around 20 million records as of now).
Technical:
I have a unique index combined on [webpage_id, link] in the links table. Also, I have a column crawl_count in the same table.
Now whenever I crawl a webpage, I already know webpage_id (the foreign key to webpages table) and I get links in that webpage (i.e. array of link) which I just do an insert or update query without worrying about what is already in the table.
INSERT INTO ........ ON DUPLICATE KEY UPDATE crawl_count=crawl_count+1
Problem:
The table grows big every day & I want to optimize the table for performance. Options I considered are,
Partitioning: Partition table by domains. All webpages belong to a particular domain. For example: Webpage https://www.amazon.in/gp/goldbox?ref_=nav_topnav_deals belong to the domain https://www.amazon.in/
NoSQL like DynamoDB. I have other tables of application in MySQL DB which I do not want to migrate to DynamoDB unless it's absolutely required. Also I have considered change in application logic (eg: change the structure of webpages table to something like
{webpage: "http://example.com/new-brands", links: [link1, link2, link3]}
and migrate this table to DynamoDB so I don't have a links table. But again, there is a limit for every record in DynamoDB(400kb). What if it exceeds this limit?
I have read pros & cons of using either of the approach. As far my understanding goes, DynamoDB doesn't seem to be a good fit for my situation. But still wanted to post this question so I can make a good decision for this scenario.
PARTITION BY domain -- No. There won't be any performance gain. Anyway, you will find that one domain dominates the table, and a zillion domains show up only once. (I'm speaking from experience.)
The only concept of an "array" is a separate table. It would have, in your case, webpage_id and link as a 2-column PRIMARY KEY (which is 'unique').
Normalize. This is to avoid having lots of copies of each domain and each link. This saves some space.
I assume you have two categories of links -- the ones for pages you have scanned, and the ones for pages waiting to scan. And probably the two sets are similar in size. I don't understand the purpose of `crawl count, but it adds to the cost.
I may be able to advise further if I could see the queries -- both inserting and selecting. Also, how big are the tables (GB) and what is the value of innodb_buffer_pool_size? Putting these together, we can discuss likely points if sluggishness.
Also the slowlog would help.
Are you dealing with non-ascii urls? Urls too long to index? Do you split urls into domain + path? Do you strip off "#..."? And "?..."?

MySQL Best Practice for adding columns

So I started working for a company where they had 3 to 5 different tables that were often queried in either a complex join or through a double, triple query (I'm probably the 4th person to start working here, it's very messy).
Anyhow, I created a table that when querying the other 3 or 5 tables at the same time inserts that data into my table along with whatever information normally got inserted there. It has drastically sped up the page speeds for many applications and I'm wondering if I made a mistake here.
I'm hoping that in the future to remove inserting into those other tables and simply inserting all that information into the table that I've started and to switch the applications to that one table. It's just a lot faster.
Could someone tell me why it's much faster to group all the information into one massive table and if there is any downside to doing it this way?
If the joins are slow, it may be because the tables did not have FOREIGN KEY relationships and indexes properly defined. If the tables had been properly normalized before, it is probably not a good idea to denormalize them into a single table unless they were not performant with proper indexing. FOREIGN KEY constraints require indexing on both the PK table and the related FK column, so simply defining those constraints if they don't already exist may go a long way toward improving performance.
The first course of action is to make sure the table relationships are defined correctly and the tables are indexed, before you begin denormalizing it.
There is a concept called materialized views, which serve as a sort of cache for views or queries whose result sets are deterministic, by storing the results of a view's query into a temporary table. MySQL does not support materialized views directly, but you can implement them by occasionally selecting all rows from a multi-table query and storing the output into a table. When the data in that table is stale, you overwrite it with a new rowset. For simple SELECT queries which are used to display data that doesn't change often, you may be able to speed up your pageloads using this method. It is not advisable to use it for data which is constantly changing though.
A good use for materialized views might be constructing rows to populate your site's dropdown lists or to store the result of complicated reports which are only run once a week. A bad use for them would be to store customer order information, which requires timely access.
Without seeing the table structures, etc it would be guesswork. But it sounds like possibly the database was over-normalized.
It is hard to say exactly what the issue is without seeing it. But you might want to look at adding indexes, and foreign keys to the tables.
If you are adding a table with all of the data in it, you might be denormalizing the database.
There are some cases where de-normalizing your tables has its advantages, but I would be more interested in finding out if the problem really lies with the table schema or with how the queries are being written. You need to know if the queries utilize indexes (or whether indexes need to be added to the table), whether the original query writer did things like using subselects when they could have been using joins to make a query more efficient, etc.
I would not just denormalize because it makes things faster unless there is a good reason for it.
Having a separate copy of the data in your newly defined table is a valid performance enchancing practice, but on the other hand it might become a total mess when it comes to keeping the data in your table and the other ones same. You are essentially having two truths, without good idea how to invalidate this "cache" when it comes to updates/deletes.
Read more about "normalization" and read more about "EXPLAIN" in MySQL - it will tell you why the other queries are slow and you might get away with few proper indexes and foreign keys instead of copying the data.

Is there any reason not to use auto_increment on an index for a database table?

I've inherited the task of maintaining a very poorly-coded e-commerce site and I'm working on refactoring a lot of the code and trying to fix ongoing bugs.
Every database insert (adding an item to cart, etc.) begins with a grab_new_id function which COUNTs the number of rows in the table, then, starting with that number, querys the database to find an unused index number. In addition to being terrible performance-wise (there are 40,000+ rows already, and indexes are regularly deleted, so sometimes it takes several seconds just to find a new id) this breaks regularly when two operations are preformed simultaneously, as two entries are added with duplicate id numbers.
This seems idiotic to me - why not just use auto-increment on the index field? I've tested it both ways, and adding rows to the table without specifying an index id is (obviously) many times faster. My question is: can anyone think of any reason the original programmer might have done this? Is there some school of thought where auto_increment is somehow considered bad form? Are there databases that don't have auto-increment capabilities?
I've seen this before from someone that didn't know that feature existed. Definitely use the auto-increment feature.
Some people take the "roll your own" approach to everything, often because they haven't taken the time to see if that is an available feature or if someone else had already come up with it. You'll often see crazy workarounds or poor performing/fragile code from these people. Inheriting a bad database is no fun at all, good luck!
Well Oracle has sequences but not auto-generated ids as I understand it. However, usually this kind of stuff is done by devs who don't understand database programming and who hate to see gaps in the data (as you get from rollbacks). There are also people who like to create the id, so they have it available beforhand to use for child tables, but most databases with autogenerated ids also have a way to return that id to the user at the time of creation.
The only issue I found partially reasonable (but totally avoidable!) against auto_inc fields is that some backup tools by default include auto_inc values into table definition even if you don't include data into a db dump that may be inconvenient.
Depending on the specific situation, there are clearly many reasons for not using consecutive numbers as a primary key.
However, under the given that I do want consecutive numbers as a primary key, I see no reason not to use the built in auto_increment functionality MySQL offers
It was probably done that way for historical reasons; i.e. earlier versions didn't have autoinc variables. I've written code that uses manual autoinc fields on databases that don't support autoinc types, but my code wasn't quite as inefficient as pulling a count().
One issue with using autoinc fields as a primary key is that moving records in and out of tables may result in the primary key changing. So, I'd recommend designing in a "LegacyID" field up front that can be used as future storage for the primary key for times when you are moving records in and out of the table.
They may just have been inexperienced and unfamiliar with auto increment. One reason I can think of, but doesn't necessarily make much sense, is that it is difficult (not impossible) to copy data from one environment to another when using auto increment id's.
For this reason, I have used sequential Guids as my primary key before for ease of transitioning data, but counting the rows to populate the ID is a bit of a WTF.
Two things to watch for:
1.Your RDBMS intelligently sets the auto-increment value upon restart. Our engineers were rolling their own auto-increment key to get around the auto-increment field jumping by an order of 100000s whenever the server restarted. However, at some point Sybase added an option to set the size of the auto-increment.
2.The other place where auto-increment can get nasty is if you are replicating databases and are using a master-master configuration. If you write on both databases (NOT ADVISED), you can run into identity-collision.
I doubt either of these were the case, but things to be aware of.
I could see if the ids were generated on the client and pushed into the database, this is common practice when speed is necessary, but what you discribed seems over the top and unnecessary. Remove it and start an auto incrementing id.

Big tables and analysis in MySql

For my startup, I track everything myself rather than rely on google analytics. This is nice because I can actually have ips and user ids and everything.
This worked well until my tracking table rose about 2 million rows. The table is called acts, and records:
ip
url
note
account_id
...where available.
Now, trying to do something like this:
SELECT COUNT(distinct ip)
FROM acts
JOIN users ON(users.ip = acts.ip)
WHERE acts.url LIKE '%some_marketing_page%';
Basically never finishes. I switched to this:
SELECT COUNT(distinct ip)
FROM acts
JOIN users ON(users.ip = acts.ip)
WHERE acts.note = 'some_marketing_page';
But it is still very slow, despite having an index on note.
I am obviously not pro at mysql. My question is:
How do companies with lots of data track things like funnel conversion rates? Is it possible to do in mysql and I am just missing some knowledge? If not, what books / blogs can I read about how sites do this?
While getting towards 'respectable', 2 Millions rows is still a relatively small size for a table. (And therefore a faster performance is typically possible)
As you found out, the front-ended wildcard are particularly inefficient and we'll have to find a solution for this if that use case is common for your application.
It could just be that you do not have the right set of indexes. Before I proceed, however, I wish to stress that while indexes will typically improve the DBMS performance with SELECT statements of all kinds, it systematically has a negative effect on the performance of "CUD" operations (i.e. with the SQL CREATE/INSERT, UPDATE, DELETE verbs, i.e. the queries which write to the database rather than just read to it). In some cases the negative impact of indexes on "write" queries can be very significant.
My reason for particularly stressing the ambivalent nature of indexes is that it appears that your application does a fair amount of data collection as a normal part of its operation, and you will need to watch for possible degradation as the INSERTs queries get to be slowed down. A possible alternative is to perform the data collection into a relatively small table/database, with no or very few indexes, and to regularly import the data from this input database to a database where the actual data mining takes place. (After they are imported, the rows may be deleted from the "input database", keeping it small and fast for its INSERT function.)
Another concern/question is about the width of a row in the cast table (the number of columns and the sum of the widths of these columns). Bad performance could be tied to the fact that rows are too wide, resulting in too few rows in the leaf nodes of the table, and hence a deeper-than-needed tree structure.
Back to the indexes...
in view of the few queries in the question, it appears that you could benefit from an ip + note index (an index made at least with these two keys in this order). A full analysis of the index situation, and frankly a possible review of the database schema cannot be done here (not enough info for one...) but the general process for doing so is to make the list of the most common use case and to see which database indexes could help with these cases. One can gather insight into how particular queries are handled, initially or after index(es) are added, with mySQL command EXPLAIN.
Normalization OR demormalization (or indeed a combination of both!), is often a viable idea for improving performance during mining operations as well.
Why the JOIN? If we can assume that no IP makes it into acts without an associated record in users then you don't need the join:
SELECT COUNT(distinct ip) FROM acts
WHERE acts.url LIKE '%some_marketing_page%';
If you really do need the JOIN it might pay to first select the distinct IPs from acts, then JOIN those results to users (you'll have to look at the execution plan and experiment to see if this is faster).
Secondly, that LIKE with a leading wild card is going to cause a full table scan of acts and also necessitate some expensive text searching. You have three choices to improve this:
Decompose the url into component parts before you store it so that the search matches a column value exactly.
Require the search term to appear at the beginning of the of the url field, not in the middle.
Investigate a full text search engine that will index the url field in such a way that even an internal LIKE search can be performed against indexes.
Finally, in the case of searching on acts.notes, if an index on notes doesn't provide sufficient search improvement, I'd consider calculating and storing an integer hash on notes and searching for that.
Try running 'EXPLAIN PLAN' on your query and look to see if there are any table scans.
Should this be a LEFT JOIN?
Maybe this site can help.