Big tables and analysis in MySql - mysql

For my startup, I track everything myself rather than rely on google analytics. This is nice because I can actually have ips and user ids and everything.
This worked well until my tracking table rose about 2 million rows. The table is called acts, and records:
ip
url
note
account_id
...where available.
Now, trying to do something like this:
SELECT COUNT(distinct ip)
FROM acts
JOIN users ON(users.ip = acts.ip)
WHERE acts.url LIKE '%some_marketing_page%';
Basically never finishes. I switched to this:
SELECT COUNT(distinct ip)
FROM acts
JOIN users ON(users.ip = acts.ip)
WHERE acts.note = 'some_marketing_page';
But it is still very slow, despite having an index on note.
I am obviously not pro at mysql. My question is:
How do companies with lots of data track things like funnel conversion rates? Is it possible to do in mysql and I am just missing some knowledge? If not, what books / blogs can I read about how sites do this?

While getting towards 'respectable', 2 Millions rows is still a relatively small size for a table. (And therefore a faster performance is typically possible)
As you found out, the front-ended wildcard are particularly inefficient and we'll have to find a solution for this if that use case is common for your application.
It could just be that you do not have the right set of indexes. Before I proceed, however, I wish to stress that while indexes will typically improve the DBMS performance with SELECT statements of all kinds, it systematically has a negative effect on the performance of "CUD" operations (i.e. with the SQL CREATE/INSERT, UPDATE, DELETE verbs, i.e. the queries which write to the database rather than just read to it). In some cases the negative impact of indexes on "write" queries can be very significant.
My reason for particularly stressing the ambivalent nature of indexes is that it appears that your application does a fair amount of data collection as a normal part of its operation, and you will need to watch for possible degradation as the INSERTs queries get to be slowed down. A possible alternative is to perform the data collection into a relatively small table/database, with no or very few indexes, and to regularly import the data from this input database to a database where the actual data mining takes place. (After they are imported, the rows may be deleted from the "input database", keeping it small and fast for its INSERT function.)
Another concern/question is about the width of a row in the cast table (the number of columns and the sum of the widths of these columns). Bad performance could be tied to the fact that rows are too wide, resulting in too few rows in the leaf nodes of the table, and hence a deeper-than-needed tree structure.
Back to the indexes...
in view of the few queries in the question, it appears that you could benefit from an ip + note index (an index made at least with these two keys in this order). A full analysis of the index situation, and frankly a possible review of the database schema cannot be done here (not enough info for one...) but the general process for doing so is to make the list of the most common use case and to see which database indexes could help with these cases. One can gather insight into how particular queries are handled, initially or after index(es) are added, with mySQL command EXPLAIN.
Normalization OR demormalization (or indeed a combination of both!), is often a viable idea for improving performance during mining operations as well.

Why the JOIN? If we can assume that no IP makes it into acts without an associated record in users then you don't need the join:
SELECT COUNT(distinct ip) FROM acts
WHERE acts.url LIKE '%some_marketing_page%';
If you really do need the JOIN it might pay to first select the distinct IPs from acts, then JOIN those results to users (you'll have to look at the execution plan and experiment to see if this is faster).
Secondly, that LIKE with a leading wild card is going to cause a full table scan of acts and also necessitate some expensive text searching. You have three choices to improve this:
Decompose the url into component parts before you store it so that the search matches a column value exactly.
Require the search term to appear at the beginning of the of the url field, not in the middle.
Investigate a full text search engine that will index the url field in such a way that even an internal LIKE search can be performed against indexes.
Finally, in the case of searching on acts.notes, if an index on notes doesn't provide sufficient search improvement, I'd consider calculating and storing an integer hash on notes and searching for that.

Try running 'EXPLAIN PLAN' on your query and look to see if there are any table scans.
Should this be a LEFT JOIN?
Maybe this site can help.

Related

MySQL self join performance: fact or just bad indexing?

As an example: I'm having a database to detect visitor (bots, etc) and since not every visitor have the same amount of 'credential' I made a 'dynamic' table like so: see fiddle: http://sqlfiddle.com/#!9/ca4c8/1 (simplified version).
This returns me the profile ID that I use to gather info about each profile (in another DB). Depending on the profile type I query the table with different nameclause (name='something') (ei: hostname, ipAddr, userAgent, HumanId, etc).
I'm not an expert in SQL but I'm familiar with indexes, constraints, primary, unique, foreign key etc. And from what I saw from these search results:
Mysql Self-Join Performance
How to tune self-join table in mysql like this?
Optimize MySQL self join query
JOIN Performance Issue MySQL
MySQL JOIN performance issue
Most of them have comments about bad performance on self-join but answers tend to go for the missing index cause.
So the final question is: is self joining a table makes it more prone to bad performance assuming that everything is indexed properly?
On a side note, more information about the table: might be irrelevant to the question but is well in context for my particular situation:
column flag is used to mark records for deletion as the user I use from php don't have DELETE permission over this database. Sorry, Security is more important than performance
I added the 'type' that will go with info I get from the user agent. (ie: if anything is (at least seems to be) a bot, we will only search for type 5000.
Column 'name' is unfortunately a varchar indexed in the primary key (with profile and type).
I tried to use as much INT and filtering (WHERE) in the SELECT query to reduce eventual lost of performance (if that even matters)
I'm willing to study and tweak the thing if needed unless someone with a high background in mysql tells me it's really not a good thing to do.
This is a big project I have in development so I cannot test it with millions of records for now but I wonder if performance will be an issues as this grows. Any input, links, references, documentation or test procedure (maybe in comments) will be appreciated.
A self-join is no different than joining two different tables. The optimizer will pick one 'table', usually based on the WHERE, then do a Nested Loop Join into the other. In your case, you have implied, via LEFT, that it should work only one way. (The Optimizer will ignore that if it sees no need for it.
Your keys are find for that Fiddle.
The real problem is "Entity-Attribute-Value", which is a messy way to lay out data in tables. Your query seems to be saying "find a (limit 1) profile (entity) that has a certain pair of attributes (name = Googlebot AND addr = ...).
It would be so much easier, and faster, to have two columns (name and addr) and a "composite" INDEX(name, addr).
I recommend doing that for the common "attributes", then put the rest into a single column with a JSON string. See here.

Is it faster to only query specific columns?

I've heard that it is faster to select colums manually ("col1, col2, col3, etc") instead of querying them all with "*".
But what if I don't even want to query all columns of a table? Would it be faster to query, for Example, only "col1, col2" insteaf of "col1, col2, col3, col4"?
From my understanding SQL has to search through all of the columns anyway, and just the return-result changes. I'd like to know if I can achieve a gain in performance by only choosing the right columns.
(I'm doing this anyway, but a backend API of one of my applications returns more often than not all columns, so I'm thinking about letting the user manually select the columns he want)
In general, reducing the number of columns in the select is a minor optimization. It means that less data is being returned from the database server to the application calling the server. Less data is usually faster.
Under most circumstances, this a minor improvement. There are some cases where the improvement can be more important:
If a covering index is available for the query, so the index satisfies the query without having to access data pages.
If some fields are very long, so records occupy multiple pages.
If the volume of data being retrieved is a small fraction (think < 10%) of the overall data in each record.
Listing the columns individually is a good idea, because it protects code from changes in underlying schema. For instance, if the name of a column is changed, then a query that lists columns explicitly will break with an easy-to-understand error. This is better than a query that runs and produces erroneous results.
You should try not to use select *.
Inefficiency in moving data to the consumer. When you SELECT *, you're often retrieving more columns from the database than your application really needs to function. This causes more data to move from the database server to the client, slowing access and increasing load on your machines, as well as taking more time to travel across the network. This is especially true when someone adds new columns to underlying tables that didn't exist and weren't needed when the original consumers coded their data access.
Indexing issues. Consider a scenario where you want to tune a query to a high level of performance. If you were to use *, and it returned more columns than you actually needed, the server would often have to perform more expensive methods to retrieve your data than it otherwise might. For example, you wouldn't be able to create an index which simply covered the columns in your SELECT list, and even if you did (including all columns [shudder]), the next guy who came around and added a column to the underlying table would cause the optimizer to ignore your optimized covering index, and you'd likely find that the performance of your query would drop substantially for no readily apparent reason.
Binding Problems. When you SELECT *, it's possible to retrieve two columns of the same name from two different tables. This can often crash your data consumer. Imagine a query that joins two tables, both of which contain a column called "ID". How would a consumer know which was which? SELECT * can also confuse views (at least in some versions SQL Server) when underlying table structures change -- the view is not rebuilt, and the data which comes back can be nonsense. And the worst part of it is that you can take care to name your columns whatever you want, but the next guy who comes along might have no way of knowing that he has to worry about adding a column which will collide with your already-developed names.
I got this from this answer.
I believe this topic has already been covered here:
select * vs select column
I believe it covers your concerns as well. Please take a look.
All the column labels and values occupy some space. Sending them to the issuer of the request instead of a subset of the columns means sending more data. More data is sent slower.
If you have columns, like
id, username, password, email, bio, url
and you want to get only the username and password, then
select username, password ...
is quicker than
select * ...
because id, email, bio and url are sent as well for the latter, which makes the response larger. But the main problem with select * is different. It might be the source of inconsistencies if, for some reason the order of the columns changed. Also, it might retrieve data you do not want to retrieve. It is always better to have a whitelist with the columns you actually want to retrieve.

What's wrong with using a wildcard in your mySQL query? [duplicate]

This question already has answers here:
Which is faster/best? SELECT * or SELECT column1, colum2, column3, etc
(49 answers)
Closed 9 years ago.
Basically what's the difference in terms of security and speed in these 2 queries?
SELECT * FROM `myTable`
and
SELECT `id`, `name`, `location`, `place` etc... FROM `myTable`
Would using * increase the benchmark on my query and perform slower than static rows?
There won't be much appreciable difference in performance if you also select all columns individually.
The idea is to select only the data you require and no more, which can improve performance if there is alot of unneeded columns in your query, for example, when you join several tables.
Ofc, on the other side of the coin, using * makes life easier when you make changes to the table.
Security-wise, the less you select, the less potentially sensitive data can be inadvertently dumped to the user's browser. Imagine if * included the column social_security_number and somewhere in your debug code it gets printed out as an HTML comment.
Performance-wise, in many cases your database is on another server, so requesting the entire row when you only need a small part of it means a lot more data going over the network.
There is not a single, simple answer, and your literal question cannot fully be answered without more detail of the specific table structure, but I'm going with the assumption that you aren't actually talking about a specific query against a specific table, but rather about selecting columns explicitly or using the *.
SELECT * is always wasteful of something unless you are actually going to use every column that exists in the rows you're reading... maybe network bandwidth, or CPU resources, or disk I/O, or a combination, but something is being unnecessarily used, though that "something" may be in very small and imperceptible quantities ... or it may not ... but it can add up over time.
The two big examples that come to mind where SELECT * can be a performance killer are cases like...
...tables with long VARCHAR and BLOB columns:
Columns such as BLOB and VARCHAR that are too long to fit on a B-tree page are stored on separately allocated disk pages called overflow pages. We call such columns off-page columns. The values of these columns are stored in singly-linked lists of overflow pages, and each such column has its own list of one or more overflow pages
— http://dev.mysql.com/doc/refman/5.6/en/innodb-row-format-overview.html
So if * includes columns that weren't stored on-page with the rest of the row data, you just took an I/O hit and/or wasted space in your buffer pool with accesses that could have been avoided had you selected only what you needed.
...also cases where SELECT * prevents the query from using a covering index:
If the index is a covering index for the queries and can be used to satisfy all data required from the table, only the index tree is scanned. In this case, the Extra column says Using index. An index-only scan usually is faster than ALL because the size of the index usually is smaller than the table data.
— http://dev.mysql.com/doc/refman/5.6/en/explain-output.html
When one or more columns are indexed, copies of the column data are stored, sorted, in the index tree, which also includes the primary key, for finding the rest of the row data. When selecting from a table, if all of the columns you are selecting can be found within a single index, the optimizer will automatically choose to return the data to you by reading it directly from the index, instead of going to the time and effort to read in all of the row data... and this, some cases, is a very significant difference in the performance of a query, because it can mean substantially smaller resource usage.
If EXPLAIN SELECT does not reveal the exact same query plan when selecting the individual columns you need compared with the plan used when selecting *, then you are looking at some fairly hard evidence that you are putting the server through unnecessary work by selecting things you aren't going to use.
In additional cases, such as with the information_schema tables, the columns you select can make a particularly dramatic and obvious difference in performance. The information_schema tables are not actually tables -- they're server internal structures exposed via the SQL interface... and the columns you select can significantly change the performance of the query because the server has to do more work to calculate the values of some columns, compared to others. A similar situation is true of FEDERATED tables, which actually fetch data from a remote MySQL server to make a foreign table appear logically to be local. The columns you select are actually transferred across the network between servers.
Explicitly selecting the columns you need can also lead to fewer sneaky bugs. If a column you were using in code is later dropped from a table, the place in your code's data structure -- in some languages -- is going to contain an undefined value, which in many languages is the same think you would see if the column still existed but was null... so the code thinks "okay, that's null, so..." a logical error follows. Had you explicitly selected the columns you wanted, subsequent executions of the query would throw a hard error instead of quietly misbehaving.
MySQL's C-client API, which some other client libraries are built on, supports two modes of fetching data, one of which is mysql_store_result, which buffers the data from the server on the client side before the application actually reads it into its internal structures... so as you are "reading from the server" you may have already implicitly allocated a lot of memory on the client side to store that incoming result-set even when you think you're fetching a row at a time. Selecting unnecessary columns means even more memory needed.
SELECT COUNT(*) is an exception. The COUNT() function counts the number of non-null values seen, and * merely means "count the rows"... it doesn't examine column data, so if you want a star there, go for it.
As a favor to your future self, unless you want to go back later and rewrite all of those queries when you're trying to get more performance out of your server, you should bite the bullet and do the extra typing, now.
As a bonus, when other people see your code, they won't accuse you of laziness or inexperience.

Using MySQL to search through large data sets?

Now I'm a really advanced PHP developer and heavily knowledged on small-scale MySQL sets, however I'm now building a large infrastructure for a startup I've recently joined and their servers push around 1 million rows of data every day using their massive server power and previous architecture.
I need to know what is the best way to search through large data sets (it currently resides at 84.9 million) rows with a database size of 394.4 gigabytes. It is hosted using Amazon RDS so it does not have any downtime or slowness, it's just that I want to know what's the best way to access large data sets internally.
For example, if I wanted to search through a database of 84 million rows it takes me 6 minutes. Now, if I made a direct request to a specific id or title it would serve it instantly. So how would I search through a large data set.
Just to remind you, it's fast to find information through database by passing in one variable but when searching it performs VERY slow.
MySQL query example:
SELECT u.*, COUNT(*) AS user_count, f.* FROM users u LEFT JOIN friends f ON u.user_id=(f.friend_from||f.friend_to) WHERE u.user_name LIKE ('%james%smith%') GROUP BY u.signed_up LIMIT 0, 100
That query under 84m rows is sigificantly slow. Specifically 47.41 seconds to perform this query standalone, any ideas guys?
All I need is that challenge sorted and I'll be able to get the drift. Also, I know MySQL isn't very good for large data sets and something like Oracle or MSSQL however I've been told to rebuild it on MySQL rather than other database solutions at this moment.
LIKE is VERY slow for a variety of reasons:
Unless your LIKE expression starts with a constant, no index will be used.
E.g. LIKE ('james%smith%') is good, LIKE ('%james%smith%') is bad for indexing. Your example will NOT use any indexes on "user_name" field.
String matching is complex (algorythmically) business compared to regular operators.
To resolve:
Make sure your LIKE expression starts with a constant, not a wildcard, if you have an index on that field you might be able to use.
Consider making an index table (in the literature/library context of the word "index", not a database index context) if you search for whole words. Or a substring lookup table if searching for random often repeating substrings.
E.g. if all user names are of the form "FN LN" or "LN, FN" - split them up and store first names and/or last names in a dictionary table, joining to that table (and doing straight equality) in your query.
LIKE ('%james%smith%')
Avoid these things like the plague. They are impossible for a general DBMS to optimise.
The right way is to calculate things like this (first and last names) at the time where the data is inserted or updated so that the cost is amortised across all reads. This can be done by adding two new columns (indexed) and using insert/update triggers.
Or, if you want all words in the column, have the trigger break the data into words then have an application-level index table to find relevant records, something like:
main_table:
id integer primary key
blah blah blah
text varchar(60)
appl_index:
id index
word varchar(20)
primary key (id,word)
index (word)
Then you can query appl_index to find those ids that have both james and smith in them, far faster than the abominable like '%...'. You could also break the actual words out to a separate table and use word IDs but that's a matter of taste - it's effect on performance would be questionable.
You may well have a similar problems with f.friend_from||f.friend_to but I've not seen that syntax before (if, as it seems to be, the context is u.user_id can be one or the other).
Basically, if you want your databases to scale, don't do anything that even looks like a per-row function in your selects. Take that from someone who works with mainframe databases where 84 million rows is about the size of our config tables :-)
And, as with all optimisation questions, measure, don't guess!

mysql index optimization for a table with multiple indexes that index some of the same columns

I have a table that stores some basic data about visitor sessions on third party web sites. This is its structure:
id, site_id, unixtime, unixtime_last, ip_address, uid
There are four indexes: id, site_id/unixtime, site_id/ip_address, and site_id/uid
There are many different types of ways that we query this table, and all of them are specific to the site_id. The index with unixtime is used to display the list of visitors for a given date or time range. The other two are used to find all visits from an IP address or a "uid" (a unique cookie value created for each visitor), as well as determining if this is a new visitor or a returning visitor.
Obviously storing site_id inside 3 indexes is inefficient for both write speed and storage, but I see no way around it, since I need to be able to quickly query this data for a given specific site_id.
Any ideas on making this more efficient?
I don't really understand B-trees besides some very basic stuff, but it's more efficient to have the left-most column of an index be the one with the least variance - correct? Because I considered having the site_id being the second column of the index for both ip_address and uid but I think that would make the index less efficient since the IP and UID are going to vary more than the site ID will, because we only have about 8000 unique sites per database server, but millions of unique visitors across all ~8000 sites on a daily basis.
I've also considered removing site_id from the IP and UID indexes completely, since the chances of the same visitor going to multiple sites that share the same database server are quite small, but in cases where this does happen, I fear it could be quite slow to determine if this is a new visitor to this site_id or not. The query would be something like:
select id from sessions where uid = 'value' and site_id = 123 limit 1
... so if this visitor had visited this site before, it would only need to find one row with this site_id before it stopped. This wouldn't be super fast necessarily, but acceptably fast. But say we have a site that gets 500,000 visitors a day, and a particular visitor loves this site and goes there 10 times a day. Now they happen to hit another site on the same database server for the first time. The above query could take quite a long time to search through all of the potentially thousands of rows for this UID, scattered all over the disk, since it wouldn't be finding one for this site ID.
Any insight on making this as efficient as possible would be appreciated :)
Update - this is a MyISAM table with MySQL 5.0. My concerns are both with performance as well as storage space. This table is both read and write heavy. If I had to choose between performance and storage, my biggest concern is performance - but both are important.
We use memcached heavily in all areas of our service, but that's not an excuse to not care about the database design. I want the database to be as efficient as possible.
I don't really understand B-trees besides some very basic stuff, but it's more efficient to have the left-most column of an index be the one with the least variance - correct?
There is one important property of B-tree indices you need to be aware of: It is possible (efficient) to search for an arbitrary prefix of the full key, but not a suffix. If you have an index site_ip(site_id, ip), and you ask for where ip = 1.2.3.4, MySQL will not use the site_ip index. If you instead had ip_site(ip, site_id), then MySQL would be able to use the ip_site index.
The is a second property of B-tree indices you should be aware of as well: they are sorted. A b-tree index can be used for queries like where site_id < 40.
There is also an important property of disk drives to keep in mind: sequential reads are cheap, seeks are not. If there are any columns used that are not in the index, MySQL must read the row from the table data. That's generally a seek, and slow. So if MySQL believes it'd wind up reading even a small percent of the table like this, it'll instead ignore the index. One big table scan (a sequential read) is usually faster than random reads of even a few percent of the rows in a table.
The same, by the way, applies to seeks through an index. Finding a key in a B-tree actually potentially requires a few seeks, so you'll find that WHERE site_id > 800 AND ip = '1.2.3.4' may not use the site_ip index, becuase each site_id requires several index seeks to find the start of the 1.2.3.4 records for that site. The ip_site index, however, would be used.
Ultimately, you're going to have to make liberal use of benchmarking and EXPLAIN to figure out the best indices for your database. Remember, you can freely add and drop indices as needed. Non-unique indices are not part of your data model; they are merely an optimization.
PS: Benchmark InnoDB as well, it often has better concurrent performance. Same with PostgreSQL.
First of all, if you are using ip as a string than change it to INT UNSIGNED column and use INET_ATON(expr) and INET_NTOA(expr) function to deal with this. Indexing on integer value is more efficient than indexing on strings of variable length.
Well indexes trade storage for performance. Its hard if you want both. Its hard to optimize this any further without know all the queries you run and their quantities per interval.
What you have will work. If you're running into a bottleneck, you'll need to find out whether its cpu,ram,disk and/or network and adjust accordingly. Its hard and wrong to prematurely optimize.
You probably want to switch to innodb if you have any updates, other wise myisam is good for insert/select. Also since your row size is small, you could look into mysql cluster (nbd). There is also an archive engine that can help with storage requirements but partitioning in 5.1 is probably a better thing to look into.
Flipping the order of your index doesn't make any sense, if these indexes are already used in all of your queries.
but it's more efficient to have the left-most column of an index be the one with the least variance - correct?
not sure but I haven't heard this before. Doesn't seem true to me for this application. The index order matters for sorting and by having multiple unique 1st most index fields, allows more possible queries to use index.