Can anyone recommend a strategy for aggregating raw 'click' and 'impression' data stored in a MySQL table with over 100,000,000 rows?
Here is the table structure.
CREATE TABLE `clicks` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`companyid` int(11) DEFAULT '0',
`type` varchar(32) NOT NULL DEFAULT '',
`contextid` int(11) NOT NULL DEFAULT '0',
`period` varchar(16) NOT NULL DEFAULT '',
`timestamp` int(11) NOT NULL DEFAULT '0',
`location` varchar(32) NOT NULL DEFAULT '',
`ip` varchar(32) DEFAULT NULL,
`useragent` varchar(64) DEFAULT NULL,
`processed` tinyint(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
KEY `type` (`type`),
KEY `companyid` (`companyid`),
KEY `period` (`period`),
KEY `contextid` (`contextid`)
) ENGINE=MyISAM AUTO_INCREMENT=21189 DEFAULT CHARSET=latin1;
What I want to do is make this data easier to work with. I want to extract weekly and monthly aggregates from it, grouped by type, companyid and contextid.
Ideally, I'd like to take this data off the production server, aggregate it and then merge it back.
I'm really in a bit of a pickle and wondered whether anyone had any good starting points or strategies for actually aggregating the data so that it can be queried quickly using MySQL. I do not require 'real time' reporting for this data.
I've tried batch PHP scripts in the past but this seemed quite slow.
You can implement a simple PHP script with the whole monthly/weekly data aggregation logic and make it execute via cron job at a given time. Depending on the software context, it could possibly be scheduled for running at night. Additionally, you could pass a GET parameter in the request for recognizing the request source.
You might be interested in MySQL replication... set up a 2nd server who's sole job is just to run the reports on the replicated copy of the data set, and therefore you can tune it specifically for that job. If you set up your replication scheme as master-master, then when the report server updates it's own tables based the report findings, those database changes will automatically replicate back over to the production server.
Also I would highly recommend you read High Performance MySQL, 3rd Ed., and take a look at http://www.mysqlperformanceblog.com/ for further info on working with massive datasets in MySQL
Related
I'm trying to denormalize a few MySQL tables I have into a new table that I can use to speed up some complex queries with lots of business logic. The problem that I'm having is that there are 2.3 million records I need to add to the new table and to do that I need to pull data from several tables and do a few conversions too. Here's my query (with names changed)
INSERT INTO database_name.log_set_logs
(offload_date, vehicle, jurisdiction, baselog_path, path,
baselog_index_guid, new_location, log_set_name, index_guid)
(
select STR_TO_DATE(logset_logs.offload_date, '%Y.%m.%d') as offload_date,
logset_logs.vehicle, jurisdiction, baselog_path, path,
baselog_trees.baselog_index_guid, new_location, logset_logs.log_set_name,
logset_logs.index_guid
from
(
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(path, '/', 7), '/', -1) as offload_date,
SUBSTRING_INDEX(SUBSTRING_INDEX(path, '/', 8), '/', -1) as vehicle,
SUBSTRING_INDEX(path, '/', 9) as baselog_path, index_guid,
path, log_set_name
FROM database_name.baselog_and_amendment_guid_to_path_mappings
) logset_logs
left join database_name.log_trees baselog_trees
ON baselog_trees.original_location = logset_logs.baselog_path
left join database_name.baselog_offload_location location
ON location.baselog_index_guid = baselog_trees.baselog_index_guid);
The query itself works because I was able to run it using a filter on log_set_name however that filter's condition will only work for less than 1% of the total records because one of the values for log_set_name has 2.2 million records in it which is the majority of the records. So there is nothing else I can use to break this query up into smaller chunks from what I can see. The problem is that the query is taking too long to run on the rest of the 2.2 million records and it ends up timing out after a few hours and then the transaction is rolled back and nothing is added to the new table for the 2.2 million records; only the 0.1 million records were able to be processed and that was because I could add a filter that said where log_set_name != 'value with the 2.2 million records'.
Is there a way to make this query more performant? Am I trying to do too many joins at once and perhaps I should populate the row's columns in their own individual queries? Or is there some way I can page this type of query so that MySQL executes it in batches? I already got rid of all my indexes on the log_set_logs table because I read that those will slow down inserts. I also jacked my RDS instance up to a db.r4.4xlarge write node. I am also using MySQL Workbench so I increased all of it's timeout values to their maximums giving them all nines. All three of these steps helped and were necessary in order for me to get the 1% of the records into the new table but it still wasn't enough to get the 2.2 million records without timing out. Appreciate any insights as I'm not adept to this type of bulk insert from a select.
'CREATE TABLE `log_set_logs` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`purged` tinyint(1) NOT NULL DEFAUL,
`baselog_path` text,
`baselog_index_guid` varchar(36) DEFAULT NULL,
`new_location` text,
`offload_date` date NOT NULL,
`jurisdiction` varchar(20) DEFAULT NULL,
`vehicle` varchar(20) DEFAULT NULL,
`index_guid` varchar(36) NOT NULL,
`path` text NOT NULL,
`log_set_name` varchar(60) NOT NULL,
`protected_by_retention_condition_1` tinyint(1) NOT NULL DEFAULT ''1'',
`protected_by_retention_condition_2` tinyint(1) NOT NULL DEFAULT ''1'',
`protected_by_retention_condition_3` tinyint(1) NOT NULL DEFAULT ''1'',
`protected_by_retention_condition_4` tinyint(1) NOT NULL DEFAULT ''1'',
`general_comments_about_this_log` text,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1736707 DEFAULT CHARSET=latin1'
'CREATE TABLE `baselog_and_amendment_guid_to_path_mappings` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`path` text NOT NULL,
`index_guid` varchar(36) NOT NULL,
`log_set_name` varchar(60) NOT NULL,
PRIMARY KEY (`id`),
KEY `log_set_name_index` (`log_set_name`),
KEY `path_index` (`path`(42))
) ENGINE=InnoDB AUTO_INCREMENT=2387821 DEFAULT CHARSET=latin1'
...
'CREATE TABLE `baselog_offload_location` (
`baselog_index_guid` varchar(36) NOT NULL,
`jurisdiction` varchar(20) NOT NULL,
KEY `baselog_index` (`baselog_index_guid`),
KEY `jurisdiction` (`jurisdiction`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1'
'CREATE TABLE `log_trees` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`baselog_index_guid` varchar(36) DEFAULT NULL,
`original_location` text NOT NULL, -- This is what I have to join everything on and since it's text I cannot index it and the largest value is above 255 characters so I cannot change it to a vachar then index it either.
`new_location` text,
`distcp_returncode` int(11) DEFAULT NULL,
`distcp_job_id` text,
`distcp_stdout` text,
`distcp_stderr` text,
`validation_attempt` int(11) NOT NULL DEFAULT ''0'',
`validation_result` tinyint(1) NOT NULL DEFAULT ''0'',
`archived` tinyint(1) NOT NULL DEFAULT ''0'',
`archived_at` timestamp NULL DEFAULT NULL,
`created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`updated_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`dir_exists` tinyint(1) NOT NULL DEFAULT ''0'',
`random_guid` tinyint(1) NOT NULL DEFAULT ''0'',
`offload_date` date NOT NULL,
`vehicle` varchar(20) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `baselog_index_guid` (`baselog_index_guid`)
) ENGINE=InnoDB AUTO_INCREMENT=1028617 DEFAULT CHARSET=latin1'
baselog_offload_location has not PRIMARY KEY; what's up?
GUIDs/UUIDs can be terribly inefficient. A partial solution is to convert them to BINARY(16) to shrink them. More details here: http://localhost/rjweb/mysql/doc.php/uuid ; (MySQL 8.0 has similar functions.)
It would probably be more efficient if you have a separate (optionally redundant) column for vehicle rather than needing to do
SUBSTRING_INDEX(SUBSTRING_INDEX(path, '/', 8), '/', -1) as vehicle
Why JOIN baselog_offload_location? Three seems to be no reference to columns in that table. If there, be sure to qualify them so we know what is where. Preferably use short aliases.
The lack of an index on baselog_index_guid may be critical to performance.
Please provide EXPLAIN SELECT ... for the SELECT in your INSERT and for the original (slow) query.
SELECT MAX(LENGTH(original_location)) FROM .. -- to see if it really is too big to index. What version of MySQL are you using? The limit increased recently.
For the above item, we can talk about having a 'hash'.
"paging the query". I call it "chunking". See http://mysql.rjweb.org/doc.php/deletebig#deleting_in_chunks . That talks about deleting, but it can be adapted to INSERT .. SELECT since you want to "chunk" the select. If you go with chunking, Javier's comment becomes moot. Your code would be chunking the selects, hence batching the inserts:
Loop:
INSERT .. SELECT .. -- of up to 1000 rows (see link)
End loop
I'm trying to implement a way to track changes to a table named user and another named report_to Below are their definitions:
CREATE TABLE `user`
(
`agent_eid` int(11) NOT NULL,
`agent_id` int(11) DEFAULT NULL,
`agent_pipkin_id` int(11) DEFAULT NULL,
`first_name` varchar(45) NOT NULL,
`last_name` varchar(45) NOT NULL,
`team_id` int(11) NOT NULL,
`hire_date` date NOT NULL,
`active` bit(1) NOT NULL,
`agent_id_req` bit(1) NOT NULL,
`agent_eid_req` bit(1) NOT NULL,
`agent_pipkin_req` bit(1) NOT NULL,
PRIMARY KEY (`agent_eid`),
UNIQUE KEY `agent_eid_UNIQUE` (`agent_eid`),
UNIQUE KEY `agent_id_UNIQUE` (`agent_id`),
UNIQUE KEY `agent_pipkin_id_UNIQUE` (`agent_pipkin_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
CREATE TABLE `report_to`
(
`agent_eid` int(11) NOT NULL,
`report_to_eid` int(11) NOT NULL,
PRIMARY KEY (`agent_eid`),
UNIQUE KEY `agent_eid_UNIQUE` (`agent_eid`),
KEY `report_to_report_fk_idx` (`report_to_eid`),
CONSTRAINT `report_to_agent_fk` FOREIGN KEY (`agent_eid`) REFERENCES `user` (`agent_eid`) ON DELETE NO ACTION ON UPDATE NO ACTION,
CONSTRAINT `report_to_report_fk` FOREIGN KEY (`report_to_eid`) REFERENCES `user` (`agent_eid`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB DEFAULT CHARSET=utf8
What can change that needs to be tracked is user.team_id, user.active and report_to.report_to_eid. What i currently have implemented is a table that is populated via an update trigger on user that tracks team changes. That table is defined as:
CREATE TABLE `user_team_changes`
(
`agent_id` int(11) NOT NULL,
`date_changed` date NOT NULL,
`old_team_id` int(11) NOT NULL,
`begin_date` date NOT NULL,
PRIMARY KEY (`agent_id`,`date_changed`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
This works fine for just tracking team changes. I'm able to use joins and a union to populate a history view that tracks that change over time for the individual users. The issue of complexity rises when I try to implement tracking for the other two change types.
I have thought about creating additional tables similar to the one tracking changes for teams, but I worry about performance hits due to the joins that will be required.
Another way I have considered is creating a table similar to a view that I have that details the current user state (it joins all necessary user data together from 4 tables), then insert a record on update with a valid until date field added. My concern with that is the amount of space this could take.
We will be using the user change history quite a bit as we will be running YTD, MTD, PMTD and time interval reports with it on an almost daily basis.
Out of the two options I am considering, which would be the best for my given situation?
The options you've presented:
using triggers to populate transaction-log tables.
including a new table with an effective-date columns in the schema and tracking change by inserting new rows.
Either one of these will work. You can add logging triggers to other tables without causing any trouble.
What distinguishes these two choices? The first one is straightforward, once you get your triggers debugged.
The second choice seems to me that it will create denormalized redundant data. That is never good. I would opt not to do that. It is possible with judicious combinations of views and effective-date columns to create history tables that are viewable as the present state of the system. To learn about this look at Prof. RT Snodgrass's excellent book on Developing Time Oriented applications. http://www.cs.arizona.edu/~rts/publications.html If you have time to do an excellent engineering (over-engineering?) job on this project you might consider this approach.
The data volume you've mentioned will not cause intractable performance problems on any modern server hardware platform. If you do get slowdowns on JOIN operations, it's almost certain that the addition of appropriate indexes will completely fix them, as long as you declare all your DATE, DATETIME, and TIMESTAMP fields NOT NULL. (NULL values can mess up indexing and searching).
Hope this helps.
I'm creating a membership directory for an organization and I'm trying to figure out a nice way to keep everyone's details in order and updateable manner. I have 3 tables:
Person table handles the actual person
CREATE TABLE `person` (
`personid` BIGINT PRIMARY KEY AUTO_INCREMENT,
`personuuid` CHAR(32) NOT NULL,
`first_name` VARCHAR(50) DEFAULT '',
`middle_name` VARCHAR(50) DEFAULT '',
`last_name` VARCHAR(50) DEFAULT '',
`prefix` VARCHAR(32) DEFAULT '',
`suffix` VARCHAR(32) DEFAULT '',
`nickname` VARCHAR(50) DEFAULT '',
`username` VARCHAR(32) ,
`created_on` DATETIME NOT NULL DEFAULT '0000-00-00 00:00:00',
`created_by` CHAR(33) DEFAULT '000000000000000000000000000000000',
`last_updated` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`last_updated_by` CHAR(33) DEFAULT '000000000000000000000000000000000'
) ENGINE=InnoDB, COMMENT='people';
Information about a person. Such as school, phone number, email, twitter name, etc. All of these values would be stored in 'value' as a json and my program will handle everything. On each update by the user a new entry is created to show the transition of changes.
CREATE TABLE `person_info` (
`person_infoid` BIGINT PRIMARY KEY AUTO_INCREMENT,
`person_infouuid` CHAR(32) NOT NULL,
`person_info_type` INT(4) NOT NULL DEFAULT 9999,
`value` TEXT,
`created_on` DATETIME NOT NULL DEFAULT '0000-00-00 00:00:00',
`created_by` CHAR(33) DEFAULT '000000000000000000000000000000000',
`last_updated` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`last_updated_by` CHAR(33) DEFAULT '000000000000000000000000000000000'
) ENGINE=InnoDB, COMMENT="Personal Details";
A map between person and person_info tables
CREATE TABLE `person_info_map` (
`personuuid` CHAR(32),
`person_infouuid` CHAR(32) ,
`created_on` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`created_by` CHAR(33) DEFAULT '000000000000000000000000000000000',
`is_active` INTEGER(1)
) ENGINE=InnoDB, COMMENT="Map between person and person info";
So given that I am creating a new entry into person_info everytime there is an update, I'm wondering if I should worry about i/o errors, tables getting to big, etc. And if so, what are possible solutions? I've never really worked with database schemas like this so I figure I should ask for help rather than get screwed in the future.
I'm sure some might ask how often the updates might occur. Truthfully I'm not expecting too much. We currently have 2k members in our directory and I don't expect us to ever have more than 10k active members at any time. I'm also thinking that we will have at most 50 different option types, but for safety and future purposes I allow up to 1000 different option types.
Considering this small piece, does anyone have any advice as to how I should proceed from here?
The person to person_info relationship seems like it should be modeled as a one-to-many relationship (i.e one person records has many person_info records). If that is true, the person_info_map can be dropped as it serves no purpose.
Do not use the UUID fields to establish relationships between your tables. Use the primary keys and create foreign key constraints on them. This will enforce data integrity and improve the performance of joins when querying.
I would personally have the 2 tables merged as the view of the person at the current moment, and have another table where you write the changes (like an event store).
That will avoid you to join the 2 tables every single time you need to fetch the person.
But that's now from the application point of view. I hear already the DBA shouting in my ears :)
So no one really answered the question so I'll post what I ended up doing. I don't know if it will work yet as our systems aren't active, but I think it should work.
I went ahead and kept the system we have above. I'm hoping that we won't ever hit too many rows that it will become an issue, but if we do then I plan on archiving a chunk of the 'old' data. I think that will help, but I think machines will likely our pace of usage.
Hi and thanx for reading my post i am having a little trouble learning my database in mysql.
Now i have it set up already but recently, but i had another person tell me my members table is slow and useless if i intend to have a lots of members!
I have looked it over a lot of times and did some google searches but i don't see anything wrong with it, maybe because i am new at it? can one of you sql experts look it over and tell me whats wrong with it please :)
--
-- Table structure for table `members`
--
CREATE TABLE IF NOT EXISTS `members` (
`userid` int(9) unsigned NOT NULL AUTO_INCREMENT,
`username` varchar(20) NOT NULL DEFAULT '',
`password` longtext,
`email` varchar(80) NOT NULL DEFAULT '',
`gender` int(1) NOT NULL DEFAULT '0',
`ipaddress` varchar(80) NOT NULL DEFAULT '',
`joinedon` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`acctype` int(1) NOT NULL DEFAULT '0',
`acclevel` int(1) NOT NULL DEFAULT '0',
`birthdate` date DEFAULT NULL,
`warnings` int(1) NOT NULL DEFAULT '0',
`banned` int(1) NOT NULL DEFAULT '0',
`enabled` int(1) NOT NULL DEFAULT '0',
`online` int(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`userid`),
UNIQUE KEY `username` (`username`),
UNIQUE KEY `emailadd` (`emailadd`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=19 ;
--
-- Dumping data for table `members`
--
It's going to be a site for faqs/tips for games, i do expect to get lots of members at one point later on but i thought i would ask to make sure it's all ok, thanx again peace.
Did the other person explain why they think it is slow and useless?
Here's a few things that I think could be improved:
email should be longer - off the top of my head, 320 should be long enough for most email addresses, but you might want to look that up.
If the int(1) fields are simple on/off fields, then they could be tinyint(1) or bool instead.
As #cularis points out, the ipaddress field might not be the appropriate type. INT UNSIGNED is better than varchar for IPv4. You can use INET_ATON() and INET_NTOA() for conversion. See:
Best Field Type for IP address?
How to store IPv6-compatible address in a relational database
As #Delan Azabani points out, your password field is too long for the value you are storing. MD5 produces a 32 character string, so varchar(32) will be sufficient. You could switch to the more secure SHA2, and use the MySQL 'SHA2()' function.
Look into using the InnoDB database engine instead of MyISAM. It offers foreign key constraints, row-level locking and transactions, amongst other things. See Should you move from MyISAM to Innodb ?.
I don't think it's necessarily slow, but I did notice that among all other text fields where you used varchar, you used longtext for the password field. This seems like you are going to store the password in the database -- don't do this!
Always take a fixed-length cryptographic hash (using, for example, SHA-1 or SHA-2) of the user's password, and put that into the database. That way, if your database server is compromised, the users' passwords are not exposed.
Apart from what #Delan said, I noted that;
JoinedOn column defined as ON UPDATE CURRENT_TIMESTAMP. If you need to maintain only the date joined, you should not update the field when the records been updated.
IPAddress column is VARCHAR(80). If you store IPv4 type IP addresses, this will be too lengthy.
Empty string ('') as DEFAULT for NOT NULL columns. Not good if intention is to have a value (other than '') on the field.
Empty string ('') as DEFAULT for UNIQUE Fields. This contradicts the contraints enforced if your intention is to have a Unique Value (other than '').
I have a high CPU problem with MYSQL using "top" ( linux ) shows cpu peaks of 90%.
I was trying to find the source of the problem, turned on general log and slow query log,
The slow query log did not find anything.
The Db contains a few small tables and one large table that contains almost 100k rows, Database Engine is MyIsam. strange thing i have noticed that on the large table, select, insert are very fast but update takes 0.2 - 0.5 secs.
already used optimize and repair and no improvement.
the table is being updated frequently, could this be the source of the high CPU% ?
What can i do to improve this?
Does your MySQL server have ganglia setup on it? Regular ganglia metrics along with the mysql_stats plugin for ganglia might reveal what's going on.
I found mytop extremely helpful.
First of all, could you define which is the query that is overloading the server? In that case, please paste it here and may be we can give you a hand on it.
Also, please look at the table structure. Tables with many indexes are likely to have slow updating timespans.
I also recommend you to give us more data about the problem.
Hope that helps,
The first thing that pops into mind is indexing but that doesn't fit since your selects and inserts are fast. It's usually inserts and updates that will slow down on an "overindexed" table. That leaves triggers... do you have an update trigger on that table that could be doing a lot of work and causing the spike?
a query that takes .5 secs won't showup in top cpu of 100%. Its too small.
Also try "show full processlist"; verify you my.cnf and even try reducing the slow query timeout. slow query log can catch anything that is slow long enough.
Any update statement on that table based on the table's key is slow.
for example UPDATE customers SET CustMoney = 1 WHERE CustUID = 'someid'
CREATE TABLE IF NOT EXISTS `customers` (
`CustFullName` varchar(45) NOT NULL,
`CustPassword` varchar(45) NOT NULL,
`CustEmail` varchar(128) NOT NULL,
`SocialNetworkId` tinyint(4) NOT NULL,
`CustUID` varchar(64) character set ascii NOT NULL,
`CustMoney` bigint(20) NOT NULL default '0',
`LastIpAddress` varchar(45) character set ascii NOT NULL,
`LastLoginTime` datetime NOT NULL default '1900-10-10 10:10:10',
`SmallPicURL` varchar(120) character set ascii default '',
`LargePicURL` varchar(120) character set ascii default '',
`LuckyChips` int(10) unsigned NOT NULL default '0',
`AccountCreationTime` datetime NOT NULL default '2009-11-11 11:11:11',
`AccountStatus` tinyint(4) NOT NULL default '1',
`CustLevel` int(11) NOT NULL default '0',
`City` varchar(32) NOT NULL default '',
`State` varchar(32) NOT NULL default '0',
`Country` varchar(32) NOT NULL default '',
`Zip` varchar(16) character set ascii NOT NULL,
`CustExp` bigint(20) NOT NULL default '0',
PRIMARY KEY (`CustUID`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Again im not sure that this is the cause for the high CPU Usage but it seems to me that its not normal for an update statement to take that long. ( 0.5 sec)
The table is being updated up to 5 times in a sec at the moment and in the future it will update more frequently.
What kind of server is this? I've seen slooow writes and relatively fast reads on virtual machines. What does http://en.wikipedia.org/wiki/Hdparm has to say?
What cpu/ram you have on it? What the load avg?