How to design Cassandra Scheme for User Actions Log? - mysql

I have a table like this in MYSQL to log user actions :
CREATE TABLE `actions` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`module` VARCHAR(32) NOT NULL,
`controller` VARCHAR(64) NOT NULL,
`action` VARCHAR(64) NOT NULL,
`date` Timestamp NOT NULL,
`userid` BIGINT(20) NOT NULL,
`ip` VARCHAR(32) NOT NULL,
`duration` DOUBLE NOT NULL,
PRIMARY KEY (`id`),
)
COLLATE='utf8mb4_general_ci'
ENGINE=MyISAM
AUTO_INCREMENT=1
I have a MYSQL Query Like this to find out count of specific actions per day :
SELECT COUNT(*) FROM actions WHERE actions.action = "join" AND
YEAR(date)=2017 AND MONTH(date)=06 GROUP BY YEAR(date), MONTH(date),
DAY(date)
this takes 50 - 60 second to me to have a list of days with count of "join" action with only 5 million rows and index in date and action.
So, I want to log actions using Cassandra, so How can I design Cassandra scheme and How to query to get such request less than 1 second.

CREATE TABLE actions (
id timeuuid,
module varchar,
controller varchar,
action varchar,
date_time timestamp,
userid bigint,
ip varchar,
duration double,
year int,
month int,
dt date,
PRIMARY KEY ((action,year,month),dt,id)
);
Explanation:
With abobe table Defination
SELECT COUNT(*) FROM actions WHERE actions.action = "join" AND yaer=2017 AND month=06 GROUP BY action,year,month,dt
will hit single partition.
In dt column only date will be there... may be you can change it to only day number with int as datatype and since id is timeuuid.. it will be unique.
Note: GROUP BY is supported by cassandra 3.10 and above

Related

Speed up mysql SQL query but with a huge dataset

I have a table that has over 2.5 million rows and I would like to run the following SQL Statment to get the
select count(*)
from workflow
where action_name= 'Workflow'
and release_date >= '2019-12-01 13:24:22'
and release_date <= '2019-12-31 13:24:22'
AND project_name= 'Web'
group
by page_id
, headline
, release_full_name
, release_date
The problem is that it takes over 2.7 seconds to return 0 rows as expected. Is there a way to speed it up more? I have 6 more SQL Statements that are similiar so that will take almost (2.7 seconds * 6) = 17 seconds at least.
Here is my table schema
CREATE TABLE workflow (
id int(11) NOT NULL AUTO_INCREMENT,
action_name varchar(100) NOT NULL,
project_name varchar(30) NOT NULL,
page_id int(11) NOT NULL,
headline varchar(200) NOT NULL,
create_full_name varchar(200) NOT NULL,
create_date datetime NOT NULL,
change_full_name varchar(200) NOT NULL,
change_date datetime NOT NULL,
release_full_name varchar(200) NOT NULL,
release_date datetime NOT NULL,
reject_full_name varchar(200) NOT NULL,
reject_date datetime NOT NULL,
PRIMARY KEY (id)
) ENGINE=InnoDB AUTO_INCREMENT=2948271 DEFAULT CHARSET=latin1
What I'm looking for in this query is to get the count of the pages that were released last month. that have project_name = "web" and action_name = "Workflow"
This is bit bigger for comments
Using Group by with Count function doesn't make any sense. Usually you need to count actual rows in DB not after aggregation. Not sure if this is your actual requirement reason being GROUP BY causes slowness of the query.
Use composite Index on (Web, start_date) as column project seems highest selective.
For other information, Please share the explain plan.
Assuming that you need counts for groups (you had listed), better to include the group fields in select (essentially) like
select page_id, headline, release_full_name, release_date, count(*)
from ...
Adding an index with (page_id, headline) would optimize well.

MySQL: best practice for querying last value before a certain date in a time series

I have the following table in MySQL:
CREATE TABLE `history` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`timestamp` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`code` CHAR(32) NOT NULL,
`value` FLOAT NULL DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE INDEX `timestamp_code` (`timestamp`, `code`),
INDEX `code` (`code`),
INDEX `timestamp` (`timestamp`)
) COLLATE='utf8_general_ci' ENGINE=InnoDB;
I would like to know what is the best practice in order to access the last available value before a certain date for a certain set of codes the most efficiently?
So far I came up with the following query:
SELECT h.* FROM history h
JOIN (
SELECT code, MAX(timestamp) as 'last_ts'
FROM history WHERE
timestamp < '2015-09-04 13:50:00' AND
code IN ('119813249', '12087792', '12087797',
'127012151', '131014335', '131014378',
'132757371', '15016059', '15016062',
'150250238', '153462747', '155802712',
'156974389', '162277696', '166330444',
'166483001', '167220356', '167264923',
'167867931', '172283682', '177539478',
'177583937', '177648754', '177649011',
'187532416', '189230667', '70273253',
'70342790', '79342386', '82460282',
'98693280', '98693380')
GROUP BY code) last_price
ON last_price.last_ts = h.timestamp
AND last_price.code = h.code
The query above works, but becomes slow as the number of entries in the table grows (100'000'000 rows).
You can download sample data to populate the table.
Create an index by code, timestamp - rather than timestamp, code. This would let mysql sort out codes before looking for the max timestamp per code - and should be much faster. Use explain for verifying that the index is used.
And if you create that index - you should not have to modify your query.

How to optimized mysql query having large dataset

I have two tables with the following schema,
CREATE TABLE `open_log` (
`delivery_id` varchar(30) DEFAULT NULL,
`email_id` varchar(50) DEFAULT NULL,
`email_activity` varchar(30) DEFAULT NULL,
`click_url` text,
`email_code` varchar(30) DEFAULT NULL,
`on_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `sent_log` (
`email_id` varchar(50) DEFAULT NULL,
`delivery_id` varchar(50) DEFAULT NULL,
`email_code` varchar(50) DEFAULT NULL,
`delivery_status` varchar(50) DEFAULT NULL,
`tries` int(11) DEFAULT NULL,
`creation_ts` varchar(50) DEFAULT NULL,
`creation_dt` varchar(50) DEFAULT NULL,
`on_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The email_id and delivery_id columns in both tables make up a unique key.
The open_log table have 2.5 million records where as sent_log table has 0.25 million records.
I want to filter out the records from open log table based on the unique key (email_id and delivery_id).
I'm writing the following query.
SELECT * FROM open_log
WHERE CONCAT(email_id,'^',delivery_id)
IN (
SELECT DISTINCT CONCAT(email_id,'^',delivery_id) FROM sent_log
)
The problem is the query is taking too much time to execute. I've waited for an hour for the query completion but didn't succeed.
Kindly, suggest what I can do to make it fast since, I have the big data size in the tables.
Thanks,
Faisal Nasir
First, rewrite your query using exists:
SELECT *
FROM open_log ol
WHERE EXISTS (SELECT 1
FROM send_log sl
WHERE sl.email_id = ol.email_id and sl.delivery_id = ol.delivery_id
);
Then, add an index so this query will run faster:
create index idx_sendlog_emailid_deliveryid on send_log(email_id, delivery_id);
Your query is slow for a variety of reasons:
The use of string concatenation makes it impossible for MySQL to use an index.
The select distinct in the subquery is unnecessary.
Exists can be faster than in.
If this request is often on, you can greatly increase it by create bigint id column, enven if it not unique.
For example you can put trigger and create column like this
alter table sent_log for_get bigint;
After that create trigger/ update it to put hash into that bigint
for_get=CONV(substr(md5(concat(email_id, delivery_id)),1,10),16,10)
If you have such column in both table and index on it, query will be like
SELECT *
FROM open_log ol
left join send_log sl on sl.for_get=ol.for_get
WHERE sl.email_id is not null and sl.email_id = ol.email_id and sl.delivery_id = ol.delivery_id;
That query will be fast.

mysql select distinct date takes FOREVER on database w/ 374 million rows

I have a MYSQL DB with table definition like this:
CREATE TABLE `minute_data` (
`date` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`open` decimal(10,2) DEFAULT NULL,
`high` decimal(10,2) DEFAULT NULL,
`low` decimal(10,2) DEFAULT NULL,
`close` decimal(10,2) DEFAULT NULL,
`volume` decimal(10,2) DEFAULT NULL,
`adj_close` varchar(45) DEFAULT NULL,
`symbol` varchar(10) NOT NULL DEFAULT '',
PRIMARY KEY (`symbol`,`date`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
It stores 1 minute data points from the stock market. The primary key is a combination of the symbol and date columns. This way I always have only 1 data point for each symbol at any time.
I am wondering why the following query takes so long that I can't even wait for it to finish:
select distinct date from test.minute_data where date >= "2013-01-01"
order by date asc limit 100;
However I can select count(*) from minute_data; and that finishes very quickly.
I know that it must have something to do with the fact that there are over 374 million rows of data in the table, and my desktop computer is pretty far from a super computer.
Does anyone know something I can try to speed up with query? Do I need to abandon all hope of using a MySQL table this big??
Thanks a lot!
When you have a composite index on 2 columns, like your (symbol, date) primary key, searching and grouping by a prefix of they key will be fast. But searching for something that doesn't include the first column in the index requires scanning all rows or using some other index.
You can either change your primary key to (date, symbol) if you don't usually need to search for symbol without date. Or you can add an additional index on date:
alter table minute_data add index (date)

MySQL Stored Procedure Getting distinct count, displaying latest records only

I have the following inside my stored procedure that retrieves unique records from player names that have a faction of Neutral:
SELECT
COUNT(DISTINCT Name) into #neutcount
FROM
dim5players
WHERE
Faction ='Neutral';
UPDATE dim5stats
SET Value = #neutcount
WHERE
Function = 'Neutral';
This works find and dandy. The problem is that I have a field called Date as well.
I want to select the lastest date of the records to be listed in the count instead of a random record from the unique "Name" record.
This is a history table, and it records daily changes of the records, where Name can appear several times. I need to count only the latest records that have a faction of neutral with their latest records only. Some people change factions from time to time. I only care about their latest faction.
This is the structure:
CREATE TABLE `dim5players` (
`id` CHAR(64) NOT NULL,
`name` VARCHAR(45) NOT NULL,
`rank_name` VARCHAR(20) NULL DEFAULT NULL,
`level` INT(11) NOT NULL,
`defender_rank_id` INT(11) NULL DEFAULT NULL,
`Faction` VARCHAR(15) NOT NULL,
`Organization` VARCHAR(100) NULL DEFAULT NULL,
`Date` DATE NOT NULL,
`Updated` BIT(1) NOT NULL,
UNIQUE INDEX `id_UNIQUE` (`id`) USING HASH,
INDEX `name_index` (`name`) USING HASH,
INDEX `date_index` (`Date`) USING HASH,
INDEX `updated_index` (`Updated`) USING HASH,
INDEX `faction_index` (`Faction`) USING HASH
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB;
After a discussion with Michael i think i figured out what he needs:
"I want the Last Updated record of each name"
SELECT
name ,
MAX(Date) as last_date
FROM
dim5players
WHERE
Faction ='Neutral'
GROUP BY
name
"I just want to count the latest date on each NAME that still holds the faction of Neutral"
SELECT
COUNT(last_date)
FROM (
SELECT
name ,
MAX(Date) as last_date
FROM
dim5players
WHERE
Faction ='Neutral'
GROUP BY
name
) as tmp
#Michael : Let me know if i understood you requirements correctly