MySQL UPDATE with JOIN, SELECT and GROUP BY? - mysql

PRETEXT:
I'm currently trying to create detailed and correct call statistics for our customers. Previously, Asterisk stored call details in one row per call. One row is not enough to store all the possible things that can happen to a call, and the data would often be misleading or wrong. They now have a new table format that stores one row per event in the call. This table will easily contain millions of records for our larger customers.
We tried selecting directly from this table on a test server with only 200k records, but the more advanced queries ended up taking forever. We decided to create a summary table (so we're back to one row per call, but now with more details available). There are many things I'd like to do with this data, but if I can solve this "simple" problem, I'm sure I'll solve all the others as well.
PROBLEM:
The field linkedid is the same for all rows in one call. The field eventtype can have zero, one or multiple occurances of the same event for one linkedid.
I fill the summary table with some data:
INSERT INTO astcel_summary
(linkedid, starttime, endtime, callfrom, callto, direction)
SELECT
linkedid, MIN(eventtime), MAX(eventtime), cid_num, exten, IF ((context = 'from-extension'), 1, 0)
FROM astcel
GROUP BY linkedid;
The BRIDGE_START event is especially important as it indicates that a person answered the call. The call can be unanswered, answered or even answered multiple times (transfer, conference). I want to UPDATE my summary table with several fields from the first (if any) BRIDGE_START event for each call.
I've been able to update one field at a time like this:
UPDATE astcel_summary, astcel
SET astcel_summary.answertime =
(
SELECT eventtime
FROM astcel
WHERE astcel.linkedid = astcel_summary.linkedid
AND astcel.eventtype = 'BRIDGE_START'
GROUP BY astcel.linkedid
)
WHERE astcel.linkedid = astcel_summary.linkedid
AND astcel.eventtype = 'BRIDGE_START';
I've tried many variations with different joins and subqueries to update multiple fields, but can't make it work. If this operation could also be merged with the original insert somehow, that would be amazing.
Even better would be a way to select, without using summary tables and without it taking too long time, for instance: The average time callers waited before being answered (plus several other similar pieces of data) during business hours last month grouped by the numbers called.

Related

How to efficiently program a big/big left join query in mySQL?

Our problem lies in performing a left join on two large tables (both having millions of entries).
The first one is a table that contains input supplied by the end-user of our program. It contains answers to a variety of questions. Every question belongs to a certain questionnaire. The most important columns are an identifier for the given response, an identifier for the questionnaire form, the datetime the answer is given and an identifier for the user that supplied the answer.
The second table contains information on daily progress of the users regarding the completion of questionnaires. It contains information on the amount of answers a certain user has given on a certain day for a given activity. The most important columns in this table are the user id, the questionnaire id and the date.
The second database is updated right after a new answer enters the first database. Updating is performed by code (workers) that runs on a different server. We would to like make the system robust against failure of this other server. An important step to ensure that the table with the results ('responses') remains in sync with the progress ('progress_questionnaires') table is to be able to check whether a combination of user_id, questionnaire_id and datetime from the 'responses' table is also present in the 'progress_questionnaires' table. A query that captures the required results, but does not perform on large databases (NxN, in which N is couple of millions entries), is displayed below.
A query that captures the required results is:
SELECT r.chapter_id, r.user_id, CAST(first_created as date) as date, 1 as original
FROM responses r
LEFT JOIN progress_questionnaires pq ON r.questionnaire_id = pq.questionnaire_id AND r.user_id = pq.user_id AND CAST(r.first_created as date) = pq.date
WHERE pa.activity_id IS NULL
GROUP BY r.questionnaire_id, r.user_id, CAST(r.first_created as date)
As stated before, this query does capture the required results, but does not perform well on large tables. All key columns are properly indexed as far as we know.
We would be very happy if someone could help us out.
P.S. We are using MariaDB, SQL version 5.5.43. I hope I supplied al necessary information, but logically I would be happy to supply additional information where necessary.

how to create tables of events that the number's are unknown with SQLite

i ll start to develop an iPhone application but i got a question/problem.
i was thinking about storing all the data in one single and huge table but than while i was drawing a schema, i noticed that i ll store events- trigger events like placed in IBActions, or in viewDidLoad's i ll keep the count but the real question is, i need to store the dates and timestamps of this events as well.Like one user may trigger "home screen appeared" 100 times, keeping the count is easy but how can i store the dates?Should i create a separate table to keep each events and their timestamps?
If thats the case i don't know how many events there will be, wouldn't it be so much of a garbae tables?
In the end i'll send these SQLite informations to my back-end so it should be neat.
Can this be done in one single table?Am i missing some points?
To do this in one table, you would need a record ( row ) for each event. You could
select count(1) from events ....
to get the count, order by date_created with a limit N clause to get the most N recent, etc. If you insisted on keeping just one row per event, then no, I can't think of a clean way to keep track of all event dates without a second table.
To answer your other questions though, you can automatically assign the data of a record's entry by defining the column like this ..
DATE_CREATED DATE DEFAULT CURRENT_TIMESTAMP
and not including that field on your insert statement. That is really your cleanest solution.

How to retrieve the new rows of a table every minute

I have a table, to which rows are only appended (not updated or deleted) with transactions (I'll explain why this is important), and I need to fetch the new, previously unfetched, rows of this table, every minute with a cron.
How am I going to do this? In any programming language (I use Perl but that's irrelevant.)
I list the ways I thought of how to solve this problem, and ask you to show me the correct one (there HAS to be one...)
The first way that popped to my head was to save (in a file) the largest auto_incrementing id of the rows fetched, so in the next minute I can fetch with: WHERE id > $last_id. But that can miss rows. Because new rows are inserted in transactions, it's possible that the transaction that saves the row with id = 5 commits before the transaction that saves the row with id = 4. It's therefore possible that the cron script retrieves row 5 but not row 4, and when row 4 gets committed one split second later, it will never gets fetched (because 4 is not > than 5 which is the $last_id).
Then I thought I could make the cron job fetch all rows that have a date field in the last TWO minutes, check which of these rows have been retrieved again in the previous run of the cron job (to do this I would need to save somewhere which row ids were retrieved), compare, and process only the new ones. Unfortunately this is complicated, and also doesn't solve the problem that will occur if a certain inserting transaction takes TWO AND A HALF minutes to commit for some weird database reason, which will cause the date to be too old for the next iteration of the cron job to fetch.
Then I thought of installing a message queue (MQ) like RabbitMQ or any other. The same process that does the inserting transaction, would notify RabbitMQ of the new row, and RabbitMQ would then notify an always-running process that processes new rows. So instead of getting a batch of rows inserted in the last minute, that process would get the new rows one-by-one as they are written. This sounds good, but has too many points of failure - RabbitMQ might be down for a second (in a restart for example) and in that case the insert transaction will have committed without the receiving process having ever received the new row. So the new row will be missed. Not good.
I just thought of one more solution: the receiving processes (there's 30 of them, doing the exact same job on exactly the same data, so the same rows get processed 30 times, once by each receiving process) could write in another table that they have processed row X when they process it, then when time comes they can ask for all rows in the main table that don't exist in the "have_processed" table with an OUTER JOIN query. But I believe (correct me if I'm wrong) that such a query will consume a lot of CPU and HD on the DB server, since it will have to compare the entire list of ids of the two tables to find new entries (and the table is huge and getting bigger each minute). It would have been fast if the receiving process was only one - then I would have been able to add a indexed field named "have_read" in the main table that would make looking for new rows extremely fast and easy on the DB server.
What is the right way to do it? What do you suggest? The question is simple, but a solution seems hard (for me) to find.
Thank you.
I believe the 'best' way to do this would be to use one process that checks for new rows and delegates them to the thirty consumer processes. Then your problem becomes simpler to manage from a database perspective and a delegating process is not that difficult to write.
If you are stuck with communicating to the thirty consumer processes through the database, the best option I could come up with is to create a trigger on the table, which copies each row to a secondary table. Copy each row to the secondary table thirty times (once for each consumer process). Add a column to this secondary table indicating the 'target' consumer process (for example a number from 1 to 30). Each consumer process checks for new rows with its unique number and then deletes those. If you are worried that some rows are deleted before they are processed (because the consumer crashes in the middle of processing), you can fetch, process and delete them one by one.
Since the secondary table is kept small by continuously deleting processed rows, INSERTs, SELECTs and DELETEs would be very fast. All operations on this secondary table would also be indexed by the primary key (if you place the consumer ID as first field of the primary key).
In MySQL statements, this would look like this:
CREATE TABLE `consumer`(
`id` INTEGER NOT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO `consumer`(`id`) VALUES
(1),
(2),
(3)
-- all the way to 30
;
CREATE TABLE `secondaryTable` LIKE `primaryTable`;
ALTER TABLE `secondaryTable` ADD COLUMN `targetConsumerId` INTEGER NOT NULL FIRST;
-- alter the secondary table further to allow several rows with the same primary key (by adding targetConsumerId to the primary key)
DELIMTER //
CREATE TRIGGER `mark_to_process` AFTER INSERT ON `primaryTable`
FOR EACH ROW
BEGIN
-- by doing a cross join with the consumer table, this automatically inserts the correct amount of rows and adding or deleting consumers is just a matter of adding or deleting rows in the consumer table
INSERT INTO `secondaryTable`(`targetConsumerId`, `primaryTableId`, `primaryTableField1`, `primaryTableField2`) SELECT `consumer`.`id`, `primaryTable`.`id`, `primaryTable`.`field1`, `primaryTable`.`field2` FROM `consumer`, `primaryTable` WHERE `primaryTable`.`id` = NEW.`id`;
END//
DELIMITER ;
-- loop over the following statements in each consumer until the SELECT doesn't return any more rows
START TRANSACTION;
SELECT * FROM secondaryTable WHERE targetConsumerId = MY_UNIQUE_CONSUMER_ID LIMIT 1;
-- here, do the processing (so before the COMMIT so that crashes won't let you miss rows)
DELETE FROM secondaryTable WHERE targetConsumerId = MY_UNIQUE_CONSUMER_ID AND primaryTableId = PRIMARY_TABLE_ID_OF_ROW_JUST_SELECTED;
COMMIT;
I've been thinking on this for a while. So, let me see if I got it right. You have a HUGE table in which N, amount which may vary in time, processes write (let's call them producers). Now, there are these M, amount which my vary in time, other processes that need to at least process once each of those records added (let's call them consumers).
The main issues detected are:
Making sure the solution will work with dynamic N and M
It is needed to keep track of the unprocessed records for each consumer
The solution has to escalate as much as possible due to the huge amount of records
In order to tackle those issues I thought on this. Create this table (PK in bold):
PENDING_RECORDS(ConsumerID, HugeTableID)
Modify the consumers so that each time they add a record to the HUGE_TABLE they also add M records to the PENDING_RECORDS table so that it has the HugeTableID and also each of the ConsumerID that exist at that time. Each time a consumer runs it will query the PENDING_RECORDS table and will find a small amount of matches for itself. It will then join against the HUGE_TABLE (note it will be an inner join, not a left join) and fetch the actual data it needs to process. Once the data is processed then the consumer will delete the records fetched from the PENDING_RECORDS table, keeping it decently small.
Interesting, i must say :)
1) First of all - is it possible to add a field to the table that has rows only added (let's call it 'transactional_table')? I mean, is it a design paradigm and you have a reason not to do any sort of updates on this table, or is it "structurally" blocked (i.e. user connecting to db has no privileges to perform updates on this table) ?
Because then the simplest way to do it is to add "have_read" column to this table with default 0, and update this column on fetched rows with 1 (even if 30 processess do this simultanously, you should be fine as it would be very fast and it won't corrupt your data). Even if 30 processess mark the same 1000 rows as fetched - nothing is corrupt. Although if you do not operate on InnoDB, this might be not the best way as far as performance is concerned (MyISAM locks whole tables on updates, InnoDB only rows that are updated).
2) If this is not what you could use - I would surely check out the solution you gave as your last one, with a little modification. Create a table (let's say: fetched_ids), and save fetched rows' ids in that table. Then you could use something like :
SELECT tt.* from transactional_table tt
RIGHT JOIN fetched_ids fi ON tt.id = fi.row_id
WHERE fi.row_id IS NULL
This will return the rows from you transactional table, that have not been saved as already fetched. As long as both (tt.id) and (fi.row_id) have (ideally unique) indexes, this should work just fine even on large sets of data. MySQL handles JOINS on indexed fields pretty well. Do not fear trying out - create new table, copy ids to it, delete some of them and run your query. You'll see the results and you'll know if they are satisfactory :)
P.S. Of course, adding rows to this 'fetched_ids' table should be ran carefully not to create unnecessary duplicates (30 simultaneous processes could write 30 times the data you need - and if you need performance, you should watch out for this case).
How about a second table with a structure like this:
source_fk - this would hold an ID of the data rows you want to read.
process_id - This would be a unique id for one of the 30 processes.
then do a LEFT JOIN and exclude items from your source that have entries matching the specified process_id.
once you get your results, just go back and add the source_fk and process_id for each result you get.
One plus about this is you can add more processes later on with no problem.
I would try adding a timestamp column and use it as a reference when retrieving new rows.

MySQL Query: Return all rows with a certain value in one column when value in another column matches specific criteria

This may be a little difficult to answer given that I'm still learning to write queries and I'm not able to view the database at the moment, but I'll give it a shot.
The database I'm trying to acquire information from contains a large table (TransactionLineItems) that essentially functions as a store transaction log. This table currently contains about 5 million rows and several columns describing products which are included in each transaction (TLI_ReceiptAlias, TLI_ScanCode, TLI_Quantity and TLI_UnitPrice). This table has a foreign key which is paired with a primary key in another table (Transactions), and this table contains transaction numbers (TRN_ReceiptNumber). When I join these two tables, the query returns one row for every item we've ever sold, and each row has a receipt number. 16 rows might have the same receipt number, meaning that all of these items were sold in a single transaction. Below that might be 12 more rows, each sharing another receipt number. All transactions are broken down into multiple rows like this.
I'm attempting to build a query which returns all rows sharing a single receipt number where at least one row with that receipt number meets certain criteria in another column. For example, three separate types of gift cards all have values in the TLI_ScanCode column that begin with "740000." I want the query to return rows with values beginning with these six digits in the TLI_ScanCode column, but I would also like to return all rows which share a receipt number with any of the rows which meet the given scan code criteria. Essentially, I need the query to return all rows for every receipt number which is also paired in at least one row with a gift card-related scan code.
I attempted to use a subquery to return a column of all receipt numbers paired with gift card scan codes, using "WHERE A.TRN_ReceiptAlias IN (subquery..." to return only those rows with a receipt number which matched one of the receipt numbers returned by the subquery. This appeared to run without issue for five minutes before the server ground to a halt for another twenty while it processed the query. The query appeared to complete successfully, but given that I was working with IT to restore normal store operations during this time I failed to obtain the results of the query (apart from the associated shame and embarrassment).
I'd like to know if there is a way to write a query to obtain this information without causing the server to hang. I'm assuming that either: a) it wasn't very smart to use a subquery in this manner on such a large table, or b) I don't know enough about SQL to obtain the information I need. I'm assuming the answer is both A and B, but I'd very much like to learn how to do this the right way. Any help would be greatly appreciated. Thanks!
SELECT *
FROM a as a1
JOIN b
ON b.id = a.id
JOIN a as a2
ON a2.id = b.id
WHERE b.some_criteria = 'something';
Include an index on (b.id,b.some_criteria)
You aren't the first person, nor will you be the last to bring down your system with an inefficient query.
The most important lesson is that "Decision Support" and "Analytics" really don't co-exist with a transaction system. You really want to pull the data into a datamart or datawarehouse or some other database that isn't your transaction database, so that you don't take the business offline.
In terms of understanding why your initial query was so inefficient, you want to familiarize yourself with the EXPLAIN EXTENDED syntax that returns you plan information that should help you debug your query and work on making it perform acceptably. If you update your question with the actual explain plan output for it, that would be helpful in determining what the issue is.
Just from the outline you provided, it does sound like a self join would make sense rather than the subquery.

Where to store users visited pages?

I have a project, where I have posts for example.
The task is next: I must show to user his last posts visit.
This is my solution: every time user visits new (for him) topic, I create a new record in table visits.
Table visits has next structure: id, user_id, post_id, last_visit.
Now my tables visits has ~14,000,000 records and its still growing every day..
May be my solution isnt optimal and exists another way how to store users visits?
Its important to save every visit as standalone record, because I also have feature to select and use users visits. And I cant purge this table, because data could be needed later month, year. How I could optimize this situation?
Nope, you don't really have much choice other than to store your visit data in a table with columns for (at a bare minimum) user id, post id, and timestamp if you need to track the last time that each user visited each post.
I question whether you need an id field in that table, rather than using a composite key on (user_id, post_id), but I'd expect that to have a minor effect, provided that you already have a unique index on (user_id, post_id). (If you don't have an index on that pair of fields, adding one should improve query performance considerably and making it a unique index or composite key will protect against accidentally inserting duplicate records.)
If performance is still an issue despite proper indexing, you should be able to improve it a bit by segmenting the table into a collection of smaller tables, but segment it by user_id or post_id (rather than by date as previous answers have suggested). If you break it up by user or post id, then you will still be able to determine whether a given user has previously viewed a given post and, if so, on what date with only a single query. If you segment it by date, then that information will be spread across all tables and, in the worst-case scenario of a user who has never previously viewed a post (which I expect to be fairly common), you'll need to separately query each and every table before having a definitive answer.
As for whether to segment it by user id or by post id, that depends on whether you will more often be looking for all posts viewed by a user (segment by user_id to get them all in one query) or all users who have viewed a post (segment by post_id).
If it doesn't need to be long lasting, you could store it in session instead. If it does, you could either break the records apart by table, like say 1 per month, or you could only store the last 5-10 pages visited, and delete old ones as new ones come in. You could also change it to pages visited today, this week, etc.
If you do need all 14 million records, I would create another historical table to archive the visits that are not the most relevant for the day-to-day site operation.
At the end of the month (or week, or quarter, etc...) have some scheduled logic to archive records beyond a certain cutoff point to the historical table and reduce the number of records in the "live" table. This should help increase the query speed on the "live" table since you would have less records in it.
If you do need to query all of the data, you can use both tables and have all of the data available to you.
you could delete the ones you don't need - if you only want to show the last 10 visited posts then
DELETE FROM visits WHERE user_id = ? AND id NOT IN (SELECT id from visits where user_id = ? ORDER BY last_visit DESC LIMIT 0, 10);
(i think that's the best way to do that query, any mysql guru can tell me otherwise? you can ORDER BY in DELETE but the LIMIT only takes 1 parameter, so you can't do LIMIT 10, 100 there)
after inserting/updating each new row, or every few days if you like
Having a structure like (id, user_id, post_id, last_visit) for your vists table, makes it appear as though you are saving all posts, not just last post per Topic. Don't you need a topic ID in there somewhere so that you can determine what there last post PER TOPIC was, and so you know which row to replace when they post in the same topic more than once?
Store post_ids to $_SESSION and then using MYSQL IN with one SELECT query you will be able to show his visited posts. But all those ids will be destroyed after member close his browser, but anyways, this is much more faster and optimal than using database.
edit: sorry, I didn't notice you that you must store that records in database and use it after months. Then I have no idea how to optimize it, but with 14 mln. records you should definitely use indexes.