Our problem lies in performing a left join on two large tables (both having millions of entries).
The first one is a table that contains input supplied by the end-user of our program. It contains answers to a variety of questions. Every question belongs to a certain questionnaire. The most important columns are an identifier for the given response, an identifier for the questionnaire form, the datetime the answer is given and an identifier for the user that supplied the answer.
The second table contains information on daily progress of the users regarding the completion of questionnaires. It contains information on the amount of answers a certain user has given on a certain day for a given activity. The most important columns in this table are the user id, the questionnaire id and the date.
The second database is updated right after a new answer enters the first database. Updating is performed by code (workers) that runs on a different server. We would to like make the system robust against failure of this other server. An important step to ensure that the table with the results ('responses') remains in sync with the progress ('progress_questionnaires') table is to be able to check whether a combination of user_id, questionnaire_id and datetime from the 'responses' table is also present in the 'progress_questionnaires' table. A query that captures the required results, but does not perform on large databases (NxN, in which N is couple of millions entries), is displayed below.
A query that captures the required results is:
SELECT r.chapter_id, r.user_id, CAST(first_created as date) as date, 1 as original
FROM responses r
LEFT JOIN progress_questionnaires pq ON r.questionnaire_id = pq.questionnaire_id AND r.user_id = pq.user_id AND CAST(r.first_created as date) = pq.date
WHERE pa.activity_id IS NULL
GROUP BY r.questionnaire_id, r.user_id, CAST(r.first_created as date)
As stated before, this query does capture the required results, but does not perform well on large tables. All key columns are properly indexed as far as we know.
We would be very happy if someone could help us out.
P.S. We are using MariaDB, SQL version 5.5.43. I hope I supplied al necessary information, but logically I would be happy to supply additional information where necessary.
Related
I am looking point system DB Design. My Question is quite similar to the question that I have found here. : - Database design - Approach for storing points for users .
In this system, user earns points when any of this action happens:-
User Register on the website. (i.e Active Entry to the User table)
User Writes Answer of the Other Users Question. (i.e Entry to the
Answer table)
User Answers are rated be other Users. (i.e Entry to Answer Rating
Table for the User )
User Invites Other Users to Join the platform
From the DB Design Side, I have created these two tables:-
Action_Master table, and
User_Action_Point table.
The Action_Master table contains this :- (id, action_name, action_point)
The User_Action_Point table stores the history of each actions, so it look like this:-
(id, action_master_id, action_done, created_by, created_at, updated_by, updated_at, deleted_at)
Now the problem here is the User_Action_Point table, it contains the repeated data of the User Table, Answer Table and Answer_rating Table.
This problem is very well addressed by Jeffrey in the first answer of the linked question. According to his answer we should have to execute Views or Stored Procedure to sum up the points from different tables every time. This approach is awesome because we need not to handle the overhead of data deletion or any other changes that may affect the User Points.
But, is that a good way when we need users points very frequently ? Don't you think this approach can increase the db response time or the loads on the MySQL server ?
or I need to store the aggregated Users points data in some table with the overhead of handling repeated data (i.e if anything get deleted then we also have to minus those points in the point table.)
Please Suggest.
This may be a little difficult to answer given that I'm still learning to write queries and I'm not able to view the database at the moment, but I'll give it a shot.
The database I'm trying to acquire information from contains a large table (TransactionLineItems) that essentially functions as a store transaction log. This table currently contains about 5 million rows and several columns describing products which are included in each transaction (TLI_ReceiptAlias, TLI_ScanCode, TLI_Quantity and TLI_UnitPrice). This table has a foreign key which is paired with a primary key in another table (Transactions), and this table contains transaction numbers (TRN_ReceiptNumber). When I join these two tables, the query returns one row for every item we've ever sold, and each row has a receipt number. 16 rows might have the same receipt number, meaning that all of these items were sold in a single transaction. Below that might be 12 more rows, each sharing another receipt number. All transactions are broken down into multiple rows like this.
I'm attempting to build a query which returns all rows sharing a single receipt number where at least one row with that receipt number meets certain criteria in another column. For example, three separate types of gift cards all have values in the TLI_ScanCode column that begin with "740000." I want the query to return rows with values beginning with these six digits in the TLI_ScanCode column, but I would also like to return all rows which share a receipt number with any of the rows which meet the given scan code criteria. Essentially, I need the query to return all rows for every receipt number which is also paired in at least one row with a gift card-related scan code.
I attempted to use a subquery to return a column of all receipt numbers paired with gift card scan codes, using "WHERE A.TRN_ReceiptAlias IN (subquery..." to return only those rows with a receipt number which matched one of the receipt numbers returned by the subquery. This appeared to run without issue for five minutes before the server ground to a halt for another twenty while it processed the query. The query appeared to complete successfully, but given that I was working with IT to restore normal store operations during this time I failed to obtain the results of the query (apart from the associated shame and embarrassment).
I'd like to know if there is a way to write a query to obtain this information without causing the server to hang. I'm assuming that either: a) it wasn't very smart to use a subquery in this manner on such a large table, or b) I don't know enough about SQL to obtain the information I need. I'm assuming the answer is both A and B, but I'd very much like to learn how to do this the right way. Any help would be greatly appreciated. Thanks!
SELECT *
FROM a as a1
JOIN b
ON b.id = a.id
JOIN a as a2
ON a2.id = b.id
WHERE b.some_criteria = 'something';
Include an index on (b.id,b.some_criteria)
You aren't the first person, nor will you be the last to bring down your system with an inefficient query.
The most important lesson is that "Decision Support" and "Analytics" really don't co-exist with a transaction system. You really want to pull the data into a datamart or datawarehouse or some other database that isn't your transaction database, so that you don't take the business offline.
In terms of understanding why your initial query was so inefficient, you want to familiarize yourself with the EXPLAIN EXTENDED syntax that returns you plan information that should help you debug your query and work on making it perform acceptably. If you update your question with the actual explain plan output for it, that would be helpful in determining what the issue is.
Just from the outline you provided, it does sound like a self join would make sense rather than the subquery.
I am a reasonably competent SQL programmer but my skills are still pretty much in the domain of simple INSERT, SELECT, UPDATE statements with an occasional LIKE etc thrown in. What I am currently trying to do is rather more complex. Here is the scenario.
I have three tables.
Table 1, *users* identifies users via a User ID, uid. Users can have one or more sub accounts
Table 2 *accounts* keeps a record of subaccounts for each user with, amongst other things the columns uid and sid where uid is the one defined in the *users* table.
Table 3, *data* is currently storing some data, in a data column that is being associated with a particular subaccount, sid.
The thing I have just realized is that there is no particular reason to block users from using those data across subaccounts. No problem - I can change my data subset search SQL to work with the uid instead. However, given the frequency of such searches, it seems well worth while simply sticking in a uid column in *data*.
To do that I would need to write some smart SQL that would get uid,sid pairs from the *accounts* table and use that information to update the newly created uid column in the data table. This I have to admit is beyond my knowledge of SQL.
I should mention that the system using these data is now in production and has several 100s of users so the option of just acting like they are not there is not available. Not terribly relevant I think but I should mention that uid and sid are alphanumeric strinsg with both columns being indexed.
I would be most grateful to anyone here who might be able to help out with it.
Mysql can do updates based on joins and based on reading of your schema here's what I'd do...
UPDATE accounts a, data d
set d.uid=a.uid
where a.sid=d.sid
and d.uid is NULL
I've researched related questions on the site but failed to find a solution. What I have is a user activity table in MySQL. It lists all kind of events of a user within a photo community site. Depending on the event that took place, I need to query certain data from other users.
I'll explain it in a more practical way by using two examples. First, a simple event, where the user joined the site. This is what the row in the activity table would look like:
event: REGISTERED
user_id: 19 (foreign key to user table)
date: current date
image_id: null, since this event has nothing to do with images
It is trivial to query this. Now an event for which extra data needs to be queried. This event indicates a user that uploaded an image:
event: IMAGEUPLOAD
user_id: 19 (foreign key to user table)
date: current date
image_id: 12
This second event needs to do a join to the image table to get the image URL column from that table. A third event could be about a comment vote, where I would need to do a join to the comments table to get extra columns.
In essence, I need a way to conditionally select extra columns (not rows) per row based on the event type. This is easy to do when the columns come from the same table, but I'm struggling to do this using joins from other tables. I hope to do this in one, conditional query without the use of a stored procedure.
Is this possible?
You could make the joins depend on the event type, like:
select *
from Events e
left join Image i
on e.event = 'IMAGEUPLOAD'
and e.image_id = i.id
left join comments c
on e.event = 'COMMENT'
and e.comment_id = c.id
If there's one column that is shared among all linked tables, for example create_date, you can coalesce to select the one that's not NULL:
select coalesce(i.create_date, c.create_date, ...) as create_date
Doing precisely what you want to do is not possible. A SELECT is designed to return a list of tuples/rows, and each has the same number of elements/columns.
What you really are doing here is collecting 2 different kinds of information, and you're going to have to process the 2 different kinds of information separately anyway, which should be a hint that you're doing something slightly wrong. Instead, pull the various event types out individually, perform whatever additional operations you need to do to convert them to your common output type (eg. HTML if this is for a website), and then interleave them together at that stage.
I have three tables called: users, facilities, and staff_facilities.
users contains average user data, the most important fields in my case being users.id, users.first, and users.last.
facilities also contains a fair amount of data, but none of it is necessarily pertinent to this example except facilities.id.
staff_facilties consists of staff_facilities.id (int,auto_inc,NOT NULL),staff_facilities.users_id (int,NOT NULL), and staff_faciltities.facilities_id (int,NOT NULL). (That's a mouthful!)
staff_facilities references the ids for the other two tables, and we are calling this table to look up users' facilities and facilities' users.
This is my select query in PHP:
SELECT users.id, users.first, users.last FROM staff_facilities LEFT JOIN users ON staff_facilities.users_id=users.id WHERE staff_facilities.facilties_id=$id ORDER BY users.last
This query works great on our development server, but when I drop it into the client's production environment often times blank rows appear in the results set. Our development server is using the replicated tables and data that already exist on the client's production server, but the hardware and software vary quite a bit.
These rows are devoid of any information, including the three id fields that require NOT NULL values to be entered into the database. Running the query through the MySQL management tools on the backend returns the same results. Searching the table for NULL fields has not turned up anything.
The other strange thing is that the number of empty rows is changing based on the varying results caused by the WHERE clause id check. It's usually around one to three empty rows, but they are consistent when using the same parameter.
I've many times dealt with the returning of nearly duplicate rows due to LEFT JOINS, but I've never had this happen before. As far as displaying the information goes, I can easily hide it from the end user. My concern is primarily that this problem will be compounded as time passes and the number of records grows larger. As it sits, this system has just been installed, and we already have 2000+ records in the staff_facilities table.
Any insight or direction would be appreciated. I can provide further more detailed examples and information as well.
You are only selecting columns from the table on the right side of the join. Of course some of them are completely null, you did a left join. So those records match to an id in the table on the left side of the join but not to any data on the right side of the join. Since you aren't returning any columns from the left table, you see no data.