PL/SQL rownum updates - mysql

I am working on a database with a couple of tables. They are a
districts table
PK district_id
student_data table
PK study_id
FK district_id
ga_data table
PK study_id
district_id
The ga_data table is data that I am adding in. Both the student_data table and ga_data have 1.3 million records. The study_id's are 1 to 1 between the two tables, but the ga_data.district_id's are NULL and need to be updated. I am having trouble with the following PL/SQL:
update ga_data
set district_id = (select district_id from student_data
where student_data.study_id = ga_data.study_id)
where ga_data.district_id is null and rownum < 100;
I need to do it incremently so that's why I need rownum. But am I using it correctly? After running the query a bunch of times, it only updated about 8,000 records of the 1.3 million (should be about 1.1 million updates since some of the district_ids are null in student_data). Thanks!

ROWNUM just chops off query after the first n rows. You have some rows in STUDENT_DATA which have a NULL for DISTRICT_ID. So after a number of runs your query is liable to get stuck in a rut, returning the same 100 QA_DATA records, all of which match one of those pesky STUDENT_DATA rows.
So you need some mechanism for ensuring that you are working your way progressively through the QA_DATA table. A flag column would be one solution. Partitioning the query so it hits a different set of STUDENT_IDs is another.
It's not clear why you have to do this in batches of 100, but perhaps the easiest way of doing this would be to use BULK PROCESSING (at least in Oracle: this PL/SQL syntax won't work in MySQL).
Here is some test data:
SQL> select district_id, count(*)
2 from student_data
3 group by district_id
4 /
DISTRICT_ID COUNT(*)
----------- ----------
7369 192
7499 190
7521 192
7566 190
7654 192
7698 191
7782 191
7788 191
7839 191
7844 192
7876 191
7900 192
7902 191
7934 192
8060 190
8061 193
8083 190
8084 193
8085 190
8100 193
8101 190
183
22 rows selected.
SQL> select district_id, count(*)
2 from qa_data
3 group by district_id
4 /
DISTRICT_ID COUNT(*)
----------- ----------
4200
SQL>
This anonymous block uses the Bulk processing LIMIT clause to batch the result set into chunks of 100 rows.
SQL> declare
2 type qa_nt is table of qa_data%rowtype;
3 qa_recs qa_nt;
4
5 cursor c_qa is
6 select qa.student_id
7 , s.district_id
8 from qa_data qa
9 join student_data s
10 on (s.student_id = qa.student_id);
11 begin
12 open c_qa;
13
14 loop
15 fetch c_qa bulk collect into qa_recs limit 100;
16 exit when qa_recs.count() = 0;
17
18 for i in qa_recs.first()..qa_recs.last()
19 loop
20 update qa_data qt
21 set qt.district_id = qa_recs(i).district_id
22 where qt.student_id = qa_recs(i).student_id;
23 end loop;
24
25 end loop;
26 end;
27 /
PL/SQL procedure successfully completed.
SQL>
Note that this construct allows us to do additional processing on the selected rows before issuing the update. This is handy if we need to apply complicated fixes programmatically.
As you can see, the data in QA_DATA now matches that in STUDENT_DATA
SQL> select district_id, count(*)
2 from qa_data
3 group by district_id
4 /
DISTRICT_ID COUNT(*)
----------- ----------
7369 192
7499 190
7521 192
7566 190
7654 192
7698 191
7782 191
7788 191
7839 191
7844 192
7876 191
7900 192
7902 191
7934 192
8060 190
8061 193
8083 190
8084 193
8085 190
8100 193
8101 190
183
22 rows selected.
SQL>

It is kind of an odd requirement to only update 100 rows at a time. Why is that?
Anyway, since district_id in student_data can be null, you might be updating the same 100 rows over and over again.
If you extend your query to make sure a non-null district_id exists, you might end up where you want to be:
update ga_data
set district_id = (
select district_id
from student_data
where student_data.study_id = ga_data.study_id
)
where ga_data.district_id is null
and exists (
select 1
from student_data
where student_data.study_id = ga_data.study_id
and district_id is not null
)
and rownum < 100;

If this is a one-time conversion you should consider a completely different approach. Recreate the table as the join of your two tables. I promise you will laugh out loud when you realise how fast it is compared to all kinds of funny 100-rows-at-a-time updates.
create table new_table as
select study_id
,s.district_id
,g.the_remaining_columns_in_ga_data
from student_data s
join ga_data g using(study_id);
create indexes, constraints etc
drop table ga_data;
alter table new_table rename to ga_data;
Or if it isn't a one time conversion or you can't re-create/drop tables or you just feel like spending a few extra hours on data loading:
merge
into ga_data g
using student_data s
on (g.study_id = s.study_id)
when matched then
update
set g.district_id = s.district_id;
The last statement can also be rewritten as an updatable-view, but I personally never use them.
Drop/disable indexes/constraints on ga_data.district_id before running the merge and recreate them afterward will improve on the performance.

Related

Efficient database schema

I have two MySQL tables that storing some data like below :
table INFO:
the "key" must be unique in this INFO table, and "group" can be duplicate for each key.
info_id: pk
group
key
1
GrA
aaa
2
GrA
bbb
3
GrB
ccc
4
GrC
ddd
table HISTORY: if the product "product_name" hasn't info_id (using SELECT sql query),
then insert the info_id for product_type.
index: pk
product_name
group
info_id
1
ProductA
GrA
1
2
ProductA
GrB
3
3
ProductA
GrA
2
4
ProductB
GrA
1
5
ProductC
GrA
1
6
ProductC
GrA
2
7
ProductD
GrC
4
8
ProductD
GrA
2
9
ProductE
GrB
3
running sql query client is python.
above table is working now, but records of INFO table are over 600,000 and records of HISTORY table are over 5,000,000.
the SQL query performance is really slow, one query ends in 5 secs after run the query.
to get faster performance for each query result, I want to rebuilding these schema.
Edit:
Hello,
I'm using below queries:
SELECT COUNT(group) FROM INFO : to get count of specific group
SELECT * FROM INFO WHERE group = "GrB" and key = "EEE"
INSERT INTO INFO(group, key) VALUES("GrB", "EEE") : insert if query 1 result is None
SELECT * FROM HISTORY WHERE product_name = "ProductA" and info_id = "4"
INSERT INTO HISTORY(product_name, group, info_id) VALUES("ProductA", "GrC", "4") : insert if query 4 result is None
I could better explain the problem if you had provided the CREATE TABLEs and the SELECT. Meanwhile, I will guess that key is either not indexed or indexed by itself. Based on how you described the table, this would be much faster:
CREATE TABLE info (
key VARCHAR(...) ...,
grp VARCHAR(...) ...,
PRIMARY KEY(key),
INDEX(grp) -- needed if you ever look up all the keys for a grp
) ENGINE=InnoDB
and replace info_id by key in the other table.
But then why have grp in both tables? Show us the schema and query; I may come up with a better way.

query executing too much time

my query is taking around 2800 secs to get output.
we have indexes also but no luck.
my target is need to get the output with in 2 to 3 seconds.
if possible please re-write the query.
query:
select ttl.id, ttl.url, ttl.canonical_url_id
from t_target_url ttl
where ttl.own_domain_id=476 and ttl.type != 10
order by ttl.week_entrances desc
limit 550000;
Explain Plan:
+----+-------------+-------+------+--------------------------------+---------------------------+---------+-------+----------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+--------------------------------+---------------------------+---------+-------+----------+-----------------------------+
| 1 | SIMPLE | ttl | ref | own_domain_id_type_status,type | own_domain_id_type_status | 5 | const | 57871959 | Using where; Using filesort |
+----+-------------+-------+------+--------------------------------+---------------------------+---------+-------+----------+-----------------------------+
1 row in set (0.80 sec)
mysql> show create table t_target_url\G
*************************** 1. row ***************************
Table: t_target_url
Create Table: CREATE TABLE `t_target_url` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`own_domain_id` int(11) DEFAULT NULL,
`url` varchar(2000) NOT NULL,
`create_date` datetime DEFAULT NULL,
`friendly_name` varchar(255) DEFAULT NULL,
`section_name_id` int(11) DEFAULT NULL,
`type` int(11) DEFAULT NULL,
`status` int(11) DEFAULT NULL,
`week_entrances` int(11) DEFAULT NULL COMMENT 'last 7 days entrances',
`week_bounces` int(11) DEFAULT NULL COMMENT 'last 7 days bounce',
`canonical_url_id` int(11) DEFAULT NULL COMMENT 'the primary URL ID, NOT allow canonical of canonical',
KEY `id` (`id`),
KEY `urlindex` (`url`(255)),
KEY `own_domain_id_type_status` (`own_domain_id`,`type`,`status`),
KEY `canonical_url_id` (`canonical_url_id`),
KEY `type` (`type`,`status`)
) ENGINE=InnoDB AUTO_INCREMENT=227984392 DEFAULT CHARSET=utf8
/*!50100 PARTITION BY RANGE (`type`)
(PARTITION p0 VALUES LESS THAN (0) ENGINE = InnoDB,
PARTITION p1 VALUES LESS THAN (1) ENGINE = InnoDB,
PARTITION p2 VALUES LESS THAN (2) ENGINE = InnoDB,
PARTITION pEOW VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */
1 row in set (0.00 sec)
Your query itself looks fine, however, the order by clause, and possible half-million records is probably your killer. I would add an index to help optimize that portion via
( own_domain_id, week_entrances, type )
So this way, you are first hitting your critical key "own_domain_id", and then getting everything already in order. The type is for != 10, thus any other type and would appear to cause more problems if that was in the second index position.
Comment Feedback.
For simplistic purposes, your critical key per the where clause is "ttl.own_domain_id=476". You only care about data for domain ID 476. Now, lets assume you have 15 "types" that span all different week entrances, such as
own_domain_id type week_entrances
476 1 1000
476 1 1700
476 1 850
476 2 15000
476 2 4250
476 2 12000
476 7 2500
476 7 5300
476 10 1250
476 10 4100
476 12 8000
476 12 3150
476 15 5750
476 15 27000
This obviously is not to scale of your half-million capacity, but shows sample data.
By having the type != 10, it will STILL have to blow through all the records for id=476, yet exclude only those with the type = 10. It then has to put all the data in order by the week entrances which would take more time. By having the week entrances as part of the key in the second position, THEN the type, the data WILL BE able to be optimized in the returned result set already in proper order. However, when it gets to the type of "!= 10", it will still skip over those quickly as they are encountered. Here would be the revised index data per above sample.
own_domain_id week_entrances type
476 850 1
476 1000 1
476 1250 10
476 1700 1
476 2500 7
476 3150 12
476 4100 10
476 4250 2
476 5300 7
476 5750 15
476 8000 12
476 12000 2
476 15000 2
476 27000 15
So, as you can see, the data is already pre-sorted per the index, and applying DESCENDING order is no problem for the engine, just pulls the records in reverse order and skips the 10's as they are found.
Does that help?
Additional comment feedback per Salman.
Think of this another way with a store with 10 different branch locations, each with their own sales. The transactions receipts are stored in boxes (literally). Think of how you would want to go through the boxes if you were looking for all transactions on a given date.
Box 1 = Store #1 only, and transactions sorted by date
Box 2 = Store #2 only, and transactions sorted by date
Box ...
Box 10 = Store #10 only, sorted by date.
You have to go through 10 boxes, pulling out all for a given date... Or in the original question, every transaction EXCEPT for one date, and you want them in order by dollar amount of transaction, regardless of date... What a mess that could be.
If you had the boxes pregroup sorted, regardless of store
Box 1 = Sales from $1 - $1000 (all properly sorted by amount)
Box 2 = Sales from $1001 - $2000 (properly sorted)
Box ...
Box 10... same...
You STILL have to go through all the boxes and put them in order, but at least, as you are looking through the transactions, you could just throw out the one for the date exclusion to ignore.
Indexes help pre-organize how the engine can best go through them for your criteria.

Expanding row value into induvidual row

I need some help with a SQL Query. I know this should be redesigned, but it is a small fix I am doing on a larger system :(.
The system has a table called sales, containing parts sold.
id | date | idpart
1 |unixtime| 227
2 |unixtime| 256
And so on..
In the table Orderdetails, containing the content of the order, the parts are listed by id, and with the amount a unique part is ordered by a customer.
id |idpart | amount
1 | 255 | 4
2 | 265 | 2
Now, my problem is that I have to run a Query to populate the sales table but adding the idpart as a new row for the same amount of times the part has as amountvalue in order.
I need result to be:
id | date | idpart
1 |unixtime| 227
2 |unixtime| 256
3 |unixtime| 255
4 |unixtime| 255
5 |unixtime| 255
6 |unixtime| 255
7 |unixtime| 265
8 |unixtime| 265
Is there anyone who could give me some help on this problem?.
This is easy if you have a table with numbers. You can do it as:
select id, date, idpart
from sales
union all
select id, date, idpart
from orders o cross
numbers n
on n.number <= o.amount
This is assuming that date is coming from the orders table.
Now, you just need to generate the numbers table. This is a pain in MySQL. Perhaps you have another table you can use, such as a calendar table. Otherwise, you can create one by inserting into a table with an auto increment column.
In other databases, you can use the row_number() function to create such a table on the fly. But MySQL does not support this.

MySQL Query to Match value from one column to the same value in another colum

I have a transactional table, that has two id reference fields NEW_REF, Orginal_REF. So in this transactional table I can have multple transactions that actually relate to the one event. When a new event is added, it gets an NEW_REF and the Original_REF field is null. If somehting changes about this event, a new record is created, and the new record has its Original_REF update to that of the previous NEW_REF ID.
So as an example, in my table, I have:
REF1 | Original_Ref
956 | 200
960 | null
967 | 960
980 | 967
990 | 600
991 | 700
992 | 670
998 | 343
1000 | 980
1001 | 778
1010 | 787
1020 | 565
As an example, if an event has more than one related transaction, I want to be able to have a query that would pull out all related transactions, per event. In the above example, I would expect to see:
REF1 | Original_Ref
960 | null
967 | 960
980 | 967
1000 | 980
Here Records 960 is the original record and has been updated 3 times. Is there a way of quering my table to identify and group together related transactions per event?
The way your table is structured, you end up having to do nested subqueries. If your tree gets any deeper than about 3 nodes, it becomes grisly. You might consider a table structure like this:
id int unsigned auto_increment primary key,
parent_id int unsigned,
root_id int unsigned not null
An initial event record might look like this: 200, null, 200. A first child might be 547, 200, 200. An nth child might be 1038, 986, 200.
So a query for all the records for an event is simple:
SELECT * FROM mytable WHERE root_id= ?
This is probably about the fastest query you can do that meets your requirements. Note that it will not group result records by "transaction group" -- in this case, such an ordering will be very difficult to achieve using SQL alone. (I'm assuming you already have a separate index on both columns -- if not, make sure that you do or this query will perform very poorly.)
SELECT a.REF1, a.Original_Ref
FROM txn AS a
LEFT JOIN txn AS b
ON a.Original_Ref = b.REF1
LEFT JOIN txn AS c
ON c.Original_Ref = a.REF1
WHERE b.REF1 IS NOT NULL
OR c.Original_Ref IS NOT NULL;
You can also do this with correlated subqueries, but MySQL is notoriously poor at optimizing those as joins.

How to select all users for which given parameters are always true

I have a table containing users and locations where they were seen:
user_id | latitude | longitude | date_seen
-------------------------------------------
1035 | NULL | NULL | April 25 2010
1035 | 127 | 35 | April 28 2010
1038 | 127 | 35 | April 30 2010
1037 | NULL | NULL | May 1 2010
1038 | 126 | 34 | May 21 2010
1037 | NULL | NULL | May 24 2010
The dates are regular timestamps in the database; I just simplified them here.
I need to get a list of the users for whom latitude and longitude are always null. So in the above example, that would be user 1037--user 1035 has one row with lat/lon information, and 1038 has two rows with lat/lon information, whereas for user 1037, in both rows the information is null.
What query can I use to achieve this result?
select distinct user_id
from table_name t
where not exists(
select 1 from table_name t1
where t.user_id = t1.user_id and
t1.latitude is not null and
t1.longitude is not null
)
You can read this query: give me all users that haven't set lat and long different than null in any row in table. In my opinion exists is preferred in such case (no exists) because even if table scan is used (not optimal way to find row) it stops just after it finds specific row (there is no need to count all rows).
Read more about this topic: Exists Vs. Count(*) - The battle never ends... .
Try this, it should work.
SELECT user_id, count(latitude), count(longitude)
FROM user_loc
GROUP BY user_id HAVING count(latitude)=0 AND count(longitude)=0;
tested in MySQL.
Try:
SELECT * FROM user WHERE latitude IS NULL AND longitude IS NULL;
-- Edit --
2nd try (untested, but constructed it from a query I have used before):
SELECT user_id, CASE WHEN MIN(latitude) IS NULL AND MAX(latitude) IS NULL THEN 1 ELSE 0 END AS noLatLong FROM user GROUP BY user_id HAVING noLatLong = 1;
This works:
SELECT DISTINCT user_id
FROM table
WHERE latitude IS NULL
AND longitude IS NULL
AND NOT user_id IN
(SELECT DISTINCT user_id
FROM table
WHERE NOT latitude IS NULL
AND NOT longitude IS NULL)
result:
1037
(syntax validated with SQLite here)
BUT: Even if not using COUNT here, my statement has to scan all table lines, so MichaƂ Powaga's statement is more efficient.
rationale:
get list of user_ids with lat/lon records to compare against (you want to EXCLUDE these from final result) - optimization: use EXISTS here...
get list of user_ids without lat/lon records (that you're interested in)
reduce by all IDs, that exist in the first list - optimization: use EXISTS here...
make user_ids DISTINCT, because the example shows multiple entries per user_id (but you want just the unique IDs)