I'm working on an online registry which was created by a previous programmer. I Have to fix a bunch of data integrity issues revolving around postal codes and cities. I am trying to do a large update query using data from our table of Canadian postal codes and our table of registrants. My query seems to literally take infinite time on my development environment. Not sure why.
Create Temporary Table RegistrantToChange AS (
SELECT
intID, vcCity, vcPostalCode
FROM
tblRegistrantWebsiteSignUps
WHERE
vcPostalCode NOT LIKE '00%' AND vcPostalCode!=''
AND (vcCity = '' OR vcCity = 'unspecified')
);
UPDATE RegistrantToChange, tblPostalCodes
SET
vcPostalCode = tblPostalCodes.PostalCode
WHERE
vcCity = tblPostalCodes.CityName;
Pardon the horrific and inconsistent naming. I just recently took over this project and am still in the process of refactoring the whole thing.
vcCity in your temporary table is not indexed, and if tblPostalCodes.CityName is not indexed then the JOIN in the update has a lot of work to do and may take some time.
I would suggest creating the temporary table first with an index on vcCity, then perform an INSERT...SELECT to populate it. Ensure that tblPostalCodes.CityName is indexed and then perform your update.
Related
Ok, strange one here
I have a database for customer data. My customers are businesses with their own customers.
I have 3000 tables (one for each business) with several thousand email addresses in each. Each table is identical, save the name.
I need to find a way to find where emails cross over between businesses (ie appear in multiple tables) and the name of the table that they sit in.
I have tried collating all entries and table names into one table and using a "group by", but the volume of data is too high to run this without our server keeling over...
Does anyone have a suggestion on how to accomplish this without running 3000 sets of joins?
Also, I cannot change the data structure AT ALL.
Thanks
EDIT: In response to those "helpful" restructure comments, not my database, not my system, I only started a couple of months ago to analyse the data
Multiple tables of identical structure almost never makes sense, all it would take is a business field to fix this structure. If at all possible you should fix the structure. If it has been foisted upon you and you cannot change it, you should still be able to work with it.
Select the distinct emails and the table name from each table either UNION ALL or pull them into a new table, then use GROUP BY and HAVING to find emails with multiple tables.
SELECT email
FROM Combined_Table
GROUP BY email
HAVING COUNT(sourc_table) > 1
So, you say you can't change the data structure, but you might be able to provide a compatible upgrade.
Provide a new mega table:
CREATE TABLE business_email (
id_business INT(10) NOT NULL,
email VARCHAR(255) NOT NULL UNIQUE,
PRIMARY KEY id_business, email
) ENGINE = MYISAM;
Myisam engine so you don't have to worry about transactions.
Add a trigger to every single business table to duplicate the email into the new one:
DELIMITER \\
CREATE TRIGGER TRG_COPY_EMAIL_BUSINESS1 AFTER INSERT OR UPDATE ON business1 FOR EACH ROW
BEGIN
INSERT INTO `business_email` (`id_business`, `email`) VALUES (NEW.`id_business`, NEW.`email`) ON DUPLICATE KEY UPDATE `id_business`=NEW.`id_business`;
END;
\\
DELIMITER ;
Your problem is to add it dynamically whenever a new table is created. It shouldn't be a problem since apparently there's already dynamic DDL in your application code.
Copy all existing data to the new table:
INSERT INTO `business_email` (`id_business`, `email`)
SELECT email FROM business1
UNION
SELECT email FROM business2
...
;
COMMIT;
proceed with your query on the new business_email table, that should be greatly simplified:
SELECT `id_business` FROM `business_email`
WHERE
GROUP BY `email`
HAVING COUNT(`email`) > 2;
This query should be easy to cope with. If not, please detail the issue as I don't think properly indexed tables should be a problem even for millions of rows (Which I don't believe is the case since we talk about emails)
The advantage of this solution is that you stay up to date all the time, while you don't change the way your application works. You just add another layer to provide additional business value.
I struggling to create a table that sets table parameters as well as creating the columns.
I am using MySQL server.
I require that the table meets the following criteria:
The table should be Called CUSTOMER with the columns CUST, LOCX, LOCY.
The column CUST will be a 1 up serial starting 1001 and will be the primary key.
LOCX and LOCY will contain X and Y Integers no greater than +-11, and will be foreign keys to other tables.
For info: I then intend to add my data to the table using the INSERT INTO function in a separate query that I already have.
Any direction on the construction of a query to create a table meeting the requirements above will be greatly appreciated
you can create a new table with a MySQL-GUI if you have problems with that.
These GUI-tools usually provide a New-Table button that also allows you to define your table without writing any code. They are often limited but should be more than sufficient for your needs. there are 1-month trial versions for paid versions and even completely free GUIs so you don't have to buy anything.
after that use the following code to retrieve "perfect" SQL from MySQL:
show create table your_schema_name.your_table_name
do that a few times and study the code. Soon you will be able to write create-table statements and include more complex column definitions on your own. It will also be easier to understand the MySQL Documentation which can be confusing and somehow intimidating with its completeness for beginners.
A simplified version of my MySQL db looks like this:
Table books (ENGINE=MyISAM)
id <- KEY
publisher <- LONGTEXT
publisher_id <- INT <- This is a new field that is currently null for all records
Table publishers (ENGINE=MyISAM)
id <- KEY
name <- LONGTEXT
Currently books.publisher holds values that keep getting repeated, but that the publishers.name holds uniquely.
I want to get rid of books.publisher and instead populate the books.publisher_id field.
The straightforward SQL code that describes what I want done, is as follows:
UPDATE books
JOIN publishers ON books.publisher = publishers.name
SET books.publisher_id = publishers.id;
The problem is that I have a big number of records, and even though it works, it's taking forever.
Is there a faster solution than using something like this in advance?:
CREATE INDEX publisher ON books (publisher(20));
Your question title says ".. optimize ... query without using an index?"
What have you got against using an index?
You should always examine the execution plan if a query is running slowly. I would guess it's having to scan the publishers table for each row in order to find a match. It would make sense to have an index on publishers.name to speed the lookup of an id.
You can drop the index later but it wouldn't harm to leave it in, since you say the process will have to run for a while until other changes are made. I imagine the publishers table doesn't get update very frequently so performance of INSERT and UPDATE on the table should not be an issue.
There are a few problems here that might be helped by optimization.
First of all, a few thousand rows doesn't count as "big" ... that's "medium."
Second, in MySQL saying "I want to do this without indexes" is like saying "I want to drive my car to New York City, but my tires are flat and I don't want to pump them up. What's the best route to New York if I'm driving on my rims?"
Third, you're using a LONGTEXT item for your publisher. Is there some reason not to use a fully indexable datatype like VARCHAR(200)? If you do that your WHERE statement will run faster, index or none. Large scale library catalog systems limit the length of the publisher field, so your system can too.
Fourth, from one of your comments this looks like a routine data maintenance update, not a one time conversion. So you need to figure out how to avoid repeating the whole deal over and over. I am guessing here, but it looks like newly inserted rows in your books table have a publisher_id of zero, and your query updates that column to a valid value.
So here's what to do. First, put an index on tables.publisher_id.
Second, run this variant of your maintenance query:
UPDATE books
JOIN publishers ON books.publisher = publishers.name
SET books.publisher_id = publishers.id
WHERE books.publisher_id = 0
LIMIT 100;
This will limit your update to rows that haven't yet been updated. It will also update 100 rows at a time. In your weekly data-maintenance job, re-issue this query until MySQL announces that your query affected zero rows (look at mysqli::rows_affected or the equivalent in your php-to-mysql interface). That's a great way to monitor database update progress and keep your update operations from getting out of hand.
Your update query has invalid syntax but you can fix that later. The way to get it to run faster is to add a where clause so that you are only updating the necessary records.
I am a bit rusty with mysql and trying to jump in again..So sorry if this is too easy of a question.
I basically created a data model that has a table called "Master" with required fields of a name and an IDcode and a then a "Details" table with a foreign key of IDcode.
Now here's where its getting tricky..I am entering:
INSERT INTO Details (Name, UpdateDate) Values (name, updateDate)
I get an error: saying IDcode on details doesn't have a default value..so I add one then it complains that Field 'Master_IDcode' doesn't have a default value
It all makes sense but I'm wondering if there's any easy way to do what I am trying to do. I want to add data into details and if no IDcode exists, I want to add an entry into the master table. The problem is I have to first add the name to the fund Master..wait for a unique ID to be generated(for IDcode) then figure that out and add it to my query when I enter the master data. As you can imagine the queries are going to probably get quite long since I have many tables.
Is there an easier way? where everytime I add something it searches by name if a foreign key exists and if not it adds it on all the tables that its linked to? Is there a standard way people do this? I can't imagine with all the complex databases out there people have not figured out a more easier way.
Sorry if this question doesn't make sense. I can add more information if needed.
p.s. this maybe a different question but I have heard of Django for python and that it helps creates queries..would it help my situation?
Thanks so much in advance :-)
(decided to expand on the comments above and put it into an answer)
I suggest creating a set of staging tables in your database (one for each data set/file).
Then use LOAD DATA INFILE (or insert the rows in batches) into those staging tables.
Make sure you drop indexes before the load, and re-create what you need after the data is loaded.
You can then make a single pass over the staging table to create the missing master records. For example, let's say that one of your staging table contains a country code that should be used as a masterID. You could add the master record by doing something along the lines of:
insert
into master_table(country_code)
select distinct s.country_code
from staging_table s
left join master_table m on(s.country_code = m.country_code)
where m.country_code is null;
Then you can proceed and insert the rows into the "real" tables, knowing that all detail rows references a valid master record.
If you need to get reference information along with the data (such as translating some code) you can do this with a simple join. Also, if you want to filter rows by some other table this is now also very easy.
insert
into real_table_x(
key
,colA
,colB
,colC
,computed_column_not_present_in_staging_table
,understandableCode
)
select x.key
,x.colA
,x.colB
,x.colC
,(x.colA + x.colB) / x.colC
,c.understandableCode
from staging_table_x x
join code_translation c on(x.strange_code = c.strange_code);
This approach is a very efficient one and it scales very nicely. Variations of the above are commonly used in the ETL part of data warehouses to load massive amounts of data.
One caveat with MySQL is that it doesn't support hash joins, which is a join mechanism very suitable to fully join two tables. MySQL uses nested loops instead, which mean that you need to index the join columns very carefully.
InnoDB tables with their clustering feature on the primary key can help to make this a bit more efficient.
One last point. When you have the staging data inside the database, it is easy to add some analysis of the data and put aside "bad" rows in a separate table. You can then inspect the data using SQL instead of wading through csv files in yuor editor.
I don't think there's one-step way to do this.
What I do is issue a
INSERT IGNORE (..) values (..)
to the master table, wich will either create the row if it doesn't exist, or do nothing, and then issue a
SELECT id FROM master where someUniqueAttribute = ..
The other option would be stored procedures/triggers, but they are still pretty new in MySQL and I doubt wether this would help performance.
I have a requirement to store all versions of an entity in a easily indexed way and was wondering if anyone has input on what system to use.
Without versioning the system is simply a relational database with a row per, for example, person. If the person's state changes that row is changed to reflect this. With versioning the entry should be updated in such a way so that we can always go back to a previous version. If I could use a temporal database this would be free and I would be able to ask 'what is the state of all people as of yesterday at 2pm living in Dublin and aged 30'. Unfortunately there doesn't seem to be any mature open source projects that can do temporal.
A really nasty way to do this is just to insert a new row per state change. This leads to duplication, as a person can have many fields but only one changing per update. It is also then quite slow to select the correct version for every person given a timestamp.
In theory it should be possible to use a relational database and a version control system to mimic a temporal database but this sounds pretty horrendous.
So I was wondering if anyone has come across something similar before and how they approached it?
Update
As suggested by Aaron here's the query we currently use (in mysql). It's definitely slow on our table with >200k rows. (id = table key, person_id = id per person, duplicated if the person has many revisions)
select name from person p where p.id = (select max(id) from person where person_id = p.person_id and timestamp <= :timestamp)
Update
It looks like the best way to do this is with a temporal db but given that there aren't any open source ones out there the next best method is to store a new row per update. The only problem is duplication of unchanged columns and a slow query.
There are two ways to tackle this. Both assume that you always insert new rows. In every case, you must insert a timestamp (created) which tells you when a row was "modified".
The first approach uses a number to count how many instances you already have. The primary key is the object key plus the version number. The problem with this approach seems to be that you'll need a select max(version) to make a modification. In practice, this is rarely an issue since for all updates from the app, you must first load the current version of the person, modify it (and increment the version) and then insert the new row. So the real problem is that this design makes it hard to run updates in the database (for example, assign a property to many users).
The next approach uses links in the database. Instead of a composite key, you give each object a new key and you have a replacedBy field which contains the key of the next version. This approach makes it simple to find the current version (... where replacedBy is NULL). Updates are a problem, though, since you must insert a new row and update an existing one.
To solve this, you can add a back pointer (previousVersion). This way, you can insert the new rows and then use the back pointer to update the previous version.
Here is a (somewhat dated) survey of the literature on temporal databases: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.6988&rep=rep1&type=pdf
I would recommend spending a good while sitting down with those references and/or Google Scholar to try to find some good techniques that fit your data model. Good luck!