Compare MANY columns in MySQL? - mysql

I have a table with some denormalized data for a specific purpose (don't ask), so it has several hundred columns. There is a primary key.
This table is updated weekly, but most id:s will have the same data as the week before.
Now, I need to store all record versions in a history table, i.e. if record with id X is added week N, no changes week N+1 but some data changed week N+2 and N+3, then the history table should contain three records: Those from weeks N, N+2 and N+3.
It's technically easy to write the appropriate insert query, but it would involve comparison of each column, so it will be a very long SQL query. I'm sure it would work, but...
Is there any way in MySQL to compare ALL columns without explicitly writing ...or t1.col1 <> t2.col1... for each column? I.e. something like ...t1.allcolumns <> t2.allcolumns..., like comparing the entire row in one go?
I'm pretty sure the answer is no, but... :-)

You can write a program (in your favourite programming language) to build the query. The program would look in the schema for the database, find all the columns of the table, and construct the query from that. I don't think it is possible to do that in plain SQL, but even if possible, plain SQL is probably the wrong tool.

You can use the row-values syntax, but you still have to name all columns:
(t1.col1, t1.col2, ...) <> (t2.col1, t2.col2, ...)

Update 1
Check this out: https://www.techonthenet.com/mysql/intersect.php
Intersect t1 and t2. Result = rows on both tables.
Select all fiends from t1 not in your intersect result.
Sorry for the lack of code, I don't have time to elaborate, but that's the idea.

Related

MS Access SQL Unequal join for 3 or more tables

I'm thinking of switching to using temp tables and vba.
I want to do this. I have multiple tables, in these tables may or may not have fields with items that have a one to many or one to one relationship. I know what those relationships are (and will create multiple queries accordingly). What I'm hunting for is each value that DOES NOT EXIST in every other table. To make an example:
Say we have 3 single column tables, table 1 is {x, y, z}, table 2 is {a, x, z}, and table 3 is {a,b,x,y,z}, the result will be b for t3 (yes I need to know where the error is). Pretty much, I want to use the unequal wizard but for 3 or more tables.
I may want to look for any item that exists in some but not all other tables. If you want to speak on that, it would be helpful, but I think that is strictly in the vba realm.
I think the challenge here is the open-endedness of the problem you are trying to solve. Varying column names, table names, and uniqueness thresholds across all tables would make it a bit more difficult. In the way I show below, I don't think it would be the most efficient, query-wise, but would be relatively easy to script. The following code assumes values in the tables are unique within each table.
There are 3 queries total:
qry_001_TableValues_ALL
SELECT Table1.MyValue, "Table1" AS Source
FROM Table1
UNION
SELECT Table2.MyValue, "Table2" AS Source
FROM Table2
UNION SELECT Table3.MyValue, "Table3" AS Source
FROM Table3;
qry_002_TableValues_Unique:
SELECT qry_001_TableValues_ALL.MyValue
FROM qry_001_TableValues_ALL
GROUP BY qry_001_TableValues_ALL.MyValue
HAVING (((Count(qry_001_TableValues_ALL.MyValue))=1));
qry_003_TableValues_UniqueWithSource:
SELECT qry_002_TableValues_Unique.MyValue, qry_001_TableValues_ALL.Source
FROM qry_002_TableValues_Unique INNER JOIN qry_001_TableValues_ALL
ON qry_002_TableValues_Unique.MyValue = qry_001_TableValues_ALL.MyValue;
The first table is the one you would need to script out if columns\tables changed. It is looking across all tables and creating a unique list of values from the specified field. The second query looks to look up the Source table name against the original unique value query for all values which have a count of 1, post aggregation. This means of all tables involved, there is only one instace of the values returned, and it joins against the original unique value list again to determine what the source table is. You can script a change to the HAVING clause here to see if there are x tables which contain the value. The final query is simply the one you run to give you the final report of the values you are looking for and where they reside.
Hope this is in the ballpark of what you are trying to do.

optimizing particular query mysql

So I've been searching for a solution and reading books, and havent been able to figure it out, the question is rather simple, I have 2 tables. On one table I have 2 fields:
table_1:"chromosome" and "position" both of the being integers.
table_2:"chromosome" "start" and "end", all being integers as well.
I want a query that gives me back all rows from table_1 that are between the start and end of table_2. The query looks like this:
SELECT
table_1 . *
FROM
table_1,
table_2
WHERE
table_1.chromosome = table_2.chromosome
AND table_1.position > table_2.start
AND table_1.position < table_1.end;
So this query works fine, but my tables are many millions of rows (7092713) and (215909) respectvely. I indexed chromosome, pos and chromosome, start, end. The weird part is that if I do the query one by one (perl DBI, do one statement for every row of table_2), this runs a lot faster. Not sure where am I screwing up.
Any help would be appreciated.
Jorge Kageyama
For the sake of clarity, let's start by recasting your query using the standard JOIN syntax. The query is equivalent but easier to read.
SELECT table_1 . *
FROM table_1
JOIN table_2 ON ( table_1.chromosome = table_2.chromosome
AND table_1.position > table_2.start
AND table_1.position < table_1.end)
Second, it's smart when searching large tables (or any tables for that matter) to avoid * in your SELECT clauses. Using * denies useful data to the optimizer about what you do, or don't, need in your result set. So let us say
SELECT table_1.chromosome, table_1.position
for SELECT.
So, it becomes clear that your result set, and your join, need chromosome and position, and nothing else, from your larger table. Try creating a compound BTREE index on that table, as follows.
CREATE INDEX ON table_1(chromosome,position) USING BTREE
Similarly, try creating an index on table_2 as follows.
CREATE INDEX ON table_2(chromosome,start, end) USING BTREE
These are called covering indexes. They contain enough columns that the query can be satisfied from the index without having to bounce back to the original table.
BTREE indexes (the default by the way) are inherently ordered. Appropriate records in table_1 can be found by range scans on the index starting with (chromosome,start) and ending with (chromosome,end).
Third, it's possible you're getting a massive combinatorial explosion of rows from table_1 in your result set. You'll get a row for every combination of rows in the two tables that matches your ON() clause. It's hard to know whether that's the case without knowing a lot about your data.
You could try to reduce that combinatorial explosion using
SELECT DISTINCT table_1.chromosome, table_1.position
Give this a try. If you're still not getting anywhere, maybe another question with complete table definitions and the results of EXPLAIN will be helpful.
Interesting question. Without knowing more about the quantities contained in "position," I would still approach it generally in this way:
Select for position generally from table_1 (with 7.0mm entities) so that the resulting table is a bin of a smaller amount of data. Let's say, for instance, that the "position" quantity is a set of discrete integers from 2-9. Select from table_1 where position is equal to 2, then select from table_2 where "start" is less than 2 and "end" is greater than 2. Iterate over this query selection 8 times updating a new table_3 with results.
I am assuming here that table_2 is unique on chromosome, and table_1 is not. Therefore, you end up with chromosomes that could have multiple positions within the same range (a chromosome has one range, but can appear anywhere within that range). You also, then, can't tell how large the resulting join table is going to be, but it could be quite large as each of the 7mm entities in table_1 could be within all ranges in table_2.
Iterating would allow you to "grow" your results while observing the quality at each point experimentally before committing to the entire loop.
Here is an idea of the query I have in mind (untested):
SELECT table_1.chromosome, table_1.position, table_2.start, table_2.end
FROM
(SELECT table_1.chromosome, table_1.position
from table_1 where table_1.position = 2)
JOIN
(SELECT table_2.chromosome, table_2.start, table_2.end
from table_2 where table_2.start < 2 AND table_2.end > 2)
ON
table_1.chromosome = table_2.chromosome
Good luck, and I hope you find your answer!

SQL. Perfomance of count on many columns

I need to get the amount of distinct values of every column in a table. So, I wonder, if using a query like
select count(col1), count(col2),.., count(colN) from table;
will scan the whole table N times to get all these counts? Then will it be better to use objects/procedures that concrete DBMS has to create array 1..N with value amount for every column and count values by looping table records and incrementing array elements?
I understand that this is totally dependent to a DBMS realization, so I'd like to know it specially for MySQL (but info about other popular systems is interesting too).
You will need to do:
select count(distinct col1),
count(distinct col2),
...
from table;
and the database should just do a single full-table scan to calculate this.

JOIN - Is it a good practice to add table name before column name?

In an oracle book I read that when when we perform SELECT by joining 2 or more tables, if tablename is used before the column name SELECT works faster.
Eg:
SELECT table1.name, table1.dob.... instead of SELECT name, dob....
Is it the same way in MySQL?
EDIT
I know that this is a good practice when there are identical field names. What i was thinking was about the performance point of view even if there are no identical field names
I dunno about performance, but it is a good practice, especially when joining tables. Joined tables could have for example identical field names, and the query will then fail. You can also use aliases if your table names are too long:
SELECT t1.name, t2.dob FROM table1 t1 JOIN table t2 ON ...
From the efficiency point of view, Oracle and MySQL compile the SQL to an internal representation before executing it, so I don't think there must a significant difference in execution time as they will decide the table from the fields name if they are not specified. The time difference will be at compilation time, where they deduce the tables for each field.
In fact, I personally doubt the fact that Oracle executes faster if the table names are specified!
It's not good practice when the field names are the same, it's good practice all the time. It's not about efficiency, it's about your query not breaking when someone else adds fields to one of the join'd tables with overlapping names, so your stuff works in six months time, not just today.

Select columns by numbers - SELECT [m:n] FROM table_name

I have a database with many columns and sometimes I need to select quite a few.
Selecting all columns would be too much data. So lets say that:
DESC table_name
gives ordered column names, for example (A,B,C,D,E,F,G,H,I,J....). Is it possible that instead:
SELECT C,D,E,F FROM table_name;
I do something like this:
SELECT [3:6] FROM table_name
I know it makes no difference in this example, but I need to select over 40 columns with long names.
No, you can't SELECT [3:6] FROM table_name What do you think this is, some kind of modern computer language with sequences and ranges as first class data types? :-) :-). This is SQL.
You can, as a commenter pointed out, fetch the names of the columns in the table and then programmatically generate your SQL queries. This is, of course, something a bunch of different data-access-object packages do automatically.