Load a pivoted CSV file to MySQL database - mysql

I have a csv file with identifying and date information and then a several dozen data columns (sample below).
In the spirit of relational DBs (and also to avoid creating a table with 100+ columns), it seems preferable to load the dataset after I've pivoted the data such that column names for the data columns are subsequently included as row entries instead (sample below; data values aren't consistent with first table, including to demonstrate desired layout):
Occurs to me I could load the data to a place-holder table in MySQL, pivot within MySQL and insert into a new table but wonder if there's a more efficient way to do this.

If you want to do this with MySQL, loading the data to an intermediate table, then unpivoting it into the target table seems like the relevant strategy.
To load the file in the table database, you can use the LOAD DATA INTO FILE syntax.
Then, you can unpivot with union all:
insert into targettable (
ticker, dimension, item, calendardate, datekey, reportperiod, astupdated, value
)
select ticker, dimension, 'accoci', calendardate, datekey, reportperiod, astupdated, accoci
from mytable
union all
select ticker, dimension, 'assets', calendardate, datekey, reportperiod, astupdated, assets
from mytable
union all
select ticker, dimension, 'assetsavg', calendardate, datekey, reportperiod, astupdated, assetsavg
from mytable
...
Disclaimer: I am not convinced that unpivoting this dataset will eventually make things better or easier. That would ultimately depend on how you want to consume it, which you did not tell.

Related

How to apply transformation on the lookup target during lookup?

My source task has following columns read from csv file:
Sale_id, sale_date, order_no, sale_amt
This is followed by lookup task that looks into the sales sql table (having same column names) and the join is on order_no column.
The issue is that order_no data in sql sale table has value like 'ABC-12345', 'WXYZ-32111' (there are couple of characters prepended to the order number).
Where as in the csv there is '12345' without any characters prepended.
Hence I cannot do a lookup as there is no direct match. Is there any way to remove the characters and the hyphen from sale sql table data (temporarily) for performing the lookup join.
1st Data Flow Task - Use Flat File Source to LookUp and OLE DB Source via SQL Command with the following to LookUp.
select Sale_id, sale_date, order_no, sale_amt,
substring(order_no,charindex('-',order_no,0),len(order_no)) as [key]
from sql sale table
Use [Key] for your look up transformation.
The functions should provide the numeric values you are looking for.
Problem restatement
In your source, you have order numbers coming in that look like numbers. You need to be able to lookup against a source that has a text string prepended to an order number. We can assume the numeric part of the database's order number is unique.
Setup
I created a simplified version of your table and populated it with data
DROP TABLE IF EXISTS dbo.so_66446302;
CREATE TABLE dbo.so_66446302
(
sales_id int
, order_no varchar(20)
);
INSERT INTO dbo.so_66446302
(sales_id, order_no)
VALUES
(1, 'ABC-12345')
, (2, 'WXYZ-32111')
, (3, 'REALLY_LONG-654321');
A critical piece in using Lookup components is getting the data types to agree. I'm going to assume that the order number from the file is a DT_STR and not an integer.
By default, people pick a table in the Lookup component's Connection tab, dbo.so_66446302 but if you check "Use results of a SQL query", you'll have what you're looking for.
Similar query to what Jatin shows below but I find "showing my work" along the way helps me debug when it goes sideways. In this query, that's why I have the intermediate cross apply steps.
SELECT
S.*
, D0.order_no_as_string_digits
FROM
dbo.so_66446302 AS S
CROSS APPLY
(
SELECT
-- Length of the string less where we find the first dash
LEN(S.order_no) - CHARINDEX('-', S.order_no, 1)
)D(dash_location)
CROSS APPLY
(
SELECT RIGHT(S.order_no, D.dash_location)
)D0 (order_no_as_string_digits);
The results of that query are
sales_id order_no order_no_as_string_digits
1 ABC-12345 12345
2 WXYZ-32111 32111
3 REALLY_LONG-654321 654321
Now you can match the derived order number in the database to the one in your file by dragging the columns together. Check any/all columns that you need to retrieve from the database and send the data to the intended destination.

Can you list data from two identical tables, one after the other in a select statement?

I have two tables, that have been structured identically but one is a log of older data and one tracks newer data entries. They are the same structure but the data entries are different (one is data before date x, one is data after date x).
My question is, is there a select statement (using MySQL) that can select all the data from both tables and list it as though it is one table? Essentially just listing the contents from table A, then listing the contents from table B in the same columns as though they are one table.
I could create another table that does this, but that would involve doubling the data size which isn't a scalable solution.
Thanks for any time you take on this!
Yes.
Union
Ex: SELECT * FROM LOG UNION SELECT * FROM LOG_OLD;

Display the contents of a VIEW in MySQL

This is actually a two part question.
First: I was wondering if there is a way to display the information in a view I just created. I couldn't find anything online that was similar to the DISPLAY Tables query that could be used for views.
The query to create my view is:
CREATE VIEW View1 AS
SELECT *
FROM CustOrder
WHERE shipToName = 'Jim Bob'
Secondly, once I find out how to display that specific view from above, how do I go about finding the highest "paidPrice" (a column in the CustOrder table)?
Thank you all in advance!
In answer to Question 1:
SHOW CREATE VIEW view_name;
reference:
http://dev.mysql.com/doc/refman/5.7/en/show-create-view.html
A view is little more than a stored SELECT statement, but from the perspective of the client, they are mostly equivalent to real tables. To interact with a view you have created, you may simply issue SELECT statements against it.
-- Regular SELECT statements, no different
-- from querying against a real table.
-- Get all rows from the view
SELECT * FROM View1
-- Get the MAX() value from a column
SELECT MAX(paidPrice) AS maxprice FROM View1
You may also create views which represent multiple joined tables. This is a common use case, wherein many tables are frequently joined for querying. You may use a view to handle the joins, and expose only certain columns to certain database users rather than grant full access to your schema.
CREATE VIEW joinview AS (
SELECT
t1.id,
t1.col1,
t1.col2,
-- The view will only expose the alias
t1.col3 AS aliased_name,
-- Use an alias to avoid a column name collision
t2.col1 AS t2c1,
-- The view will expose the column name without the table name
t2.col99
FROM
t1 INNER JOIN t2 ON t1.id = t2.t1_id
);
Now the view will only expose columns as a SELECT query would. You will no longer need to reference the individual tables, since it produces a flat output.
-- Retrieve 2 columns from the join view
SELECT col99, aliased_name FROM joinview
Finally, because views act just like normal tables, you can join them to other tables or views too. Take care when assembling views with joins though, to be sure that the underlying tables are appropriately indexed. Otherwise, the views may perform poorly (just as they would for normal SELECT queries executed without appropriate indexing).

Sum of counts from multiple tables

I have a dozen of tables with the same structure. All of their names match question_20%. Each table has an indexed column named loaded which can have values of 0 and 1.
I want to count all of the records where loaded = 1. If I had only one table, I would run select count(*) from question_2015 where loaded = 1.
Is there a query I can run that finds the tables in INFORMATION_SCHEMA.TABLES, sums over all of these counts, and produces a single output?
You can do what you want with dynamic SQL.
However, you have a problem with your data structure. Having multiple parallel tables is usually a very bad idea. SQL supports very large tables, so having all the information in one table is a great convenience, from the perspective of querying (as you are now learning) and maintainability.
SQL offers indexes and partitioning schemes for addressing performance issues on large tables.
Sometimes, separate tables are necessary, to meet particular system requirements. If so, then a view should be available to combine all the tables:
create view v_tables as
select t1.*, 'table1' as which from table1 union all
select t2.*, 'table2' as which from table2 union all
. . .
If you had such a view, then your query would simply be:
select which, count(*)
from v_tables
where loaded = 1
group by which;

Will a SQL update query using a WHERE IN clause containing potentially a lot of row id's that won't match be an performance issue?

I have a list of inactive_products in an XML file. I need to update the "active" (boolean) field of products I have in a DB based on the list of inactive products. However, the list of inactive products contains a lot of products that are not even in my DB. Will it be faster to just run an UPDATE statement using a WHERE IN clause with the entire list of inactive products, or should I compare the list of inactive products with a CSV file of active products first, to obtain the relevant inactive products, and then run an UPDATE statement with all this filtered list instead?
My initial thought is that I should just let SQL do its thing and handle the unfiltered list, but what is the technically correct answer?
One way to work around long list of values of IN operator is to JOIN to temporary table instead. It was the workaround back when Oracle didn't support more than 1000 values for IN.
If you have query like this
SELECT * FROM mytable WHERE column IN (value1, value2, ..., valueN)
then replace it with:
CREATE TEMPORARY TABLE temp_table AS
SELECT column FROM <source that contains value1, value2, ..., valueN>
SELECT * FROM mytable JOIN temp_table ON (mytable.column = temp_table.column)
Asymptotically this should perform better than using IN (at least in Oracle).
I would recommend letting SQL do the filtering, which is essentially letting the database do the heavy lifting since its what it does best, however make sure that you have indexes on the id columns for better performance.
If this becomes too slow then you can look at filtering the list first.