Load Redshift Spectrum external table with CSVs with differing column order - csv

This got no answer and I have a similar question, though I'll expand it.
Suppose I have 3 CSV files in s3://test_path/. I want to create an external table and populate it with the data in these CSVs. However, not only does column order differ across CSVs, but some columns may be missing from some CSVs.
Is Redshift Spectrum capable of doing what I want?
a.csv:
id,name,type
a1,apple,1
a2,banana,2
b.csv:
type,id,name
1,b1,orange
2,b2,lemon
c.csv:
name,id
kiwi,c1
I create the external database/schema and table by running this in Redshift query editor v2 on my Redshift cluster:
CREATE EXTERNAL SCHEMA test_schema
FROM DATA CATALOG
DATABASE 'test_db'
REGION 'region'
IAM_ROLE 'iam_role'
CREATE EXTERNAL DATABASE IF NOT EXISTS
;
CREATE EXTERNAL TABLE test_schema.test_table (
"id" VARCHAR,
"name" VARCHAR,
"type" SMALLINT
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION 's3://test_path/'
TABLE PROPERTIES ('skip.header.line.count'='1')
;
I expect SELECT * FROM test_schema.test_table to yield:
id
name
type
a1
apple
1
a2
banana
2
b1
orange
1
b2
lemon
2
c1
kiwi
NULL
Instead I get:
id
name
type
a1
apple
1
a2
banana
2
1
b1
NULL
2
b2
NULL
kiwi
c1
NULL
It seems Redshift Spectrum cannot match columns by name across files the way pandas.concat() can with data frames with differing column order.

No, your data needs to be transformed to align the data between files. Your Spectrum DDL specifies that the first row of the CSVs is ignored so the information you need isn't even being read.
If you want to have these files usable as one Spectrum table you will need to transform them to align columns and store new files to S3. You can do this with Redshift if you also have a support piece of code reading the column order from each file. You could write a Lambda to do this easily or if your CSV files are fairly simple then a Glue crawler will work. Almost any ETL tool can do this as well. Lots of choices but these files are not Spectrum ready as is.

Related

SQL Multi-Location inventory DB

I am creating a DB where in one table it holds store location information like location ID, address info, and a column for the inventory:
Location_ID
paint_inventory
address info
1001
red,blue,black
address stuff
1002
blue,orange
address stuff
The database also has a table that gives each paint an ID:
paint_ID
color_name
1
red
2
blue
3
purple
4
black
5
orange
How can I efficiently store this information where the location table looks like this (using an ID to reference the color info in the other table):
Location_ID
paint_inventory
address info
1001
1,2,4
address stuff
1002
2, 5
address stuff
NOTE:
I have seen posts where people say storing information in a delimited string is poor practice, however, I am basically recreating an already existing DB and the first location table is how the data is formatted.
UPDATE:
To rephrase a little: how could I efficiently store the paint_inventory column? I was given the data as it is in the first table and my only thought of how to store this is the way I provided (Once again I understand you should never store data as a delimited string but this is how I was given the data)
You should create a joining table -- that table would look like this
id int -- unique generated id for relationship
paint_id int -- pointer to the paint
location_id int -- pointer to the location
I like to call the tables by the things they join ordered alphabetically so I would call this table location_paint
There might be additional information for a can of paint at a location you could put in this table -- how full it is, when it was purchased, when it was used up, etc.
There could also be meta data in the table -- when the record was created when it was edited, who created it -- etc.

Delete Data JSON from table MySql

Hi I need to remove from a json table everything that contains the name weapon_pistol50 , this is one of my tables in mysql
{"weapons":[{"ammo":74,"name":"WEAPON_PISTOL50"},{"ammo":118,"name":"WEAPON_PISTOL50"},{"ammo":54,"name":"WEAPON_PISTOL"}]}
The table is named: datastore_data
and the column that contains json format is called data.
I want to update all the tables by deleting this from the json: '{"ammo":118,"name":"WEAPON_PISTOL50"}'
I haven't tested many variables for now, but I need to do the above.
Here's a solution tested on MySQL 8.0:
update datastore_data cross join json_table(data, '$.weapons[*]'
columns(
i for ordinality,
ammo int path '$.ammo',
name varchar(20) path '$.name'
)
) as j
set data = json_remove(data, concat('$.weapons[', j.i-1, ']'))
where j.ammo = 118 and j.name = 'WEAPON_PISTOL50';
If you are using a version of MySQL too old to support JSON_TABLE(), then it's a lot harder.
Frankly, it would be far easier if you didn't use JSON. Instead, store one weapon per row in a second table with normal columns named ammo and name.
create table weapons(
id serial primary key,
owner_id int,
ammo int,
name varchar(20)
);
Then you could do this task much more simply:
delete from weapons
where ammo = 118 and name = 'WEAPON_PISTOL50';
Storing data in JSON documents might seem like a convenient way to load complex data into a single row of a single table, but virtually every task you have to do with that data afterwards becomes a lot harder than if you had used normal tables and columns.

Filter objects inside array in MySQL JSON column [duplicate]

MySQL 5.7.24
Lets say I have 3 rows like this:
ID (PK) | Name (VARCHAR) | Data (JSON)
--------+----------------+-------------------------------------
1 | Admad | [{"label":"Color", "value":"Red"}, {"label":"Age", "value":40}]
2 | Saleem | [{"label":"Color", "value":"Green"}, {"label":"Age", "value":37}, {"label":"Hoby", "value":"Chess"}]
3 | Daniel | [{"label":"Food", "value":"Grape"}, {"label":"Age", "value":47}, {"label":"State", "value":"Sel"}]
Rule #1: The JSON column is dynamic. Means not everybody will have the same structure
Rule #2: Assuming I can't modify the data structure
My question, it it possible to query so that I can get the ID of records where the Age is >= 40? In this case 1 & 3.
Additional Info (after being pointed as duplicate): if you look at my data, the parent container is array. If I store my data like
{"Age":"40", "Color":"Red"}
then I can simply use
Data->>'$.Age' >= 40
My current thinking is to use a stored procedure to loop the array but I hope I don't have to take that route. The second option is to use regex (which I also hope not). If you think "JSON search" is the solution, kindly point to me which one (or some sample for this noob of me). The documentation's too general for my specific needs.
Here's a demo:
mysql> create table letsayi (id int primary key, name varchar(255), data json);
mysql> > insert into letsayi values
-> (1, 'Admad', '[{"label":"Color", "value":"Red"}, {"label":"Age", "value":"40"}]'),
-> (2, 'Saleem', '[{"label":"Color", "value":"Green"}, {"label":"Age", "value":"37"}, {"label":"Hoby", "value":"Chess"}]');
mysql> select id, name from letsayi
where json_contains(data, '{"label":"Age","value":"40"}');
+----+-------+
| id | name |
+----+-------+
| 1 | Admad |
+----+-------+
I have to say this is the least efficient way you could store your data. There's no way to use an index to search for your data, even if you use indexes on generated columns. You're not even storing the integer "40" as an integer — you're storing the numbers as strings, which makes them take more space.
Using JSON in MySQL when you don't need to is a bad idea.
Is it still possible to query age >= 40?
Not using JSON_CONTAINS(). That function is not like an inequality condition in a WHERE clause. It only matches exact equality of a subdocument.
To do an inequality, you'd have to upgrade to MySQL 8.0 and use JSON_TABLE(). I answered another question recently about that: MySQL nested JSON column search and extract sub JSON
In other words, you have to convert your JSON into a format as if you had stored it in traditional rows and columns. But you have to do this every time you query your data.
If you need to use conditions in the WHERE clause, you're better off not using JSON. It just makes your queries much too complex. Listen to this old advice about programming:
"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."
— Brian Kernighan
how people tackle dynamically added form fields
You could create a key/value table for the dynamic form fields:
CREATE TABLE keyvalue (
user_id INT NOT NULL,
label VARCHAR(64) NOT NULL,
value VARCHAR(255) NOT NULL,
PRIMARY KEY (user_id, label),
INDEX (label)
);
Then you can add key/value pairs for each user's dynamic form entries:
INSERT INTO keyvalue (user_id, label, value)
VALUES (123, 'Color', 'Red'),
(123, 'Age', '40');
This is still a bit inefficient in storage compared to real columns, because the label names are stored every time you enter a user's data, and you still store integers as strings. But if the users are really allowed to store any labels of their own choosing, you can't make those real columns.
With the key/value table, querying for age > 40 is simpler:
SELECT user_id FROM key_value
WHERE label = 'Age' AND value >= 40

SQL database structure for time series data type

I wonder if someone could take a minute out of their day to give some suggestion on my database structure design.
I have sensor data (e.g temperature, humidity ...) with time series format (10Hz) which are installed in different floors of different houses of different cities. So let say something like this:
*City Paris-->House A -->Floor 1 --> Sensor Humidity & temp --> csv file with time series for hours, days, years
City Paris-->House B -->Floor 3 --> Sensor Humidity --> csv file with time series for hours, days, years*
So now I would like to answer these questions:
1- What would be the most efficient method to store the data A sql database?
2- Would it make sense to have timestamp data stored in sql database but the sensor data in CSV file and then link them them to sql database?
3- What about the scalability and possibility to add new sensors?
Many thanks for your help and suggestion in advance,
If your objective is to run time-series analytics, I would recommend to break down your data so that each reading is in one row and to use a time-series database.
The schema proposed earlier is good. But I personally find storing the data in 3 tables too complex as you need to write / check constraints across 3 different tables, and most of your queries will require JOIN clauses.
There are ways to make this schema simpler, for example by leveraging the symbol type in QuestDB. Symbol stores repetitive strings as a map of integers. On the surface, you are manipulating strings, but the storage cost and operation complexity is that of an int.
This means you can store all your data in a single, more simple table, with no performance or storage penalty. And this would simplify both ingestion as you write into only one table, and queries by removing the need to perform multiple joins.
Here is what the DDL would look like.
CREATE TABLE measurements (
id INT,
ts TIMESTAMP,
sensor_name SYMBOL,
floor_name SYMBOL,
building_name SYMBOL,
building_city SYMBOL,
type SYMBOL,
value DOUBLE
) timestamp (ts)
If you want to add more sensors or buildings, all you need to do is write to the same table.
At least you should not save the csv in the database as a varchar or text at once. You should break down eveything in as small parts as possible. My suggestion is you first create a table like this
CREATE TABLE measurements (measurement_id INT PRIMARY KEY, floor_id INT, type VARCHAR(50), value FLOAT)
Then you create a table for floors
CREATE TABLE floors (floor_id INT PRIMARY KEY, building_id INT, floor_name INT)
And at least the connection to the building
CREATE TABLE buildings (building_id INT PRIMARY KEY, building_name VARCHAR(200), building_city VARCHAR(200))
You should create foreign keys from the floors.floor_id to measurements.floor_id and the buildings.building_id to floor.building _id.
You can even break down into more tables to have cities and/or addresses in own once if you like.

Importing Excel CSV to MySQL relational database?

I have two tables baskets and fruits; a basket has many different fruits so it's a one to many relation.
The tables are as follows:
baskets
id NOT NULL PRIMARY KEY AUTO_INCREMENT,
basket_name varchar(20)
fruits
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
fruit_name varchar(20),
basket_id INT FOREIGN KEY REFERENCES baskets(id)
I have an Excel sheet with column names and data organized like this:
Basket Fruits
Basket_1 Apple, Mango, Banana, Pear
Basket_2 Mango, Strawberry, Plums, Banana, Grapes
Baskt_3 Raspberry, Apple, Pear
What would be the best possible way to store all these fruits for a single basket in the relational architecture modeled above?
I have converted my XLS file to CSV for better parsing using scripts (in Ruby) and found that the table has been messed up such that each fruit is now in a separate cell and the number of cells vary for one basket (in a row) in the spreadsheet.
My question is some what similar to this: Import Excel Data to Relational Tables at MySQL
But not the same in data.
All suggestions are welcome!
These kinds of problems always seem to take a fair amount of steps.
If the fruits are in separate cells, use the paste - transpose tool in Excel to turn rows into columns by copying all cells in one sheet and then using paste special - transpose to paste them in a new sheet.
In a third sheet i would make a formula that concatenates the value of each cell, starting in the second row, with the value in the first row. If sheetA has the freshly transposed values, and sheetB is the new sheet, the formula would be =sheetA!a$1 & "," & sheetA!a2, and copy that formula from a2:zz999 or however far it needs to go.
Then the columns need to be concatenated, which can be done manually, or via UNION statements in Mysql after the sheet is imported:
SELECT COLUMN1 FROM SHEETB
UNION
SELECT COLUMN2 FROM SHEETB
UNION
...
You get the idea. Once the combined basket-fruit field is in one column, it's easy to separate out into two columns.