Clickhouse: Want to extract data from Array(Tupple) column in Clickhouse - mysql

Query used to create the table:
CREATE TABLE default.ntest2(job_name String, list_data Array(Tuple(s UInt64, e UInt64, name String))) ENGINE = MergeTree ORDER BY (job_name) SETTINGS index_granularity = 8192;
Table Data:
job_name
list_data.s
list_data.e
list_data.name
job1
[19,22]
[38,92]
['test1','test2']
job2
[28,63]
[49,87]
['test3''test4']
Expected Output:
job_name
list_data.s
list_data.e
list_data.name
job1
19
38
'test1'
job1
22
92
'test2'
job2
28
49
'test3'
job2
63
87
'test4'
How can I achieve this with less query time?

ARRAY JOIN https://clickhouse.com/docs/en/sql-reference/statements/select/array-join/
SELECT
job_name,
`list_data.s`,
`list_data.e`,
`list_data.name`
FROM
(
SELECT
c1 AS job_name,
c2 AS list_data
FROM values(('job1', ([19, 22], [38, 92], ['test1', 'test2'])), ('job2', ([28, 63], [49, 87], ['test3', 'test4'])))
) AS T
ARRAY JOIN
list_data.1 AS `list_data.s`,
list_data.2 AS `list_data.e`,
list_data.3 AS `list_data.name`
┌─job_name─┬─list_data.s─┬─list_data.e─┬─list_data.name─┐
│ job1 │ 19 │ 38 │ test1 │
│ job1 │ 22 │ 92 │ test2 │
│ job2 │ 28 │ 49 │ test3 │
│ job2 │ 63 │ 87 │ test4 │
└──────────┴─────────────┴─────────────┴────────────────┘
SELECT
job_name,
list_data.s,
list_data.e,
list_data.name
FROM
(
SELECT
c1 AS job_name,
c2 AS `list_data.s`,
c3 AS `list_data.e`,
c4 AS `list_data.name`
FROM values(('job1', [19, 22], [38, 92], ['test1', 'test2']), ('job2', [28, 63], [49, 87], ['test3', 'test4']))
) AS T
ARRAY JOIN
`list_data.s` AS `list_data.s`,
`list_data.e` AS `list_data.e`,
`list_data.name` AS `list_data.name`
┌─job_name─┬─list_data.s─┬─list_data.e─┬─list_data.name─┐
│ job1 │ 19 │ 38 │ test1 │
│ job1 │ 22 │ 92 │ test2 │
│ job2 │ 28 │ 49 │ test3 │
│ job2 │ 63 │ 87 │ test4 │
└──────────┴─────────────┴─────────────┴────────────────┘

Related

Add identifier of first created record to select statement with group_by

I have the following payments table
┌─name───────────────────────────┬─type────────────────────────────┐
│ payment_id │ UInt64 │
│ factory │ String │
│ user_id │ UInt64 │
│ amount_cents │ Int64 │
│ action │ String │
│ success │ UInt8 │
│ country │ FixedString(2) │
│ created_at │ DateTime │
│ finished_at │ Nullable(DateTime) │
└────────────────────────────────┴─────────────────────────────────┘
With sample data
┌─factory───┬─────────finished_at─┬─payment_id─┬─country─┬─action──┬─amount_cents─┬─user_id───┬
│ 0_factory │ 2021-01-18 00:00:01 │ 1 │ BY │ payment │ 1 │ 1 │
│ 0_factory │ 2021-01-18 00:00:02 │ 2 │ BY │ payment │ 1 │ 1 │
│ 1_factory │ 2021-01-18 00:00:02 │ 2 │ PL │ win │ 4 │ 1 │
│ 1_factory │ 2021-01-18 00:00:03 │ 3 │ PL │ win │ 7 │ 1 │
│ 2_factory │ 2021-01-18 00:00:01 │ 4 │ PL │ win │ 7 │ 1 │
│ 2_factory │ 2021-01-18 00:00:02 │ 1 │ PL │ payment │ 7 │ 1 │
│ 2_factory │ 2021-01-18 00:00:03 │ 2 │ PL │ win │ 7 │ 1 │
│ 2_factory │ 2021-01-18 00:00:04 │ 3 │ GR │ win │ 2 │ 1 │
└───────────┴─────────────────────┴────────────┴─────────┴─────────┴─────────┴────────────────┘
This is an example of what I have right now with
SELECT
factory,
user_id,
payment_id,
action,
created_at
FROM payments_all
WHERE (payments_all.action = 'payment') AND (payments_all.factory IN ('0_factory', '1_factory', '2_factory')) AND isNotNull(payments_all.created_at)
GROUP BY
factory,
user_id,
payment_id,
action
HAVING (min(created_at) >= toDate('2019-01-01 00:00:00')) AND (min(created_at) < toDate('2021-10-01 00:00:00'))
ORDER BY user_id
┌─factory───┬─user_id─┬─payment_id─┬─action──┬──────────created_at─┐
│ 1_factory │ 1 │ 1 │ payment │ 2021-02-04 09:00:00 │
│ 0_factory │ 1 │ 1 │ payment │ 2021-01-17 00:00:01 │
│ 0_factory │ 1 │ 2 │ payment │ 2021-01-17 00:00:06 │
└───────────┴─────────┴────────────┴─────────┴─────────────────────┘
I need to add new column first_payment
first_payment takes value 1 if action is payment && it is first payment for a user. Otherwise it takes value 0.
the first_payment should be checked for all period
So expected result is:
┌─factory───┬─────────finished_at─┬─payment_id─┬─country─┬─action──┬─amount_cents─┬─user_id───┬first_payment─┐
│ 0_factory │ 2021-01-18 00:00:01 │ 1 │ BY │ deposit │ 1 │ 1 │ 1 │
│ 0_factory │ 2021-01-18 00:00:02 │ 2 │ BY │ deposit │ 1 │ 1 │ 0 │
│ 1_factory │ 2021-01-18 00:00:02 │ 2 │ PL │ win │ 4 │ 1 │ 0 │
│ 1_factory │ 2021-01-18 00:00:03 │ 3 │ PL │ win │ 7 │ 1 │ 0 │
│ 2_factory │ 2021-01-18 00:00:01 │ 4 │ PL │ win │ 7 │ 1 │ 0 │
│ 2_factory │ 2021-01-18 00:00:02 │ 1 │ PL │ deposit │ 7 │ 1 │ 1 │
│ 2_factory │ 2021-01-18 00:00:03 │ 2 │ PL │ win │ 7 │ 1 │ 0 │
│ 2_factory │ 2021-01-18 00:00:04 │ 3 │ GR │ win │ 2 │ 1 │ 0 │
└───────────┴─────────────────────┴────────────┴─────────┴─────────┴─────────┴────────────────┘
I couldn't find much about ClickHouse, but it doesn't appear to support Windowed Functions.
Your example output also seems to be exactly the same as your sample table, plus one additional column, so I'm not sure what you GROUP BY was meant to achieve.
So, I'd use a LEFT JOIN on to a sub-query.
SELECT
payments_all.*,
CASE WHEN user_summary.user_id IS NOT NULL THEN 1 ELSE 0 END AS first_payment
FROM
payments_all
LEFT JOIN
(
SELECT
user_id,
factory,
MIN(created_at) AS first_created_at
FROM
payments_all
WHERE
action = 'payment'
GROUP BY
user_id,
factory
)
AS user_summary
ON payments_all.user_id = user_summary.user_id
ON payments_all.factory = user_summary.factory
AND payments_all.created_at = user_summary.first_created_at
WHERE
(payments_all.factory IN ('0_factory', '1_factory', '2_factory'))
AND (payments_all.created_at >= toDate('2019-01-01 00:00:00'))
AND (payments_all.created_at < toDate('2021-10-01 00:00:00'))
As I can see for first payment the payment_id is always 1. So, I think you can use CASE WHEN payment_id=1 Then 1 ELSE 0 END AS first_payment. Please check query below =>
WITH CTE AS
(SELECT
factory,
user_id,
payment_id,
action,
created_at
FROM payments_all
WHERE (payments_all.action = 'payment') AND (payments_all.factory IN ('0_factory', '1_factory', '2_factory')) AND isNotNull(payments_all.created_at)
GROUP BY
factory,
user_id,
payment_id,
action
HAVING (min(created_at) >= toDate('2019-01-01 00:00:00')) AND (min(created_at) < toDate('2021-10-01 00:00:00'))
) T1
SELECT *,CASE WHEN payment_id=1 Then 1
ELSE 0 END AS first_payment
FROM CTE
ORDER BY T1.user_id
NOTE: Query is written in SQL Server. Please check and let me know.

ClickHouse Aggregates - GROUP BY DAY/MONTH/YEAR(timestamp)?

Is there a way in ClickHouse to do a GROUP BY DAY/MONTH/YEAR() with a timestamp value? Having hard time figuring it out while rewriting MySQL queries to ClickHouse. My MySQL queries looks like so...
SELECT COUNT(this), COUNT(that) FROM table WHERE something = x AND stamp BETWEEN startdate AND enddate
SELECT COUNT(this), COUNT(that) FROM table WHERE something = x AND stamp BETWEEN startdate AND enddate GROUP BY DAY(stamp)
SELECT COUNT(this), COUNT(that) FROM table WHERE something = x AND stamp BETWEEN startdate AND enddate GROUP BY MONTH(stamp)
SELECT COUNT(this), COUNT(that) FROM table WHERE something = x AND stamp BETWEEN startdate AND enddate GROUP BY YEAR(stamp)
Quite simple AND SLOW in MySQL, but I do not know how to do the aggregates in ClickHouse.
Thanks!
To get part of date use function toYear, toMonth, toDayOfMonth by the next way:
SELECT
toMonth(time) AS month,
count()
FROM
(
SELECT
number,
addDays(now(), number) AS time
FROM numbers(8)
)
GROUP BY month
/*
┌─month─┬─count()─┐
│ 1 │ 7 │
│ 2 │ 1 │
└───────┴─────────┘
*/
To get multiple grouping set use WITH ROLLUP-modifier:
SELECT
toYear(time) AS year,
toMonth(time) AS month,
toDayOfMonth(time) AS day,
count()
FROM
(
SELECT
number,
addDays(now(), number) AS time
FROM numbers(8)
)
GROUP BY
year,
month,
day
WITH ROLLUP
/*
┌─year─┬─month─┬─day─┬─count()─┐
│ 2021 │ 2 │ 1 │ 1 │ // day
│ 2021 │ 1 │ 29 │ 1 │ // day
│ 2021 │ 1 │ 31 │ 1 │ // day
│ 2021 │ 1 │ 26 │ 1 │ // day
│ 2021 │ 1 │ 25 │ 1 │ // day
│ 2021 │ 1 │ 28 │ 1 │ // day
│ 2021 │ 1 │ 30 │ 1 │ // day
│ 2021 │ 1 │ 27 │ 1 │ // day
│ 2021 │ 1 │ 0 │ 7 │ // month
│ 2021 │ 2 │ 0 │ 1 │ // month
│ 2021 │ 0 │ 0 │ 8 │ // year
│ 0 │ 0 │ 0 │ 8 │
└──────┴───────┴─────┴─────────┘
*/

SQL count occurrences of entries

I'm having some problems in counting some entries in my database.
I have following query:
SELECT ref_training_date, training_date FROM fynslund.users_training
INNER JOIN training
ON training.id = users_training.ref_training_date
WHERE attendance = 1
Which gives me following results:
┌──────┬───────────────┐
│ ID │ DATE │
├──────┼───────────────┤
│ '55' │ '2018-01-09' │
│ '55' │ '2018-01-09' │
│ '54' │ '2018-02-03' │
│ '54' │ '2018-02-03' │
│ '54' │ '2018-02-03' │
│ '54' │ '2018-02-03' │
└──────┴───────────────┘
How do I count how many times the date with ID '55' appears?
You need to GOUP BY ref_training_date and then COUNT(training_date):
SELECT ref_training_date, COUNT(training_date) AS date_count
FROM fynslund.users_training
INNER JOIN training ON training.id = users_training.ref_training_date
WHERE attendance = 1
GROUP BY ref_training_date

How to read a non-standard space delimited data into a DataFrame and build a GLM model using it?

I am trying to read a tab delimited file with all data present into julia. It saves all the columns as NullableArrays.NullableArray{Int64,1} although I specified the type:
data = CSV.read("../datasets/baby.dat"; delim='\t', types=[Int, Float64, Float64, Float64, Float64, Float64])
The dataset is from http://stat.ethz.ch/Teaching/Datasets/baby.dat
I want to do a regression with the dataset, but the glm.jl Package gives an error with Nullable Arrays ...
Any ideas?
The complete error message is:
fit(GeneralizedLinearModel, #formula(Survival2 ~
Weight+Age+X1.Apgar+X5.Apgar+pH), data, Binomial(), ProbitLink())
ERROR: Non-call expression encountered
Stacktrace:
[1] dospecials(::Expr) at
/.julia/v0.6/DataFrames/src/statsmodels/formula.jl:97
[2] collect_to!(::Array{Symbol,1},
::Base.Generator{Array{Any,1},DataFrames.#dospecials}, ::Int64,::Int64) at
./array.jl:508
[3] collect_to_with_first!(::Array{Symbol,1}, ::Symbol,
::Base.Generator{Array{Any,1},DataFrames.#dospecials}, ::Int64) at
./array.jl:495
[4] _collect(::Array{Any,1},
::Base.Generator{Array{Any,1},DataFrames.#dospecials}, ::Base.EltypeUnknown,
::Base.HasShape) at ./array.jl:489
[5] map(::Function, ::Array{Any,1}) at ./abstractarray.jl:1868
[6] dospecials(::Expr) at
.julia/v0.6/DataFrames/src/statsmodels/formula.jl:101
[7] DataFrames.Terms(::DataFrames.Formula) at
.julia/v0.6/DataFrames/src/statsmodels/formula.jl:209
[8] #ModelFrame#127(::Array{Any,1}, ::Type{T} where T, ::DataFrames.Formula, ::DataFrames.DataFrame) at .julia/v0.6/DataFrames/src/statsmodels/formula.jl:333
[9] (::Core.#kw#Type)(::Array{Any,1}, ::Type{DataFrames.ModelFrame}, ::DataFrames.Formula, ::DataFrames.DataFrame) at ./<missing>:0
[10] #fit#153(::Dict{Any,Any}, ::Array{Any,1}, ::Function, ::Type{GLM.GeneralizedLinearModel}, ::DataFrames.Formula, ::DataFrames.DataFrame, ::Distributions.Binomial{Float64}, ::Vararg{Any,N} where N) at .julia/v0.6/DataFrames/src/statsmodels/statsmodel.jl:52
[11] fit(::Type{GLM.GeneralizedLinearModel}, ::DataFrames.Formula, ::DataFrames.DataFrame, ::Distributions.Binomial{Float64}, ::GLM.ProbitLink) at .julia/v0.6/DataFrames/src/statsmodels/statsmodel.jl:52
[12] eval(::Module, ::Any) at ./boot.jl:235
[13] eval(::Any) at ./boot.jl:234
[14] macro expansion at .julia/v0.6/Atom/src/repl.jl:186 [inlined]
[15] anonymous at ./<missing>:?
I assume that you want to get a DataFrame. Unfortunately your file is not tab-delimited. This is how you can load it into a DataFrame:
using DataFrames
data = split.(readlines("baby.dat"))
types = [Int, Float64, Float64, Float64, Float64, Float64]
df = DataFrame([parse.(t, getindex.(data[2:end], i)) for (i, t) in enumerate(types)],
Symbol.(replace.(data[1], ".", "")))
Observe that I remove . from names of columns as later GLM package has problem with them.
Now you can check that all is as desired:
julia> showcols(df)
247×6 DataFrames.DataFrame
│ Col # │ Name │ Eltype │ Missing │ Values │
├───────┼──────────┼─────────┼─────────┼──────────────────┤
│ 1 │ Survival │ Int64 │ 0 │ 1 … 0 │
│ 2 │ Weight │ Float64 │ 0 │ 1350.0 … 790.0 │
│ 3 │ Age │ Float64 │ 0 │ 32.0 … 27.0 │
│ 4 │ X1Apgar │ Float64 │ 0 │ 4.0 … 4.0 │
│ 5 │ X5Apgar │ Float64 │ 0 │ 7.0 … 8.0 │
│ 6 │ pH │ Float64 │ 0 │ 7.25 … 7.35 │
julia> head(df)
6×6 DataFrames.DataFrame
│ Row │ Survival │ Weight │ Age │ X1Apgar │ X5Apgar │ pH │
├─────┼──────────┼────────┼──────┼─────────┼─────────┼──────┤
│ 1 │ 1 │ 1350.0 │ 32.0 │ 4.0 │ 7.0 │ 7.25 │
│ 2 │ 0 │ 725.0 │ 27.0 │ 5.0 │ 6.0 │ 7.36 │
│ 3 │ 0 │ 1090.0 │ 27.0 │ 5.0 │ 7.0 │ 7.42 │
│ 4 │ 0 │ 1300.0 │ 24.0 │ 9.0 │ 9.0 │ 7.37 │
│ 5 │ 0 │ 1200.0 │ 31.0 │ 5.0 │ 5.0 │ 7.35 │
│ 6 │ 0 │ 590.0 │ 22.0 │ 9.0 │ 9.0 │ 7.37 │
julia> tail(df)
6×6 DataFrames.DataFrame
│ Row │ Survival │ Weight │ Age │ X1Apgar │ X5Apgar │ pH │
├─────┼──────────┼────────┼──────┼─────────┼─────────┼──────┤
│ 1 │ 1 │ 1120.0 │ 28.0 │ 7.0 │ 7.0 │ 7.33 │
│ 2 │ 1 │ 1020.0 │ 28.0 │ 5.0 │ 7.0 │ 7.34 │
│ 3 │ 1 │ 1320.0 │ 28.0 │ 6.0 │ 6.0 │ 7.24 │
│ 4 │ 0 │ 900.0 │ 27.0 │ 5.0 │ 6.0 │ 7.37 │
│ 5 │ 1 │ 1150.0 │ 27.0 │ 4.0 │ 7.0 │ 7.37 │
│ 6 │ 0 │ 790.0 │ 27.0 │ 4.0 │ 8.0 │ 7.35 │
Now the GLM part (notice the correct way to call GLM):
julia> using GLM
julia> glm(#formula(Survival ~ Weight+Age+X1Apgar+X5Apgar+pH), df, Binomial(), ProbitLink())
StatsModels.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Binomial{Float64},GLM.ProbitLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
Formula: Survival ~ 1 + Weight + Age + X1Apgar + X5Apgar + pH
Coefficients:
Estimate Std.Error z value Pr(>|z|)
(Intercept) -0.563327 8.36692 -0.0673279 0.9463
Weight 0.00213458 0.000479601 4.45074 <1e-5
Age 0.0996481 0.0444713 2.24073 0.0250
X1Apgar 0.0698717 0.0646315 1.08108 0.2797
X5Apgar 0.0371294 0.0703724 0.527614 0.5978
pH -0.624956 1.11015 -0.562946 0.5735
You can check that the results are the same as in R for this model.

MySQL - Left Join with Case select sets NULL value in first CASE

Scenario
I have a users table that has a column for the users iso_code_2 for their country of residence and nationality, and in another table I have all the countries in different languages, so what I want to do is get the country text for the users residence and nationality. I know the problem is the GROUP BY but I do not know how to solve it.
Tables
/* Users table */
╔══════╦═════════════╦════════════╦═════════════╦═══════════════╗
║ id ║ firstname ║ lastname ║ residence ║ nationality ║
╚══════╩═════════════╩════════════╩═════════════╩═══════════════╝
│ 1 │ Joe │ Doe │ JP │ PH │
├──────┼─────────────┼────────────┼─────────────┼───────────────┤
│ 2 │ Lisa │ Simpson │ US │ AR │
├──────┼─────────────┼────────────┼─────────────┼───────────────┤
│ 3 │ Homer │ Simpson │ JP │ JP │
└──────┴─────────────┴────────────┴─────────────┴───────────────┘
/* Countries table */
╔══════╦═══════════════╦══════════════╦═════════════════════╗
║ id ║ language_id ║ iso_code_2 ║ country ║
╚══════╩═══════════════╩══════════════╩═════════════════════╝
│ 1 │ 1 │ JP │ Japan │
├──────┼───────────────┼──────────────┼─────────────────────┤
│ 2 │ 2 │ JP │ 日本 │
├──────┼───────────────┼──────────────┼─────────────────────┤
│ 3 │ 1 │ PH │ Philippines │
├──────┼───────────────┼──────────────┼─────────────────────┤
│ 4 │ 2 │ PH │ フィリピン │
├──────┼───────────────┼──────────────┼─────────────────────┤
│ 5 │ 1 │ US │ United States │
├──────┼───────────────┼──────────────┼─────────────────────┤
│ 6 │ 2 │ US │ 米国 │
├──────┼───────────────┼──────────────┼─────────────────────┤
│ 7 │ 1 │ AR │ Argentina │
├──────┼───────────────┼──────────────┼─────────────────────┤
│ 8 │ 2 │ AR │ アルゼンチン │
└──────┴───────────────┴──────────────┴─────────────────────┘
/* Expected results */
╔══════╦═════════════╦════════════╦════════════════════════╦═══════════════════════╗
║ id ║ firstname ║ lastname ║ residence_country ║ nationality_country ║
╚══════╩═════════════╩════════════╩════════════════════════╩═══════════════════════╝
│ 1 │ Joe │ Doe │ Japan │ Philippines │
├──────┼─────────────┼────────────┼────────────────────────┼───────────────────────┤
│ 1 │ Lisa │ Simpson │ United States │ Argentina │
├──────┼─────────────┼────────────┼────────────────────────┼───────────────────────┤
│ 1 │ Homer │ Simpson │ Japan │ Japan │
└──────┴─────────────┴────────────┴────────────────────────┴───────────────────────┘
Current Query
SELECT
u.id,
u.firstname,
u.lastname,
CASE c.iso_code_2
WHEN u.nationality THEN c.country
END AS nationality_country,
CASE c.iso_code_2
WHEN u.residence THEN c.country
END AS residence_country
FROM
users AS u
LEFT JOIN
countries AS c ON c.language_id = 1 WHERE c.iso_code_2 IN (u.nationality, u.residence)
GROUP BY u.id
ORDER BY u.created_at DESC
LIMIT 15
Wrong results
╔══════╦═════════════╦════════════╦════════════════════════╦═══════════════════════╗
║ id ║ firstname ║ lastname ║ residence_country ║ nationality_country ║
╚══════╩═════════════╩════════════╩════════════════════════╩═══════════════════════╝
│ 1 │ Joe │ Doe │ NULL │ Philippines │
├──────┼─────────────┼────────────┼────────────────────────┼───────────────────────┤
│ 1 │ Lisa │ Simpson │ NULL │ Argentina │
├──────┼─────────────┼────────────┼────────────────────────┼───────────────────────┤
│ 1 │ Homer │ Simpson │ NULL │ Japan │
└──────┴─────────────┴────────────┴────────────────────────┴───────────────────────┘
You group by is retaining only one row per user in the result. Depending on MySQL's preferences, it will either contain the residence_country or the nationality_country.
You need to select twice from the country table to get your desired results (and it will make the query easier)
SELECT
u.id,
u.firstname,
u.lastname,
cn.country
cr.country
FROM
users AS u
LEFT JOIN countries AS cn ON cn.language_id = 1 WHERE cn.iso_code_2 = u.nationality
LEFT JOIN countries AS cr ON cr.language_id = 1 WHERE cr.iso_code_2 = u.residence