How to make stack graph become side by side with this dataset in julia language? - vega-lite

I have got this dataset in julia:
julia>import Downloads
julia>using DLMReader, VegaLite, InMemoryDatasets
julia>data=Downloads.download("https://raw.githubusercontent.com/akshdfyehd/salary/main/ds_salaries.csv")
julia>ds=filereader(data,emptycolname=true)
julia>new=filter(ds,:employment_type,by= ==("FT"))
julia>select!(new,:job_title,:salary_in_usd,:work_year)
588×4 Dataset
Row │ job_title work_year experience_level salary_in_usd
│ identity identity identity identity
│ String? Int64? String? Int64?
─────┼────────────────────────────────────────────────────────────────────────
1 │ Data Scientist 2020 MI 79833
2 │ Machine Learning Scientist 2020 SE 260000
3 │ Big Data Engineer 2020 SE 109024
4 │ Product Data Analyst 2020 MI 20000
5 │ Machine Learning Engineer 2020 SE 150000
6 │ Data Analyst 2020 EN 72000
7 │ Lead Data Scientist 2020 SE 190000
8 │ Data Scientist 2020 MI 35735
9 │ Business Data Analyst 2020 MI 135000
10 │ Lead Data Engineer 2020 SE 125000
11 │ Data Scientist 2020 EN 51321
12 │ Data Scientist 2020 MI 40481
13 │ Data Scientist 2020 EN 39916
14 │ Lead Data Analyst 2020 MI 87000
⋮ │ ⋮ ⋮ ⋮ ⋮
576 │ Data Analytics Manager 2022 SE 150260
577 │ Data Analytics Manager 2022 SE 109280
578 │ Data Scientist 2022 SE 210000
579 │ Data Analyst 2022 SE 170000
580 │ Data Scientist 2022 MI 160000
581 │ Data Scientist 2022 MI 130000
582 │ Data Analyst 2022 EN 67000
583 │ Data Analyst 2022 EN 52000
584 │ Data Engineer 2022 SE 154000
585 │ Data Engineer 2022 SE 126000
586 │ Data Analyst 2022 SE 129000
587 │ Data Analyst 2022 SE 150000
588 │ AI Scientist 2022 MI 200000
561 rows omitted
I have tried following graph with code like this:
new |> #vlplot(:bar,columns=2,wrap="experience_level:o",x={:job_title},y={"salary_in_usd",sort="-x"},color=:work_year,width=600,height=150)
Can I please have any suggestions on how to make the stack part to be side by side? any other packages will do as long as it can show good graph. Thanks in advance!

Related

Clickhouse: Want to extract data from Array(Tupple) column in Clickhouse

Query used to create the table:
CREATE TABLE default.ntest2(job_name String, list_data Array(Tuple(s UInt64, e UInt64, name String))) ENGINE = MergeTree ORDER BY (job_name) SETTINGS index_granularity = 8192;
Table Data:
job_name
list_data.s
list_data.e
list_data.name
job1
[19,22]
[38,92]
['test1','test2']
job2
[28,63]
[49,87]
['test3''test4']
Expected Output:
job_name
list_data.s
list_data.e
list_data.name
job1
19
38
'test1'
job1
22
92
'test2'
job2
28
49
'test3'
job2
63
87
'test4'
How can I achieve this with less query time?
ARRAY JOIN https://clickhouse.com/docs/en/sql-reference/statements/select/array-join/
SELECT
job_name,
`list_data.s`,
`list_data.e`,
`list_data.name`
FROM
(
SELECT
c1 AS job_name,
c2 AS list_data
FROM values(('job1', ([19, 22], [38, 92], ['test1', 'test2'])), ('job2', ([28, 63], [49, 87], ['test3', 'test4'])))
) AS T
ARRAY JOIN
list_data.1 AS `list_data.s`,
list_data.2 AS `list_data.e`,
list_data.3 AS `list_data.name`
┌─job_name─┬─list_data.s─┬─list_data.e─┬─list_data.name─┐
│ job1 │ 19 │ 38 │ test1 │
│ job1 │ 22 │ 92 │ test2 │
│ job2 │ 28 │ 49 │ test3 │
│ job2 │ 63 │ 87 │ test4 │
└──────────┴─────────────┴─────────────┴────────────────┘
SELECT
job_name,
list_data.s,
list_data.e,
list_data.name
FROM
(
SELECT
c1 AS job_name,
c2 AS `list_data.s`,
c3 AS `list_data.e`,
c4 AS `list_data.name`
FROM values(('job1', [19, 22], [38, 92], ['test1', 'test2']), ('job2', [28, 63], [49, 87], ['test3', 'test4']))
) AS T
ARRAY JOIN
`list_data.s` AS `list_data.s`,
`list_data.e` AS `list_data.e`,
`list_data.name` AS `list_data.name`
┌─job_name─┬─list_data.s─┬─list_data.e─┬─list_data.name─┐
│ job1 │ 19 │ 38 │ test1 │
│ job1 │ 22 │ 92 │ test2 │
│ job2 │ 28 │ 49 │ test3 │
│ job2 │ 63 │ 87 │ test4 │
└──────────┴─────────────┴─────────────┴────────────────┘

ClickHouse Aggregates - GROUP BY DAY/MONTH/YEAR(timestamp)?

Is there a way in ClickHouse to do a GROUP BY DAY/MONTH/YEAR() with a timestamp value? Having hard time figuring it out while rewriting MySQL queries to ClickHouse. My MySQL queries looks like so...
SELECT COUNT(this), COUNT(that) FROM table WHERE something = x AND stamp BETWEEN startdate AND enddate
SELECT COUNT(this), COUNT(that) FROM table WHERE something = x AND stamp BETWEEN startdate AND enddate GROUP BY DAY(stamp)
SELECT COUNT(this), COUNT(that) FROM table WHERE something = x AND stamp BETWEEN startdate AND enddate GROUP BY MONTH(stamp)
SELECT COUNT(this), COUNT(that) FROM table WHERE something = x AND stamp BETWEEN startdate AND enddate GROUP BY YEAR(stamp)
Quite simple AND SLOW in MySQL, but I do not know how to do the aggregates in ClickHouse.
Thanks!
To get part of date use function toYear, toMonth, toDayOfMonth by the next way:
SELECT
toMonth(time) AS month,
count()
FROM
(
SELECT
number,
addDays(now(), number) AS time
FROM numbers(8)
)
GROUP BY month
/*
┌─month─┬─count()─┐
│ 1 │ 7 │
│ 2 │ 1 │
└───────┴─────────┘
*/
To get multiple grouping set use WITH ROLLUP-modifier:
SELECT
toYear(time) AS year,
toMonth(time) AS month,
toDayOfMonth(time) AS day,
count()
FROM
(
SELECT
number,
addDays(now(), number) AS time
FROM numbers(8)
)
GROUP BY
year,
month,
day
WITH ROLLUP
/*
┌─year─┬─month─┬─day─┬─count()─┐
│ 2021 │ 2 │ 1 │ 1 │ // day
│ 2021 │ 1 │ 29 │ 1 │ // day
│ 2021 │ 1 │ 31 │ 1 │ // day
│ 2021 │ 1 │ 26 │ 1 │ // day
│ 2021 │ 1 │ 25 │ 1 │ // day
│ 2021 │ 1 │ 28 │ 1 │ // day
│ 2021 │ 1 │ 30 │ 1 │ // day
│ 2021 │ 1 │ 27 │ 1 │ // day
│ 2021 │ 1 │ 0 │ 7 │ // month
│ 2021 │ 2 │ 0 │ 1 │ // month
│ 2021 │ 0 │ 0 │ 8 │ // year
│ 0 │ 0 │ 0 │ 8 │
└──────┴───────┴─────┴─────────┘
*/

How to get a SQL Query to combine two tables

I have some questions about a SQL query. I've a table called CUSTOMER with the following fields (name, city):
Juan New York
Louis Madrid
And another table called ASESOR with the following fields (name, city):
Michael New York
Peter Zurich
Dan Madrid
I need a query that combines both tables where the city is the same.
Expected result:
Juan New York
Loius Madrid
Michael New York
Dan Madrid
Peter has to be out of the results.
Thanks
You could use UNION ALL combined with EXISTS:
SELECT name , city
FROM CUSTOMER c
WHERE EXISTS (SELECT 1 FROM ASESOR a WHERE c.city= a.city)
UNION ALL
SELECT name , city
FROM ASESOR a
WHERE EXISTS (SELECT 1 FROM CUSTOMER c WHERE c.city= a.city);
DBFiddle Demo
Output:
┌──────────┬──────────┐
│ name │ city │
├──────────┼──────────┤
│ Juan │ New York │
│ Louis │ Madrid │
│ Michael │ New York │
│ Dan │ Madrid │
└──────────┴──────────┘

How to read a non-standard space delimited data into a DataFrame and build a GLM model using it?

I am trying to read a tab delimited file with all data present into julia. It saves all the columns as NullableArrays.NullableArray{Int64,1} although I specified the type:
data = CSV.read("../datasets/baby.dat"; delim='\t', types=[Int, Float64, Float64, Float64, Float64, Float64])
The dataset is from http://stat.ethz.ch/Teaching/Datasets/baby.dat
I want to do a regression with the dataset, but the glm.jl Package gives an error with Nullable Arrays ...
Any ideas?
The complete error message is:
fit(GeneralizedLinearModel, #formula(Survival2 ~
Weight+Age+X1.Apgar+X5.Apgar+pH), data, Binomial(), ProbitLink())
ERROR: Non-call expression encountered
Stacktrace:
[1] dospecials(::Expr) at
/.julia/v0.6/DataFrames/src/statsmodels/formula.jl:97
[2] collect_to!(::Array{Symbol,1},
::Base.Generator{Array{Any,1},DataFrames.#dospecials}, ::Int64,::Int64) at
./array.jl:508
[3] collect_to_with_first!(::Array{Symbol,1}, ::Symbol,
::Base.Generator{Array{Any,1},DataFrames.#dospecials}, ::Int64) at
./array.jl:495
[4] _collect(::Array{Any,1},
::Base.Generator{Array{Any,1},DataFrames.#dospecials}, ::Base.EltypeUnknown,
::Base.HasShape) at ./array.jl:489
[5] map(::Function, ::Array{Any,1}) at ./abstractarray.jl:1868
[6] dospecials(::Expr) at
.julia/v0.6/DataFrames/src/statsmodels/formula.jl:101
[7] DataFrames.Terms(::DataFrames.Formula) at
.julia/v0.6/DataFrames/src/statsmodels/formula.jl:209
[8] #ModelFrame#127(::Array{Any,1}, ::Type{T} where T, ::DataFrames.Formula, ::DataFrames.DataFrame) at .julia/v0.6/DataFrames/src/statsmodels/formula.jl:333
[9] (::Core.#kw#Type)(::Array{Any,1}, ::Type{DataFrames.ModelFrame}, ::DataFrames.Formula, ::DataFrames.DataFrame) at ./<missing>:0
[10] #fit#153(::Dict{Any,Any}, ::Array{Any,1}, ::Function, ::Type{GLM.GeneralizedLinearModel}, ::DataFrames.Formula, ::DataFrames.DataFrame, ::Distributions.Binomial{Float64}, ::Vararg{Any,N} where N) at .julia/v0.6/DataFrames/src/statsmodels/statsmodel.jl:52
[11] fit(::Type{GLM.GeneralizedLinearModel}, ::DataFrames.Formula, ::DataFrames.DataFrame, ::Distributions.Binomial{Float64}, ::GLM.ProbitLink) at .julia/v0.6/DataFrames/src/statsmodels/statsmodel.jl:52
[12] eval(::Module, ::Any) at ./boot.jl:235
[13] eval(::Any) at ./boot.jl:234
[14] macro expansion at .julia/v0.6/Atom/src/repl.jl:186 [inlined]
[15] anonymous at ./<missing>:?
I assume that you want to get a DataFrame. Unfortunately your file is not tab-delimited. This is how you can load it into a DataFrame:
using DataFrames
data = split.(readlines("baby.dat"))
types = [Int, Float64, Float64, Float64, Float64, Float64]
df = DataFrame([parse.(t, getindex.(data[2:end], i)) for (i, t) in enumerate(types)],
Symbol.(replace.(data[1], ".", "")))
Observe that I remove . from names of columns as later GLM package has problem with them.
Now you can check that all is as desired:
julia> showcols(df)
247×6 DataFrames.DataFrame
│ Col # │ Name │ Eltype │ Missing │ Values │
├───────┼──────────┼─────────┼─────────┼──────────────────┤
│ 1 │ Survival │ Int64 │ 0 │ 1 … 0 │
│ 2 │ Weight │ Float64 │ 0 │ 1350.0 … 790.0 │
│ 3 │ Age │ Float64 │ 0 │ 32.0 … 27.0 │
│ 4 │ X1Apgar │ Float64 │ 0 │ 4.0 … 4.0 │
│ 5 │ X5Apgar │ Float64 │ 0 │ 7.0 … 8.0 │
│ 6 │ pH │ Float64 │ 0 │ 7.25 … 7.35 │
julia> head(df)
6×6 DataFrames.DataFrame
│ Row │ Survival │ Weight │ Age │ X1Apgar │ X5Apgar │ pH │
├─────┼──────────┼────────┼──────┼─────────┼─────────┼──────┤
│ 1 │ 1 │ 1350.0 │ 32.0 │ 4.0 │ 7.0 │ 7.25 │
│ 2 │ 0 │ 725.0 │ 27.0 │ 5.0 │ 6.0 │ 7.36 │
│ 3 │ 0 │ 1090.0 │ 27.0 │ 5.0 │ 7.0 │ 7.42 │
│ 4 │ 0 │ 1300.0 │ 24.0 │ 9.0 │ 9.0 │ 7.37 │
│ 5 │ 0 │ 1200.0 │ 31.0 │ 5.0 │ 5.0 │ 7.35 │
│ 6 │ 0 │ 590.0 │ 22.0 │ 9.0 │ 9.0 │ 7.37 │
julia> tail(df)
6×6 DataFrames.DataFrame
│ Row │ Survival │ Weight │ Age │ X1Apgar │ X5Apgar │ pH │
├─────┼──────────┼────────┼──────┼─────────┼─────────┼──────┤
│ 1 │ 1 │ 1120.0 │ 28.0 │ 7.0 │ 7.0 │ 7.33 │
│ 2 │ 1 │ 1020.0 │ 28.0 │ 5.0 │ 7.0 │ 7.34 │
│ 3 │ 1 │ 1320.0 │ 28.0 │ 6.0 │ 6.0 │ 7.24 │
│ 4 │ 0 │ 900.0 │ 27.0 │ 5.0 │ 6.0 │ 7.37 │
│ 5 │ 1 │ 1150.0 │ 27.0 │ 4.0 │ 7.0 │ 7.37 │
│ 6 │ 0 │ 790.0 │ 27.0 │ 4.0 │ 8.0 │ 7.35 │
Now the GLM part (notice the correct way to call GLM):
julia> using GLM
julia> glm(#formula(Survival ~ Weight+Age+X1Apgar+X5Apgar+pH), df, Binomial(), ProbitLink())
StatsModels.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Binomial{Float64},GLM.ProbitLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
Formula: Survival ~ 1 + Weight + Age + X1Apgar + X5Apgar + pH
Coefficients:
Estimate Std.Error z value Pr(>|z|)
(Intercept) -0.563327 8.36692 -0.0673279 0.9463
Weight 0.00213458 0.000479601 4.45074 <1e-5
Age 0.0996481 0.0444713 2.24073 0.0250
X1Apgar 0.0698717 0.0646315 1.08108 0.2797
X5Apgar 0.0371294 0.0703724 0.527614 0.5978
pH -0.624956 1.11015 -0.562946 0.5735
You can check that the results are the same as in R for this model.

MySQL - Left Join with Case select sets NULL value in first CASE

Scenario
I have a users table that has a column for the users iso_code_2 for their country of residence and nationality, and in another table I have all the countries in different languages, so what I want to do is get the country text for the users residence and nationality. I know the problem is the GROUP BY but I do not know how to solve it.
Tables
/* Users table */
╔══════╦═════════════╦════════════╦═════════════╦═══════════════╗
║ id ║ firstname ║ lastname ║ residence ║ nationality ║
╚══════╩═════════════╩════════════╩═════════════╩═══════════════╝
│ 1 │ Joe │ Doe │ JP │ PH │
├──────┼─────────────┼────────────┼─────────────┼───────────────┤
│ 2 │ Lisa │ Simpson │ US │ AR │
├──────┼─────────────┼────────────┼─────────────┼───────────────┤
│ 3 │ Homer │ Simpson │ JP │ JP │
└──────┴─────────────┴────────────┴─────────────┴───────────────┘
/* Countries table */
╔══════╦═══════════════╦══════════════╦═════════════════════╗
║ id ║ language_id ║ iso_code_2 ║ country ║
╚══════╩═══════════════╩══════════════╩═════════════════════╝
│ 1 │ 1 │ JP │ Japan │
├──────┼───────────────┼──────────────┼─────────────────────┤
│ 2 │ 2 │ JP │ 日本 │
├──────┼───────────────┼──────────────┼─────────────────────┤
│ 3 │ 1 │ PH │ Philippines │
├──────┼───────────────┼──────────────┼─────────────────────┤
│ 4 │ 2 │ PH │ フィリピン │
├──────┼───────────────┼──────────────┼─────────────────────┤
│ 5 │ 1 │ US │ United States │
├──────┼───────────────┼──────────────┼─────────────────────┤
│ 6 │ 2 │ US │ 米国 │
├──────┼───────────────┼──────────────┼─────────────────────┤
│ 7 │ 1 │ AR │ Argentina │
├──────┼───────────────┼──────────────┼─────────────────────┤
│ 8 │ 2 │ AR │ アルゼンチン │
└──────┴───────────────┴──────────────┴─────────────────────┘
/* Expected results */
╔══════╦═════════════╦════════════╦════════════════════════╦═══════════════════════╗
║ id ║ firstname ║ lastname ║ residence_country ║ nationality_country ║
╚══════╩═════════════╩════════════╩════════════════════════╩═══════════════════════╝
│ 1 │ Joe │ Doe │ Japan │ Philippines │
├──────┼─────────────┼────────────┼────────────────────────┼───────────────────────┤
│ 1 │ Lisa │ Simpson │ United States │ Argentina │
├──────┼─────────────┼────────────┼────────────────────────┼───────────────────────┤
│ 1 │ Homer │ Simpson │ Japan │ Japan │
└──────┴─────────────┴────────────┴────────────────────────┴───────────────────────┘
Current Query
SELECT
u.id,
u.firstname,
u.lastname,
CASE c.iso_code_2
WHEN u.nationality THEN c.country
END AS nationality_country,
CASE c.iso_code_2
WHEN u.residence THEN c.country
END AS residence_country
FROM
users AS u
LEFT JOIN
countries AS c ON c.language_id = 1 WHERE c.iso_code_2 IN (u.nationality, u.residence)
GROUP BY u.id
ORDER BY u.created_at DESC
LIMIT 15
Wrong results
╔══════╦═════════════╦════════════╦════════════════════════╦═══════════════════════╗
║ id ║ firstname ║ lastname ║ residence_country ║ nationality_country ║
╚══════╩═════════════╩════════════╩════════════════════════╩═══════════════════════╝
│ 1 │ Joe │ Doe │ NULL │ Philippines │
├──────┼─────────────┼────────────┼────────────────────────┼───────────────────────┤
│ 1 │ Lisa │ Simpson │ NULL │ Argentina │
├──────┼─────────────┼────────────┼────────────────────────┼───────────────────────┤
│ 1 │ Homer │ Simpson │ NULL │ Japan │
└──────┴─────────────┴────────────┴────────────────────────┴───────────────────────┘
You group by is retaining only one row per user in the result. Depending on MySQL's preferences, it will either contain the residence_country or the nationality_country.
You need to select twice from the country table to get your desired results (and it will make the query easier)
SELECT
u.id,
u.firstname,
u.lastname,
cn.country
cr.country
FROM
users AS u
LEFT JOIN countries AS cn ON cn.language_id = 1 WHERE cn.iso_code_2 = u.nationality
LEFT JOIN countries AS cr ON cr.language_id = 1 WHERE cr.iso_code_2 = u.residence