Edited
I have a mysql table "prop" with a column "detail" that contains a dict field.
fnum details
55 '{"a":"3"},{"b":"2"},{"d":"1"}'
I have tried to convert this to a table. using this
SELECT p.fnum, deets.*
FROM prop p
JOIN JSON_TABLE( p.details,
'$[*]'
COLUMNS (
idx FOR ORDINALITY,
a varChar(10) PATH '$.a',
b varchar(20) PATH '$.b'
d varchar(45) PATH '$.d',
)
) deets
I have tried various paths including $.*. I am expecting the following:
fnum a b d
55 3 2 1
also if I have 2 rows such as
fnum details
55 '{"a":"3"},{"b":"2"},{"d":"1"}'
56 '{"c":"car"}'
should generate the following
fnum a b d c
55 3 2 1 null
56 null null null car
Your details data is not valid JSON, because it doesn't have [ ] delimiting the array.
Demo:
mysql> create table prop (fnum int, details json);
mysql> insert into prop select 55, '{"a":"3"},{"b":"2"},{"d":"1"}';
ERROR 3140 (22032): Invalid JSON text: "The document root must
not be followed by other values." at position 9 in value for column
'prop.details'.
mysql> insert into prop select 55, '[{"a":"3"},{"b":"2"},{"d":"1"}]';
Query OK, 1 row affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings: 0
It's worth using the JSON data type instead of storing JSON in a text column, because the JSON data type ensures that the document is valid JSON format. It must be valid JSON to use the JSON_TABLE() function or any other JSON function.
Also your query has some syntax mistakes with respect to commas:
SELECT p.fnum, deets.*
FROM prop p
JOIN JSON_TABLE( p.details,
'$[*]'
COLUMNS (
idx FOR ORDINALITY,
a varChar(10) PATH '$.a',
b varchar(20) PATH '$.b' <-- missing comma
d varchar(45) PATH '$.d', <-- extra comma
)
) deets
After writing the code, When I run my pgloader script, it fails this way:
load database
from mysql://xxx:yyy127.0.0.1/zzz
into pgsql://xxx:yyy#localhost/zzz
including only table names matching 'TABLE'
with
data only,
create no tables, preserve index names,
batch rows = 1000,
batch size = 500 MB,
prefetch rows = 1000
-- on error stop,
set work_mem to '2048 MB', maintenance_work_mem to '4096 MB';
-- before load do $$ drop schema if exists jobs cascade; $$;
I get this error message, but this is intermittent, that is, it doesn't always happen, and I'm not sure what parameters to put. I have plenty of ram, the records are about 50kb each.
2020-05-25T04:28:14.194000Z INFO Incomplete Foreign Key definition: constraint "fk_job_currency1" on table "jobs2_beta2.job" referencing table NIL
2020-05-25T04:28:14.194000Z INFO Incomplete Foreign Key definition: constraint "fk_job_job_category1" on table "jobs2_beta2.job" referencing table NIL
2020-05-25T04:28:14.194000Z INFO Incomplete Foreign Key definition: constraint "fk_job_organization1" on table "jobs2_beta2.job" referencing table NIL
2020-05-25T04:28:14.194000Z INFO Incomplete Foreign Key definition: constraint "fk_job_resource1" on table "jobs2_beta2.job" referencing table NIL
2020-05-25T04:28:14.194000Z INFO Incomplete Foreign Key definition: constraint "fk_job_user1" on table "jobs2_beta2.job" referencing table NIL
2020-05-25T04:28:14.198000Z SQL MySQL: sending query: -- params: db-name
-- table-type-name
-- only-tables
-- only-tables
-- including
-- filter-list-to-where-clause incuding
-- excluding
-- filter-list-to-where-clause excluding
SELECT table_name, index_name, index_type,
sum(non_unique),
cast(GROUP_CONCAT(column_name order by seq_in_index) as char)
FROM information_schema.statistics
WHERE table_schema = 'jobs2_beta2'
and (table_name = 'job')
GROUP BY table_name, index_name, index_type;
2020-05-25T04:28:14.225000Z INFO Processing source catalogs
2020-05-25T04:28:14.272000Z NOTICE Prepare PostgreSQL database.
2020-05-25T04:28:14.275000Z DEBUG CONNECTED TO #<PGLOADER.PGSQL:PGSQL-CONNECTION pgsql://ziprecruiter#localhost:5432/ziprecruiter {1006A2A683}>
2020-05-25T04:28:14.275000Z DEBUG SET client_encoding TO 'utf8'
2020-05-25T04:28:14.275000Z DEBUG SET work_mem TO '1024 MB'
2020-05-25T04:28:14.276000Z DEBUG SET maintenance_work_mem TO '4096 MB'
2020-05-25T04:28:14.276000Z DEBUG SET application_name TO 'pgloader'
2020-05-25T04:28:14.280000Z DEBUG BEGIN
2020-05-25T04:28:14.314000Z SQL DROP TABLE IF EXISTS jobs2_beta2.job CASCADE;
2020-05-25T04:28:15.316000Z SQL CREATE TABLE jobs2_beta2.job
(
id bigserial not null,
resource_id bigint not null,
url text not null,
job_title text,
html_job_description text,
text_job_description text,
last_crawl_date timestamptz,
first_indexed_date timestamptz,
job_category_id bigint not null,
currency_id bigint not null,
salary_exact double precision,
salary_is_range smallint,
salary_range_start double precision,
salary_range_end double precision,
address text,
source_program varchar(255),
organization varchar(255),
organization_count bigint,
expiration_date timestamptz,
is_sponsored smallint,
is_hidden smallint,
educational_requirements text,
experience_requirements text,
destination text,
organization_id bigint not null,
user_id bigint not null,
is_expired smallint,
salary_periodicity varchar(255),
json_schema text,
clean_job_description text,
estimated_job_category varchar(255)
);
2020-05-25T04:28:15.326000Z SQL -- params: table-names
select n, n::regclass::oid
from (values ('jobs2_beta2.job')) as t(n);
2020-05-25T04:28:15.392000Z NOTICE COPY jobs2_beta2.job
2020-05-25T04:28:15.392000Z DEBUG Reader started for jobs2_beta2.job
2020-05-25T04:28:15.405000Z DEBUG start jobs2_beta2.job 1400
2020-05-25T04:28:15.407000Z INFO COPY ON ERROR STOP
2020-05-25T04:28:15.408000Z DEBUG CONNECTED TO #<MYSQL-CONNECTION mysql://jobs_client2#127.0.0.1:3306/jobs2_beta2 {100A4E61B3}>
2020-05-25T04:28:15.408000Z SQL MySQL: sending query: SELECT `id`, `resource_id`, `url`, `job_title`, `html_job_description`, `text_job_description`, `last_crawl_date`, `first_indexed_date`, `job_category_id`, `currency_id`, `salary_exact`, `salary_is_range`, `salary_range_start`, `salary_range_end`, `address`, `source_program`, `organization`, `organization_count`, `expiration_date`, `is_sponsored`, `is_hidden`, `educational_requirements`, `experience_requirements`, `destination`, `organization_id`, `user_id`, `is_expired`, `salary_periodicity`, `json_schema`, `clean_job_description`, `estimated_job_category` FROM `job`
2020-05-25T04:28:15.416000Z DEBUG CONNECTED TO #<PGLOADER.PGSQL:PGSQL-CONNECTION pgsql://ziprecruiter#localhost:5432/ziprecruiter {100A8C11A3}>
2020-05-25T04:28:15.416000Z DEBUG SET client_encoding TO 'utf8'
2020-05-25T04:28:15.416000Z DEBUG SET work_mem TO '1024 MB'
2020-05-25T04:28:15.416000Z DEBUG SET maintenance_work_mem TO '4096 MB'
2020-05-25T04:28:15.416000Z DEBUG SET application_name TO 'pgloader'
2020-05-25T04:28:15.416000Z SQL SET search_path TO jobs2_beta2;
2020-05-25T04:28:15.417000Z INFO pgsql:copy-rows-from-queue[0]: jobs2_beta2.job (id resource_id url job_title
html_job_description
text_job_description
last_crawl_date
first_indexed_date
job_category_id currency_id
salary_exact salary_is_range
salary_range_start
salary_range_end address
source_program organization
organization_count
expiration_date is_sponsored
is_hidden
educational_requirements
experience_requirements
destination organization_id
user_id is_expired
salary_periodicity json_schema
clean_job_description
estimated_job_category)
Gen Boxed Unboxed LgBox LgUnbox Pin Alloc Waste Trig WP GCs Mem-age
0 0 0 0 0 0 0 0 42949672 0 0 0.0000
1 577 42719 0 36 5 1301487024 118415952 753961944 0 1 1.3187
2 1776 83008 25 60 65 2579123472 201863920 2000000 668 0 0.8672
3 0 0 0 0 0 0 0 2000000 0 0 0.0000
4 0 0 0 0 0 0 0 2000000 0 0 0.0000
5 0 0 0 0 0 0 0 2000000 0 0 0.0000
6 1593 1278 0 0 0 90993120 3083808 2000000 1501 0 0.0000
7 0 0 0 0 0 0 0 2000000 0 0 0.0000
Total bytes allocated = 3971603616
Dynamic-space-size bytes = 4294967296
GC control variables:
*GC-INHIBIT* = true
*GC-PENDING* = true
*STOP-FOR-GC-PENDING* = false
fatal error encountered in SBCL pid 9514(tid 0x7ffff492f700):
Heap exhausted, game over.
Welcome to LDB, a low-level debugger for the Lisp runtime environment.
ldb>
I tried fiddling with memory the parameters, can anybody provide assistance on this issue?
Found a github issue related to "Heap exhausted, game over", might be helpful.
https://github.com/dimitri/pgloader/issues/327
I get this error and tried everything available in the internet and stackoverlow to solve this. I am trying to run a query after connecting MySQL db using sqlx package and scan through the results. I have tried the solutions shared for similar questions but nothing worked for me.
type Trip struct {
ID int `db:"id"`
Type int `db:"type"`
DID int `db:"did"`
DUID int `db:"duid"`
VID int `db:"vid"`
Sts string `db:"sts"`
AM int `db:"am"`
Sdate null.Time `db:"sdate"`
}
func GetTripByID(db sqlx.Queryer, id int) (*Trip, error) {
row := db.QueryRowx("select ID,Type,DID,DUID,VID,Sts,AM,Sdate from mytbl where ID=123", id)
var t Trip
err := row.StructScan(&t)
if err != nil {
fmt.Println("Error during struct scan")
return nil, err
}
return &t, nil
}
The exact error that I get is
panic: sql: Scan error on column index 6, name "sdate": null:
cannot scan type []uint8 into null.Time: [50 48 49 56 45 49 50 45 48
55 32 48 50 58 48 56 58 53 49]
syntax wise the query is working perfectly fine and I am getting results when I run it in sql workbench. I have also tried ParseTime=true as suggested by one of the one of the links.
Try to use special types for null values in package "database/sql"
For example, when text or varchar can be null in db, use sql.NullString for var type.
As suggested above, I did null handling for the column "Sdate"
// NullTime defining nullTime
type NullTime mysql.NullTime
// Scan implements the Scanner interface for NullTime
func (nt *NullTime) Scan(value interface{}) error {
var t mysql.NullTime
if err := t.Scan(value); err != nil {
return err
}
// if nil then make Valid false
if reflect.TypeOf(value) == nil {
*nt = NullTime{t.Time, false}
} else {
*nt = NullTime{t.Time, true}
}
and changes in the struct
type Trip struct {
ID int `db:"id"`
Type int `db:"type"`
DID int `db:"did"`
DUID int `db:"duid"`
VID int `db:"vid"`
Sts string `db:"sts"`
AM int `db:"am"`
Sdate NullTime `db:"sdate"`
}
so the solution is not just defining the struct for handling null but also implementing the scanner interface.
BigQuery has facilities to parse JSON in real-time interactive queries: Just store the JSON encoded object as a string, and query in real time, with functions like JSON_EXTRACT_SCALAR.
However, I can't find a way to discover all the keys (properties) in these objects.
Can I use a UDF for this?
Here's something that uses Standard SQL:
CREATE TEMP FUNCTION jsonObjectKeys(input STRING)
RETURNS Array<String>
LANGUAGE js AS """
return Object.keys(JSON.parse(input));
""";
WITH keys AS (
SELECT
jsonObjectKeys(myColumn) AS keys
FROM
myProject.myTable
WHERE myColumn IS NOT NULL
)
SELECT
DISTINCT k
FROM keys
CROSS JOIN UNNEST(keys.keys) AS k
ORDER BY k
Below version fixes some "issues" in original answer like:
1. only first level of keys was emitted
2. having to manually comppile and than run final query for extracting info based on discovered keys
SELECT type, key, value, COUNT(1) AS weight
FROM JS(
(SELECT json, type
FROM [fh-bigquery:openlibrary.ol_dump_20151231#0]
WHERE type = '/type/edition'
),
json, type, // Input columns
"[{name: 'type', type:'string'}, // Output schema
{name: 'key', type:'string'},
{name: 'value', type:'string'}]",
"function(r, emit) { // The function
x = JSON.parse(r.json);
processKey(x, '');
function processKey(node, parent) {
if (parent !== '') {parent += '.'};
Object.keys(node).map(function(key) {
value = node[key].toString();
if (value !== '[object Object]') {
emit({type:r.type, key:parent + key, value:value});
} else {
processKey(node[key], parent + key);
};
});
};
}"
)
GROUP EACH BY type, key, value
ORDER BY weight DESC
LIMIT 1000
The result is as below
Row type key value weight
1 /type/edition type.key /type/edition 25140209
2 /type/edition last_modified.type /type/datetime 25140209
3 /type/edition created.type /type/datetime 17092292
4 /type/edition languages.0.key /languages/eng 14514830
5 /type/edition notes.type /type/text 11681480
6 /type/edition revision 2 8714084
7 /type/edition latest_revision 2 8704217
8 /type/edition revision 3 5041680
9 /type/edition latest_revision 3 5040634
10 /type/edition created.value 2008-04-01T03:28:50.625462 3579095
11 /type/edition revision 1 3396868
12 /type/edition physical_format Paperback 3181270
13 /type/edition revision 4 3053266
14 /type/edition latest_revision 4 3053197
15 /type/edition revision 5 2076094
16 /type/edition latest_revision 5 2076072
17 /type/edition publish_country nyu 1727347
18 /type/edition created.value 2008-04-30T09:38:13.731961 1681227
19 /type/edition publish_country enk 1627969
20 /type/edition publish_places London 1613755
21 /type/edition physical_format Hardcover 1495864
22 /type/edition publish_places New York 1467779
23 /type/edition revision 6 1437467
24 /type/edition latest_revision 6 1437463
25 /type/edition publish_country xxk 1407624
The answers above don't work well in the current (2021) version, fail if either the JSON field is null or the JSON has null entries, don't aggregate well (we're trying to get structure, not content), or the like.
So, here's an improved version based on Felipe Hoffa's answer.
It's fully recursive; checks for null and Array types; suppresses array indices (as []); flagged deterministic so it'll get cached; and groups, sorts, & counts the results.
Sample output:
key type n
"" null 213
avatar string 1046
blinking boolean 1046
created_at string 1046
deprecated_fields Array 1046
display_name string 1046
fields Array 1046
fields.[] Object 31
fields.[].name string 31
fields.[].value string 31
fields.[].verified_at null 27
fields.[].verified_at string 4
friends_count number 1046
Note:
the empty string key means that the field itself is actually null
the deprecated_fields key is one where all examples in the JSON are ..., deprecated_fields: [], ...
null is returned as the string "null", like other types (not SQL null)
It could be improved to detect different types of number (int, bigint, float, decimal), dates, numbers stored as strings, or the like. But eh, this was good enough for my purposes, and that'd require more processing.
Just change the your-* bits in the last couple lines:
CREATE TEMP FUNCTION jsonParsed(input STRING)
RETURNS Array<Struct<key STRING, type STRING>>
DETERMINISTIC LANGUAGE js AS
"""
function processKey(node, parent) {
var ary = [];
if (parent !== '') {
parent += '.';
}
if (node == null) {
ary.push({
key: parent,
type: 'null'
})
} else {
Object.keys(node).map(function(key) {
var v = node[key];
if (node.constructor.name == "Array") {
keytouse = '[]'
} else {
keytouse = key
}
if ((v == null) || (typeof(v) !== 'object')) {
if (v == null) { typetouse = 'null';} else {typetouse = typeof(v);}
ary.push({
key: parent + keytouse,
type: typetouse
});
} else {
ary.push({
key: parent + keytouse,
type: v.constructor.name
});
ary = [].concat(ary, processKey(v, parent + keytouse));
}
});
}
return ary;
}
return processKey(JSON.parse(input), '');
""";
with keys as (SELECT jsonParsed(your-json-field) as keys FROM `your-project-id.your-database-id.your-table-id`)
select key, type, count(*) as n from keys k cross join unnest(k.keys) as kk group by key, type order by key asc;
How to extract all of a JSON object keys using a JavaScript UDF in BigQuery:
SELECT type, key
FROM (
SELECT * FROM
js(
(SELECT json, type FROM [fh-bigquery:openlibrary.ol_dump_20151231]
),
// Input columns.
json, type,
// Output schema.
"[{name: 'key', type:'string'},
{name: 'type', type:'string'}]",
// The function.
"function(r, emit) {
x=JSON.parse(r.json)
Object.keys(x).forEach(function(entry) {
emit({key:entry, type:r.type,});
});
}"
)
)
LIMIT 100
Grouped and counted:
Once you've found all the keys you can use, then you can use JSON_EXTRACT_SCALAR on a normal SQL query:
Now that you know the keys, you can extract all information known for a type:
SELECT JSON_EXTRACT_SCALAR(json, '$.key') key,
JSON_EXTRACT_SCALAR(json, '$.type.key') type,
JSON_EXTRACT(json, '$.revision') revision,
JSON_EXTRACT_SCALAR(json, '$.last_modified.value') last_modified,
JSON_EXTRACT_SCALAR(json, '$.title') title,
JSON_EXTRACT_SCALAR(json, '$.publish_date') publish_date,
JSON_EXTRACT(json, '$.publishers') publishers,
JSON_EXTRACT(json, '$.latest_revision') latest_revision,
JSON_EXTRACT(json, '$.languages') languages,
JSON_EXTRACT(json, '$.authors') authors,
JSON_EXTRACT(json, '$.works') works,
JSON_EXTRACT(json, '$.number_of_pages') number_of_pages,
JSON_EXTRACT(json, '$.publish_places') publish_places,
JSON_EXTRACT(json, '$.publish_country') publish_country,
JSON_EXTRACT(json, '$.subjects') subjects,
JSON_EXTRACT_SCALAR(json, '$.created.value') created,
JSON_EXTRACT_SCALAR(json, '$.pagination') pagination,
JSON_EXTRACT_SCALAR(json, '$.by_statement') by_statement,
JSON_EXTRACT(json, '$.isbn_10') isbn_10,
JSON_EXTRACT_SCALAR(json, '$.isbn_10[0]') isbn_10_0,
JSON_EXTRACT(json, '$.notes') notes,
JSON_EXTRACT(json, '$.lc_classifications') lc_classifications,
JSON_EXTRACT_SCALAR(json, '$.subtitle') subtitle,
JSON_EXTRACT(json, '$.lccn') lccn,
JSON_EXTRACT(json, '$.identifiers') identifiers,
JSON_EXTRACT(json, '$.contributions') contributions,
JSON_EXTRACT(json, '$.isbn_13') isbn_13,
JSON_EXTRACT_SCALAR(json, '$.isbn_13[0]') isbn_13_0,
JSON_EXTRACT(json, '$.physical_format') physical_format,
JSON_EXTRACT(json, '$.oclc_numbers') oclc_numbers,
JSON_EXTRACT(json, '$.series') series,
JSON_EXTRACT(json, '$.source_records') source_records,
JSON_EXTRACT(json, '$.covers') covers,
JSON_EXTRACT(json, '$.dewey_decimal_class') dewey_decimal_class,
JSON_EXTRACT_SCALAR(json, '$.edition_name') edition_name,
# ...
FROM [fh-bigquery:openlibrary.ol_dump_20151231]
WHERE type='/type/edition'
LIMIT 10
(sample data taken from an Open Library data dump https://openlibrary.org/developers/dumps, based on a reddit conversation)
This is what I came up with (Specifically for StandardSQL).. Not sure if accumulating in a list is the best method... Also.. I simplified for my case where I'm just concerned with keys.
CREATE TEMPORARY FUNCTION Foo(infoo STRING)
RETURNS Array<String>
LANGUAGE js AS """
blah = [];
function processKey(node, parent) {
if (parent !== '') {parent += '.'};
Object.keys(node).forEach(function(key) {
value = node[key].toString();
if (value !== '[object Object]') {
blah.push(parent+key)
} else {
processKey(node[key], parent + key);
};
});
};
try {
x = JSON.parse(infoo);
processKey(x,'');
return blah;
} catch (e) { return null }
"""
OPTIONS ();
WITH x as(
select Foo(jsonfield) as bbb from clickstream.clikcs
)
select distinct arr_item from (SELECT arr_item FROM x, UNNEST(bbb) as arr_item)